# graphics # -------- # collision detection alorithms 1. separating-axis theorem and manifolds 2. GJK(GilbertJohnsonKeerthi - using Minkowski difference)/EPA ## move layer to a location ### GIMP ## select border ### GIMP - righ click layer - alpha to selection ## blur edges ### GIMP - Layer>Transparency>Add alpha channel - Layer>Transparency>Alpha to selection - Select>Shrink - Select>Feather - Select>Invert - Edit>Clear ## Compute unit vs CUDA core - An AMD Compute unit is a cluster of processing elements(PE) - An Nvidia CUDA core is a single processing element ## AMD ### CDNA - GPU architecture that give exascale computing - Augments scalar and vector processing with Matrix Core Engines - Add Infinity Fabric technology to scale up to larger systems - ROCm: Radeon Open Compute - MI100 has 7680 PEs arranged into 120 CUs built on foundations of the GraphicsCoreNext(GCN) architecture are organized into four compute engines. - Each CU is re architectured for matrix core engine - No fixed function hardware for rasterization, tessellation, graphics caches blending and display engine. - Dedicated logic for HEVC, H.264 and VP9 decoding that is sometimes used for compute workloads that operate on multimedia data. ![CDNA](images/CDNA.jpg) - With InfinityFabric each GPU directly connects with other 3 in 4GPU block ![InfinityFabric](images/InfinityFabric.jpg) ![MI100InfinityFabric](images/MI100InfinityFabric.jpg) - BERT language model training is 44% faster with InfinityFabric than PCIe4 - Command processor and scheduling logic translates higher-level API commands into compute tasks. - Compute tasks in turn implemented as compute arrays and managed by the Asynchronous Compute Engines(ACE). Each of the four ACEs maintain an indepent stream of commands and can dispatch wavefronts to the CUs. - CDNA build's on GCN's foundation of scalars and vectors and adds matrices as first class citizen along with new numerical formats while preserving backward with GCN architecture. These Matrix core engines add a new family of wavefront level instructions, the Matrix Fused Multiply-Add. - MFMA performs mixed-precision arithmetic with KxN matrices: INT8, FP16, BF16(BrainFP) and FP32 and Outputs either INT32 or FP32. - INT8 mostly used in ML inference with quantized weights. FP32(8e23m) for ML training and HPC applications. FP16(5e10m) for graphics bf16(8e7m) for ML training but got few convergence problems #### Enhanced CU ![EnhanceCU](images/EnhancedCU.png) #### SIMD view of CDNA architecture ![SIMD view of CDNA architecture](images/SIMDCDNA.jpg) - energy consumed by a multiply-accumulate operation is the square of the input datatypes, so shifting from FP32 to FP16 or BF16 can save a tremendous amount of energy. #### ROCm stack ![ROCmStack](images/ROCmStack.jpg) ### Counters and Metrics #### Various counters - Computational pipeline counters - Caches usage counters - Memory usage counters - Bus usage counters #### Types of counters - Stats counters - instructions/transactions, caches/memory, hit/miss - Utilization counters - busy cycles, total cycles, latency - Stalls - core, cache, MC, bus stalls #### Terminology - Grid/work-group/wave/work-item - VALU/SALU - vector/scalar ALU - SQ - fetch/decoder/scheduler - SIMD - vector arithmetic pipeline, SIMD/ComputeUnit/ShaderEngine - LDS - local shared data store - GDS - Global data share unit - FLAT - single flat memory space: video/system/LDS/scratch(private) memory - TCP/TA - L1 data cache/L1 address generation logic - TCC - L2 data cache - EA - system memory bus arbiter - DRM - Direct Rendering Module #### Transaction abbreviations: - VMEM_(WR|RD) - memory read/write vector - SMEM - memory read scalar - VWrite - memory write vector - (V/S)Fetch - memory read vector/scalar - VMem - video memory (in derived metrics) #### GCN architecture ![GCNArchitecture](images/GCN.jpg) #### AMDGPU Address spaces Address Space LLVM IR Address HSA Segment Hardware Name Address NULL Value Name Space Number Name Size Generic 0 flat flat 64 0x0000000000000 000 Global 1 global global 64 0x0000000000000 000 Region 2 N/A GDS 32 Not implemented for AMDHSA Local 3 group LDS 32 0xFFFFFFFF Constant 4 constant same as global 64 0X0000000000000 000 Private 5 private scratch 32 0X00000000 #### GPU utilization counters - GRBM_COUNT: Number of GPU clocks - GRBM_GUI_ACTIVE: Number of GPU busy cycles #### SQ events - SQ_WAVES: Number of waves sent to SQs - SQ_INSTS_(VALU|VMEM_WR|VMEM_RD|SALU|SMEM|FLAT|FLAT_LDS_ONLY|LDS|GDS): Number of instructions - FLAT_LDS_ONLY: FLAT instructions issuesd that read/wrote only from/to LDS #### SQ utilization and stalls - SQ_INST_CYCLES_SALU: Number of cycles needed to execute non-memory read scalar ops - SQ_THREAD_CYCLES_VALU: Number of thread-cycless used to execute VALU ops (similar to INST_CYCLES_VALU but multiplied by # of active threads). - SQ_WAIT_INST_LDS: Number of wave cycles spent waiting for LDS instruction issue. In units of 4 cycles. - SQ_ACTIVE_INST_VALU: Number of cycles the SQ instruction arbiter is working on a VALU instruction. - SQ_LDS_BANK_CONFLICT: Number of cycles LDS is stalled by bank conflicts. #### TCP/TA counters - TA_TA_BUSY: TA block is busy. - TA_FLAT_(READ|WRITE)_WAVEFRONTS: Number of flat opcode reads/write processed by the TA. - TCP_TCP_TA_DATA_STALL_CYCLES:_ TCP stalls TA data intereface. #### HIP programming #Get GPU node number, by looking for `node:` in rocminfo #### select gpu #Use HIP_VISIBLE_DEVICES environment variable to select the target GPUs for #the process from the HIP level. export HIP_VISIBLE_DEVICES=0,2 #Use ROCR_VISIBLE_DEVICES environment variable to select the target GPUs from #the ROCr (ROCm user-bit driver) level export ROCR_VISIBLE_DEVICES=0,2 #Pass selected GPU driver interfaces (/dev/dri/render#) )to Docker container. ls /dev/dri/render* /dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131 sudo docker run -it --network=host --device=/dev/kfd \ --device=/dev/dri/renderD128 --group-add video ## DRM - Provides memory management, interrupt handling, DMA and consistent interface and several other services to graphics drivers like - v(ertical)blank(interval after each frame) event handling - memory management - output management - framebuffer management - command submission & fencing - suspand/resume support - DMA services many of them driven by the application interfaces it provides through libdrm through DRM ioctls. ### DRM tree features - TTM memory manager - output configuration and mode setting - vblank internals ### Driver initialization requirements - setup command buffers - create initial output configuration - initialize core services - (struct drm_driver*)drm_dev_alloc() to allocate device instance; contains static information that describes the driver and features it supports and pointers to methods that the DRM core will call to implement the DRM API. - after device instance initialized, it is registered using drm_device_register()