# graphics
# --------

# collision detection alorithms
1. separating-axis theorem and manifolds
2. GJK(GilbertJohnsonKeerthi - using Minkowski difference)/EPA

## move layer to a location
### GIMP


## select border
### GIMP
- righ click layer
- alpha to selection

## blur edges
### GIMP
- Layer>Transparency>Add alpha channel
- Layer>Transparency>Alpha to selection
- Select>Shrink
- Select>Feather
- Select>Invert
- Edit>Clear

## Compute unit vs CUDA core
- An AMD Compute unit is a cluster of processing elements(PE)
- An Nvidia CUDA core is a single processing element

## AMD
### CDNA
- GPU architecture that give exascale computing
- Augments scalar and vector processing with Matrix Core Engines
- Add Infinity Fabric technology to scale up to larger systems
- ROCm: Radeon Open Compute
- MI100 has 7680 PEs arranged into 120 CUs built on foundations of the
  GraphicsCoreNext(GCN) architecture are organized into four compute engines.
- Each CU is re architectured for matrix core engine
- No fixed function hardware for rasterization, tessellation, graphics caches
  blending and display engine.
- Dedicated logic for HEVC, H.264 and VP9 decoding that is sometimes used for
  compute workloads that operate on multimedia data.
![CDNA](images/CDNA.jpg)
- With InfinityFabric each GPU directly connects with other 3 in 4GPU block
![InfinityFabric](images/InfinityFabric.jpg)
![MI100InfinityFabric](images/MI100InfinityFabric.jpg)
- BERT language model training is 44% faster with InfinityFabric than PCIe4
- Command processor and scheduling logic translates higher-level API commands
  into compute tasks.
- Compute tasks in turn implemented as compute arrays and managed by the
  Asynchronous Compute Engines(ACE). Each of the four ACEs maintain an
  indepent stream of commands and can dispatch wavefronts to the CUs.
- CDNA build's on GCN's foundation of scalars and vectors and adds matrices
  as first class citizen along with new numerical formats while preserving
  backward with GCN architecture. These Matrix core engines add a new family
  of wavefront level instructions, the Matrix Fused Multiply-Add.
- MFMA performs mixed-precision arithmetic with KxN matrices: INT8, FP16,
  BF16(BrainFP) and FP32 and Outputs either INT32 or FP32.
- INT8 mostly used in ML inference with quantized weights.
  FP32(8e23m) for ML training and HPC applications.
  FP16(5e10m) for graphics
  bf16(8e7m) for ML training but got few convergence problems
#### Enhanced CU
![EnhanceCU](images/EnhancedCU.png)
#### SIMD view of CDNA architecture
![SIMD view of CDNA architecture](images/SIMDCDNA.jpg)
- energy consumed by a multiply-accumulate operation is the square of the
input datatypes, so shifting from FP32 to FP16 or BF16 can save a tremendous
amount of energy.
#### ROCm stack
![ROCmStack](images/ROCmStack.jpg)
### Counters and Metrics
#### Various counters
- Computational pipeline counters
- Caches usage counters
- Memory usage counters
- Bus usage counters
#### Types of counters
- Stats counters - instructions/transactions, caches/memory, hit/miss
- Utilization counters - busy cycles, total cycles, latency
- Stalls - core, cache, MC, bus stalls
#### Terminology
- Grid/work-group/wave/work-item
- VALU/SALU - vector/scalar ALU
- SQ - fetch/decoder/scheduler
- SIMD - vector arithmetic pipeline, SIMD/ComputeUnit/ShaderEngine
- LDS - local shared data store
- GDS - Global data share unit
- FLAT - single flat memory space: video/system/LDS/scratch(private) memory
- TCP/TA - L1 data cache/L1 address generation logic
- TCC - L2 data cache
- EA - system memory bus arbiter
- DRM - Direct Rendering Module
#### Transaction abbreviations:
- VMEM_(WR|RD) - memory read/write vector
- SMEM - memory read scalar
- VWrite - memory write vector
- (V/S)Fetch - memory read vector/scalar
- VMem - video memory (in derived metrics)
#### GCN architecture
![GCNArchitecture](images/GCN.jpg)
#### AMDGPU Address spaces

Address Space LLVM IR Address HSA Segment Hardware Name  Address NULL Value
Name          Space Number    Name                       Size
Generic       0               flat        flat           64      0x0000000000000
                                                                 000
Global        1               global      global         64      0x0000000000000
                                                                 000
Region        2               N/A         GDS            32      Not implemented
                                                                 for AMDHSA
Local         3               group       LDS            32      0xFFFFFFFF
Constant      4               constant    same as global 64      0X0000000000000
                                                                 000
Private       5               private     scratch        32      0X00000000
#### GPU utilization counters
- GRBM_COUNT: Number of GPU clocks
- GRBM_GUI_ACTIVE: Number of GPU busy cycles
#### SQ events
- SQ_WAVES: Number of waves sent to SQs
- SQ_INSTS_(VALU|VMEM_WR|VMEM_RD|SALU|SMEM|FLAT|FLAT_LDS_ONLY|LDS|GDS):
  Number of instructions
  - FLAT_LDS_ONLY: FLAT instructions issuesd that read/wrote only from/to LDS
#### SQ utilization and stalls
- SQ_INST_CYCLES_SALU: Number of cycles needed to execute non-memory read scalar
  ops
- SQ_THREAD_CYCLES_VALU: Number of thread-cycless used to execute VALU ops
  (similar to INST_CYCLES_VALU but multiplied by # of active threads).
- SQ_WAIT_INST_LDS: Number of wave cycles spent waiting for LDS instruction
  issue. In units of 4 cycles.
- SQ_ACTIVE_INST_VALU: Number of cycles the SQ instruction arbiter is working
  on a VALU instruction.
- SQ_LDS_BANK_CONFLICT: Number of cycles LDS is stalled by bank conflicts.
#### TCP/TA counters
- TA_TA_BUSY: TA block is busy.
- TA_FLAT_(READ|WRITE)_WAVEFRONTS: Number of flat opcode reads/write processed
  by the TA.
- TCP_TCP_TA_DATA_STALL_CYCLES:_ TCP stalls TA data intereface.
#### HIP programming
#Get GPU node number, by looking for `node:` in rocminfo
#### select gpu
#Use HIP_VISIBLE_DEVICES environment variable to select the target GPUs for
#the process from the HIP level.
export HIP_VISIBLE_DEVICES=0,2

#Use ROCR_VISIBLE_DEVICES environment variable to select the target GPUs from
#the ROCr (ROCm user-bit driver) level
export ROCR_VISIBLE_DEVICES=0,2

#Pass selected GPU driver interfaces (/dev/dri/render#) )to Docker container.
ls /dev/dri/render*
/dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131
sudo docker run -it --network=host --device=/dev/kfd \
--device=/dev/dri/renderD128 --group-add video

## DRM
- Provides memory management, interrupt handling, DMA and consistent interface
  and several other services to graphics drivers like
  - v(ertical)blank(interval after each frame) event handling
  - memory management
  - output management
  - framebuffer management
  - command submission &  fencing
  - suspand/resume support
  - DMA services
  many of them driven by the application interfaces it provides through libdrm
  through DRM ioctls.
### DRM tree features
- TTM memory manager
- output configuration and mode setting
- vblank internals
### Driver initialization requirements
- setup command buffers
- create initial output configuration
- initialize core services
- (struct drm_driver*)drm_dev_alloc() to allocate device instance; contains
  static information that describes the driver and features it supports and
  pointers to methods that the DRM core will call to implement the DRM API.
- after device instance initialized, it is registered using
  drm_device_register()