## Linux Graphics Driver ## GPE - Graphics Processing Engine GC - Graphics Controller DRM - Direct Rendering Module - Proper pipeline configuraion (planes, CRTC, connector) - Well-balanced abstraction of hardware complexity - DRM Atomic uAPI, allows u to group changes and apply at once to display pipline - DRM GEM/TTM memory management - DRM also do bit of cache management - DRM Prime for dma-buf zero-copy - DRM Sync Object for fences - Various drivers for graphics cards and embedded - Drivers registered from buses like PCI, platform, i2c, spi, mipi_dsi - Drivers register components to DRM KMS framework: - struct drm_device, for display controllers - struct drm_bridge, for bridges - struct drm_panel, for panels - Drivers need to identify eachother: - represent pipeline topology(eg. device tree) - retrieve remote component structures for API use - static description via device-tree graph with port/endpoint (or ACPI) - Node nesting for bus(DSI) ## DRM Display Controller: Base Driver has - Driver data static declaration: struct drm_driver and its members - driver_features: bitfield of DRIVER_MODESET, DRIVER_ATOMIC, DRIVER_GEM - fops: default definitions with DEFINE_DRM_GEM_FOPS - name, desc, date/major/minor for information - various operation callbacks, default definitions with DRM_GEM_DMA_DRIVER_OPS(not used by i915) - Device data: struct drm_device - created by the DRM framework which creates FS char dev nodes like card, render - `struct drm_mode_config` holds current screen. its members: - `{min,max}_{width,height}`: framebuffer dimension limits - `preferred_depth`: default framebuffer pixel depth - `funcs`: driver-specific `struc drm_mode_config_funcs` points to functions: - `fb_create`: framebuffer createion with `drm_gem_fb_create()` if `GEM` - `atomic_check`: atomically validate if mode is supported by HW. - `atomic_commit`: atomically sets the mode. there are helper funcs `drm_atomic_helper_commit`, `drm_atomic_helper_check` - it can be initialized with `drmm_mode_config_init()` which automatically triggers `drm_mode_config_cleanup()` using destroy functions - `drm_mode_config_reset()` using `reset` functions - register device with `drm_dev_register()` - whenever there is a mode change, `drm_atomic_state` object is created. which is populated with desired pipeline component states. and is destroyed once modesetting is done. - there is a legacy non atomic modesetting api aswell. - `struct drm_plane` members - `type`: `DRM_PLANE_TYPE_PRIMARY` `DRM_PLANE_TYPE_OVERLAY`, `DRM_PLANE_TYPE_CURSOR` - `possible_crtcs`: valid CRTCs with `drm_crtc_mask()`. a plane can be connected to multiple crtcs - `formats`: list of supported pixel formats. - `modifiers`: specifies how pixel data in fb laid out in memory. they describe tiling, compression etc. to optimize storage and access for specific GPU hardware. tiling:pixel are grouped into tiles to increase cache efficiency. - `struct drm_plane_funcs* funcs`: reset, destroy, update, disable - `struct drm_plane_state`: created when a plane is configured. - `struct drm_crtc`: `struct drm_crtc_funcs`: `enable_vblanc`, `disable_vblanc`: enables are disables interrupting on vblanc. - `struct drm_encoder`: nothing much to configure except to be associated with a crtc and a connector. - `struct drm_bridge`: its seperate driver - TTM - Translation Table Manager-a gfx memory manager in DRM GEM - Graphics Execution Manager GEM is a more lightweight, simpler memory manager focused on UMA architectures, while TTM is a more feature-rich, general-purpose memory manager that can handle more complex memory management tasks, including those involving dedicated video memory - provides fops callbacks with DEFINE_DRM_GEM_FOPS and DRM_GEM_DMA_DRIVER_OPS - provides `dumb_create` operation to create simple framebuffer ideal for scanout screens like boot splash or basic display tasks. - `dma_alloc_wc` function allocates write-combining memory(combines multiple writes into large bursts). if iommu not present then it allocates contiguous memory. - also supports non-coherent memory. - GEM can use Contiguoush Memory Allocator(CMA). Uses reclaimable reserved pools of DRAM memory which is reserved early at boot with static size(cma kernel command line or `CONFIG_CMA_SIZE_MBYTES` or per device with device tree with compatible shared-dma-pool under reserved-memory node with memory-region property;attached to `struct device` with `of_reserved_mem_device_init()`). ```DT { [...] reserved-memory { #address-cells = <1>; #size-cell = <1>; ranges; gfx_memory: framebuffer { size = <0x01000000>; alignment = <0x01000000>; compatible = "shared-dma-pool"; reusable; } } } gfx: display@1e6e6000 { compatible = "aspeed,ast2600-gfx", "syscon"; [...] memory-region = <&gfx_memory>; }; ``` Gen - Intel Gen architecture was introduced in 2004. SGX - GTT - graphics translation table PPGTT - per process GTT SA - system agent HWS_PAGE - HWS_PAGE AMBA - Advanced Microcontroller Bus Architecture ASB - Advanced System Bus APB - Advanced Peripheral Bus AHB - Advanced High performance Bus AXI - Advanced eXtensible interface ACE - AXI Coherency Extensions DVM - Distributed Virtual Memory MOESI - Modified Owned Exclusive Shared Invalid HST - Hierarchy Scheduling Technology HSR - Hidden surface removal TBDR - Tile Based deferred rendering ASIC - Application Specific IC RTU - Ray Tracing Unit TCL/T&L - Transform CCI - Cache Coherent Interconnect TS - Thread Spawner URB - Unified Return Buffer is the on chip memory managed/shared by Fixed Functions in order for a thread to return data that will be consumed either by a fixed function or other threads. EU - Execution Unit is a mutithreaded process within the GEN multiprocessor system, Each EU is a fully-capable processor containing instruction fetch and decode, register files, source operand swizzle and SIMD ALU, etc, An EU is also referred to as a GEN core. EUID - The 4-bit field within a thread state register SR0 that identifies the row and column location of the EU a thread is located. A thread can be uniquely identified by the EUID and TID. ExecWidth - Execution witdh is the width of each of several data elements that may be processed by a single GEN SIMD instruction. EM - Extended Math Unit. A shared function that performs more complex math operation on behalf of several EUs GMHC - IP - Instruction Pointer holds the address of the instruction currently being fetched by an EU. each EU has its own IP. AIP - Application Instruction Pointer holds the instruction address that throws the exception. Then the thread jumps to SIP. SIP - System IP is one global system ip register for all the threads. From a thread's point of view, this is a virtual read only register. Upon an exception, hardware performs some book keeping and then jumps to SIP. ARF - Architectural Register File is a collection of architecturally visible registers such as address registers, accumulator, flags, notification registers, IP, null, etc. BTP - Block Table Pointer is a pointer to a binding table which holds pointers to surface state blocks, specified as an offset from the Surface State Base Adress register BLT - Block Image Transfer B - Signed byte integer NDC - Normalized Device Coordinates CC - Color Calculator computes pixel color like blending CS - Command streamer is the functional unit of the GPE that fetches commands, parses them and routes to the appropriate pipeline UE - URB Entry CURBE - Constant URB Entry is a UE that contains constant data for use by various stages of the pipeline. CR - Control Register is a R/W register used for thread mode control and exception handling for a thread. DP - Data Port is a shared function unit that performs a majority of the memory access types on behalf of GEN programs. It contains the render cache and the constant cache and performs all memory accesses requested by GEN programs except those performed by the Sampler DQ - Double Quad word is a fundamental data type of 16 bytes D or DW - Double word, a fundamental data type of 4bytes QQ - Quad Quad word fundamental datatype of 32 bytes QW - Quad word fundamental datatype of 8 bytes EOB - End Of Block is a 1-bit flag in the non-zero DCT coefficient data buffer. EOT - End Of Thread is a message sideband signal on the Output message bus signifying that the message requester thread is terminated. A thread must have at least one SEND instruction with the EOT bit in the message discriptor field set in order to properly terminate. ExecSize - Number of data elements processed by a GEN SIMD instruction. It is one of the GEN instruction fields and can be changed per instruction. ExecWidth - The width of each of several data elements that may be processed by a single SIMD instruction. FF - Fixed-Fucnction is the function of the pipeline that is performed by dedicated(vs. programmable) hardware. FFID - Unique identifier for a FF unit. GW - GateWay is a message gateway. GRF - General Register File is large R/W file share by all the EUs for operand sources and and distinations. This is the most commonly used read-write register space organized as an array of 256-bit registers for a thread. MRF - Message Register File holds operands of a Message fmax - FLT_MAX the magnitude of the maximum representabe single precision floating number according to IEEE-754 standard. FLT_MAX has an exponent of 0xFE and a mantissa of all one's GB - Guard band is the region that may be clipped against to make sure objects do not exceed the limitations of the renderer's coordinate space. HorzStride - The distance in element-sized units between adjacent elements of a region-based GRF access. V - Immediate floating point vector. A numerical data type of 32 bits, an immediate integer vector of type V contains 8 signed integer elements with 4 bits each. The 4-bit integer element is in 2's complement form. It may be used to specify the type of an immediate operand in an instruction. VF - 4 floating points 8 bits each - a sign bit,3bit exponent and a 4 bit mantissa. IB - Index buffer in memory contains vertex indices. IA - Intel Architecture ISA - The ISA describes the instructions supported by an EU. ISC - Instruction State Cache, on-chip memory that holds recently-used instruction and state variable values. IZ - Intermediate Z, completion of the Z test at the front end of the Windower/Masker unit when certain conditions are met (no alpha, no pixel- shader computed Z values, etc.). IDCT - Inverse Discrete Cosine Transform, the stage in the video decoding pipe between IQ and MC IQ - A stage in the video decoding pipe between IS and IDCT LRCA - Logical Ring Context Area used to store contents of registers and state information required for initiating and resuming communication between software application and hardware graphics pipeline via ring buffers LSB - Least Significant Bit is the bit with lowest bit position within a group of bits, which could be a bit group, DWord, field,instruction, memory range, register, or structure. For example, bit 0 of a DWord. MMIO - A method for performing i/o between the CPU/GPU and peripheral devices. MSB - Most Significant Bit: bit with the highest bit position MC - MotionCompensation. Part of video decoding pipe MPEG - Motion Picture Expert Group MVFS - A four-bit field selecting reference fields for the motion vectors of the current macroblock MRT - Multiple Render Targets are multiple independent surfaces that may be the target of a sequence of 3D or Media commands that use the same surface state. NDC - Normalized Device Coordinates are Clip Space Coordinates that have been divided by the Clip Space "W" component. OGL - OpenGL PSP - Pointers to state blocks in memory that are passed down the pipeline. PS - Pixel Shader is supplied by the application, translated by the jitter and is dispatched to the EU by the windower (conceptually) once per pixel RC - Render Cache is cache in which pixer color and depth information is written before being written to memory, and where prior pixel destination atrributes are read in preparation for blending and Z test. RT - Render Target, a destination surface in memory where render results are written. RS - Resource Streamer is the functional unit of the GPE that examines the commands in the ring buf in an attempt to pre-process certain long latency items for the remainder of the graphics processing. VFE - Video Front End, the first fixed function in the first fixed function in the generic pipe; media operations. TS - Thread Spawner is the second and last fixed fun stage of the media pipeline that initiates new threads on behalf of generic/media processing. SF - Function unit that is shared by EUs. EUs send messages to shared functions, that consume the data and may return results. The Sampler, Data port and extended math unit are all shared functions. SFID - Shared Function ID is unique identifier used by kernels and shaders to target shared functions to identify their returned messages. SIMD - A parallel processing architecture that exploit data parallelism at the instruction level. It can also be used to describe the instructions in such an architecture or to describe the amount of data parallelism in a particular instruction (SIMD8 for example). SR - State Register is the read-only registers containing the sate information of the current thread, including the EUID/TID, Dispatcher Mask and SIP SV - State Variable is an individual state element that can be varied to change the way given primitives are rendered or media objects processed. On state variables persist only in memory and are cached as needed by rendering/ processing operations except for a small amount of non-pipelined state. SF - Strips and Fans in fixed function unit whose main function is to decompose primitive topologies such as strips and fans into primitives or objects TD - Functional unit that arbitrates thread initiation requests from fixed functions units and instantiates the threads on EUs TID - The field within a thread state register (SR0) that identifies which thread slots on an EU a thread occupies. A thread can be uniquely identified by the EUID and TID UB - Unsigned Byte integer is a numerical data type of 8 bits. UD - A numerical data type of 32 bits. It may be used to specify the type of an operand in an instruction is the unsigned double word integer UW - A numerical data type of 16 bits. It may be used to specify the type of an operand in an instruction. UE - URB Entry: A logical entity stored in the URB (such as a vertex), referenced via a URB Handle VLD - The first stage of the video decoding pipe that consists mainly of bit-wide operations VB - Vertex Buffer is buffer in memory containing vertex attributes. VUE - An URB entry that contains data for a vertex VC - Vertex Cache, Cache of Vertex URB Entry handles tagged with vertex indices. VF - Vertex Fetcher is the first FF unit in the 3D pipeline responsible for fetching vertex data from memory. Sometimes referred to as the vertex formatter. VS - An API-supplied program that calculates vertex attributes. Also refers to the FF unit that dispatches threads to "shade" (calculates attributes for) vertices. VertStride - Vertical Stride is the distance in element-size units between 2 vertically-adjacent elements of a region-based GRF access. VFE - Video Front End is the first fixed function in the generic pipeline; performs fixed-function media operations. VP - ViewPort WIZ - Windower IZ is the term for Windower/Masker that encapsulates its early ("intermediate") depth test function. WM - Windower/Masker is fixed function triangle/line rasterizer. W - Word, a numerical data type of 16 bits, W represents a signed word integer. DIP - Data Island Packet ELD - Electronic Logging Device or Enhanced Low Delay M/CTS - BCS - Blitter Command Streamer EIR - Error Identity Register contains the persistent values of hardware-detected error condition bits. Any bit set in this register will cause the master error bit in the ISR to be set. CLS - Cache Line Size MSAA - Multi Sample Anti Aliasing VCE/MFX - Video Codec Engine is a fixed function video decoder and encoder engine. It is also referred to as the multi-format codec MFX engine, as a unified fixed function pipeline is implemented to support multiple video coding standards such as MPEG2, VC1 and AVC VCS - VCE Command Streamer unit (also referred to as BCS) BSD - Bitstream Decoder unit VDS - Video Dispatcher unit VMC - Video Motion Compensation unit VIP - Video Intra Prediction unit VIT - Video Inverse Transform unit VLF - Video Loop Filter unit VFT - Video Forward Transform unit BSC - Bitstream Encoder unit CURBE - Constant URB Entry function SKU - BDW - Broadwell VIN - PTE - UC - WB - bpp - bits per pixel - guess ppc - per pixel count - guess LUT - DIB - Device Independent Bitmap surface containing "logical" pixel values that are converted (via LUTs) to physical colors. Geom/FF - 3D Geometry / Fixed Function (Geom/FF) block: 3D FF pipeline (CS, VFVS, HS, TE, DS, GS) VFE, TSG (Global TS), TDG, URBM-URB Manager Media/FF - Media fixed function assets: VD - Video Decode Box VE - Video Encode Box WD - Wireless Display Box GA - Global Assets block as the primary interface n memory stream gateway to the outside world, consisting of: GT Interface GTI SVM - State Variable Manager, BLT GAM - Graphics Arbiter GWY - Gateway IC - Instruction cache TDL - Local Thread Dispatcher (TDL) BC - Barycentric Calculator PSD - Pixel Shader Dispatcher HDC - Data Cluster DAPRC - Dataport Render Cache HZ - Hi-Z IZ - Iintermediate Z SBE - Setup Backend RCC - MSC - RCZ - STC - #1 Command streamer Context switching Memory access (including tiling) Memory data formats Graphics Processing Engine (3D, Media, the subsytem and their memory interface) #2 3D and Media pipelines, fixed functions, commands processed by the pipelines, VLD (Media Fixed Function), threads with Thread Spawner(TS), programmable kernel handle the media functions such as IDCT, Motion Compensation, and Motion Estimation. #3 Display Registers control the overlay and VGA. #4 GEN Subsystem contains ProgrammableCores/EUs and Shared Functions shared by multiple EUs to perform i/o or math ops. Shared functions contain Extended math unit, data port, URB and maessage gateway used by EU threads to signal each other. EUs use messages to Rx/Tx data form the subsystem; the messages are described with the shared functions. Messages or part of Instruction Set Architecture. # Command Opcodes #1 3DPRIMITIVE DWORD 0 VU TS RQ PO NM LK JI HG FE DC BA 98 76 54 32 10 3h 3h 3h 0h -- -- -- -- RR RR RR DWORD 1 DWORD 2 DWORD 3 DWORD 4 DWORD 5 DWORD 6 #2 3DSTATE_AA_LINEPARAMETERS is used to specify the slope and bias terms used in the improved alpha coverage computation (specifically for DX WHQL compliance). Note that in these devices the coverage values passed to PS threads are full U0.8 values, versus where U0.4 values are passed DWORD 0 DWORD 1 DWORD 2 #3 3DSTATE_BINDING_TABLE_EDIT_DS DWORD 0 DWORD 1 DWORD 2 #4 3DSTATE_BINDING_TABLE_EDIT_HS DWORD 0 DWORD 1 DWORD 2..N #5 3DSTATE_BINDING_TABLE_EDIT_PS DWORD 0 DWORD 1 DWORD 2..N #6 3DSTATE_BINDING_TABLE_EDIT_VS DWORD 0 DWORD 1 DWORD 2..N #7 3DSTATE_BINDING_TABLE_POINTERS_DS command is used to define the location of fixed funtions' BINDING_TABLE_STATE. Only some of the fixed functions utilize binding tables. DWORD 0 DWORD 1 DWORD 2 #8 3DSTATE_BINDING_TABLE_POINTERS_GS command is used to define the location of fixed functions' BINDING_TABLE_STATE. Only some of the fixed functions utilize binding tables. #9 3DSTATE_BINDING_TABLE_POINTERS_HS is used to define the location of fixed functions' BINDING_TABLE_STATE. Only some of the fixed functions utilize binding tables. #10 3DSTATE_BINDING_TABLE_POINTERS_PS #11 3DSTATE_BINDING_TABLE_POINTERS_VS #12 3DSTATE_BINDING_TABLE_POOL_ALLOC sets up the binding table pool for HW generated binding tables. when RS is enabled due to a MI_RS_CONTROL or MI_BATCH_BUFFER_START with RS enable bit set, driver must reprogram the 3DSTATE_BINDING_TABLE_POOL_ALLOC to ensure the resource streamer and the render engine are in sync with the programming with the command. this field specifies the 4gb aligned base address of gfx 4gb virtual address space within the host's 64-bit virtual address space. #13 3DSTATE_BLEND_STATE_POINTERS command is used to set up the pointers to the color calculator state. When the BLEND_STATE ./images/VertexMemory.png ./images/CommNStream.png ./images/SliceCommon.png Nvidia 2060 2070 2080 Radeon 7 GPU Arch Turing Turing Turing RTX-OPS 37T CUDA Cores 1920 2304 2944 Giga Rasy/s 5 BoostClock(MHz) 1680 1710 1800 BaseClock(MHz) 1365 1410 1515 MemSpeed Gbps 14 14 14 Mem Config 6GB GDDR6 8GB GDDR6 8GB GDDR6 MemInfWidth 192-bit 256-bit 256-bit MemBW 336 448GB/s 448GB/s RTRT Y Y Y GeForce Y Y Y Ansel Y Y Y Highlights Y Y Y G-SYNC Y Y Y GR drivers Y Y Y DP 1.4 Y Y Y HDCP 2.2 Y Y Y GPU Boost 4 4 4 VR Ready Y Y Y USB-C VL Y Y Y NVENC Y (Turing) Y (Turing) Y(Turing) MaxRes 7680x4320 7680X4320 7680x4320 DispConnectors DP,HDMI,UC,DVI DP,HDMI,USB-C DP,HDMI,UC Max monitor 4 4 4 HDCP 2.2 2.2 2.2 Height CM 11.26 11.26 11.57 Length CM 22.86 22.86 26.67 Width Slot 2-Slot 2-Slot 2 Max Temp C 88 89 88 power W 160 175 225 SysPow W 500 550 650 PowConns 8 8 pin 6+8 SLI - Scalable Link Interface # VideoCore IV # --------- -- ARB - Architecture Review Board # VPU ./images/VPU.png # VideoCore IV 3D System Block Diagram +---+ +-----------------------+ o | | Control Lists | Control List Executer | Primitives | | +------------------> [CLE] +--------------+=+ | | | +-+-+----------+-----------------------+ | | | | | V V V | | | | | State change data (to fifo) | | | | | +----------------------------+ | | | | | General DMA Write Data | VPM DMA Writer [VDW] <------------|----------------+ | <-------------------------+----------------------------+ | | | | | | Vertex Attributes/ | <--+ | |G |G | | General DMA Read Data | Vertex Cache Manager & DMA |---+| |e |e | +-------------------------> [VCM and VCD] +------------|n-----------+ |n | | +----------------------------+------+ |e |Ve |e | | P|| | |r |rt |r | | Clipped Primitives +-----------------------+r|| | |a |ex |a | <--------------------------------+ Primitive Tile Binner <=+| | |l |At |l | | Tile Lists (primitives & state)| [PTB] <----|V----|P-------+ |tr | | <--------------------------------+-----------------------+i|| |e |r | |ib |D | | m|| |r |o | |ut |M | | Clipped Primitives +------------------------+ s|| |t |g | |es |A | +---------------------> Primitive Setup Engine <-----------++ |e |r | |/P | | | | [PSE] <--------------|x----|a---+ |S |re |W | | V+-------------+------------------------+ |S |m |S |h |Sh |r | | a| | Front End Pipe [FEP] +-------+F |h |R |h |a |ad |i | | r| +----> (Rasteriser, Early-Z, | |r |a |e |a |d |ed |t |AXI| y| | | Z, W interp, 1/W) | |a |d |q |d |e |Ve |e |ARB| i| E| +----------+-------------+ |g |e | |e |d |rt | | | n| a| | |m |R | |d |V |ic |D | | g| r| +----+------------+Q |e |e | |V |e |es |a | | | l| | |u |n |q | |e |r |/G |t | | I| y| +-----V---------------+ |a |t | | |r |t |en |a | | n| || | Coverage Accumulate | |d |S | | |t |i |er | | | t| z| | Pipe [CAP] | | |h | | |i |c |al | | | e| | +-----A---------------+ |X,Y, |a | | |c |e |DM | | | r| | | |Flags, |d | | |e |s |AR | | | p| | | Quad Coverage |Z, |e | | |s | |ea | | | o| | | |1/w |Req | | | | |dD | | | l| | +-----V---------------+ | | | | | | |at | | | a| +----++-------------------+| | +V------V-----V-+ ++---+---V---+-+ | | Fram t| e Buf Data ||Scoreboard || | | QPU Scheduler | | Vertex Pipe | | +------i|------------->+-------------------+| | | [QPS] | | Memory [VPM] | | | o| +-----A---------------+ | +-----+-+-------+ +-------A------+ | | n| | Quad Z, | |S| | | | | | Colour | |t| Unshaded|General | | Co| | | |a| &Shaded |Data | | ef| | | |r| vertices| | | fi| | | |t| | | | ci| | | |P| | | | en| | | |C| | | | ts| +-----V-----------------V-------------V-+-----------------V--+ | | | | +-------------------+ Slice 0 |--+ | | | | | Interpolator[VRI] | +------------------------+ +----+ | |--+ | | | | |-----------+ | | Quad Processor | |Sp | | | |--+ | | +---------------->Coeffs Mem | | | QPU 0, 0 | |ec | | | | | | | | |-----------+ | +------------------------+ |ia | | | | | | | | +-------------------+ +------------------------+ |l [| | | | | | | | +-------------------+ | QPU 0, 1 | |Fu S| | | | | | | |+-> Uniforms Cache | +------------------------+ |nc F| | | | | | | || | [QUC] | +------------------------+ |ti U| | | | | | | || +-------------------+ | QPU 0, 2 | |on ]| | | | | | | || +-------------------+ +------------------------+ |Un | | | | | | | |+-> Icache [QIC] | +------------------------+ |it | | | | | | | || +-------------------+ | QPU 0, 3 | +----+ | | | | | | +--------------+ || +-------------------+ +------------------------+ | | | | | +--->L2 Cache [L2C]+---+-> Texture & Memory | | | | | | | | +->| | Lookup Unit [TMU] | | | | | | | | +->| +-------------------+ | | | | | | | +->+-+----------------------------------------------------------+ | | | +---+ +--------------+ | Slice 1 | | | +-------------------------------------------------------------+ | | | Slice 2 | | +--------------------------------------------------------------+ | | Slice 3 | +---------------------------------------------------------------+ ./images/QPU.png 1. Highly Uniform 64bit instruction set 2. 4 way physical x 4 way successiv clock multiplexed prallelism 3. 2 hardware threads 4. 2 large single-ported register files 5. I/O mapped into register space 6. Instruction and register level coupling to 3D hardware. DRM --- DRM ioctls: 1. vblank event handling 2. memory management 3. output management 4. framebuffer management 5. command submission & fencing 6. suspend/resume support 7. DMA services branch prediction is crucial in most contemporary processors SM - scalar multiproccessor SP - scalar processor warp - collection of threads that are guarnteed to execute in parallel, an SM is not necessarily considered as a warp; an SM may contain multiple warps into which the SPs are divided. block - multiple of warp #Operation Units 1. Floating-point, integer - add, multiply, multiply-add, minimum, maximum, compare, set predicate and conversion between integer and floating-point nums 2. Transcendental funcs - cosine, sine, binary exponential, binary logarithm, reciprocal and reciprocal square root 3. Bitwise operators - shift left, shift right, logic operators, and move 4. Control flow - branch, call, return, trap and barrier synchronization Each SM has 1. large vector register file whose register files are divided logically across SIMD lanes i.e SPs 2. Several Caches: shared memory, Constant, Texture, L1 etc 3. Warp Schedulers 4. Scalar Processors 5. Special Function Units (SFUs) for single-precision floating-point transcendental functions Fermi GTX 480 GPU 16 SMs each 32 SPs total 512 CUDA cores each SP or SIMD thread/lane has access to 64 32-bit registers of register file while operands are integers 32 64-bit registers of register file while operands are double float each SM has 16 load/store units each lane has 2048 registers each SM has 4 SFUs, each SP has one FP, one Integer ALU ALUs also support Boolean, shift, move, compare, convert, bit-field extract Memory Hierarchy Local memory for per-thread, private, temporary data(external dram) Shared memory shared by threads on the same SM Global memory shared by all threads implemented in external DRAM L1 cache + shared memory is private to SMs along with read-only texture and constant caches L2 is unified for all SMs, 6 high-bandwidth DRAM channels compared to CPU, GPU has larger register file, smaller L1/L2 cache with higher bandwidth PTX instructions - parallel thread execution instruction are the GPU insts ## Device tree example ``` tcon0: lcd-controller@1c0c000 { compatible = "allwinner, sun50i-a64-tcon-lcd", "allwinner, sun8i-a83t-tcon-lcd"; [...] ports { [...] tcon0_out: port@1 { reg = <1>; [...] tcon0_out_dsi: endpoint@1 { reg = <1>; remote-endpoint = <&dsi_in_tcon0>; allwinner,tcon-channel = <1>; }; }; }; }; dsi: dsi@1ca0000 { compatible = "allwinner,sun50i-a64-mipi-dsi"; [...] port { dsi_in_tcon0: endpoint { remote-endpoint = <&tcon0_out_dsi>; }; }; panel@0 { compatible = "xingbangda,xbd599"; reg = <0>; [...] }; }; ``` above shows tcon0 lcd controller connected to dsi IN endpoint at port1 and dsi device port connected to tcon0 OUT endpoint