4340a4cf52aab864889a93cc735a3970169353c7
[libreriscv.git] / 3d_gpu / requirements_specification.mdwn
1 # Requirements Specification
2
3 This document contains the Requirements Specification for the Libre RISC-V
4 micro-architectural design. It shall meet the target of 5-6 32-bit GFLOPs,
5 150 M-Pixels/sec, 30 Million Triangles/sec, and minimum video decode
6 capability of 720p @ 30fps to a 1920x1080 framebuffer, in under 2.5 watts
7 at an 800mhz clock rate. Exceeding this target is acceptable if the
8 power budget is not exceeded. Exceeding this target "just because we can"
9 is also acceptable, as long as it does not disrupt meeting the minimum
10 performance and power requirements.
11
12 # General Architectural Design Principle
13
14 The general design base is to utilise an augmented and enhanced variant
15 of the original CDC 6600 scoreboard system. It is not well-known that
16 the 6600 includes operand forwarding and register renaming. Precise
17 exceptions, precise in-order commit, branch speculation, "nameless"
18 registers (results detected that need not be written because they have
19 been overwritten by another instruction), predication and vectorisation
20 will all be added by overloading write hazards.
21
22 An overview of the design is as follows:
23
24 * 3D and Video primitives (operations) will only be added as strictly
25 necessary to achieve the minimum power and performance target.
26 * Identified so far is a 4xFP32 ARGB Quad to 1xINT32 ARGB pixel
27 conversion opcode (part of the Vulkan API). It will write directly
28 to a separate "tile buffer" (SRAM), not to the integer register
29 file. The instruction will be scalar and will inherently and
30 automatically parallelised by SV, just like all other scalar opcodes.
31 * xBitManip opcodes will be required to deal with VPU workloads
32 * The register files will be stratified into 4-way 2R1W banks,
33 with *separate* and distinct byte-level write-enable lines on all four
34 bytes of all four banks.
35 * 6600-style scoreboards will be augmented with "shadow" wires
36 and write hazard capability on exceptions, branch speculation,
37 LD/ST and predication.
38 * Each "shadow" capability of each type will be provided by a separate
39 Function Unit. For example if there is to exist the possibility of rolling
40 ahead through two speculative branches, then two **separate**
41 Branch-speculative Function Units will be required: each will
42 hold their own separate and distinct "shadow" (Go-Die wire) and
43 write-hazard over instructions on which the branch depends.
44 * Likewise for predication, which shall place a "hold" on
45 the Function Units that depend on it until the register used
46 as a predicate mask has been read and decoded, there will be
47 separate Function Units waiting for each predication mask register.
48 Bits in the mask that are "zero" will result in "Go-Die" signals being
49 sent to the Function Units previously (speculatively) allocated for that
50 (now cancelled) element operation. Bits that are "1" will cancel
51 their Write-Hazard and allow the Function Unit to proceed with that
52 element's operation.
53 * The 6600 "Q-Table" that records, for each register, the last Function
54 Unit (in instruction issue order) that is to write its result to that
55 register, shall be augmented with "history" capability that aids and
56 assists in "rollback" of "nameless" registers, should an exception
57 or interrupt occur. "History" is simply a (short) queue (stack)
58 that preserves, in instruction-issue order, a record of the previous
59 Function Unit(s) that targetted each register as a destination.
60 * Function Units will have both src and destination Reservation
61 Stations (latches) in order to buffer incoming and outgoing data.
62 This to make best use of (limited) inter-Function-Unit bus bandwidth.
63 * Crossbar Routing from the Register File will be on the **source**
64 registers **only**: Function Units will route **directly** to
65 and be hard-wired associated with one of four register banks.
66 * Additional "Operand Forwarding" crossbar(s) will be added that
67 **bypass** the register file entirely, to be used exclusively
68 for registers that have specifically been identified as "nameless".
69 * Function Units will be the *front-end* to **shared** pipelined
70 concurrent ALUs. The input src registers will come from the
71 latches associated with the Function Unit, and will put the
72 result **back** into the destination latch associated with that
73 **same** Function Unit.
74 * **Pairs** of 32-bit Function Units will handle 64-bit operations,
75 with the 32-bit src Reservation Stations (latches) "teaming up"
76 to store 64-bit src register values, and likewise the 32-bit
77 destination latches for the same (paired) Function Units.
78 * 32-bit Function Units will handle 8 and 16 bit operations in
79 cases where batches of operations may be (easily, conveniently)
80 allocated to a 32-bit-wide SIMD-style (predicated) ALU.
81 * Additional 8-bit Function Units (in groups of 4) will handle
82 8-bit operations as well as pair up to handle 16-bit operations
83 in cases where neither 8 nor 16 bit operations can be (conveniently,
84 easily) allocated to parallel (SIMD-like) ALUs. This to handle
85 corner-cases and to not jam up the 32-bit Function Units with single-byte
86 operations (resulting in only 25% utilisation).
87 * Allocation of an operation to a 32-bit ALU will block the
88 corresponding 8/16-bit Function Unit(s) for that register, and vice-versa.
89 8/16-bit operations will however **not** block the remaining
90 (unallocated) bytes of the same register from being utilised.
91
92 # Register File
93
94 There shall be two 127-entry 64-bit register files: one for floating-point,
95 the other for integer operations. Each shall have byte-level write-enable
96 lines, and shall be divided into 4-way 2R1W banks that are split into
97 odd-even register numbers and further split into hi-32 and lo-32 bits.
98
99 In this way, 2 simultaneous 64-bit operations may write to the register
100 file (as long as the destinations have odd and even numbers), or 4
101 simultaneous 32-bit operations likewise. byte-level write-enable is
102 so that writes may be performed down to the 16-bit and 8-bit level
103 without requiring additional reads.
104
105 Additionally, if a read is requested for a register that is currently
106 being written, the written value shall be "passed through" on the same
107 cycle, such that the register file may effectively be used as an
108 "Operand Forwarding" Channel.
109
110 # Function Units
111
112
113 # 6600 Scoreboards
114
115 6600 Scoreboards are usually viewed as incomplete: incapable of register
116 renaming and precise exceptions are two of the perceived flaws. These
117 flaws do not exist, however it takes some explaining.
118
119 ## Q-Table (FU to Register Lookup)
120
121 The Q Table is a lookup table that records (in binary form in the
122 original 6600, however unary bit-wise form - N Function Unit bits
123 and M register bits - can be recommended) the last Function Unit
124 that, in instruction issue order, is to write to any given
125 register.
126
127 However, to support "nameless" registers, the Q-Table shall support
128 *multiple* (historical) entries, recording the history of the
129 *previous* Function Unit that was to write to each register.
130 When historic entries exist (non-empty), the following shall occur:
131
132 * All Function Units with historic entries shall **not** commit
133 their values to the register file, even if they are free to do so.
134 * All Function Units with historic entries shall hold a "write hazard"
135 against their dependencies that are waiting for that "nameless" result.
136 * When a dependent Function Unit has cleared all possibility of an
137 Exception being raised, it shall **drop** the write hazard on the
138 "nameless" source.
139 * If a "nameless" Function Unit needs to generate an Exception, it
140 does so in the standard way (see "Exceptions"), **however**,
141 in doing so it will also result in a **roll back** of the Q-Table for
142 **all and any** cancelled Function Units, to *previous* (historic)
143 Q-Table values for the relevant destination registers. Once
144 rolled back, the Function Unit must store its result in the register
145 file, prior to permitting the Exception to proceed.
146 * Likewise If a dependent Function Unit has to generate an exception,
147 and its source Function Units are "nameless", the "nameless"
148 Function Units must also "roll back", store their results, and
149 finally permit the Exception to trigger.
150 * Likewise, all other "nameless" results must also be "rolled back",
151 except unlike the Function Units triggering the exception they may
152 roll back to the newest "nameless" historical Q-Table entry
153 (if they have not already been cancelled by the FU triggering the
154 exception).
155
156 Bear in mind that exceptions (like all operations that are ready to
157 commit) may only occur in-order (following a FU-to-FU "link" bit),
158 and may only occur if the Function Unit is entirely free of write hazards.
159
160 ## FU-to-FU Dependency Matrix
161
162 The Function-Unit to Function-Unit Dependency Matrix expresses the
163 read and write hazards - dependencies - between Function Units.