add conversation notes
[libreriscv.git] / 3d_gpu / microarchitecture.mdwn
1 # High-level architectural Requirements
2
3 * SMP Cache coherency (TileLink?)
4 * Minumum 800mhz
5 * Minimum 2-core SMP, more likely 4-core uniform design,
6 each core with full 4-wide SIMD-style predicated ALUs
7 * 6GFLOPS single-precision FP
8 * 128 64-bit FP and 128 64-bit INT register files
9 * RV64GC compliance for running full GNU/Linux-based OS
10 * SimpleV compliance
11 * xBitManip (required for VPU and ideal for predication)
12 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
13 Multi-R x Multi-W for registers 1-31.
14 TODO: consider 2R for registers to be used as predication targets
15 if >= 32.
16 * Idea: generic implementation of ports on register file so as to be able
17 to experiment with different arrangements.
18 * Potentially: Lane-swapping / crossing / data-multiplexing
19 bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
20 * Potentially: Registers subdivided into 16-bit, to match
21 elwidth down to 16-bit (for FP16). 8-bit elwidth only
22 goes down as far as twin-SIMD (with predication). This
23 requires registers to have extra hidden bits: register
24 x30 is now "x30:0+x30.1+x30.2+x30.3". have to discuss.
25
26 # Conversation Notes
27
28 ----
29
30 'm thinking about using tilelink (or something similar) internally as
31 having a cache-coherent protocol is required for implementing Vulkan
32 (unless you want to turn off the cache for the GPU memory, which I
33 don't think is a good idea), axi is not a cache-coherent protocol,
34 and tilelink already has atomic rmw operations built into the protocol.
35 We can use an axi to tilelink bridge to interface with the memory.
36
37 I'm thinking we will want to have a dual-core GPU since a single
38 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
39 clock speed. Additionally, that allows us to use an 800MHz core clock
40 instead of the 1.6GHz we would otherwise need, allowing us to lower the
41 core voltage and save power, since the power used is proportional to
42 F\*V^2. (just guessing on clock speeds.)
43
44 ----
45
46 I don't know about power, however I have done some research and a 4Kbyte
47 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
48 ballpark of 1000 um^2 in 28nm.
49 Using a 4xFMA with a banked register file where the bank is selected by the
50 lower order register number means we could probably get away with 1Rx1W
51 SRAM as the backing memory for the register file, similarly to Hwacha. I
52 would suggest 8 banks allowing us to do more in parallel since we could run
53 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
54 gate the SRAM banks that are not in use for the current clock cycle
55 allowing us to save more power. Note that the 4xFMA could be 4 separately
56 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
57 parallelism, we can under-volt and under-clock the GPU cores allowing for a
58 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
59 think it would be important to be able to use a faster clock speed when not
60 using the extended registers (similar to how Intel processors use a lower
61 clock rate when AVX512 is in use) so that scalar code is not slowed down
62 too much.
63
64 > > Using a 4xFMA with a banked register file where the bank is selected by
65 > the
66 > > lower order register number means we could probably get away with 1Rx1W
67 > > SRAM as the backing memory for the register file, similarly to Hwacha.
68 >
69 > okaaay.... sooo... we make an assumption that the top higher "banks"
70 > are pretty much always going to be "vectorised", such that, actually,
71 > they genuinely don't need to be 6R-4W (or whatever).
72 >
73 Yeah pretty much, though I had meant the bank number comes from the
74 least-significant bits of the 7-bit register number.
75
76 ----
77
78 Assuming 64-bit operands:
79 If you could organize 2 SRAM macros and use the pair of them to
80 read/write 4 registers at a time (256-bits). The pipeline will allow you to
81 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
82
83 <pre>
84 RS1 = Read of operand S1
85 WRd = Write of result Dst
86 FMx = Floating Point Multiplier, x = stage.
87
88 |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
89 |FWD|FM1|FM2|FM3|FM4|
90 |FWD|FM1|FM2|FM3|FM4|
91 |FWD|FM1|FM2|FM3|FM4|WRd|
92 |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
93 |FWD|FM1|FM2|FM3|FM4|
94 |FWD|FM1|FM2|FM3|FM4|
95 |FWD|FM1|FM2|FM3|FM4|WRd|
96 |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
97 |FWD|FM1|FM2|FM3|FM4|
98 |FWD|FM1|FM2|FM3|FM4|
99 |FWD|FM1|FM2|FM3|FM4|WRd|
100 </pre>
101
102 The only trick is getting the read and write dedicated on different clocks.
103 When the RS3 operand is not needed (60% of the time) you can use
104 the time slot for reading or writing on behalf of memory refs; STs read,
105 LDs write.
106
107 You will find doing VRFs a lot more compact this way. In GPU land we
108 called the flip-flops orchestrating the timing "collectors".
109
110 ----
111
112 Justification for Branch Prediction
113
114 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
115
116 We can combine several branch predictors to make a decent predictor:
117 call/return predictor -- important as it can predict calls and returns
118 with around 99.8% accuracy loop predictor -- basically counts loop
119 iterations some kind of global predictor -- handles everything else
120
121 We will also want a btb, a smaller one will work, it reduces average
122 branch cycle count from 2-3 to 1 since it predicts which instructions
123 are taken branches while the instructions are still being fetched,
124 allowing the fetch to go to the target address on the next clock rather
125 than having to wait for the fetched instructions to be decoded.
126
127 ----
128
129 > https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
130
131 well, there is this concept:
132 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
133
134 it is a 2-level hierarchy for register cacheing. honestly, though, the
135 reservation stations of the tomasulo algorithm are similar to a cache,
136 although only of the intermediate results, not of the initial operands.
137
138 i have a feeling we should investigate putting a 2-level register cache
139 in front of a multiplexed SRAM.
140
141 ----
142
143 For GPU workloads FP64 is not common so I think having 1 FP64 alu would
144 be sufficient. Since indexed loads and stores are not supported, it will
145 be important to support 4x64 integer operations to generate addresses
146 for loads/stores.
147
148 I was thinking we would use scoreboarding to keep track of operations
149 and dependencies since it doesn't need a cam per alu. We should be able
150 to design it to forward past the register file to allow for 0-latency
151 forwarding. If we combined that with register renaming it should prevent
152 most war and waw data hazards.
153
154 I think branch prediction will be essential if only to fetch and decode
155 operations since it will reduce the branch penalty substantially.
156
157 Note that even if we have a zero-overhead loop extension, branch
158 prediction will still be useful as we will want to be able to run code
159 like compilers and standard RV code with decent performance. Additionally,
160 quite a few shaders have branching in their internal loops so
161 zero-overhead loops won't be able to fix all the branching problems.
162
163 ----
164
165 > you would need a 4-wide cdb anyway, since that's the performance we're
166 > trying for.
167
168 if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
169 then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
170 per core, which means only a 2-wide CDB, a heck of a lot better than
171 4.
172
173 oh: i thought of another way to cut the power-impact of the Reorder
174 Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
175 length equal to the number of registers, 2 is because of 2-issue).
176
177 the CAM of a ROB is on the instruction destination register. key:
178 ROBnum, value: instr-dest-reg. if you have a bitfleid that says "this
179 destreg has no ROB tag", it's dead-easy to check that bitfield, first.
180
181 ----
182
183 Avoiding Memory Hazards
184
185 * WAR and WAR hazards through memory are eliminated with speculation
186 because actual updating of memory occurs in order, when a store is at
187 the head of the ROB, and hence, no earlier loads or stores can still
188 be pending
189 * RAW hazards are maintained by two restrictions:
190 1. not allowing a load to initiate the second step of its execution if
191 any active ROB entry occupied by a store has a destination
192 field that matches the value of the A field of the load and
193 2. maintaining the program order for the computation of an effective
194 address of a load with respect to all earlier stores
195 * These restrictions ensure that any load that access a memory location
196 written to by an earlier store cannot perform the memory access until
197 the store has written the data.
198
199 Advantages of Speculation, Load and Store hazards:
200
201 * A store updates memoryy only when it reached the head of the ROB
202 * WAW and WAR type of hazards are eliminated with speculation
203 (actual updating of memory occurs in order)
204 * RAW hazards through memory are maintained by not allowing a load
205 to initiate the second step of its execution
206 * Check if any store has a destination field that matched the
207 value of the load:
208 - SD F1 100(R2)
209 - LD F2 100(R2)
210
211 Exceptions
212
213 * Exceptions are handled by not recognising the exception until
214 instruction that caused it is ready to commit in ROB (reaches head
215 of ROB)
216
217 Reorder Buffer
218
219 * Results of an instruction become visible externally when it leaves
220 the ROB
221 - Registers updated
222 - Memory updated
223
224 Reorder Buffer Entry
225
226 * Instruction type
227 - branch (no destination resutl)
228 - store (has a memory address destination)
229 - register operation (ALU operation or load, which has reg dests)
230 * Destination
231 - register number (for loads and ALU ops) or
232 - memory address (for stores) where the result should be written
233 * Value
234 - value of instruction result, pending a commit
235 * Ready
236 - indicates that the instruction has completed execution: value is ready
237
238 # References
239
240 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
241 * <https://en.wikipedia.org/wiki/Reservation_station>
242 * <https://en.wikipedia.org/wiki/Register_renaming> points out that
243 reservation stations take a *lot* of power.
244 * <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
245 * MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
246 <https://github.com/afwolfe/mesi-simulator>
247 * <https://kshitizdange.github.io/418CacheSim/final-report> report on
248 types of caches
249 * <https://github.com/ssc3?tab=repositories> interesting stuff
250 * <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
251 pipeline bypassing
252 * <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
253 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
254 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
255 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
256 * <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
257 * <https://www.eda.ncsu.edu/wiki/FreePDK>