3d_gpu/microarchitecture.mdwn

   1 # High-level architectural Requirements
   2
   3 * SMP Cache coherency (TileLink?)
   4 * Minumum 800mhz
   5 * Minimum 2-core SMP, more likely 4-core uniform design,
   6   each core with full 4-wide SIMD-style predicated ALUs
   7 * 6GFLOPS single-precision FP
   8 * 128 64-bit FP and 128 64-bit INT register files
   9 * RV64GC compliance for running full GNU/Linux-based OS
  10 * SimpleV compliance
  11 * xBitManip (required for VPU and ideal for predication)
  12 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  13   Multi-R x Multi-W for registers 1-31.
  14   TODO: consider 2R for registers to be used as predication targets
  15   if >= 32.
  16 * Idea: generic implementation of ports on register file so as to be able
  17   to experiment with different arrangements.
  18 * Potentially: Lane-swapping / crossing / data-multiplexing
  19   bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
  20 * Potentially: Registers subdivided into 16-bit, to match
  21   elwidth down to 16-bit (for FP16).  8-bit elwidth only
  22   goes down as far as twin-SIMD (with predication).  This
  23   requires registers to have extra hidden bits: register
  24   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  25
  26 # Conversation Notes
  27
  28 ----
  29
  30 'm thinking about using tilelink (or something similar) internally as
  31 having a cache-coherent protocol is required for implementing Vulkan
  32 (unless you want to turn off the cache for the GPU memory, which I
  33 don't think is a good idea), axi is not a cache-coherent protocol,
  34 and tilelink already has atomic rmw operations built into the protocol.
  35 We can use an axi to tilelink bridge to interface with the memory.
  36
  37 I'm thinking we will want to have a dual-core GPU since a single
  38 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
  39 clock speed. Additionally, that allows us to use an 800MHz core clock
  40 instead of the 1.6GHz we would otherwise need, allowing us to lower the
  41 core voltage and save power, since the power used is proportional to
  42 F\*V^2. (just guessing on clock speeds.)
  43
  44 ----
  45
  46 I don't know about power, however I have done some research and a 4Kbyte
  47 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
  48 ballpark of 1000 um^2 in 28nm.
  49 Using a 4xFMA with a banked register file where the bank is selected by the
  50 lower order register number means we could probably get away with 1Rx1W
  51 SRAM as the backing memory for the register file, similarly to Hwacha. I
  52 would suggest 8 banks allowing us to do more in parallel since we could run
  53 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
  54 gate the SRAM banks that are not in use for the current clock cycle
  55 allowing us to save more power. Note that the 4xFMA could be 4 separately
  56 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
  57 parallelism, we can under-volt and under-clock the GPU cores allowing for a
  58 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
  59 think it would be important to be able to use a faster clock speed when not
  60 using the extended registers (similar to how Intel processors use a lower
  61 clock rate when AVX512 is in use) so that scalar code is not slowed down
  62 too much.
  63
  64 > > Using a 4xFMA with a banked register file where the bank is selected by
  65 > the
  66 > > lower order register number means we could probably get away with 1Rx1W
  67 > > SRAM as the backing memory for the register file, similarly to Hwacha.
  68 >
  69 >  okaaay.... sooo... we make an assumption that the top higher "banks"
  70 > are pretty much always going to be "vectorised", such that, actually,
  71 > they genuinely don't need to be 6R-4W (or whatever).
  72 >
  73 Yeah pretty much, though I had meant the bank number comes from the
  74 least-significant bits of the 7-bit register number.
  75
  76 ----
  77
  78 Assuming 64-bit operands:
  79 If you could organize 2 SRAM macros and use the pair of them to
  80 read/write 4 registers at a time (256-bits). The pipeline will allow you to
  81 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
  82
  83 <pre>
  84 RS1 = Read of operand S1
  85 WRd = Write of result Dst
  86 FMx = Floating Point Multiplier, x = stage.
  87
  88    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  89                    |FWD|FM1|FM2|FM3|FM4|
  90                        |FWD|FM1|FM2|FM3|FM4|
  91                            |FWD|FM1|FM2|FM3|FM4|WRd|
  92                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  93                                    |FWD|FM1|FM2|FM3|FM4|
  94                                        |FWD|FM1|FM2|FM3|FM4|
  95                                            |FWD|FM1|FM2|FM3|FM4|WRd|
  96                                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  97                                                    |FWD|FM1|FM2|FM3|FM4|
  98                                                        |FWD|FM1|FM2|FM3|FM4|
  99                                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 100 </pre>
 101
 102 The only trick is getting the read and write dedicated on different clocks.
 103 When the RS3 operand is not needed (60% of the time) you can use
 104 the time slot for reading or writing on behalf of memory refs; STs read,
 105 LDs write.
 106
 107 You will find doing VRFs a lot more compact this way. In GPU land we
 108 called the flip-flops orchestrating the timing "collectors".
 109
 110 ----
 111
 112 Justification for Branch Prediction
 113
 114 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
 115
 116 We can combine several branch predictors to make a decent predictor:
 117 call/return predictor -- important as it can predict calls and returns
 118 with around 99.8% accuracy loop predictor -- basically counts loop
 119 iterations some kind of global predictor -- handles everything else
 120
 121 We will also want a btb, a smaller one will work, it reduces average
 122 branch cycle count from 2-3 to 1 since it predicts which instructions
 123 are taken branches while the instructions are still being fetched,
 124 allowing the fetch to go to the target address on the next clock rather
 125 than having to wait for the fetched instructions to be decoded.
 126
 127 ----
 128
 129 > https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
 130
 131 well, there is this concept:
 132 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
 133
 134 it is a 2-level hierarchy for register cacheing.  honestly, though, the
 135 reservation stations of the tomasulo algorithm are similar to a cache,
 136 although only of the intermediate results, not of the initial operands.
 137
 138 i have a feeling we should investigate putting a 2-level register cache
 139 in front of a multiplexed SRAM.
 140
 141 ----
 142
 143 For GPU workloads FP64 is not common so I think having 1 FP64 alu would
 144 be sufficient. Since indexed loads and stores are not supported, it will
 145 be important to support 4x64 integer operations to generate addresses
 146 for loads/stores.
 147
 148 I was thinking we would use scoreboarding to keep track of operations
 149 and dependencies since it doesn't need a cam per alu. We should be able
 150 to design it to forward past the register file to allow for 0-latency
 151 forwarding. If we combined that with register renaming it should prevent
 152 most war and waw data hazards.
 153
 154 I think branch prediction will be essential if only to fetch and decode
 155 operations since it will reduce the branch penalty substantially.
 156
 157 Note that even if we have a zero-overhead loop extension, branch
 158 prediction will still be useful as we will want to be able to run code
 159 like compilers and standard RV code with decent performance. Additionally,
 160 quite a few shaders have branching in their internal loops so
 161 zero-overhead loops won't be able to fix all the branching problems.
 162
 163 ----
 164
 165 > you would need a 4-wide cdb anyway, since that's the performance we're
 166 > trying for.
 167
 168  if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
 169 then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
 170 per core, which means only a 2-wide CDB, a heck of a lot better than
 171 4.
 172
 173  oh: i thought of another way to cut the power-impact of the Reorder
 174 Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
 175 length equal to the number of registers, 2 is because of 2-issue).
 176
 177  the CAM of a ROB is on the instruction destination register.  key:
 178 ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
 179 destreg has no ROB tag", it's dead-easy to check that bitfield, first.
 180
 181 ----
 182
 183 Avoiding Memory Hazards
 184
 185 * WAR and WAR hazards through memory are eliminated with speculation
 186 because actual updating of memory occurs in order, when a store is at
 187 the head of the ROB, and hence, no earlier loads or stores can still
 188 be pending
 189 * RAW hazards are maintained by two restrictions:
 190   1. not allowing a load to initiate the second step of its execution if
 191     any active ROB entry occupied by a store has a destination
 192     field that matches the value of the A field of the load and
 193   2. maintaining the program order for the computation of an effective
 194       address of a load with respect to all earlier stores
 195 * These restrictions ensure that any load that access a memory location
 196   written to by an earlier store cannot perform the memory access until
 197   the store has written the data.
 198
 199 Advantages of Speculation, Load and Store hazards:
 200
 201 * A store updates memoryy only when it reached the head of the ROB
 202 * WAW and WAR type of hazards are eliminated with speculation
 203   (actual updating of memory occurs in order)
 204 * RAW hazards through memory are maintained by not allowing a load
 205   to initiate the second step of its execution
 206 * Check if any store has a destination field that matched the
 207   value of the load:
 208     - SD F1 100(R2)
 209     - LD F2 100(R2)
 210
 211 Exceptions
 212
 213 * Exceptions are handled by not recognising the exception until
 214   instruction that caused it is ready to commit in ROB (reaches head
 215   of ROB)
 216
 217 Reorder Buffer
 218
 219 * Results of an instruction become visible externally when it leaves
 220   the ROB
 221     - Registers updated
 222     - Memory updated
 223
 224 Reorder Buffer Entry
 225
 226 * Instruction type
 227     - branch (no destination resutl)
 228     - store (has a memory address destination)
 229     - register operation (ALU operation or load, which has reg dests)
 230 * Destination
 231     - register number (for loads and ALU ops) or
 232     - memory address (for stores) where the result should be written
 233 * Value
 234     - value of instruction result, pending a commit
 235 * Ready
 236     - indicates that the instruction has completed execution: value is ready
 237
 238 ----
 239
 240 Register Renaming resources
 241
 242 * <https://www.youtube.com/watch?v=p4SdrUhZrBM>
 243 * <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
 244 * ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
 245
 246 Video @ 3:24, "RAT" table - Register Aliasing Table:
 247
 248 <img src="/3d_gpu/rat_table.png" />
 249
 250 This scheme looks very much like a Reservation Station.
 251
 252 ----
 253
 254 There is another way to get precise ordering of the writes in a scoreboard.
 255 First, one has to implement forwarding in the scoreboard.
 256 Second, the function units need an output queue <of say 4 registers>
 257 Now, one can launch an instruction and pick up its operand either
 258 from the RF or from the function unit output while the result sits
 259 in the function unit waiting for its GO_Write signal.
 260
 261 Thus the launching of instructions is not delayed due to hazards
 262 but the results are delivered to the RF in program order.
 263
 264 This looks surprisingly like a 'belt' at the end of the function unit.
 265
 266 ----
 267
 268 > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
 269
 270  so, coming back to this diagram, i think if we stratify the
 271 Functional Units into lanes as well, we may get a multi-issue
 272 architecture.
 273
 274  the 6600 scoreboard rules - which are awesomely simple and actually
 275 involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
 276 in parallel because there will be no overlap between stratified registers.
 277
 278  if using that odd-even / msw-lsw division (instead of modulo 4 on the
 279 register number) it will be more like a 2-issue for standard RV
 280 instructions and a 4-issue for when SV 32-bit ops are loop-generated.
 281
 282  by subdividing the registers into odd-even banks we will need a
 283 _pair_ of (completely independent) register-renaming tables:
 284   https://libre-riscv.org/3d_gpu/rat_table.png
 285
 286  for SIMD'd operations, if we have the same type of reservation
 287 station queue as with Tomasulo, it can be augmented with the byte-mask:
 288 if the byte-masks in the queue of both the src and dest registers do
 289 not overlap, the operations may be done in parallel.
 290
 291  i still have not yet thought through how the Reorder Buffer would
 292 work: here, again, i am tempted to recommend that, again, we "stratify"
 293 the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
 294 however the CAM is only 4-bit or 3-bit wide.
 295
 296  if an instruction's destination register does not meet the modulo
 297 requirements, that ROB entry is *left empty*.  this does mean that,
 298 for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
 299 4), and there are 4 sequential instructions that happen e.g. to have
 300 a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
 301 etc.... the ROB will only hold 8 such instructions
 302
 303 and that i think is perfectly fine, because, statistically, it'll balance
 304 out, and SV generates sequentially-incrementing instruction registers,
 305 so *that* is fine, too.
 306
 307 i'll keep working on diagrams, and also reading mitch alsup's chapters
 308 on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
 309 LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
 310 for ST, X6 and X7 for LD.
 311
 312 ----
 313
 314 i took a shot at explaining this also on comp.arch today, and that
 315 allowed me to identify a problem with the proposed modulo-4 "lanes"
 316 stratification.
 317
 318 when a result is created in one lane, it may need to be passed to the next
 319 lane.  that means that each of the other lanes needs to keep a watchful
 320 eye on when another lane updates the other regfiles (all 3 of them).
 321
 322 when an incoming update occurs, there may be up to 3 register writes
 323 (that need to be queued?) that need to be broadcast (written) into
 324 reservation stations.
 325
 326 what i'm not sure of is: can data consistency be preserved, even if
 327 there's a delay?  my big concern is that during the time where the data is
 328 broadcast from one lane, the head of the ROB arrives at that instruction
 329 (which is the "commit" condition), it gets committed, then, unfortunately,
 330 the same ROB# gets *reused*.
 331
 332 now that i think about it, as long as the length of the queue is below
 333 the size of the Reorder Buffer (preferably well below), and as long as
 334 it's guaranteed to be emptied by the time the ROB cycles through the
 335 whole buffer, it *should* be okay.
 336
 337 ----
 338
 339 > Don't forget that in these days of Spectre and Meltdown, merely
 340 > preventing dead instruction results from being written to registers or
 341 > memory is NOT ENOUGH. You also need to prevent load instructions from
 342 > altering cache and branch instructions from altering branch prediction
 343 > state.
 344
 345 Which, oddly enough, provides a necessity for being able to consume
 346 multiple containers from the cache Miss buffers, which oddly enough,
 347 are what makes a crucial mechanism in the Virtual Vector Method work.
 348
 349 In the past, one would forward the demand container to the waiting
 350 memref and then write the whole the line into the cache. S&M means you
 351 have to forward multiple times from the miss buffers and avoid damaging
 352 the cache until the instruction retires. VVM uses this to avoid having
 353 a vector strip mine the data cache.
 354
 355 # References
 356
 357 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
 358 * <https://en.wikipedia.org/wiki/Reservation_station>
 359 * <https://en.wikipedia.org/wiki/Register_renaming> points out that
 360   reservation stations take a *lot* of power.
 361 * <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
 362 * MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
 363   <https://github.com/afwolfe/mesi-simulator>
 364 * <https://kshitizdange.github.io/418CacheSim/final-report> report on
 365   types of caches
 366 * <https://github.com/ssc3?tab=repositories> interesting stuff
 367 * <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
 368   pipeline bypassing
 369 * <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
 370 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
 371 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
 372 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
 373 * <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
 374 * <https://www.eda.ncsu.edu/wiki/FreePDK>