3d_gpu/microarchitecture.mdwn

   1 # High-level architectural Requirements
   2
   3 * SMP Cache coherency (TileLink?)
   4 * Minumum 800mhz
   5 * Minimum 2-core SMP, more likely 4-core uniform design,
   6   each core with full 4-wide SIMD-style predicated ALUs
   7 * 6GFLOPS single-precision FP
   8 * 128 64-bit FP and 128 64-bit INT register files
   9 * RV64GC compliance for running full GNU/Linux-based OS
  10 * SimpleV compliance
  11 * xBitManip (required for VPU and ideal for predication)
  12 * On-chip tile buffer (memory-mapped SRAM), likely shared
  13   between all cores, for the collaborative creation of pixel "tiles".
  14 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  15   Multi-R x Multi-W for registers 1-31.
  16   TODO: consider 2R for registers to be used as predication targets
  17   if >= 32.
  18 * Idea: generic implementation of ports on register file so as to be able
  19   to experiment with different arrangements.
  20 * Potentially: Lane-swapping / crossing / data-multiplexing
  21   bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
  22 * Potentially: Registers subdivided into 16-bit, to match
  23   elwidth down to 16-bit (for FP16).  8-bit elwidth only
  24   goes down as far as twin-SIMD (with predication).  This
  25   requires registers to have extra hidden bits: register
  26   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  27
  28 See [[requirements_specification]]
  29
  30 # Conversation Notes
  31
  32 ----
  33
  34 'm thinking about using tilelink (or something similar) internally as
  35 having a cache-coherent protocol is required for implementing Vulkan
  36 (unless you want to turn off the cache for the GPU memory, which I
  37 don't think is a good idea), axi is not a cache-coherent protocol,
  38 and tilelink already has atomic rmw operations built into the protocol.
  39 We can use an axi to tilelink bridge to interface with the memory.
  40
  41 I'm thinking we will want to have a dual-core GPU since a single
  42 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
  43 clock speed. Additionally, that allows us to use an 800MHz core clock
  44 instead of the 1.6GHz we would otherwise need, allowing us to lower the
  45 core voltage and save power, since the power used is proportional to
  46 F\*V^2. (just guessing on clock speeds.)
  47
  48 ----
  49
  50 I don't know about power, however I have done some research and a 4Kbyte
  51 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
  52 ballpark of 1000 um^2 in 28nm.
  53 Using a 4xFMA with a banked register file where the bank is selected by the
  54 lower order register number means we could probably get away with 1Rx1W
  55 SRAM as the backing memory for the register file, similarly to Hwacha. I
  56 would suggest 8 banks allowing us to do more in parallel since we could run
  57 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
  58 gate the SRAM banks that are not in use for the current clock cycle
  59 allowing us to save more power. Note that the 4xFMA could be 4 separately
  60 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
  61 parallelism, we can under-volt and under-clock the GPU cores allowing for a
  62 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
  63 think it would be important to be able to use a faster clock speed when not
  64 using the extended registers (similar to how Intel processors use a lower
  65 clock rate when AVX512 is in use) so that scalar code is not slowed down
  66 too much.
  67
  68 > > Using a 4xFMA with a banked register file where the bank is selected by
  69 > the
  70 > > lower order register number means we could probably get away with 1Rx1W
  71 > > SRAM as the backing memory for the register file, similarly to Hwacha.
  72 >
  73 >  okaaay.... sooo... we make an assumption that the top higher "banks"
  74 > are pretty much always going to be "vectorised", such that, actually,
  75 > they genuinely don't need to be 6R-4W (or whatever).
  76 >
  77 Yeah pretty much, though I had meant the bank number comes from the
  78 least-significant bits of the 7-bit register number.
  79
  80 ----
  81
  82 Assuming 64-bit operands:
  83 If you could organize 2 SRAM macros and use the pair of them to
  84 read/write 4 registers at a time (256-bits). The pipeline will allow you to
  85 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
  86
  87 <pre>
  88 RS1 = Read of operand S1
  89 WRd = Write of result Dst
  90 FMx = Floating Point Multiplier, x = stage.
  91
  92    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  93                    |FWD|FM1|FM2|FM3|FM4|
  94                        |FWD|FM1|FM2|FM3|FM4|
  95                            |FWD|FM1|FM2|FM3|FM4|WRd|
  96                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  97                                    |FWD|FM1|FM2|FM3|FM4|
  98                                        |FWD|FM1|FM2|FM3|FM4|
  99                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 100                                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
 101                                                    |FWD|FM1|FM2|FM3|FM4|
 102                                                        |FWD|FM1|FM2|FM3|FM4|
 103                                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 104 </pre>
 105
 106 The only trick is getting the read and write dedicated on different clocks.
 107 When the RS3 operand is not needed (60% of the time) you can use
 108 the time slot for reading or writing on behalf of memory refs; STs read,
 109 LDs write.
 110
 111 You will find doing VRFs a lot more compact this way. In GPU land we
 112 called the flip-flops orchestrating the timing "collectors".
 113
 114 ----
 115
 116 Justification for Branch Prediction
 117
 118 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
 119
 120 We can combine several branch predictors to make a decent predictor:
 121 call/return predictor -- important as it can predict calls and returns
 122 with around 99.8% accuracy loop predictor -- basically counts loop
 123 iterations some kind of global predictor -- handles everything else
 124
 125 We will also want a btb, a smaller one will work, it reduces average
 126 branch cycle count from 2-3 to 1 since it predicts which instructions
 127 are taken branches while the instructions are still being fetched,
 128 allowing the fetch to go to the target address on the next clock rather
 129 than having to wait for the fetched instructions to be decoded.
 130
 131 ----
 132
 133 > https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
 134
 135 well, there is this concept:
 136 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
 137
 138 it is a 2-level hierarchy for register cacheing.  honestly, though, the
 139 reservation stations of the tomasulo algorithm are similar to a cache,
 140 although only of the intermediate results, not of the initial operands.
 141
 142 i have a feeling we should investigate putting a 2-level register cache
 143 in front of a multiplexed SRAM.
 144
 145 ----
 146
 147 For GPU workloads FP64 is not common so I think having 1 FP64 alu would
 148 be sufficient. Since indexed loads and stores are not supported, it will
 149 be important to support 4x64 integer operations to generate addresses
 150 for loads/stores.
 151
 152 I was thinking we would use scoreboarding to keep track of operations
 153 and dependencies since it doesn't need a cam per alu. We should be able
 154 to design it to forward past the register file to allow for 0-latency
 155 forwarding. If we combined that with register renaming it should prevent
 156 most war and waw data hazards.
 157
 158 I think branch prediction will be essential if only to fetch and decode
 159 operations since it will reduce the branch penalty substantially.
 160
 161 Note that even if we have a zero-overhead loop extension, branch
 162 prediction will still be useful as we will want to be able to run code
 163 like compilers and standard RV code with decent performance. Additionally,
 164 quite a few shaders have branching in their internal loops so
 165 zero-overhead loops won't be able to fix all the branching problems.
 166
 167 ----
 168
 169 > you would need a 4-wide cdb anyway, since that's the performance we're
 170 > trying for.
 171
 172  if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
 173 then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
 174 per core, which means only a 2-wide CDB, a heck of a lot better than
 175 4.
 176
 177  oh: i thought of another way to cut the power-impact of the Reorder
 178 Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
 179 length equal to the number of registers, 2 is because of 2-issue).
 180
 181  the CAM of a ROB is on the instruction destination register.  key:
 182 ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
 183 destreg has no ROB tag", it's dead-easy to check that bitfield, first.
 184
 185 ----
 186
 187 Avoiding Memory Hazards
 188
 189 * WAR and WAR hazards through memory are eliminated with speculation
 190 because actual updating of memory occurs in order, when a store is at
 191 the head of the ROB, and hence, no earlier loads or stores can still
 192 be pending
 193 * RAW hazards are maintained by two restrictions:
 194   1. not allowing a load to initiate the second step of its execution if
 195     any active ROB entry occupied by a store has a destination
 196     field that matches the value of the A field of the load and
 197   2. maintaining the program order for the computation of an effective
 198       address of a load with respect to all earlier stores
 199 * These restrictions ensure that any load that access a memory location
 200   written to by an earlier store cannot perform the memory access until
 201   the store has written the data.
 202
 203 Advantages of Speculation, Load and Store hazards:
 204
 205 * A store updates memory only when it reached the head of the ROB
 206 * WAW and WAR type of hazards are eliminated with speculation
 207   (actual updating of memory occurs in order)
 208 * RAW hazards through memory are maintained by not allowing a load
 209   to initiate the second step of its execution
 210 * Check if any store has a destination field that matched the
 211   value of the load:
 212     - SD F1 100(R2)
 213     - LD F2 100(R2)
 214
 215 Exceptions
 216
 217 * Exceptions are handled by not recognising the exception until
 218   instruction that caused it is ready to commit in ROB (reaches head
 219   of ROB)
 220
 221 Reorder Buffer
 222
 223 * Results of an instruction become visible externally when it leaves
 224   the ROB
 225     - Registers updated
 226     - Memory updated
 227
 228 Reorder Buffer Entry
 229
 230 * Instruction type
 231     - branch (no destination resutl)
 232     - store (has a memory address destination)
 233     - register operation (ALU operation or load, which has reg dests)
 234 * Destination
 235     - register number (for loads and ALU ops) or
 236     - memory address (for stores) where the result should be written
 237 * Value
 238     - value of instruction result, pending a commit
 239 * Ready
 240     - indicates that the instruction has completed execution: value is ready
 241
 242 ----
 243
 244 Register Renaming resources
 245
 246 * <https://www.youtube.com/watch?v=p4SdrUhZrBM>
 247 * <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
 248 * ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
 249
 250 Video @ 3:24, "RAT" table - Register Aliasing Table:
 251
 252 <img src="/3d_gpu/rat_table.png" />
 253
 254 This scheme looks very much like a Reservation Station.
 255
 256 ----
 257
 258 There is another way to get precise ordering of the writes in a scoreboard.
 259 First, one has to implement forwarding in the scoreboard.
 260 Second, the function units need an output queue <of say 4 registers>
 261 Now, one can launch an instruction and pick up its operand either
 262 from the RF or from the function unit output while the result sits
 263 in the function unit waiting for its GO_Write signal.
 264
 265 Thus the launching of instructions is not delayed due to hazards
 266 but the results are delivered to the RF in program order.
 267
 268 This looks surprisingly like a 'belt' at the end of the function unit.
 269
 270 ----
 271
 272 > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
 273
 274  so, coming back to this diagram, i think if we stratify the
 275 Functional Units into lanes as well, we may get a multi-issue
 276 architecture.
 277
 278  the 6600 scoreboard rules - which are awesomely simple and actually
 279 involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
 280 in parallel because there will be no overlap between stratified registers.
 281
 282  if using that odd-even / msw-lsw division (instead of modulo 4 on the
 283 register number) it will be more like a 2-issue for standard RV
 284 instructions and a 4-issue for when SV 32-bit ops are loop-generated.
 285
 286  by subdividing the registers into odd-even banks we will need a
 287 _pair_ of (completely independent) register-renaming tables:
 288   https://libre-riscv.org/3d_gpu/rat_table.png
 289
 290  for SIMD'd operations, if we have the same type of reservation
 291 station queue as with Tomasulo, it can be augmented with the byte-mask:
 292 if the byte-masks in the queue of both the src and dest registers do
 293 not overlap, the operations may be done in parallel.
 294
 295  i still have not yet thought through how the Reorder Buffer would
 296 work: here, again, i am tempted to recommend that, again, we "stratify"
 297 the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
 298 however the CAM is only 4-bit or 3-bit wide.
 299
 300  if an instruction's destination register does not meet the modulo
 301 requirements, that ROB entry is *left empty*.  this does mean that,
 302 for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
 303 4), and there are 4 sequential instructions that happen e.g. to have
 304 a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
 305 etc.... the ROB will only hold 8 such instructions
 306
 307 and that i think is perfectly fine, because, statistically, it'll balance
 308 out, and SV generates sequentially-incrementing instruction registers,
 309 so *that* is fine, too.
 310
 311 i'll keep working on diagrams, and also reading mitch alsup's chapters
 312 on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
 313 LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
 314 for ST, X6 and X7 for LD.
 315
 316 ----
 317
 318 i took a shot at explaining this also on comp.arch today, and that
 319 allowed me to identify a problem with the proposed modulo-4 "lanes"
 320 stratification.
 321
 322 when a result is created in one lane, it may need to be passed to the next
 323 lane.  that means that each of the other lanes needs to keep a watchful
 324 eye on when another lane updates the other regfiles (all 3 of them).
 325
 326 when an incoming update occurs, there may be up to 3 register writes
 327 (that need to be queued?) that need to be broadcast (written) into
 328 reservation stations.
 329
 330 what i'm not sure of is: can data consistency be preserved, even if
 331 there's a delay?  my big concern is that during the time where the data is
 332 broadcast from one lane, the head of the ROB arrives at that instruction
 333 (which is the "commit" condition), it gets committed, then, unfortunately,
 334 the same ROB# gets *reused*.
 335
 336 now that i think about it, as long as the length of the queue is below
 337 the size of the Reorder Buffer (preferably well below), and as long as
 338 it's guaranteed to be emptied by the time the ROB cycles through the
 339 whole buffer, it *should* be okay.
 340
 341 ----
 342
 343 > Don't forget that in these days of Spectre and Meltdown, merely
 344 > preventing dead instruction results from being written to registers or
 345 > memory is NOT ENOUGH. You also need to prevent load instructions from
 346 > altering cache and branch instructions from altering branch prediction
 347 > state.
 348
 349 Which, oddly enough, provides a necessity for being able to consume
 350 multiple containers from the cache Miss buffers, which oddly enough,
 351 are what makes a crucial mechanism in the Virtual Vector Method work.
 352
 353 In the past, one would forward the demand container to the waiting
 354 memref and then write the whole the line into the cache. S&M means you
 355 have to forward multiple times from the miss buffers and avoid damaging
 356 the cache until the instruction retires. VVM uses this to avoid having
 357 a vector strip mine the data cache.
 358
 359 ----
 360
 361 > I meant the renaming done as part of the SV extension, not the
 362 > microarchitectural renaming.
 363
 364 ah ok, yes. right.  ok, so i don't know what to name that, and i'd
 365 been thinking of it in terms of "post-renaming", as in my mind, it's
 366 not really renaming, at all, it's... remapping.  or, vector
 367 "elements".
 368
 369 as in: architecturally we already have a name (vector "elements").
 370 physically we already have a name: register file.
 371
 372 i was initially thinking that the issue stage would take care of it,
 373 by producing:
 374
 375 * post-remapped elements which are basically post-remapped register indices
 376 * a byte-mask indicating which *bytes* of the register are to be
 377   modified and which left alone
 378 * an element-width that is effectively an augmentation of (part of) the opcode
 379
 380 the element width goes into the ALU as an augmentation of the opcode
 381 because the 64-bit "register" now contains e.g. 16-bit "elements"
 382 indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a
 383 SIMD-style (predicated) operation to take place.
 384
 385 now that i think about it, i think we may need to have the three
 386 phases be part of a pipeline, in a single dependency matrix.
 387
 388 ----
 389
 390 I had a state machine in one chip that could come up out of power on in a
 391 state it could not get out of. Since this experience, I have a rule with
 392 state machines, A state machine must be able to go from any state to idle
 393 when the reset line is asserted.
 394
 395 You have to prove that the logic can never create a circular dependency,
 396 not a proof with test vectors, a logical proof like what we do with FP
 397 arithmetic these days.
 398
 399 ----
 400
 401
 402 >  however... we don't mind that, as the vectorisation engine will
 403 >  be, for the most part, generating sequentially-increasing index
 404 >  dest *and* src registers, so we kinda get away with it.
 405
 406 In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
 407 and read 4 registers at a time or write 4 registers at a time. Timing
 408 looks like:
 409
 410 <pre>
 411      |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
 412                     |F123|F123|F123|F123|
 413                          |Esk1|EsK2|EsK3|EsK4|
 414                                         |EfK1|EfK2|EfK3|EfK4|
 415 </pre>
 416
 417 4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
 418 Read as much as you need in 4 cycles for another operand, read as much
 419 as you need in 4 cycles for the last operand, then write as much as you
 420 can for the result. This simply requires flip-flops to capture the width
 421 and then deliver operands in parallel (serial to parallel converter) and
 422 similarly for writing.
 423
 424 ----
 425
 426 * <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/32aNXIzeDQAJ>
 427
 428 discussion of how to do dest-latches rather than src-latches.
 429
 430 also includes need for forwarding to achieve it (synonymous with
 431 Tomasulo CDB).
 432
 433 also, assigning a result number at issue time allows multiple results
 434 to be stored-and-forwarded, meaning that multiplying up the FUs is
 435 not needed.
 436
 437 also, discussion of how to have multiple instructions issued even with
 438 the same dest reg: drop the reg-store and effectively rename them
 439 to "R.FU#".  exceptions under discussion.
 440
 441 # Register File having same-cycle "forwarding"
 442
 443 discussion about CDC 6600 Register File: it was capable of forwarding
 444 operands being written out to "reads", *in the same cycle*.  this
 445 effectively turns the Reg File *into* a "Forwarding Bus".
 446
 447 we aim to only have (4 banks of) 2R1W ported register files,
 448 with *additional* Forwarding Multiplexers (which look exactly
 449 like multi-port regfile gate logic).
 450
 451 suggestion by Mitch is to have a "demon" on the front of the regfile,
 452 <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/qY2SYjd2DgAJ>,
 453 which:
 454
 455     basically, you are going to end up with a "demon" at the RF and when
 456     all read reservations have been satisfied the demon determines if the
 457     result needs to be written to the RF or discarded. The demon sees
 458     the instruction issue process, the branch resolutions, and the FU
 459     exceptions, and keeps track of whether the result needs to be written.
 460     It then forwards the result from the FU and clears the slot, then writes
 461     the result to the RF if needed.
 462
 463 # Design Layout
 464
 465 ok,so continuing some thoughts-in-order notes:
 466
 467 ## Scoreboards
 468
 469 scoreboards are not just scoreboards, they are dependency matrices,
 470 and there are several of them:
 471
 472 * one for LOAD/STORE-to-LOAD/STORE
 473   - most recent LOADs prevent later STOREs
 474   - most recent STOREs prevent later LOADs.
 475   - a separate process analyses LOAD-STORE addresses for
 476     conflicts, based on sufficient bits to assess uniqueness
 477     as opposed to precise and exact matches
 478 * one for Function-Unit to Function-Unit.
 479   - it expresses both RAW and WAW hazards through "Go_Write"
 480     and "Go_Read" signals, which are stopped from proceeding by
 481     dependent 1-bit CAM latches
 482   - exceptions may ALSO be made "precise" by holding a "Write prevention"
 483     signal.  only when the Function Unit knows that an exception is
 484     not going to occur (memory has been fetched, for example), does
 485     it release the signal
 486   - speculative branch execution likewise may hold a "Write prevention",
 487     however it also needs a "Go die" signal, to clear out the
 488     incorrectly-taken branch.
 489   - LOADs/STOREs *also* must be considered as "Functional Units" and thus
 490     must also have corresponding entries (plural) in the FU-to-FU Matrix
 491   - it is permitted for ALUs to *BEGIN* execution (read operands are
 492     valid) without being permitted to *COMMIT*.  thus, each FU must
 493     store (buffer) results, until such time as a "commit" signal is
 494     received
 495   - we may need to express an inter-dependence on the instruction order
 496     (raising the WAW hazard line to do so) as a way to preserve execution
 497     order.  only the oldest instructions will have this flag dropped,
 498     permitting execution that has *begun* to also reach "commit" phase.
 499 * one for Function-Unit to Registers.
 500   - it expresses the read and write requirements: the source
 501     and destination registers on which the operation depends.  source
 502     registers are marked "need read", dest registers marked
 503     "need write".
 504   - by having *more than one* Functional Unit matrix row per ALU
 505     it becomes possible to effectively achieve "Reservation Stations"
 506     orthogonality with the Tomasulo Algorithm.  the FU row must, like
 507     RS's, take and store a copy of the src register values.
 508
 509 ## Register Renaming
 510
 511 There are several potential well-known schemes for register-renaming:
 512 *none of them will be used here*.  The scheme below is a new form of
 513 renaming that is a topologically and functionally **direct** equivalent
 514 of the Tomasulo Algorithm with a Reorder Buffer, that came from the
 515 "Register Alias Table" concept that is better suited to Scoreboards.
 516 It works by flattening out Reservation Stations to one per FU (requiring
 517 more FUs as a result).  On top of this the function normally carried
 518 out by "tags" of the RAT table may be merged-morphed into the role
 519 carried out by the ROB Destination Register CAM which may be merged-morphed
 520 into a single vector (per register) of 1-bit mutually-exclusive "CAMs"
 521 that are added, very simply, to the FU-Register Dependency Matrix.
 522
 523 In this way, exactly as in the Tomasulo Algorithm, there is absolutely no
 524 need whatsoever for a separate PRF-ARF scheme.  The PRF *is* the ARF.
 525
 526 Register-renaming will be done with a single extra mutually-exclusive bit
 527 in the FUxReg Dependency Matrix, which may be set on only one FU (per register).
 528 This bit indicates which of the FUs has the **most recent** destination
 529 register value pending.  It is **directly** functionally equivalent to
 530 the Reorder Buffer Dest Reg# CAM value, except that now it is a
 531 string of 1-bit "CAMs".
 532
 533 When an FU needs a src reg and finds that it needs to create a
 534 dependency waiting for a result to be created, it must use this
 535 bit to determine which FU it creates a dependency on.
 536
 537 If there is a destination register that already has a bit set
 538 (anywhere in the column), it is **cleared** and **replaced**
 539 with a bit in the FU's row and the destination register's column.
 540
 541 See https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/c80jRn4PCQAJ
 542
 543 MUL r1, r2, r3
 544
 545     FU name  Reg name
 546              12345678
 547     add-0    ........
 548     add-1    ........
 549     mul-0    X.......
 550     mul-1    ........
 551
 552 ADD r4, r1, r3
 553
 554     FU name  Reg name
 555              12345678
 556     add-0    ...X....
 557     add-1    ........
 558     mul-0    X.......
 559     mul-1    ........
 560
 561 ADD r1, r5, r6
 562
 563     FU name  Reg name
 564              12345678
 565     add-0    ...X....
 566     add-1    X.......
 567     mul-0    ........
 568     mul-1    ........
 569
 570 note how on the 3rd instruction, the (mul-0,R1) entry is **cleared**
 571 and **replaced** with an (add-1,R1) entry.  future instructions now
 572 know that if their src operands require R1, they are to place a
 573 RaW dependency on **add-1**, not mul-0
 574
 575 ## Multi-issue
 576
 577 we may potentially have 2-issue (or 4-issue) and a simpler issue and
 578 detection by "striping" the register file according to modulo 2 (or 4)
 579 on the destination   register number
 580
 581 * the Function Unit rows are multiplied up by 2 (or 4) however they are
 582   actually connected to the same ALUs (pipelined and with both src and
 583   dest register buffers/latches).
 584 * the Register Read and Write signals are then "striped" such that
 585   read/write requests for every 2nd (or 4th) register are "grouped" and
 586   will have to fight for access to a multiplexer in order to access
 587   registers that do not have the same modulo 2 (or 4) match.
 588 * we MAY potentially be able to drop the destination (write) multiplexer(s)
 589   by only permitting FU rows with the same modulo to write to that
 590   destination bank.  FUs with indices 0,4,8,12 may only write to registers
 591   similarly numbered.
 592 * there will therefore be FOUR separate register-data buses, with (at least)
 593   the Read buses multiplexed so that all FU banks may read all src registers
 594   (even if there is contention for the multiplexers)
 595
 596 ## FU-to-Register address de-muxed already
 597
 598 an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
 599 write/read enable signals already exist as single-bits.  "normal" processors
 600 store the src/dest registers as an index (5 bits == 0-31), where in this
 601 design, that has been expanded out to 32 individual Read/Write wires,
 602 already.
 603
 604 * the register file verilog implementation therefore must take in an
 605   array of 128-bit write-enable and 128-bit read-enable signals.
 606 * however the data buses will be multiplexed modulo 2 (or 4) according
 607   to the lower bits of the register number, in order to cross "lanes".
 608
 609 ## FU "Grouping"
 610
 611 with so many Function Units in RISC-V (dozens of instructions, times 2
 612 to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
 613 we almost certainly are going to have to deploy a "grouping" scheme:
 614
 615 * rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
 616   to MUL etc., instead we group the FUs by how many src and dest
 617   registers are required, and *pass the opcode down to them*
 618 * only FUs with the exact same number (and type) of register profile
 619   will receive like-minded opcodes.
 620 * when src and dest are free for a particular op (and an ALU pipeline is
 621   not stalled) the FU is at liberty to push the operands into the
 622   appropriate free ALU.
 623 * FUs therefore only really express the register, memory, and execution
 624   dependencies: they don't actually do the execution.
 625
 626 ## Recommendations
 627
 628 * Include a merged address-generator in the INT ALU
 629 * Have simple ALU units duplicated and allow more than one FU to
 630   receive (and process) the src operands.
 631
 632 ## Register file workloads
 633
 634 Note: Vectorisation also includes predication, which is one extra integer read
 635
 636 Integer workloads:
 637
 638 * 43% Integer
 639 * 21% Load
 640 * 12% store
 641 * 24% branch
 642
 643 * 100% of the instruction stream can be integer instructions
 644 * 75% utilize two source operand registers.
 645 * 50% of the instruction stream can be Load instructions
 646 * 25% can be store instructions,
 647 * 25% can be branch instructions
 648
 649 FP workloads:
 650
 651 * 30% Integer
 652 * 25% Load
 653 * 10% Store
 654 * 13% Multiplication
 655 * 17% Addition
 656 * 5% branch
 657
 658 ----
 659
 660 >  in particular i found it fascinating that analysis of INT
 661 >  instructions found a 50% LD, 25% ST and 25% branch, and that
 662 >  70% were 2-src ops.  therefore you made sure that the number
 663 >  of read and write ports matched these, to ensure no bottlenecks,
 664 >  bearing in mind that ST requires reading an address *and*
 665 >  a data register.
 666
 667 I never had a problem in "reading the write slot" in any of my pipelines.
 668 That is, take a pipeline where LD (cache hit) has a latency of 3 cycles
 669 (AGEN, Cache, Align). Align would be in the cycle where the data was being
 670 forwarded, and the subsequent cycle, data could be written into the RF.
 671
 672 |dec|AGN|$$$|ALN|LDW|
 673
 674 For stores I would read the LDs write slot Align the store data and merge
 675 into the cache as::
 676
 677 |dec|AGEN|tag|---|STR|ALN|$$$|
 678
 679 You know 4 cycles in advance that a store is coming, 2 cycles after hit
 680 so there is easy logic to decide to read the write slot (or not), and it
 681 costs 2 address comparators to disambiguate this short shadow in the pipeline.
 682
 683 This is a lower expense than building another read port into the RF, in
 684 both area and power, and uses the pipeline efficiently.
 685
 686 # References
 687
 688 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
 689 * <https://en.wikipedia.org/wiki/Reservation_station>
 690 * <https://en.wikipedia.org/wiki/Register_renaming> points out that
 691   reservation stations take a *lot* of power.
 692 * <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
 693 * MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
 694   <https://github.com/afwolfe/mesi-simulator>
 695 * <https://kshitizdange.github.io/418CacheSim/final-report> report on
 696   types of caches
 697 * <https://github.com/ssc3?tab=repositories> interesting stuff
 698 * <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
 699   pipeline bypassing
 700 * <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
 701 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
 702 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
 703 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
 704 * <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
 705 * <https://www.eda.ncsu.edu/wiki/FreePDK>