3d_gpu/microarchitecture.mdwn

   1 # High-level architectural Requirements
   2
   3 * SMP Cache coherency (TileLink?)
   4 * Minumum 800mhz
   5 * Minimum 2-core SMP, more likely 4-core uniform design,
   6   each core with full 4-wide SIMD-style predicated ALUs
   7 * 6GFLOPS single-precision FP
   8 * 128 64-bit FP and 128 64-bit INT register files
   9 * RV64GC compliance for running full GNU/Linux-based OS
  10 * SimpleV compliance
  11 * xBitManip (required for VPU and ideal for predication)
  12 * On-chip tile buffer (memory-mapped SRAM), likely shared
  13   between all cores, for the collaborative creation of pixel "tiles".
  14 * 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  15   Multi-R x Multi-W for registers 1-31.
  16   TODO: consider 2R for registers to be used as predication targets
  17   if >= 32.
  18 * Idea: generic implementation of ports on register file so as to be able
  19   to experiment with different arrangements.
  20 * Potentially: Lane-swapping / crossing / data-multiplexing
  21   bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
  22 * Potentially: Registers subdivided into 16-bit, to match
  23   elwidth down to 16-bit (for FP16).  8-bit elwidth only
  24   goes down as far as twin-SIMD (with predication).  This
  25   requires registers to have extra hidden bits: register
  26   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  27
  28 # Conversation Notes
  29
  30 ----
  31
  32 'm thinking about using tilelink (or something similar) internally as
  33 having a cache-coherent protocol is required for implementing Vulkan
  34 (unless you want to turn off the cache for the GPU memory, which I
  35 don't think is a good idea), axi is not a cache-coherent protocol,
  36 and tilelink already has atomic rmw operations built into the protocol.
  37 We can use an axi to tilelink bridge to interface with the memory.
  38
  39 I'm thinking we will want to have a dual-core GPU since a single
  40 core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
  41 clock speed. Additionally, that allows us to use an 800MHz core clock
  42 instead of the 1.6GHz we would otherwise need, allowing us to lower the
  43 core voltage and save power, since the power used is proportional to
  44 F\*V^2. (just guessing on clock speeds.)
  45
  46 ----
  47
  48 I don't know about power, however I have done some research and a 4Kbyte
  49 (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
  50 ballpark of 1000 um^2 in 28nm.
  51 Using a 4xFMA with a banked register file where the bank is selected by the
  52 lower order register number means we could probably get away with 1Rx1W
  53 SRAM as the backing memory for the register file, similarly to Hwacha. I
  54 would suggest 8 banks allowing us to do more in parallel since we could run
  55 other units in parallel with a 4xFMA. 8 banks would also allow us to clock
  56 gate the SRAM banks that are not in use for the current clock cycle
  57 allowing us to save more power. Note that the 4xFMA could be 4 separately
  58 allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
  59 parallelism, we can under-volt and under-clock the GPU cores allowing for a
  60 more efficient GPU. If we are using the GPU cores as CPU cores as well, I
  61 think it would be important to be able to use a faster clock speed when not
  62 using the extended registers (similar to how Intel processors use a lower
  63 clock rate when AVX512 is in use) so that scalar code is not slowed down
  64 too much.
  65
  66 > > Using a 4xFMA with a banked register file where the bank is selected by
  67 > the
  68 > > lower order register number means we could probably get away with 1Rx1W
  69 > > SRAM as the backing memory for the register file, similarly to Hwacha.
  70 >
  71 >  okaaay.... sooo... we make an assumption that the top higher "banks"
  72 > are pretty much always going to be "vectorised", such that, actually,
  73 > they genuinely don't need to be 6R-4W (or whatever).
  74 >
  75 Yeah pretty much, though I had meant the bank number comes from the
  76 least-significant bits of the 7-bit register number.
  77
  78 ----
  79
  80 Assuming 64-bit operands:
  81 If you could organize 2 SRAM macros and use the pair of them to
  82 read/write 4 registers at a time (256-bits). The pipeline will allow you to
  83 dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
  84
  85 <pre>
  86 RS1 = Read of operand S1
  87 WRd = Write of result Dst
  88 FMx = Floating Point Multiplier, x = stage.
  89
  90    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  91                    |FWD|FM1|FM2|FM3|FM4|
  92                        |FWD|FM1|FM2|FM3|FM4|
  93                            |FWD|FM1|FM2|FM3|FM4|WRd|
  94                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  95                                    |FWD|FM1|FM2|FM3|FM4|
  96                                        |FWD|FM1|FM2|FM3|FM4|
  97                                            |FWD|FM1|FM2|FM3|FM4|WRd|
  98                                    |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
  99                                                    |FWD|FM1|FM2|FM3|FM4|
 100                                                        |FWD|FM1|FM2|FM3|FM4|
 101                                                            |FWD|FM1|FM2|FM3|FM4|WRd|
 102 </pre>
 103
 104 The only trick is getting the read and write dedicated on different clocks.
 105 When the RS3 operand is not needed (60% of the time) you can use
 106 the time slot for reading or writing on behalf of memory refs; STs read,
 107 LDs write.
 108
 109 You will find doing VRFs a lot more compact this way. In GPU land we
 110 called the flip-flops orchestrating the timing "collectors".
 111
 112 ----
 113
 114 Justification for Branch Prediction
 115
 116 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
 117
 118 We can combine several branch predictors to make a decent predictor:
 119 call/return predictor -- important as it can predict calls and returns
 120 with around 99.8% accuracy loop predictor -- basically counts loop
 121 iterations some kind of global predictor -- handles everything else
 122
 123 We will also want a btb, a smaller one will work, it reduces average
 124 branch cycle count from 2-3 to 1 since it predicts which instructions
 125 are taken branches while the instructions are still being fetched,
 126 allowing the fetch to go to the target address on the next clock rather
 127 than having to wait for the fetched instructions to be decoded.
 128
 129 ----
 130
 131 > https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
 132
 133 well, there is this concept:
 134 https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
 135
 136 it is a 2-level hierarchy for register cacheing.  honestly, though, the
 137 reservation stations of the tomasulo algorithm are similar to a cache,
 138 although only of the intermediate results, not of the initial operands.
 139
 140 i have a feeling we should investigate putting a 2-level register cache
 141 in front of a multiplexed SRAM.
 142
 143 ----
 144
 145 For GPU workloads FP64 is not common so I think having 1 FP64 alu would
 146 be sufficient. Since indexed loads and stores are not supported, it will
 147 be important to support 4x64 integer operations to generate addresses
 148 for loads/stores.
 149
 150 I was thinking we would use scoreboarding to keep track of operations
 151 and dependencies since it doesn't need a cam per alu. We should be able
 152 to design it to forward past the register file to allow for 0-latency
 153 forwarding. If we combined that with register renaming it should prevent
 154 most war and waw data hazards.
 155
 156 I think branch prediction will be essential if only to fetch and decode
 157 operations since it will reduce the branch penalty substantially.
 158
 159 Note that even if we have a zero-overhead loop extension, branch
 160 prediction will still be useful as we will want to be able to run code
 161 like compilers and standard RV code with decent performance. Additionally,
 162 quite a few shaders have branching in their internal loops so
 163 zero-overhead loops won't be able to fix all the branching problems.
 164
 165 ----
 166
 167 > you would need a 4-wide cdb anyway, since that's the performance we're
 168 > trying for.
 169
 170  if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
 171 then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
 172 per core, which means only a 2-wide CDB, a heck of a lot better than
 173 4.
 174
 175  oh: i thought of another way to cut the power-impact of the Reorder
 176 Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
 177 length equal to the number of registers, 2 is because of 2-issue).
 178
 179  the CAM of a ROB is on the instruction destination register.  key:
 180 ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
 181 destreg has no ROB tag", it's dead-easy to check that bitfield, first.
 182
 183 ----
 184
 185 Avoiding Memory Hazards
 186
 187 * WAR and WAR hazards through memory are eliminated with speculation
 188 because actual updating of memory occurs in order, when a store is at
 189 the head of the ROB, and hence, no earlier loads or stores can still
 190 be pending
 191 * RAW hazards are maintained by two restrictions:
 192   1. not allowing a load to initiate the second step of its execution if
 193     any active ROB entry occupied by a store has a destination
 194     field that matches the value of the A field of the load and
 195   2. maintaining the program order for the computation of an effective
 196       address of a load with respect to all earlier stores
 197 * These restrictions ensure that any load that access a memory location
 198   written to by an earlier store cannot perform the memory access until
 199   the store has written the data.
 200
 201 Advantages of Speculation, Load and Store hazards:
 202
 203 * A store updates memory only when it reached the head of the ROB
 204 * WAW and WAR type of hazards are eliminated with speculation
 205   (actual updating of memory occurs in order)
 206 * RAW hazards through memory are maintained by not allowing a load
 207   to initiate the second step of its execution
 208 * Check if any store has a destination field that matched the
 209   value of the load:
 210     - SD F1 100(R2)
 211     - LD F2 100(R2)
 212
 213 Exceptions
 214
 215 * Exceptions are handled by not recognising the exception until
 216   instruction that caused it is ready to commit in ROB (reaches head
 217   of ROB)
 218
 219 Reorder Buffer
 220
 221 * Results of an instruction become visible externally when it leaves
 222   the ROB
 223     - Registers updated
 224     - Memory updated
 225
 226 Reorder Buffer Entry
 227
 228 * Instruction type
 229     - branch (no destination resutl)
 230     - store (has a memory address destination)
 231     - register operation (ALU operation or load, which has reg dests)
 232 * Destination
 233     - register number (for loads and ALU ops) or
 234     - memory address (for stores) where the result should be written
 235 * Value
 236     - value of instruction result, pending a commit
 237 * Ready
 238     - indicates that the instruction has completed execution: value is ready
 239
 240 ----
 241
 242 Register Renaming resources
 243
 244 * <https://www.youtube.com/watch?v=p4SdrUhZrBM>
 245 * <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
 246 * ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
 247
 248 Video @ 3:24, "RAT" table - Register Aliasing Table:
 249
 250 <img src="/3d_gpu/rat_table.png" />
 251
 252 This scheme looks very much like a Reservation Station.
 253
 254 ----
 255
 256 There is another way to get precise ordering of the writes in a scoreboard.
 257 First, one has to implement forwarding in the scoreboard.
 258 Second, the function units need an output queue <of say 4 registers>
 259 Now, one can launch an instruction and pick up its operand either
 260 from the RF or from the function unit output while the result sits
 261 in the function unit waiting for its GO_Write signal.
 262
 263 Thus the launching of instructions is not delayed due to hazards
 264 but the results are delivered to the RF in program order.
 265
 266 This looks surprisingly like a 'belt' at the end of the function unit.
 267
 268 ----
 269
 270 > https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
 271
 272  so, coming back to this diagram, i think if we stratify the
 273 Functional Units into lanes as well, we may get a multi-issue
 274 architecture.
 275
 276  the 6600 scoreboard rules - which are awesomely simple and actually
 277 involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
 278 in parallel because there will be no overlap between stratified registers.
 279
 280  if using that odd-even / msw-lsw division (instead of modulo 4 on the
 281 register number) it will be more like a 2-issue for standard RV
 282 instructions and a 4-issue for when SV 32-bit ops are loop-generated.
 283
 284  by subdividing the registers into odd-even banks we will need a
 285 _pair_ of (completely independent) register-renaming tables:
 286   https://libre-riscv.org/3d_gpu/rat_table.png
 287
 288  for SIMD'd operations, if we have the same type of reservation
 289 station queue as with Tomasulo, it can be augmented with the byte-mask:
 290 if the byte-masks in the queue of both the src and dest registers do
 291 not overlap, the operations may be done in parallel.
 292
 293  i still have not yet thought through how the Reorder Buffer would
 294 work: here, again, i am tempted to recommend that, again, we "stratify"
 295 the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
 296 however the CAM is only 4-bit or 3-bit wide.
 297
 298  if an instruction's destination register does not meet the modulo
 299 requirements, that ROB entry is *left empty*.  this does mean that,
 300 for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
 301 4), and there are 4 sequential instructions that happen e.g. to have
 302 a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
 303 etc.... the ROB will only hold 8 such instructions
 304
 305 and that i think is perfectly fine, because, statistically, it'll balance
 306 out, and SV generates sequentially-incrementing instruction registers,
 307 so *that* is fine, too.
 308
 309 i'll keep working on diagrams, and also reading mitch alsup's chapters
 310 on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
 311 LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
 312 for ST, X6 and X7 for LD.
 313
 314 ----
 315
 316 i took a shot at explaining this also on comp.arch today, and that
 317 allowed me to identify a problem with the proposed modulo-4 "lanes"
 318 stratification.
 319
 320 when a result is created in one lane, it may need to be passed to the next
 321 lane.  that means that each of the other lanes needs to keep a watchful
 322 eye on when another lane updates the other regfiles (all 3 of them).
 323
 324 when an incoming update occurs, there may be up to 3 register writes
 325 (that need to be queued?) that need to be broadcast (written) into
 326 reservation stations.
 327
 328 what i'm not sure of is: can data consistency be preserved, even if
 329 there's a delay?  my big concern is that during the time where the data is
 330 broadcast from one lane, the head of the ROB arrives at that instruction
 331 (which is the "commit" condition), it gets committed, then, unfortunately,
 332 the same ROB# gets *reused*.
 333
 334 now that i think about it, as long as the length of the queue is below
 335 the size of the Reorder Buffer (preferably well below), and as long as
 336 it's guaranteed to be emptied by the time the ROB cycles through the
 337 whole buffer, it *should* be okay.
 338
 339 ----
 340
 341 > Don't forget that in these days of Spectre and Meltdown, merely
 342 > preventing dead instruction results from being written to registers or
 343 > memory is NOT ENOUGH. You also need to prevent load instructions from
 344 > altering cache and branch instructions from altering branch prediction
 345 > state.
 346
 347 Which, oddly enough, provides a necessity for being able to consume
 348 multiple containers from the cache Miss buffers, which oddly enough,
 349 are what makes a crucial mechanism in the Virtual Vector Method work.
 350
 351 In the past, one would forward the demand container to the waiting
 352 memref and then write the whole the line into the cache. S&M means you
 353 have to forward multiple times from the miss buffers and avoid damaging
 354 the cache until the instruction retires. VVM uses this to avoid having
 355 a vector strip mine the data cache.
 356
 357 ----
 358
 359 > I meant the renaming done as part of the SV extension, not the
 360 > microarchitectural renaming.
 361
 362 ah ok, yes. right.  ok, so i don't know what to name that, and i'd
 363 been thinking of it in terms of "post-renaming", as in my mind, it's
 364 not really renaming, at all, it's... remapping.  or, vector
 365 "elements".
 366
 367 as in: architecturally we already have a name (vector "elements").
 368 physically we already have a name: register file.
 369
 370 i was initially thinking that the issue stage would take care of it,
 371 by producing:
 372
 373 * post-remapped elements which are basically post-remapped register indices
 374 * a byte-mask indicating which *bytes* of the register are to be
 375   modified and which left alone
 376 * an element-width that is effectively an augmentation of (part of) the opcode
 377
 378 the element width goes into the ALU as an augmentation of the opcode
 379 because the 64-bit "register" now contains e.g. 16-bit "elements"
 380 indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a
 381 SIMD-style (predicated) operation to take place.
 382
 383 now that i think about it, i think we may need to have the three
 384 phases be part of a pipeline, in a single dependency matrix.
 385
 386 ----
 387
 388 I had a state machine in one chip that could come up out of power on in a
 389 state it could not get out of. Since this experience, I have a rule with
 390 state machines, A state machine must be able to go from any state to idle
 391 when the reset line is asserted.
 392
 393 You have to prove that the logic can never create a circular dependency,
 394 not a proof with test vectors, a logical proof like what we do with FP
 395 arithmetic these days.
 396
 397 ----
 398
 399
 400 >  however... we don't mind that, as the vectorisation engine will
 401 >  be, for the most part, generating sequentially-increasing index
 402 >  dest *and* src registers, so we kinda get away with it.
 403
 404 In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
 405 and read 4 registers at a time or write 4 registers at a time. Timing
 406 looks like:
 407
 408 <pre>
 409      |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
 410                     |F123|F123|F123|F123|
 411                          |Esk1|EsK2|EsK3|EsK4|
 412                                         |EfK1|EfK2|EfK3|EfK4|
 413 </pre>
 414
 415 4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
 416 Read as much as you need in 4 cycles for another operand, read as much
 417 as you need in 4 cycles for the last operand, then write as much as you
 418 can for the result. This simply requires flip-flops to capture the width
 419 and then deliver operands in parallel (serial to parallel converter) and
 420 similarly for writing.
 421 # Design Layout
 422
 423 ok,so continuing some thoughts-in-order notes:
 424
 425 ## Scoreboards
 426
 427 scoreboards are not just scoreboards, they are dependency matrices,
 428 and there are several of them:
 429
 430 * one for LOAD/STORE-to-LOAD/STORE
 431   - most recent LOADs prevent later STOREs
 432   - most recent STOREs prevent later LOADs.
 433   - a separate process analyses LOAD-STORE addresses for
 434     conflicts, based on sufficient bits to assess uniqueness
 435     as opposed to precise and exact matches
 436 * one for Function-Unit to Function-Unit.
 437   - it expresses both RAW and WAW hazards through "Go_Write"
 438     and "Go_Read" signals, which are stopped from proceeding by
 439     dependent 1-bit CAM latches
 440   - exceptions may ALSO be made "precise" by holding a "Write prevention"
 441     signal.  only when the Function Unit knows that an exception is
 442     not going to occur (memory has been fetched, for example), does
 443     it release the signal
 444   - speculative branch execution likewise may hold a "Write prevention",
 445     however it also needs a "Go die" signal, to clear out the
 446     incorrectly-taken branch.
 447   - LOADs/STOREs *also* must be considered as "Functional Units" and thus
 448     must also have corresponding entries (plural) in the FU-to-FU Matrix
 449   - it is permitted for ALUs to *BEGIN* execution (read operands are
 450     valid) without being permitted to *COMMIT*.  thus, each FU must
 451     store (buffer) results, until such time as a "commit" signal is
 452     received
 453   - we may need to express an inter-dependence on the instruction order
 454     (raising the WAW hazard line to do so) as a way to preserve execution
 455     order.  only the oldest instructions will have this flag dropped,
 456     permitting execution that has *begun* to also reach "commit" phase.
 457 * one for Function-Unit to Registers.
 458   - it expresses the read and write requirements: the source
 459     and destination registers on which the operation depends.  source
 460     registers are marked "need read", dest registers marked
 461     "need write".
 462   - by having *more than one* Functional Unit matrix row per ALU
 463     it becomes possible to effectively achieve "Reservation Stations"
 464     orthogonality with the Tomasulo Algorithm.  the FU row must, like
 465     RS's, take and store a copy of the src register values.
 466
 467 ## Register Renaming
 468
 469 There are several potential well-known schemes for register-renaming:
 470 *none of them will be used here*.  The scheme below is a new form of
 471 renaming that is a topologically and functionally **direct** equivalent
 472 of the Tomasulo Algorithm with a Reorder Buffer, that came from the
 473 "Register Alias Table" concept that is better suited to Scoreboards.
 474 It works by flattening out Reservation Stations to one per FU (requiring
 475 more FUs as a result).  On top of this the function normally carried
 476 out by "tags" of the RAT table may be merged-morphed into the role
 477 carried out by the ROB Destination Register CAM which may be merged-morphed
 478 into a single vector (per register) of 1-bit mutually-exclusive "CAMs"
 479 that are added, very simply, to the FU-Register Dependency Matrix.
 480
 481 In this way, exactly as in the Tomasulo Algorithm, there is absolutely no
 482 need whatsoever for a separate PRF-ARF scheme.  The PRF *is* the ARF.
 483
 484 Register-renaming will be done with a single extra mutually-exclusive bit
 485 in the FUxReg Dependency Matrix, which may be set on only one FU (per register).
 486 This bit indicates which of the FUs has the **most recent** destination
 487 register value pending.  It is **directly** functionally equivalent to
 488 the Reorder Buffer Dest Reg# CAM value, except that now it is a
 489 string of 1-bit "CAMs".
 490
 491 When an FU needs a src reg and finds that it needs to create a
 492 dependency waiting for a result to be created, it must use this
 493 bit to determine which FU it creates a dependency on.
 494
 495 If there is a destination register that already has a bit set
 496 (anywhere in the column), it is **cleared** and **replaced**
 497 with a bit in the FU's row and the destination register's column.
 498
 499 See https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/c80jRn4PCQAJ
 500
 501 MUL r1, r2, r3
 502
 503     FU name  Reg name
 504              12345678
 505     add-0    ........
 506     add-1    ........
 507     mul-0    X.......
 508     mul-1    ........
 509
 510 ADD r4, r1, r3
 511
 512     FU name  Reg name
 513              12345678
 514     add-0    ...X....
 515     add-1    ........
 516     mul-0    X.......
 517     mul-1    ........
 518
 519 ADD r1, r5, r6
 520
 521     FU name  Reg name
 522              12345678
 523     add-0    ...X....
 524     add-1    X.......
 525     mul-0    ........
 526     mul-1    ........
 527
 528 note how on the 3rd instruction, the (mul-0,R1) entry is **cleared**
 529 and **replaced** with an (add-1,R1) entry.  future instructions now
 530 know that if their src operands require R1, they are to place a
 531 RaW dependency on **add-1**, not mul-0
 532
 533 ## Multi-issue
 534
 535 we may potentially have 2-issue (or 4-issue) and a simpler issue and
 536 detection by "striping" the register file according to modulo 2 (or 4)
 537 on the destination   register number
 538
 539 * the Function Unit rows are multiplied up by 2 (or 4) however they are
 540   actually connected to the same ALUs (pipelined and with both src and
 541   dest register buffers/latches).
 542 * the Register Read and Write signals are then "striped" such that
 543   read/write requests for every 2nd (or 4th) register are "grouped" and
 544   will have to fight for access to a multiplexer in order to access
 545   registers that do not have the same modulo 2 (or 4) match.
 546 * we MAY potentially be able to drop the destination (write) multiplexer(s)
 547   by only permitting FU rows with the same modulo to write to that
 548   destination bank.  FUs with indices 0,4,8,12 may only write to registers
 549   similarly numbered.
 550 * there will therefore be FOUR separate register-data buses, with (at least)
 551   the Read buses multiplexed so that all FU banks may read all src registers
 552   (even if there is contention for the multiplexers)
 553
 554 ## FU-to-Register address de-muxed already
 555
 556 an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
 557 write/read enable signals already exist as single-bits.  "normal" processors
 558 store the src/dest registers as an index (5 bits == 0-31), where in this
 559 design, that has been expanded out to 32 individual Read/Write wires,
 560 already.
 561
 562 * the register file verilog implementation therefore must take in an
 563   array of 128-bit write-enable and 128-bit read-enable signals.
 564 * however the data buses will be multiplexed modulo 2 (or 4) according
 565   to the lower bits of the register number, in order to cross "lanes".
 566
 567 ## FU "Grouping"
 568
 569 with so many Function Units in RISC-V (dozens of instructions, times 2
 570 to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
 571 we almost certainly are going to have to deploy a "grouping" scheme:
 572
 573 * rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
 574   to MUL etc., instead we group the FUs by how many src and dest
 575   registers are required, and *pass the opcode down to them*
 576 * only FUs with the exact same number (and type) of register profile
 577   will receive like-minded opcodes.
 578 * when src and dest are free for a particular op (and an ALU pipeline is
 579   not stalled) the FU is at liberty to push the operands into the
 580   appropriate free ALU.
 581 * FUs therefore only really express the register, memory, and execution
 582   dependencies: they don't actually do the execution.
 583
 584 ## Recommendations
 585
 586 * Include a merged address-generator in the INT ALU
 587 * Have simple ALU units duplicated and allow more than one FU to
 588   receive (and process) the src operands.
 589
 590 ## Register file workloads
 591
 592 Note: Vectorisation also includes predication, which is one extra integer read
 593
 594 Integer workloads:
 595
 596 * 43% Integer
 597 * 21% Load
 598 * 12% store
 599 * 24% branch
 600
 601 * 100% of the instruction stream can be integer instructions
 602 * 75% utilize two source operand registers.
 603 * 50% of the instruction stream can be Load instructions
 604 * 25% can be store instructions,
 605 * 25% can be branch instructions
 606
 607 FP workloads:
 608
 609 * 30% Integer
 610 * 25% Load
 611 * 10% Store
 612 * 13% Multiplication
 613 * 17% Addition
 614 * 5% branch
 615
 616 ----
 617
 618 >  in particular i found it fascinating that analysis of INT
 619 >  instructions found a 50% LD, 25% ST and 25% branch, and that
 620 >  70% were 2-src ops.  therefore you made sure that the number
 621 >  of read and write ports matched these, to ensure no bottlenecks,
 622 >  bearing in mind that ST requires reading an address *and*
 623 >  a data register.
 624
 625 I never had a problem in "reading the write slot" in any of my pipelines.
 626 That is, take a pipeline where LD (cache hit) has a latency of 3 cycles
 627 (AGEN, Cache, Align). Align would be in the cycle where the data was being
 628 forwarded, and the subsequent cycle, data could be written into the RF.
 629
 630 |dec|AGN|$$$|ALN|LDW|
 631
 632 For stores I would read the LDs write slot Align the store data and merge
 633 into the cache as::
 634
 635 |dec|AGEN|tag|---|STR|ALN|$$$|
 636
 637 You know 4 cycles in advance that a store is coming, 2 cycles after hit
 638 so there is easy logic to decide to read the write slot (or not), and it
 639 costs 2 address comparators to disambiguate this short shadow in the pipeline.
 640
 641 This is a lower expense than building another read port into the RF, in
 642 both area and power, and uses the pipeline efficiently.
 643
 644 # References
 645
 646 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
 647 * <https://en.wikipedia.org/wiki/Reservation_station>
 648 * <https://en.wikipedia.org/wiki/Register_renaming> points out that
 649   reservation stations take a *lot* of power.
 650 * <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
 651 * MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
 652   <https://github.com/afwolfe/mesi-simulator>
 653 * <https://kshitizdange.github.io/418CacheSim/final-report> report on
 654   types of caches
 655 * <https://github.com/ssc3?tab=repositories> interesting stuff
 656 * <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
 657   pipeline bypassing
 658 * <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
 659 * Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
 660 * Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
 661 * <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
 662 * <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
 663 * <https://www.eda.ncsu.edu/wiki/FreePDK>