3d_gpu/architecture/6600scoreboard.mdwn

   1 # Augmented 6600-style Scoreboards
   2
   3 Images reproduced with kind permission from Mitch Alsup.  This page
   4 describes *augmentations* made to 6600-style scoreboards for full,
   5 precise, register-renaming (more accurately "nameless" register) capability
   6 directly equivalent to the Tomasulo algorithm.  For
   7 an explanation of the full functional-equivalence,
   8 see [[tomasulo_transformation]].
   9
  10 # Notes and insights on Scoreboard design
  11
  12 btw one thing that's not obvious - at all - about scoreboards is: there's nothing that seems to "control" how instructions "know" to read, write, or complete.  it's very... weird.  i'll probably put this on the discussion page.
  13
  14 the reason i feel that the weirdness exists is for a few reasons:
  15
  16 * firstly, the Matrices create a Directed Acyclic Graph, using single-bit
  17   SR-Latches.  for a software engineer, being able to express a DAG using
  18   a matrix is itself.. .weird :)
  19 * secondly: those Matrices preserve time *order* (instruction
  20   dependent order actually), they are not themselves dependent *on* time
  21   itself.  this is especially weird if one is used to an in-order system,
  22   which is very much critically dependent on "time" and on strict observance
  23   of how long results are going to take to get through a pipeline.  we
  24   could do the entire design based around low-gate-count FSMs and it would
  25   still be absolutely fine.
  26 * thirdly, it's the *absence* of blocks that allows a unit to
  27   proceed.  unlike an in-order system, there's nothing saying "you go now,
  28   you go now": it's the opposite.  the unit is told instead, "here's the
  29   resources you need to WAIT for: go when those resources are available".
  30 * fourth (clarifying 3): it's reads that block writes, and writes
  31   that block reads.  although obvious when thought through from first
  32   principles, it can get particularly confusing that it is the *absence*
  33   of read hazards that allow writes to proceed, and the *absence* of write
  34   hazards that allow reads to proceed.
  35 * fifth: the ComputationUnits still need to "manage" the input and output
  36   of those resources to actual pipelines (or FSMs).
  37  - (a) the CUs are *not* permitted to blithely say, if there is an
  38     expected output that also needs managing "ok i got the inputs, now throw
  39     them at the pipeline, i'm done".  they *must* wait for that result.  of
  40     course if there is no result to wait for, they're permitted to indicate
  41 "done" without waiting (this actually happens in the case of STORE).
  42  - (b) there's an apparent disconnect between "fetching of registers"
  43     and "Computational Unit progress".  surely, one feels, there should
  44     be something that, again, "orders the CU to proceed in a set, orderly
  45     progressive fashion?".  instead, because the progress is from the
  46     *absence* of hazards, the CU's FSMs likewise make forward progress from
  47   the "acknowledgement" of each blockage being dropped.
  48 * sixth: one of the incredible but puzzling things is that register
  49   renaming is *automatically* built-in to the design.  the Function Unit's
  50   input and output latches are effectively "nameless" registers.
  51  - (a) the more Function Units you have, the more nameless registers
  52     exist.  the more nameless registers exist, the further ahead that
  53  in-flight execution can progress, speculatively.
  54  - (b) whilst the Function Units are devoid of register "name"
  55     information, the FU-Regs Dependency Matrix is *not* devoid of that
  56     information, having latched the read/write register numbers in an unary
  57     form, as a "row", one bit in each row representing which register(s)
  58     the instruction originally contained.
  59  - (c) by virtue of the direct Operand Port connectivity between the FU
  60     and its corresponding FU-Regs DM "row", the Function Unit requesting for
  61     example Operand1 results in the FU-Regs DM *row* triggering a register
  62     file read-enable line, *NOT* the Function Unit itself.
  63 * seventh: the PriorityPickers manage resource contention between the FUs
  64   and the row-information from the FU-Regs Matrix.  the port bandwidth
  65   by nature has to be limited (we cannot have 200 read/write ports on
  66   the regfile).  therefore the connection between the FU and the FU-Regs
  67   "row" in which the actual reg numbers is stored (in unary) is even *less*
  68   direct than it is in an in-order system.
  69
  70 ultimately then, there is:
  71
  72 * an FU-Regs Matrix that, on a per-row basis, captures the instruction's
  73   register numbering (in unary, one SR-Latch raised per register per row)
  74   on a per-operand basis
  75 * an FU-FU Matrix that preserves, as a Directed Acyclic Graph (DAG),
  76   the instruction order.  again, this is a bit-based system (SR Latches)
  77   that record which *read port* of the Function Unit needs a write result
  78   (when available), and which write port needs a *read* result.
  79 * a suite of Function Units with input *and* output latches where the
  80   register information is *removed* (that being back in the FU-Regs row
  81   associated with a given FU)
  82 * a PriorityPicker system that acknowledges the desire for access to the
  83   register file, and, due to the regfile ports being a contended resource,
  84   *only* permits one and only one FunctionUnit at a time to gain access to
  85   that regfile port.  where the FunctionUnit knows the Operand number it
  86   requires the input (or output) to come from (or to), it is the FU-Regs
  87   *row* that knows, on a per-operand-number basis, what the actual register
  88   file number is.
  89
  90 Note, again, that whilst academic literature focusses on the patent
  91 (Q-Tables) it is only the *combination* of Q-Tables with the rest of
  92 the algorithm, the PriorityPickers, FU-Regs and FU-FU Matrices,
  93 that gives the capability to perform out-of-order execution.
  94
  95 # Example allocation of Function Units
  96
  97 This is the key diagram showing the relationship between Function Units
  98 and the Register File(s), covering "read".  A similar diagram exists
  99 for regfile "write".  Observe also that there are *two* Increment
 100 Function Units.
 101
 102 The Dependency Matrices manage the DAG of read-write relationships:
 103 each Function Unit indicates which registers it requires for read
 104 and which it needs permission to write to.
 105
 106 NOTE: **AT NO TIME** is **any** Function Unit permitted "direct" access to
 107 global resources by way of any form of "unmanaged" path.  Each Function
 108 Unit may **only** receive incoming data and may **only** pass that data
 109 out via the set determined path, as controlled by the Dependency Matrices.
 110
 111 Thus, the inputs for a given FU absolutely have to cover all resources
 112 that the ALU will need.  In the case of POWER9, for the Integer Function
 113 Units this is not just the Integer operands, it's the *Condition* operands
 114 (CR0, XER carry bits etc.) that need to be inputs (and outputs) as well.
 115 In the case of the Branch Function Unit, the input operands (and outputs)
 116 will likewise be not just the Integer operands, but CTR, LR etc. as well.
 117
 118 In the arrangement below (from the CDC 6600), it can be observed that
 119 there are actually separate Register Files (A, B and X).  Also observe
 120 that whilst X goes to *all* Function Units as input, B only goes to
 121 Branch, Increment and LongAdd, where A goes to Branch and the two Increment
 122 Function Units.  Thus, A was 3R3W, B was 4R3W, and X was 5R4W.
 123
 124 An augmentation of this arrangement, for a modern processor using pipelines,
 125 is, rather than have separate FSMs and doubled-up (or greater) Function Units
 126 is to "double up" (or triple, or quadruple etc.) the number of Function
 127 Units, but to *share the same pipeline*.  See "Concurrent Computation Unit"
 128 section below for details.
 129
 130 [[!img multiple_function_units_write.svg]]
 131 [[!img multiple_function_units_read.svg]]
 132
 133 # Modifications needed to Computation Unit and Group Picker
 134
 135 The scoreboard uses two big NOR gates respectively to determine when there
 136 are no read/write hazards.  These two NOR gates are permanently active
 137 (per Function Unit) even if the Function Unit is idle.
 138
 139 In the case of the Write path, these "permanently-on" signals are gated
 140 by a Write-Release-Request signal that would otherwise leave the Priority
 141 Picker permanently selecting one of the Function Units (the highest priority).
 142 However the same thing has to be done for the read path, as well.
 143
 144 Below are the modifications required to add a read-release path that
 145 will prevent a Function Unit from requesting a GoRead signal when it
 146 has no need to read registers.  Note that once both the Busy and GoRead
 147 signals combined are dropped, the ReadRelease is dropped.
 148
 149 Note that this is a loop: GoRead (ANDed with Busy) goes through
 150 to the priority picker, which generates GoRead, so it is critical
 151 (in a modern design) to use a clock-sync'd latch in this path.
 152 The original 6600 used rising-edge and falling-edge of the clock
 153 to avoid this issue.
 154
 155 [[!img comp_unit_req_rel-new.svg]]
 156
 157 [[!img group_pick_rd_rel.svg]]
 158
 159 [[!img priority_picker_16_yosys.png size="400x"]]
 160
 161 Source:
 162
 163 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
 164 * [ALU Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu.py;h=f7b5e411a739e770777ceb71d7bd09fe4e70e8c0;hb=b08dee1c3e8cf0d635820693fe50cd0518caeed2)
 165
 166 # Concurrent Computational Unit
 167
 168 With the original 6600 having only a 2-stage pipelined FPU (which took
 169 many years to notice from examining the now-archaic notation from James
 170 Thornton's book, "Design of a Computer"), there is no actual use of this
 171 pipeline capability at the front-end Function Unit.  Instead it is
 172 treated effectively as a Finite State Machine, only one result to be
 173 computed at a time.
 174
 175 Mitch Alsup recommends, when using pipelines, to allow multiple
 176 Function Unit "front-ends", each one having inputs that were pushed
 177 into a particular stage of the pipeline, and, therefore, those multiple
 178 Function Units also track and store the result as it comes out.
 179
 180 The trick then is to have a method that ensures that FU front-end #1
 181 can get result #1 when it pops out the end of the (serial) pipeline.
 182 Mitch recommends using timing chains, here.
 183
 184 Note in this diagram that there are *multiple* ISSUE, GO\_READ and GO\_WRITE
 185 signals.  These link up to the Function Unit's ISSUE, GO\_RD and GO\_WR,
 186 where the latches are, that will (on an available slot) feed the pipeline
 187 with incoming data.
 188
 189 [[!img concurrent_comp_unit.png size="600x"]]
 190
 191 The actual design being used is slightly different, in the following
 192 ways:
 193
 194 * Due to micro-coding and thus external contention, the pipelines
 195   have a ready/valid signalling arrangement that can result in
 196   a stall cascading back down the pipeline.  Thus a timing chain
 197   is not appropriate.
 198 * A decision was therefore made to pass a "context" alongside the
 199   operands, which is the "Function Unit Index".  It is *this* information
 200   that is used to "reassociate" the result with the correct FU, when
 201   the result is produced.
 202 * With "Shadow cancellation" being in effect, *additional* global
 203   context is passed (combinatorially) to every single stage of the
 204   pipeline, as an unary bitmask.  If any Function Unit's "GO_DIE"
 205   signal is asserted, the corresponding bit in the unary mask is
 206   asserted, terminating effective immediate the intermediary data
 207   anywhere in the pipeline from progressing further, thus saving power.
 208
 209
 210 # Multi-in cascading Priority Picker
 211
 212 Using the Group Picker as a fundamental unit, a cascading chain is created,
 213 with each output "masking" an output from being selected in all down-chain
 214 Pickers.  Whilst the input is a single unary array of bits, the output is
 215 *multiple* unary arrays where only one bit in each is set.
 216
 217 This can be used for "port selection", for example when there are multiple
 218 Register File ports or multiple LOAD/STORE cache "ways", and there are many
 219 more devices seeking access to those "ports" than there are actual ports.
 220 (If the number of devices seeking access to ports were equal to the number
 221 of ports, each device could be allocated its own dedicated port).
 222
 223 Click on image to see full-sized version:
 224
 225 [[!img multi_priority_picker.png size="800x"]]
 226
 227 Links:
 228
 229 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
 230 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005204.html>
 231
 232 # Modifications to Dependency Cell
 233
 234 Note: this version still requires CLK to operate on a HI-LO cycle.
 235 Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE
 236 sequence.  For now however it is easier to stick with the original
 237 diagrams produced by Mitch Alsup.
 238
 239 The dependency cell is responsible for recording that a Function Unit
 240 requires the use of a dest or src register, which is given in UNARY.
 241 It is also responsible for "defending" that unary register bit for
 242 read and write hazards, and for also, on request (GoRead/GoWrite)
 243 generating a "Register File Select" signal.
 244
 245 The sequence of operations for determining hazards is as follows:
 246
 247 * Issue goes HI when CLK is HI.  If any of Dest / Oper1 / Oper2 are also HI,
 248   the relevant SRLatch will go HI to indicate that this Function Unit requires
 249   the use of this dest/src register
 250 * Bear in mind that this cell works in conjunction with the FU-FU cells
 251 * Issue is LOW when CLK is HI.  This is where the "defending" comes into
 252   play.  There will be *another* Function Unit somewhere that has had
 253   its Issue line raised.  This cell needs to know if there is a conflict
 254   (Read Hazard or Write Hazard).
 255 * Therefore, *this* cell must, if either of the Oper1/Oper2 signals are
 256   HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI.
 257   This is the *Read_Pending* signal.
 258 * Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI,
 259   this cell must output a "Write after Read" (WaR) hazard if the (other)
 260   instruction has raised the unary Dest line.
 261
 262 The sequence for determining register select is as follows:
 263
 264 * After the Issue+CLK-HI has resulted in the relevant (unary) latches for
 265   dest and src (unary) latches being set, at some point a GoRead (or GoWrite)
 266   signal needs to be asserted
 267 * The GoRead (or GoWrite) is asserted when *CLK is LOW*.  The AND gate
 268   on Reset ensures that the SRLatch *remains ENABLED*.
 269 * This gives an opportunity for the Latch Q to be ANDed with the GoRead
 270   (or GoWrite), raising an indicator flag that the register is being
 271   "selected" by this Function Unit.
 272 * The "select" outputs from the entire column (all Function Units for this
 273   unary Register) are ORed together.  Given that only one GoRead (or GoWrite)
 274   is guaranteed to be ASSERTed (because that is the Priority Picker's job),
 275   the ORing is acceptable.
 276 * Whilst the GoRead (or GoWrite) signal is still asserted HI, the *CLK*
 277   line goes *LOW*.  With the Reset-AND-gate now being HI, this *clears* the
 278   latch.  This is the desired outcome because in the previous cycle (which
 279   happened to be when CLK was LOW), the register file was read (or written)
 280
 281 The release of the latch happens to have a by-product of releasing the
 282 "reservation", such that future instructions, if they ever test for
 283 Read/Write hazards, will find that this Cell no longer responds: the
 284 hazard has already passed as this Cell already indicated that it was
 285 safe to read (or write) the register file, freeing up future instructions
 286 from hazards in the process.
 287
 288 [[!img dependence_cell_pending.jpg]]
 289
 290 # Shadowing
 291
 292 Shadowing is important as it is the fundamental basis of:
 293
 294 * Precise exceptions
 295 * Write-after-write hazard avoidance
 296 * Correct multi-issue instruction sequencing
 297 * Branch speculation
 298
 299 Modifications to the shadow circuit below allow the shadow flip-flops
 300 to be automatically reset after a Function Unit "dies".  Without these
 301 modifications, the shadow unit may spuriously fire on subsequent re-use
 302 due to some of the latches being left in a previous state.
 303
 304 Note that only "success" will cause the latch to reset.  Note also
 305 that the introduction of the NOT gate causes the latch to be more like
 306 a DFF (register).
 307
 308 [[!img shadow.svg]]
 309
 310 # LD/ST Computation Unit
 311
 312 Discussions:
 313
 314 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006167.html>
 315 * <https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI>
 316
 317 Walk-through Videos:
 318
 319 * <https://www.youtube.com/watch?v=idDn1norNl0>
 320 * <https://www.youtube.com/watch?v=ipOe0cLOJWc>
 321
 322 The Load/Store Computation Unit is a little more complex, involving
 323 three functions: LOAD, STORE, and LOAD-UPDATE.  The SR Latches create
 324 a forward-progressing Finite State Machine, with three possible paths:
 325
 326 * LD Mode will activate Issue, GoRead1, GoAddr then finally GoWrite1
 327 * LD-UPDATE Mode will *additionally* activate GoWrite2.
 328 * ST Mode will activate Issue, GoRead1, GoRead2, GoAddr then GoStore.
 329   ST-UPDATE Mode will *additionally* activate GoWrite2.
 330
 331 These signals will be allowed to activate when the correct "Req" lines
 332 are active.  Minor complications are involved (extra latches) that respond
 333 to an external API interface that has a more "traditional" valid/ready
 334 signalling interface, with single-clock responses.
 335
 336 Source:
 337
 338 * [LD/ST Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py)
 339
 340 [[!img ld_st_comp_unit.jpg]]
 341
 342 # Memory-Memory Dependency Matrix
 343
 344 Due to the possibility of more than on LD/ST being in flight, it is necessary
 345 to determine which memory operations are conflicting, and to preserve a
 346 semblance of order.  It turns out that as long as there is no *possibility*
 347 of overlaps (note this wording carefully), and that LOADs are done separately
 348 from STOREs, this is sufficient.
 349
 350 The first step then is to ensure that only a mutually-exclusive batch of LDs
 351 *or* STs (not both) is detected, with the order between such batches being
 352 preserved.  This is what the memory-memory dependency matrix does.
 353
 354 "WAR" stands for "Write After Read" and is an SR Latch.  "RAW" stands for
 355 "Read After Write" and likewise is an SR Latch.  Any LD which comes in
 356 when a ST is pending will result in the relevant RAW SR Latch going active.
 357 Likewise, any ST which comes in when a LD is pending results in the
 358 relevant WAR SR Latch going active.
 359
 360 LDs can thus be prevented when it has any dependent RAW hazards active,
 361 and likewise STs can be prevented when any dependent WAR hazards are active.
 362 The matrix also ensures that ordering is preserved.
 363
 364 Note however that this is the equivalent of an ALU "FU-FU" Matrix.  A
 365 separate Register-Mem Dependency Matrix is *still needed* in order to
 366 preserve the **register** read/write dependencies that occur between
 367 instructions, where the Mem-Mem Matrix simply protects against memory
 368 hazards.
 369
 370 Note also that it does not detect address clashes: that is the responsibility
 371 of the Address Match Matrix.
 372
 373 Source:
 374
 375 * [Memory-Dependency Row](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;h=2958d864cec75480b97a0725d9b3c44f53d2e7a0;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 376 * [Memory-Dependency Matrix](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;h=6b9ce140312290a26babe2e3e3d821ae3036e3ab;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 377
 378 [[!img ld_st_dep_matrix.png size="600x"]]
 379
 380 # Address Match Matrix
 381
 382 This is an important adjunct to the Memory Dependency Matrices: it ensures
 383 that no LDs or STs overlap, because if they did it could result in memory
 384 corruption.  Example: a 64-bit ST at address 0x0001 comes in at the
 385 same time as a 64-bit ST to address 0x0002: the second write will overwrite
 386 all writes to bytes in memory 0x0002 thru 0x0008 of the first write,
 387 and consequently the order of these two writes absolutely has to be
 388 preserved.
 389
 390 The suggestion from Mitch Alsup was to use a match system based on bits
 391 4 thru 10/11 of the address.  The idea being: we don't care if the matching
 392 is "too inclusive", i.e. we don't care if it includes addresses that don't
 393 actually overlap, because this just means "oh dear some LD/STs do not
 394 happen concurrently, they happen a few cycles later" (translation: Big Deal)
 395
 396 What we care about is if it were to **miss** some addresses that **do**
 397 actually overlap.  Therefore it is perfectly acceptable to use only a few
 398 bits of the address.  This is fortunate because the matching has to be
 399 done in a huge NxN Pascal's Triangle, and if we were to compare against
 400 the entirety of the address it would consume vast amounts of power and gates.
 401
 402 An enhancement of this idea is to turn the length of the operation
 403 (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the
 404 bottom 4 bits of the address to offset this mask and "line up" with
 405 the Memory byte read/write enable wires on the underlying Memory used
 406 in the L1 Cache.
 407
 408 Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary
 409 mask, can be "matched" using simple AND gate logic (instead of XOR for
 410 binary address matching), with the advantage that it is both trivial to
 411 use these masks as L1 Cache byte read/write enable lines, and furthermore
 412 it is straightforward to detect misaligned LD/STs crossing cache line
 413 boundaries.
 414
 415 Crossing over cache line boundaries is trivial in that the creation of
 416 the byte-map mask is permitted to be 24 bits in length (actually, only
 417 23 needed).  When the bottom 4 bits of the address are 0b1111 and the
 418 LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST)
 419 will be shifted up by 15 bits.  This can then be chopped into two
 420 segments:
 421
 422 * First segment is 0b1000 0000 0000 0000 and indicates that the
 423   first byte of the LD/ST is to go into byte 15 of the cache line
 424 * Second segment is 0b0111 1111 and indicates that bytes 2 through
 425   8 of the LD/ST must go into bytes 0 thru 7 of the **second**
 426   cache line at an address offset by 16 bytes from the first.
 427
 428 Thus we have actually split the LD/ST operation into two.  The AddrSplit
 429 class takes care of synchronising the two, by issuing two *separate*
 430 sets of LD/ST requests, waiting for both of them to complete (or indicate
 431 an error), and (in the case of a LD) merging the two.
 432
 433 The big advantage of this approach is that at no time does the L1 Cache
 434 need to know anything about the offsets from which the LD/ST came.  All
 435 it needs to know is: which bytes to read/write into which positions
 436 in the cache line(s).
 437
 438 Source:
 439
 440 * [Address Matcher](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=a47f635f4e9c56a7a13329810855576358110339;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 441 * [Address Splitter](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=bf89e0970e9a8b44c76018660114172f5a3061f4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 442
 443 [[!img ld_st_splitter.png size="600x"]]
 444
 445 # L0 Cache/Buffer
 446
 447 See:
 448
 449 * <https://bugs.libre-soc.org/show_bug.cgi?id=216>
 450 * <https://bugs.libre-soc.org/show_bug.cgi?id=257>
 451 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006118.html>
 452
 453 The L0 cache/buffer needs to be kept extremely small due to it having
 454 significant extra CAM functionality than a normal L1 cache.  However,
 455 crucially, the Memory Dependency Matrices and address-matching
 456 [take care of certain things](https://bugs.libre-soc.org/show_bug.cgi?id=216#c20)
 457 that greatly simplify its role.
 458
 459 The problem is that a standard "queue" in a multi-issue environment would
 460 need to be massively-ported: 8-way read and 8-way write.  However that's not
 461 the only problem: the major problem is caused by the fact that we are
 462 overloading "vectorisation" on top of multi-issue execution, where a
 463 "normal" vector system would have a Vector LD/ST operation where sequences
 464 of consecutive LDs/STs are part of the same operation, and thus a "full
 465 cache line" worth of reads/writes is near-trivial to perform and detect.
 466
 467 Thus with the "element" LD/STs being farmed out to *individual* LD/ST
 468 Computation Units, a batch of consecutive LD/ST operations arrive at the
 469 LD/ST Buffer which could - hypothetically - be merged into a single
 470 cache line, prior to passing them on to the L1 cache.
 471
 472 This is the primary task of the L0 Cache/Buffer: to resolve multiple
 473 (potentially misaligned) 1/2/4/8 LD/ST operations (per cycle) into one
 474 **single** L1 16-byte LD/ST operation.
 475
 476 The amount of wiring involved however is so enormous (3,000+ wires if
 477 "only" 4-in 4-out multiplexing is done from the LD/ST Function Units) that
 478 considerable care has to be taken to not massively overload the ASIC
 479 layout.
 480
 481 To help with this, a recommendation from
 482 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE)
 483 came to do a split odd-even double-L1-cache system: have *two* L1 caches,
 484 one dealing with even-numbered 16-byte cache lines (addressed by bit 4 == 0)
 485 and one dealing with odd-numbered 16-byte cache lines (addr[4] == 1).
 486 This trick doubles the sequential throughput whilst halving the bandwidth
 487 of a drastically-overloaded multiplexer bus.
 488 Thus, we can also have two L0 LD/ST Cache/Buffers (one each looking after
 489 its corresponding L1 cache).
 490
 491 The next phase - task - of the L0 Cache/Buffer - is to identify and merge
 492 any requests with the same top 5 bits.  This becomes a trivial task (under
 493 certain conditions, already satisfied by other components), by simply
 494 picking the first request, and using that row's address as a search
 495 pattern to match against all upper bits (5 onwards).  When such a match
 496 is located, then due to the job(s) carried out by prior components, the
 497 byte-mask for all requests with the same upper address bits may simply be
 498 ORed together.
 499
 500 This requires a little back-tracking to explain.  The prerequisite
 501 conditions are as follows:
 502
 503 * Mask, in each row of the L0 Cache/Buffer, encodes the bottom 4 LSBs
 504   of the address **and** the length of the LD/ST operation (1/2/4/8 bytes),
 505   in a "bitmap" form.
 506 * These "Masks" have already been analysed for overlaps by the Address
 507   Match Matrix: we **know** therefore that there are no overlaps (hence why
 508   addresses with the same MSBs from bits 5 and above may have their
 509   masks ORed together)
 510
 511 [[!img mem_l0_to_l1_bridge.png size="600x"]]
 512
 513 ## Twin L0 cache/buffer design
 514
 515 See <https://groups.google.com/d/msg/comp.arch/cbGAlcCjiZE/OPNAvWSHAQAJ>.
 516 [Flaws](https://bugs.libre-soc.org/show_bug.cgi?id=216#c24)
 517 in the above were detected, and needed correction.
 518
 519 Notes:
 520
 521 * The flaw detected above is that for each pair of LD/ST operations
 522   coming from the Function Unit (to cover mis-aligned requests),
 523   the Addr[4] bit is **mutually-exclusive**.  i.e. it is **guaranteed**
 524   that Addr[4] for the first FU port's LD/ST request will **never**
 525   equal that of the second.
 526 * Therefore, if the two requests are split into left/right separate L0
 527   Cache/Buffers, the advantages and optimisations for XOR-comparison
 528   of bits 12-48 of the address **may not take place**.
 529 * Solution: merge both L0-left and L0-right into one L0 Cache/Buffer,
 530   with twin left/right banks in the same L0 Cache/Buffer
 531 * This then means that the number of rows may be reduced to 8
 532 * It also means that Addr[12-48] may be stored (and compared) only once
 533 * It does however mean that the reservation on the row has to wait for
 534   *both* ports (left and right) to clear out their LD/ST operation(s).
 535 * Addr[4] still selects whether the request is to go into left or right bank
 536 * When the misaligned address bits 4-11 are all 0b11111111, this is not
 537   a case that can be handled, because it implies that Addr[12:48] will
 538   be **different** in the row.  This case throws a misaligned exception.
 539
 540 Other than that, the design remains the same, as does the algorithm to
 541 merge the bytemasks.  This remains as follows:
 542
 543 * PriorityPicker selects one row
 544 * For all rows greater than the selected row, if Addr[5:48] matches
 545   then the bytemask is "merged" into the output-bytemask-selector
 546 * The output-bytemask-selector is used as a "byte-enable" line on
 547   a single 128-bit byte-level read-or-write (never both).
 548
 549 Twin 128-bit requests (read-or-write) are then passed directly through
 550 to a pair of L1 Caches.
 551
 552 [[!img twin_l0_cache_buffer.jpg size="600x"]]
 553
 554 # Multi-input/output Dependency Cell and Computation Unit
 555
 556 * <https://www.youtube.com/watch?v=ohHbWRLDCfs>
 557 * <https://youtu.be/H0Le4ZF0cd0>
 558
 559 apologies that this is best done using images rather than text.
 560 i'm doing a redesign of the (augmented) 6600 engine because there
 561 are a couple of design criteria/assumptions that do not fit our
 562 requirements:
 563
 564 1. operations are only 2-in, 1-out
 565 2. simultaneous register port read (and write) availability is guaranteed.
 566
 567 we require:
 568
 569 1. operations with up to *four* in and up to *three* out
 570 2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write
 571
 572 here are the two associated diagrams which describe the *original*
 573 6600 computational unit and FU-to-Regs Dependency Cell:
 574
 575 1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
 576 2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg
 577
 578 as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
 579 we found a signal missing from Mitch's book chapters, and tracked it down
 580 from the original Thornton "Design of a Computer": Read_Release.  this
 581 is a synchronisation / acknowledgement signal for Go_Read which is directly
 582 analogous to Req_Rel for Go_Write.
 583
 584 also in the dependency cell, we found that it is necessary to OR the
 585 two "Read" Oper1 and Oper2 signals together and to AND that with the
 586 Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
 587 hand-drawn orange OR gate.
 588
 589 thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
 590 any Read (oper1 *OR* oper2) is requested.
 591
 592
 593 now onto the additional modifications.
 594
 595 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
 596 4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg
 597
 598 firstly, the computation unit modifications:
 599
 600 * multiple Go_Read signals are present, GoRD1-3
 601 * multiple incoming operands are present, Op1-3
 602 * multiple Go_Write signals are present, GoWR1-3
 603 * multiple outgoing results are present, Out1-2
 604
 605 note that these are *NOT* necessarily 64-bit registers: they are in fact
 606 Carry Flags because we are implementing POWER9.  however (as mentioned
 607 yesterday in the huge 250+ discussion, as far as the Dep Matrices are
 608 concerned you still have to treat Carry-In and Carry-out as Read/Write
 609 Hazard-protected *actual* Registers)
 610
 611 in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
 612 that *both* registers will be read (and supplied) simultaneously from
 613 the Register File, the sequence - the Finite State Machine - is real
 614 simple:
 615
 616 * ISSUE  -> BUSY (latched)
 617 * RD-REQ -> GO_RD
 618 * WR-REQ -> GO_WR
 619 * repeat
 620
 621 [aside: there is a protective "revolving door" loop where the SR latch for
 622  each state in the FSM is guaranteed stable (never reaches "unknown") ]
 623
 624 in *this* diagram (3), we instead need:
 625
 626 * ISSUE   -> BUSY (latched)
 627 * RD-REQ1 -> GO_RD1     (may occur independent of RD2/3)
 628 * RD-REQ2 -> GO_RD2     (may occur independent of RD1/3)
 629 * RD-REQ3 -> GO_RD3     (may occur independent of RD1/2)
 630 * when all 3 of GO_RD1-3 have been asserted,
 631   ONLY THEN raise WR-REQ1-2
 632 * WR-REQ1 -> GO_WR1     (may occur independent of WR2)
 633 * WR-REQ2 -> GO_WR2     (may occur independent of WR1)
 634 * when all (2) of GO_WR1-2 have been asserted,
 635   ONLY THEN reset back to the beginning.
 636
 637 note the crucial difference is that read request and acknowledge (GO_RD)
 638 are *all independent* and may occur:
 639
 640 * in any order
 641 * in any combination
 642 * all at the same time
 643
 644 likewise for write-request/go-write.
 645
 646 thus, if there is only one spare READ Register File port available
 647 (because this particular Computation Unit is a low priority, but
 648 the other operations need only two Regfile Ports and the Regfile
 649 happens to be 3R1W), at least one of OP1-3 may get its operation.
 650
 651 thus, if we have three 2-operand operations and a 3R1W regfile:
 652
 653 * clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
 654 * clock cycle 2: the second grabs one more (Oper2) and the third grabs 2
 655
 656 compare this to the *original* 6600: if there are three 2-operand
 657 operations outstanding, they MUST go:
 658
 659 * clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
 660 * clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
 661 * clock cycle 3: the 3rd grabs 2 ports
 662
 663 this because the Comp Unit - and associated Dependency Matrices - *FORCE*
 664 the Comp Unit to only proceed when *ALL* necessary Register Read Ports
 665 are available (because there is only the one Go_Read signal).
 666
 667
 668 so my questions are:
 669
 670 * does the above look reasonable?  both in terms of the DM changes
 671   and CompUnit changes.
 672
 673 * the use of the three SR latches looks a little weird to me
 674   (bottom right corner of (3) which is a rewrite of the middle
 675   of the page.
 676
 677   it looks a little weird to have an SR Latch looped back
 678   "onto itself".  namely that when the inversion of both
 679   WR_REQ1 and WR_REQ2 going low triggers that AND gate
 680   (the one with the input from Q of an SR Latch), it *resets*
 681   that very same SR-Latch, which will cause a mini "blip"
 682   on Reset, doesn't it?
 683
 684   argh.  that doesn't feel right.  what should it be replaced with?
 685
 686 [[!img compunit_multi_rw.jpg size="600x"]]
 687
 688 [[!img dependence_cell_multi_pending.svg]]
 689
 690 # Corresponding FU-FU (Function-to-Function) Dependency Cell Modifications
 691
 692 * Video <https://youtu.be/_5fmPpInJ7U>
 693
 694 Original 6600 FU-FU Cell diagram:
 695
 696 [[!img fu_dep_cell_6600.jpg size="600x"]]
 697
 698 Augmented multi-GORD/GOWR 6600 FU-FU Cell diagram:
 699
 700 [[!img fu_dep_cell_multi_6600.jpg size="600x"]]
 701
 702 # FU-Regs Vectors
 703
 704 There are two FU-Regs Vectors.  The first is an accumulation of
 705 all row information.  This indicates that (on a per-Operand basis
 706 in the Libre-SOC design) there is *a* write pending for that Operand
 707 (note that this is not per **register**, it is per **operand**).
 708 Likewise, the OR-accumulation of every unary-encoded register SR-Latch
 709 bit in the row, for reading for each FU's Operand, indicates a
 710 desire of that Function Unit's need to *read* from a given port.
 711
 712 These accumulated signals, coming out on a per-row basis for each
 713 Operand port, are sent straight to every cell in the corresponding
 714 FU-FU Matrix row.
 715
 716 [[!img fu_regs_row_pending_vec.png size="600x"]]
 717
 718 The second vector set accumulates the **column** information.  With the
 719 FU-Regs Cells capturing the instruction operand read/write register
 720 numbers (in unary form), the ORing per column of those bits creates
 721 a "global picture", per register, of the fact that *any* Function Unit
 722 needs to read (or write) a particular Operand latch port.
 723
 724 [[!img fu_regs_global_pending_vec.png size="500x"]]
 725
 726 # FU-FU Vectors
 727
 728 Two vectors exist that accumulate row and column information.  With the
 729 FU-FU Cell recording whether the Function Unit *wants* to read (or write)
 730 the per-cell information is not so crucial as the *accumulation* of that
 731 information.  When all other Function Units in that column no longer
 732 indicate that they were waiting for a read, that FU is clear to **write**.
 733 Correspondingly, when all FUs in the column no longer indicate waiting
 734 for a write, that FU is clear to **read**.  With a full NxN matrix of
 735 cells, this inversion preserves Read-after-Write and Write-after-Read
 736 hazard information relationships between **all** Function Units and all
 737 other Function Units.
 738
 739 [[!img fu_fu_readable_writeable.png size="500x"]]
 740
 741 # Illustrative diagram of Pipelines vs FSMs
 742
 743 Explanatory note here
 744 <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-March/004528.html>
 745
 746 [[!img pipeline_vs_fsms.jpg size="800x"]]
 747