3d_gpu/architecture/6600scoreboard.mdwn

   1 # 6600-style Scoreboards
   2
   3 Images reproduced with kind permission from Mitch Alsup
   4
   5 # Notes and insights on Scoreboard design
   6
   7 btw one thing that's not obvious - at all - about scoreboards is: there's nothing that seems to "control" how instructions "know" to read, write, or complete.  it's very... weird.  i'll probably put this on the discussion page.
   8
   9 the reason i feel that the weirdness exists is for a few reasons:
  10
  11 * firstly, the Matrices create a Directed Acyclic Graph, using single-bit
  12   SR-Latches.  for a software engineer, being able to express a DAG using
  13   a matrix is itself.. .weird :)
  14 * secondly: those Matrices preserve time *order* (instruction
  15   dependent order actually), they are not themselves dependent *on* time
  16   itself.  this is especially weird if one is used to an in-order system,
  17   which is very much critically dependent on "time" and on strict observance
  18   of how long results are going to take to get through a pipeline.  we
  19   could do the entire design based around low-gate-count FSMs and it would
  20   still be absolutely fine.
  21 * thirdly, it's the *absence* of blocks that allows a unit to
  22   proceed.  unlike an in-order system, there's nothing saying "you go now,
  23   you go now": it's the opposite.  the unit is told instead, "here's the
  24   resources you need to WAIT for: go when those resources are available".
  25 * fourth (clarifying 3): it's reads that block writes, and writes
  26   that block reads.  although obvious when thought through from first
  27   principles, it can get particularly confusing that it is the *absence*
  28   of read hazards that allow writes to proceed, and the *absence* of write
  29   hazards that allow reads to proceed.
  30 * fifth: the ComputationUnits still need to "manage" the input and output
  31   of those resources to actual pipelines (or FSMs).
  32  - (a) the CUs are *not* permitted to blithely say, if there is an
  33     expected output that also needs managing "ok i got the inputs, now throw
  34     them at the pipeline, i'm done".  they *must* wait for that result.  of
  35     course if there is no result to wait for, they're permitted to indicate
  36 "done" without waiting (this actually happens in the case of STORE).
  37  - (b) there's an apparent disconnect between "fetching of registers"
  38     and "Computational Unit progress".  surely, one feels, there should
  39     be something that, again, "orders the CU to proceed in a set, orderly
  40     progressive fashion?".  instead, because the progress is from the
  41     *absence* of hazards, the CU's FSMs likewise make forward progress from
  42   the "acknowledgement" of each blockage being dropped.
  43 * sixth: one of the incredible but puzzling things is that register
  44   renaming is *automatically* built-in to the design.  the Function Unit's
  45   input and output latches are effectively "nameless" registers.
  46  - (a) the more Function Units you have, the more nameless registers
  47     exist.  the more nameless registers exist, the further ahead that
  48  in-flight execution can progress, speculatively.
  49  - (b) whilst the Function Units are devoid of register "name"
  50     information, the FU-Regs Dependency Matrix is *not* devoid of that
  51     information, having latched the read/write register numbers in an unary
  52     form, as a "row", one bit in each row representing which register(s)
  53     the instruction originally contained.
  54  - (c) by virtue of the direct Operand Port connectivity between the FU
  55     and its corresponding FU-Regs DM "row", the Function Unit requesting for
  56     example Operand1 results in the FU-Regs DM *row* triggering a register
  57     file read-enable line, *NOT* the Function Unit itself.
  58 * seventh: the PriorityPickers manage resource contention between the FUs
  59   and the row-information from the FU-Regs Matrix.  the port bandwidth
  60   by nature has to be limited (we cannot have 200 read/write ports on
  61   the regfile).  therefore the connection between the FU and the FU-Regs
  62   "row" in which the actual reg numbers is stored (in unary) is even *less*
  63   direct than it is in an in-order system.
  64
  65 ultimately then, there is:
  66
  67 * an FU-Regs Matrix that, on a per-row basis, captures the instruction's
  68   register numbering (in unary, one SR-Latch raised per register per row)
  69   on a per-operand basis
  70 * an FU-FU Matrix that preserves, as a Directed Acyclic Graph (DAG),
  71   the instruction order.  again, this is a bit-based system (SR Latches)
  72   that record which *read port* of the Function Unit needs a write result
  73   (when available).
  74 * a suite of Function Units with input *and* output latches where the
  75   register information is *removed* (that being back in the FU-Regs row
  76   associated with a given FU)
  77 * a PriorityPicker system that acknowledges the desire for access to the
  78   register file, and, due to the regfile ports being a contended resource,
  79   *only* permits one and only one FunctionUnit at a time to gain access to
  80   that regfile port.  where the FunctionUnit knows the Operand number it
  81   requires the input (or output) to come from (or to), it is the FU-Regs
  82   *row* that knows, on a per-operand-number basis, what the actual register
  83   file number is.
  84
  85 # Modifications needed to Computation Unit and Group Picker
  86
  87 The scoreboard uses two big NOR gates respectively to determine when there
  88 are no read/write hazards.  These two NOR gates are permanently active
  89 (per Function Unit) even if the Function Unit is idle.
  90
  91 In the case of the Write path, these "permanently-on" signals are gated
  92 by a Write-Release-Request signal that would otherwise leave the Priority
  93 Picker permanently selecting one of the Function Units (the highest priority).
  94 However the same thing has to be done for the read path, as well.
  95
  96 Below are the modifications required to add a read-release path that
  97 will prevent a Function Unit from requesting a GoRead signal when it
  98 has no need to read registers.  Note that once both the Busy and GoRead
  99 signals combined are dropped, the ReadRelease is dropped.
 100
 101 Note that this is a loop: GoRead (ANDed with Busy) goes through
 102 to the priority picker, which generates GoRead, so it is critical
 103 (in a modern design) to use a clock-sync'd latch in this path.
 104 The original 6600 used rising-edge and falling-edge of the clock
 105 to avoid this issue.
 106
 107 [[!img comp_unit_req_rel.jpg]]
 108 [[!img group_pick_rd_rel.jpg]]
 109
 110 [[!img priority_picker_16_yosys.png size="400x"]]
 111
 112 Source:
 113
 114 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
 115 * [ALU Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu.py;h=f7b5e411a739e770777ceb71d7bd09fe4e70e8c0;hb=b08dee1c3e8cf0d635820693fe50cd0518caeed2)
 116
 117 # Multi-in cascading Priority Picker
 118
 119 Using the Group Picker as a fundamental unit, a cascading chain is created,
 120 with each output "masking" an output from being selected in all down-chain
 121 Pickers.  Whilst the input is a single unary array of bits, the output is
 122 *multiple* unary arrays where only one bit in each is set.
 123
 124 This can be used for "port selection", for example when there are multiple
 125 Register File ports or multiple LOAD/STORE cache "ways", and there are many
 126 more devices seeking access to those "ports" than there are actual ports.
 127 (If the number of devices seeking access to ports were equal to the number
 128 of ports, each device could be allocated its own dedicated port).
 129
 130 Click on image to see full-sized version:
 131
 132 [[!img multi_priority_picker.png size="800x"]]
 133
 134 Links:
 135
 136 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
 137 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005204.html>
 138
 139 # Modifications to Dependency Cell
 140
 141 Note: this version still requires CLK to operate on a HI-LO cycle.
 142 Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE
 143 sequence.  For now however it is easier to stick with the original
 144 diagrams produced by Mitch Alsup.
 145
 146 The dependency cell is responsible for recording that a Function Unit
 147 requires the use of a dest or src register, which is given in UNARY.
 148 It is also responsible for "defending" that unary register bit for
 149 read and write hazards, and for also, on request (GoRead/GoWrite)
 150 generating a "Register File Select" signal.
 151
 152 The sequence of operations for determining hazards is as follows:
 153
 154 * Issue goes HI when CLK is HI.  If any of Dest / Oper1 / Oper2 are also HI,
 155   the relevant SRLatch will go HI to indicate that this Function Unit requires
 156   the use of this dest/src register
 157 * Bear in mind that this cell works in conjunction with the FU-FU cells
 158 * Issue is LOW when CLK is HI.  This is where the "defending" comes into
 159   play.  There will be *another* Function Unit somewhere that has had
 160   its Issue line raised.  This cell needs to know if there is a conflict
 161   (Read Hazard or Write Hazard).
 162 * Therefore, *this* cell must, if either of the Oper1/Oper2 signals are
 163   HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI.
 164   This is the *Read_Pending* signal.
 165 * Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI,
 166   this cell must output a "Write after Read" (WaR) hazard if the (other)
 167   instruction has raised the unary Dest line.
 168
 169 The sequence for determining register select is as follows:
 170
 171 * After the Issue+CLK-HI has resulted in the relevant (unary) latches for
 172   dest and src (unary) latches being set, at some point a GoRead (or GoWrite)
 173   signal needs to be asserted
 174 * The GoRead (or GoWrite) is asserted when *CLK is LOW*.  The AND gate
 175   on Reset ensures that the SRLatch *remains ENABLED*.
 176 * This gives an opportunity for the Latch Q to be ANDed with the GoRead
 177   (or GoWrite), raising an indicator flag that the register is being
 178   "selected" by this Function Unit.
 179 * The "select" outputs from the entire column (all Function Units for this
 180   unary Register) are ORed together.  Given that only one GoRead (or GoWrite)
 181   is guaranteed to be ASSERTed (because that is the Priority Picker's job),
 182   the ORing is acceptable.
 183 * Whilst the GoRead (or GoWrite) signal is still asserted HI, the *CLK*
 184   line goes *LOW*.  With the Reset-AND-gate now being HI, this *clears* the
 185   latch.  This is the desired outcome because in the previous cycle (which
 186   happened to be when CLK was LOW), the register file was read (or written)
 187
 188 The release of the latch happens to have a by-product of releasing the
 189 "reservation", such that future instructions, if they ever test for
 190 Read/Write hazards, will find that this Cell no longer responds: the
 191 hazard has already passed as this Cell already indicated that it was
 192 safe to read (or write) the register file, freeing up future instructions
 193 from hazards in the process.
 194
 195 [[!img dependence_cell_pending.jpg]]
 196
 197 # Shadowing
 198
 199 Shadowing is important as it is the fundamental basis of:
 200
 201 * Precise exceptions
 202 * Write-after-write hazard avoidance
 203 * Correct multi-issue instruction sequencing
 204 * Branch speculation
 205
 206 Modifications to the shadow circuit below allow the shadow flip-flops
 207 to be automatically reset after a Function Unit "dies".  Without these
 208 modifications, the shadow unit may spuriously fire on subsequent re-use
 209 due to some of the latches being left in a previous state.
 210
 211 Note that only "success" will cause the latch to reset.  Note also
 212 that the introduction of the NOT gate causes the latch to be more like
 213 a DFF (register).
 214
 215 [[!img shadow.jpg]]
 216
 217 # LD/ST Computation Unit
 218
 219 Discussions:
 220
 221 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006167.html>
 222 * <https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI>
 223
 224 Walk-through Videos:
 225
 226 * <https://www.youtube.com/watch?v=idDn1norNl0>
 227 * <https://www.youtube.com/watch?v=ipOe0cLOJWc>
 228
 229 The Load/Store Computation Unit is a little more complex, involving
 230 three functions: LOAD, STORE, and LOAD-UPDATE.  The SR Latches create
 231 a forward-progressing Finite State Machine, with three possible paths:
 232
 233 * LD Mode will activate Issue, GoRead1, GoAddr then finally GoWrite1
 234 * LD-UPDATE Mode will *additionally* activate GoWrite2.
 235 * ST Mode will activate Issue, GoRead1, GoRead2, GoAddr then GoStore.
 236   ST-UPDATE Mode will *additionally* activate GoWrite2.
 237
 238 These signals will be allowed to activate when the correct "Req" lines
 239 are active.  Minor complications are involved (extra latches) that respond
 240 to an external API interface that has a more "traditional" valid/ready
 241 signalling interface, with single-clock responses.
 242
 243 Source:
 244
 245 * [LD/ST Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py)
 246
 247 [[!img ld_st_comp_unit.jpg]]
 248
 249 # Memory-Memory Dependency Matrix
 250
 251 Due to the possibility of more than on LD/ST being in flight, it is necessary
 252 to determine which memory operations are conflicting, and to preserve a
 253 semblance of order.  It turns out that as long as there is no *possibility*
 254 of overlaps (note this wording carefully), and that LOADs are done separately
 255 from STOREs, this is sufficient.
 256
 257 The first step then is to ensure that only a mutually-exclusive batch of LDs
 258 *or* STs (not both) is detected, with the order between such batches being
 259 preserved.  This is what the memory-memory dependency matrix does.
 260
 261 "WAR" stands for "Write After Read" and is an SR Latch.  "RAW" stands for
 262 "Read After Write" and likewise is an SR Latch.  Any LD which comes in
 263 when a ST is pending will result in the relevant RAW SR Latch going active.
 264 Likewise, any ST which comes in when a LD is pending results in the
 265 relevant WAR SR Latch going active.
 266
 267 LDs can thus be prevented when it has any dependent RAW hazards active,
 268 and likewise STs can be prevented when any dependent WAR hazards are active.
 269 The matrix also ensures that ordering is preserved.
 270
 271 Note however that this is the equivalent of an ALU "FU-FU" Matrix.  A
 272 separate Register-Mem Dependency Matrix is *still needed* in order to
 273 preserve the **register** read/write dependencies that occur between
 274 instructions, where the Mem-Mem Matrix simply protects against memory
 275 hazards.
 276
 277 Note also that it does not detect address clashes: that is the responsibility
 278 of the Address Match Matrix.
 279
 280 Source:
 281
 282 * [Memory-Dependency Row](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;h=2958d864cec75480b97a0725d9b3c44f53d2e7a0;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 283 * [Memory-Dependency Matrix](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;h=6b9ce140312290a26babe2e3e3d821ae3036e3ab;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 284
 285 [[!img ld_st_dep_matrix.png size="600x"]]
 286
 287 # Address Match Matrix
 288
 289 This is an important adjunct to the Memory Dependency Matrices: it ensures
 290 that no LDs or STs overlap, because if they did it could result in memory
 291 corruption.  Example: a 64-bit ST at address 0x0001 comes in at the
 292 same time as a 64-bit ST to address 0x0002: the second write will overwrite
 293 all writes to bytes in memory 0x0002 thru 0x0008 of the first write,
 294 and consequently the order of these two writes absolutely has to be
 295 preserved.
 296
 297 The suggestion from Mitch Alsup was to use a match system based on bits
 298 4 thru 10/11 of the address.  The idea being: we don't care if the matching
 299 is "too inclusive", i.e. we don't care if it includes addresses that don't
 300 actually overlap, because this just means "oh dear some LD/STs do not
 301 happen concurrently, they happen a few cycles later" (translation: Big Deal)
 302
 303 What we care about is if it were to **miss** some addresses that **do**
 304 actually overlap.  Therefore it is perfectly acceptable to use only a few
 305 bits of the address.  This is fortunate because the matching has to be
 306 done in a huge NxN Pascal's Triangle, and if we were to compare against
 307 the entirety of the address it would consume vast amounts of power and gates.
 308
 309 An enhancement of this idea is to turn the length of the operation
 310 (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the
 311 bottom 4 bits of the address to offset this mask and "line up" with
 312 the Memory byte read/write enable wires on the underlying Memory used
 313 in the L1 Cache.
 314
 315 Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary
 316 mask, can be "matched" using simple AND gate logic (instead of XOR for
 317 binary address matching), with the advantage that it is both trivial to
 318 use these masks as L1 Cache byte read/write enable lines, and furthermore
 319 it is straightforward to detect misaligned LD/STs crossing cache line
 320 boundaries.
 321
 322 Crossing over cache line boundaries is trivial in that the creation of
 323 the byte-map mask is permitted to be 24 bits in length (actually, only
 324 23 needed).  When the bottom 4 bits of the address are 0b1111 and the
 325 LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST)
 326 will be shifted up by 15 bits.  This can then be chopped into two
 327 segments:
 328
 329 * First segment is 0b1000 0000 0000 0000 and indicates that the
 330   first byte of the LD/ST is to go into byte 15 of the cache line
 331 * Second segment is 0b0111 1111 and indicates that bytes 2 through
 332   8 of the LD/ST must go into bytes 0 thru 7 of the **second**
 333   cache line at an address offset by 16 bytes from the first.
 334
 335 Thus we have actually split the LD/ST operation into two.  The AddrSplit
 336 class takes care of synchronising the two, by issuing two *separate*
 337 sets of LD/ST requests, waiting for both of them to complete (or indicate
 338 an error), and (in the case of a LD) merging the two.
 339
 340 The big advantage of this approach is that at no time does the L1 Cache
 341 need to know anything about the offsets from which the LD/ST came.  All
 342 it needs to know is: which bytes to read/write into which positions
 343 in the cache line(s).
 344
 345 Source:
 346
 347 * [Address Matcher](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=a47f635f4e9c56a7a13329810855576358110339;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 348 * [Address Splitter](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=bf89e0970e9a8b44c76018660114172f5a3061f4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 349
 350 [[!img ld_st_splitter.png size="600x"]]
 351
 352 # L0 Cache/Buffer
 353
 354 See:
 355
 356 * <https://bugs.libre-soc.org/show_bug.cgi?id=216>
 357 * <https://bugs.libre-soc.org/show_bug.cgi?id=257>
 358 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006118.html>
 359
 360 The L0 cache/buffer needs to be kept extremely small due to it having
 361 significant extra CAM functionality than a normal L1 cache.  However,
 362 crucially, the Memory Dependency Matrices and address-matching
 363 [take care of certain things](https://bugs.libre-soc.org/show_bug.cgi?id=216#c20)
 364 that greatly simplify its role.
 365
 366 The problem is that a standard "queue" in a multi-issue environment would
 367 need to be massively-ported: 8-way read and 8-way write.  However that's not
 368 the only problem: the major problem is caused by the fact that we are
 369 overloading "vectorisation" on top of multi-issue execution, where a
 370 "normal" vector system would have a Vector LD/ST operation where sequences
 371 of consecutive LDs/STs are part of the same operation, and thus a "full
 372 cache line" worth of reads/writes is near-trivial to perform and detect.
 373
 374 Thus with the "element" LD/STs being farmed out to *individual* LD/ST
 375 Computation Units, a batch of consecutive LD/ST operations arrive at the
 376 LD/ST Buffer which could - hypothetically - be merged into a single
 377 cache line, prior to passing them on to the L1 cache.
 378
 379 This is the primary task of the L0 Cache/Buffer: to resolve multiple
 380 (potentially misaligned) 1/2/4/8 LD/ST operations (per cycle) into one
 381 **single** L1 16-byte LD/ST operation.
 382
 383 The amount of wiring involved however is so enormous (3,000+ wires if
 384 "only" 4-in 4-out multiplexing is done from the LD/ST Function Units) that
 385 considerable care has to be taken to not massively overload the ASIC
 386 layout.
 387
 388 To help with this, a recommendation from
 389 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE)
 390 came to do a split odd-even double-L1-cache system: have *two* L1 caches,
 391 one dealing with even-numbered 16-byte cache lines (addressed by bit 4 == 0)
 392 and one dealing with odd-numbered 16-byte cache lines (addr[4] == 1).
 393 This trick doubles the sequential throughput whilst halving the bandwidth
 394 of a drastically-overloaded multiplexer bus.
 395 Thus, we can also have two L0 LD/ST Cache/Buffers (one each looking after
 396 its corresponding L1 cache).
 397
 398 The next phase - task - of the L0 Cache/Buffer - is to identify and merge
 399 any requests with the same top 5 bits.  This becomes a trivial task (under
 400 certain conditions, already satisfied by other components), by simply
 401 picking the first request, and using that row's address as a search
 402 pattern to match against all upper bits (5 onwards).  When such a match
 403 is located, then due to the job(s) carried out by prior components, the
 404 byte-mask for all requests with the same upper address bits may simply be
 405 ORed together.
 406
 407 This requires a little back-tracking to explain.  The prerequisite
 408 conditions are as follows:
 409
 410 * Mask, in each row of the L0 Cache/Buffer, encodes the bottom 4 LSBs
 411   of the address **and** the length of the LD/ST operation (1/2/4/8 bytes),
 412   in a "bitmap" form.
 413 * These "Masks" have already been analysed for overlaps by the Address
 414   Match Matrix: we **know** therefore that there are no overlaps (hence why
 415   addresses with the same MSBs from bits 5 and above may have their
 416   masks ORed together)
 417
 418 [[!img mem_l0_to_l1_bridge.png size="600x"]]
 419
 420 ## Twin L0 cache/buffer design
 421
 422 See <https://groups.google.com/d/msg/comp.arch/cbGAlcCjiZE/OPNAvWSHAQAJ>.
 423 [Flaws](https://bugs.libre-soc.org/show_bug.cgi?id=216#c24)
 424 in the above were detected, and needed correction.
 425
 426 Notes:
 427
 428 * The flaw detected above is that for each pair of LD/ST operations
 429   coming from the Function Unit (to cover mis-aligned requests),
 430   the Addr[4] bit is **mutually-exclusive**.  i.e. it is **guaranteed**
 431   that Addr[4] for the first FU port's LD/ST request will **never**
 432   equal that of the second.
 433 * Therefore, if the two requests are split into left/right separate L0
 434   Cache/Buffers, the advantages and optimisations for XOR-comparison
 435   of bits 12-48 of the address **may not take place**.
 436 * Solution: merge both L0-left and L0-right into one L0 Cache/Buffer,
 437   with twin left/right banks in the same L0 Cache/Buffer
 438 * This then means that the number of rows may be reduced to 8
 439 * It also means that Addr[12-48] may be stored (and compared) only once
 440 * It does however mean that the reservation on the row has to wait for
 441   *both* ports (left and right) to clear out their LD/ST operation(s).
 442 * Addr[4] still selects whether the request is to go into left or right bank
 443 * When the misaligned address bits 4-11 are all 0b11111111, this is not
 444   a case that can be handled, because it implies that Addr[12:48] will
 445   be **different** in the row.  This case throws a misaligned exception.
 446
 447 Other than that, the design remains the same, as does the algorithm to
 448 merge the bytemasks.  This remains as follows:
 449
 450 * PriorityPicker selects one row
 451 * For all rows greater than the selected row, if Addr[5:48] matches
 452   then the bytemask is "merged" into the output-bytemask-selector
 453 * The output-bytemask-selector is used as a "byte-enable" line on
 454   a single 128-bit byte-level read-or-write (never both).
 455
 456 Twin 128-bit requests (read-or-write) are then passed directly through
 457 to a pair of L1 Caches.
 458
 459 [[!img twin_l0_cache_buffer.jpg size="600x"]]
 460
 461 # Multi-input/output Dependency Cell and Computation Unit
 462
 463 * <https://www.youtube.com/watch?v=ohHbWRLDCfs>
 464 * <https://youtu.be/H0Le4ZF0cd0>
 465
 466 apologies that this is best done using images rather than text.
 467 i'm doing a redesign of the (augmented) 6600 engine because there
 468 are a couple of design criteria/assumptions that do not fit our
 469 requirements:
 470
 471 1. operations are only 2-in, 1-out
 472 2. simultaneous register port read (and write) availability is guaranteed.
 473
 474 we require:
 475
 476 1. operations with up to *four* in and up to *three* out
 477 2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write
 478
 479 here are the two associated diagrams which describe the *original*
 480 6600 computational unit and FU-to-Regs Dependency Cell:
 481
 482 1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
 483 2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg
 484
 485 as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
 486 we found a signal missing from Mitch's book chapters, and tracked it down
 487 from the original Thornton "Design of a Computer": Read_Release.  this
 488 is a synchronisation / acknowledgement signal for Go_Read which is directly
 489 analogous to Req_Rel for Go_Write.
 490
 491 also in the dependency cell, we found that it is necessary to OR the
 492 two "Read" Oper1 and Oper2 signals together and to AND that with the
 493 Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
 494 hand-drawn orange OR gate.
 495
 496 thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
 497 any Read (oper1 *OR* oper2) is requested.
 498
 499
 500 now onto the additional modifications.
 501
 502 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
 503 4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg
 504
 505 firstly, the computation unit modifications:
 506
 507 * multiple Go_Read signals are present, GoRD1-3
 508 * multiple incoming operands are present, Op1-3
 509 * multiple Go_Write signals are present, GoWR1-3
 510 * multiple outgoing results are present, Out1-2
 511
 512 note that these are *NOT* necessarily 64-bit registers: they are in fact
 513 Carry Flags because we are implementing POWER9.  however (as mentioned
 514 yesterday in the huge 250+ discussion, as far as the Dep Matrices are
 515 concerned you still have to treat Carry-In and Carry-out as Read/Write
 516 Hazard-protected *actual* Registers)
 517
 518 in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
 519 that *both* registers will be read (and supplied) simultaneously from
 520 the Register File, the sequence - the Finite State Machine - is real
 521 simple:
 522
 523 * ISSUE  -> BUSY (latched)
 524 * RD-REQ -> GO_RD
 525 * WR-REQ -> GO_WR
 526 * repeat
 527
 528 [aside: there is a protective "revolving door" loop where the SR latch for
 529  each state in the FSM is guaranteed stable (never reaches "unknown") ]
 530
 531 in *this* diagram (3), we instead need:
 532
 533 * ISSUE   -> BUSY (latched)
 534 * RD-REQ1 -> GO_RD1     (may occur independent of RD2/3)
 535 * RD-REQ2 -> GO_RD2     (may occur independent of RD1/3)
 536 * RD-REQ3 -> GO_RD3     (may occur independent of RD1/2)
 537 * when all 3 of GO_RD1-3 have been asserted,
 538   ONLY THEN raise WR-REQ1-2
 539 * WR-REQ1 -> GO_WR1     (may occur independent of WR2)
 540 * WR-REQ2 -> GO_WR2     (may occur independent of WR1)
 541 * when all (2) of GO_WR1-2 have been asserted,
 542   ONLY THEN reset back to the beginning.
 543
 544 note the crucial difference is that read request and acknowledge (GO_RD)
 545 are *all independent* and may occur:
 546
 547 * in any order
 548 * in any combination
 549 * all at the same time
 550
 551 likewise for write-request/go-write.
 552
 553 thus, if there is only one spare READ Register File port available
 554 (because this particular Computation Unit is a low priority, but
 555 the other operations need only two Regfile Ports and the Regfile
 556 happens to be 3R1W), at least one of OP1-3 may get its operation.
 557
 558 thus, if we have three 2-operand operations and a 3R1W regfile:
 559
 560 * clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
 561 * clock cycle 2: the second grabs one more (Oper2) and the third grabs 2
 562
 563 compare this to the *original* 6600: if there are three 2-operand
 564 operations outstanding, they MUST go:
 565
 566 * clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
 567 * clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
 568 * clock cycle 3: the 3rd grabs 2 ports
 569
 570 this because the Comp Unit - and associated Dependency Matrices - *FORCE*
 571 the Comp Unit to only proceed when *ALL* necessary Register Read Ports
 572 are available (because there is only the one Go_Read signal).
 573
 574
 575 so my questions are:
 576
 577 * does the above look reasonable?  both in terms of the DM changes
 578   and CompUnit changes.
 579
 580 * the use of the three SR latches looks a little weird to me
 581   (bottom right corner of (3) which is a rewrite of the middle
 582   of the page.
 583
 584   it looks a little weird to have an SR Latch looped back
 585   "onto itself".  namely that when the inversion of both
 586   WR_REQ1 and WR_REQ2 going low triggers that AND gate
 587   (the one with the input from Q of an SR Latch), it *resets*
 588   that very same SR-Latch, which will cause a mini "blip"
 589   on Reset, doesn't it?
 590
 591   argh.  that doesn't feel right.  what should it be replaced with?
 592
 593 [[!img compunit_multi_rw.jpg size="600x"]]
 594
 595 [[!img dependence_cell_multi_pending.jpg size="600x"]]
 596
 597 # Corresponding Function-Unit Dependency Cell Modifications
 598
 599 * Video <https://youtu.be/_5fmPpInJ7U>
 600
 601 Original 6600 FU-FU Cell diagram:
 602
 603 [[!img fu_dep_cell_6600.jpg size="600x"]]
 604
 605 Augmented multi-GORD/GOWR 6600 FU-FU Cell diagram:
 606
 607 [[!img fu_dep_cell_multi_6600.jpg size="600x"]]
 608