3d_gpu/architecture/6600scoreboard.mdwn

   1 # 6600-style Scoreboards
   2
   3 Images reproduced with kind permission from Mitch Alsup
   4
   5 # Modifications needed to Computation Unit and Group Picker
   6
   7 The scoreboard uses two big NOR gates respectively to determine when there
   8 are no read/write hazards.  These two NOR gates are permanently active
   9 (per Function Unit) even if the Function Unit is idle.
  10
  11 In the case of the Write path, these "permanently-on" signals are gated
  12 by a Write-Release-Request signal that would otherwise leave the Priority
  13 Picker permanently selecting one of the Function Units (the highest priority).
  14 However the same thing has to be done for the read path, as well.
  15
  16 Below are the modifications required to add a read-release path that
  17 will prevent a Function Unit from requesting a GoRead signal when it
  18 has no need to read registers.  Note that once both the Busy and GoRead
  19 signals combined are dropped, the ReadRelease is dropped.
  20
  21 Note that this is a loop: GoRead (ANDed with Busy) goes through
  22 to the priority picker, which generates GoRead, so it is critical
  23 (in a modern design) to use a clock-sync'd latch in this path.
  24
  25 [[!img comp_unit_req_rel.jpg]]
  26 [[!img group_pick_rd_rel.jpg]]
  27
  28 [[!img priority_picker_16_yosys.png size="400x"]]
  29
  30 Source:
  31
  32 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
  33 * [ALU Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu.py;h=f7b5e411a739e770777ceb71d7bd09fe4e70e8c0;hb=b08dee1c3e8cf0d635820693fe50cd0518caeed2)
  34
  35 # Multi-in cascading Priority Picker
  36
  37 Using the Group Picker as a fundamental unit, a cascading chain is created,
  38 with each output "masking" an output from being selected in all down-chain
  39 Pickers.  Whilst the input is a single unary array of bits, the output is
  40 *multiple* unary arrays where only one bit in each is set.
  41
  42 This can be used for "port selection", for example when there are multiple
  43 Register File ports or multiple LOAD/STORE cache "ways", and there are many
  44 more devices seeking access to those "ports" than there are actual ports.
  45 (If the number of devices seeking access to ports were equal to the number
  46 of ports, each device could be allocated its own dedicated port).
  47
  48 Click on image to see full-sized version:
  49
  50 [[!img multi_priority_picker.png size="800x"]]
  51
  52 Links:
  53
  54 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
  55 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005204.html>
  56
  57 # Modifications to Dependency Cell
  58
  59 Note: this version still requires CLK to operate on a HI-LO cycle.
  60 Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE
  61 sequence.  For now however it is easier to stick with the original
  62 diagrams produced by Mitch Alsup.
  63
  64 The dependency cell is responsible for recording that a Function Unit
  65 requires the use of a dest or src register, which is given in UNARY.
  66 It is also responsible for "defending" that unary register bit for
  67 read and write hazards, and for also, on request (GoRead/GoWrite)
  68 generating a "Register File Select" signal.
  69
  70 The sequence of operations for determining hazards is as follows:
  71
  72 * Issue goes HI when CLK is HI.  If any of Dest / Oper1 / Oper2 are also HI,
  73   the relevant SRLatch will go HI to indicate that this Function Unit requires
  74   the use of this dest/src register
  75 * Bear in mind that this cell works in conjunction with the FU-FU cells
  76 * Issue is LOW when CLK is HI.  This is where the "defending" comes into
  77   play.  There will be *another* Function Unit somewhere that has had
  78   its Issue line raised.  This cell needs to know if there is a conflict
  79   (Read Hazard or Write Hazard).
  80 * Therefore, *this* cell must, if either of the Oper1/Oper2 signals are
  81   HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI.
  82   This is the *Read_Pending* signal.
  83 * Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI,
  84   this cell must output a "Write after Read" (WaR) hazard if the (other)
  85   instruction has raised the unary Dest line.
  86
  87 The sequence for determining register select is as follows:
  88
  89 * After the Issue+CLK-HI has resulted in the relevant (unary) latches for
  90   dest and src (unary) latches being set, at some point a GoRead (or GoWrite)
  91   signal needs to be asserted
  92 * The GoRead (or GoWrite) is asserted when *CLK is LOW*.  The AND gate
  93   on Reset ensures that the SRLatch *remains ENABLED*.
  94 * This gives an opportunity for the Latch Q to be ANDed with the GoRead
  95   (or GoWrite), raising an indicator flag that the register is being
  96   "selected" by this Function Unit.
  97 * The "select" outputs from the entire column (all Function Units for this
  98   unary Register) are ORed together.  Given that only one GoRead (or GoWrite)
  99   is guaranteed to be ASSERTed (because that is the Priority Picker's job),
 100   the ORing is acceptable.
 101 * Whilst the GoRead (or GoWrite) signal is still asserted HI, the *CLK*
 102   line goes *LOW*.  With the Reset-AND-gate now being HI, this *clears* the
 103   latch.  This is the desired outcome because in the previous cycle (which
 104   happened to be when CLK was LOW), the register file was read (or written)
 105
 106 The release of the latch happens to have a by-product of releasing the
 107 "reservation", such that future instructions, if they ever test for
 108 Read/Write hazards, will find that this Cell no longer responds: the
 109 hazard has already passed as this Cell already indicated that it was
 110 safe to read (or write) the register file, freeing up future instructions
 111 from hazards in the process.
 112
 113 [[!img dependence_cell_pending.jpg]]
 114
 115 # Shadowing
 116
 117 Shadowing is important as it is the fundamental basis of:
 118
 119 * Precise exceptions
 120 * Write-after-write hazard avoidance
 121 * Correct multi-issue instruction sequencing
 122 * Branch speculation
 123
 124 Modifications to the shadow circuit below allow the shadow flip-flops
 125 to be automatically reset after a Function Unit "dies".  Without these
 126 modifications, the shadow unit may spuriously fire on subsequent re-use
 127 due to some of the latches being left in a previous state.
 128
 129 Note that only "success" will cause the latch to reset.  Note also
 130 that the introduction of the NOT gate causes the latch to be more like
 131 a DFF (register).
 132
 133 [[!img shadow.jpg]]
 134
 135 # LD/ST Computation Unit
 136
 137 The Load/Store Computation Unit is a little more complex, involving
 138 three functions: LOAD, STORE, and INT Addition.  The SR Latches create
 139 a cyclic chain (just as with the ALU Computation Unit) however here
 140 there are three possible chains.
 141
 142 * INT Addition mode will activate Issue, GoRead, GoWrite
 143 * LD Mode will activate Issue, GoRead, GoAddr then finally GoWrite
 144 * ST Mode will activate Issue, GoRead, GoAddr then GoStore.
 145
 146 These signals will be allowed to activate when the correct "Req" lines
 147 are active.  Cyclically respecting these request-response signals results in
 148 the SR Latches never going into "unstable / unknown" states.
 149
 150 * Issue will close the opcode latch and OPEN the operand latch AND
 151 trigger "Request-Read" (and set "Busy")
 152 * Go-Read will close the operand latch and OPEN the address latch AND
 153 trigger "Request Address".
 154 * Go-Address will close the address latch and OPEN the result latch
 155 AND trigger "Request Write"
 156 * Go-Write will close the result latch and OPEN the opcode latch, and
 157 reset BUSY back to OFF, ready for a new cycle.
 158
 159 Note: there is an error in the diagram, compared to the source code.
 160 It was necessary to capture src2 (op2) separate from src1 (op1), so that
 161 for the ST, op2 goes into the STORE as the data, not op1.
 162
 163 Source:
 164
 165 * [LD/ST Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py;h=206f44876b00b6c1d94716e624a03e81208120d4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 166
 167 [[!img ld_st_comp_unit.png]]
 168
 169 # Memory-Memory Dependency Matrix
 170
 171 Due to the possibility of more than on LD/ST being in flight, it is necessary
 172 to determine which memory operations are conflicting, and to preserve a
 173 semblance of order.  It turns out that as long as there is no *possibility*
 174 of overlaps (note this wording carefully), and that LOADs are done separately
 175 from STOREs, this is sufficient.
 176
 177 The first step then is to ensure that only a mutually-exclusive batch of LDs
 178 *or* STs (not both) is detected, with the order between such batches being
 179 preserved.  This is what the memory-memory dependency matrix does.
 180
 181 "WAR" stands for "Write After Read" and is an SR Latch.  "RAW" stands for
 182 "Read After Write" and likewise is an SR Latch.  Any LD which comes in
 183 when a ST is pending will result in the relevant RAW SR Latch going active.
 184 Likewise, any ST which comes in when a LD is pending results in the
 185 relevant WAR SR Latch going active.
 186
 187 LDs can thus be prevented when it has any dependent RAW hazards active,
 188 and likewise STs can be prevented when any dependent WAR hazards are active.
 189 The matrix also ensures that ordering is preserved.
 190
 191 Note however that this is the equivalent of an ALU "FU-FU" Matrix.  A
 192 separate Register-Mem Dependency Matrix is *still needed* in order to
 193 preserve the **register** read/write dependencies that occur between
 194 instructions, where the Mem-Mem Matrix simply protects against memory
 195 hazards.
 196
 197 Note also that it does not detect address clashes: that is the responsibility
 198 of the Address Match Matrix.
 199
 200 Source:
 201
 202 * [Memory-Dependency Row](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;h=2958d864cec75480b97a0725d9b3c44f53d2e7a0;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 203 * [Memory-Dependency Matrix](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;h=6b9ce140312290a26babe2e3e3d821ae3036e3ab;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 204
 205 [[!img ld_st_dep_matrix.png size="600x"]]
 206
 207 # Address Match Matrix
 208
 209 This is an important adjunct to the Memory Dependency Matrices: it ensures
 210 that no LDs or STs overlap, because if they did it could result in memory
 211 corruption.  Example: a 64-bit ST at address 0x0001 comes in at the
 212 same time as a 64-bit ST to address 0x0002: the second write will overwrite
 213 all writes to bytes in memory 0x0002 thru 0x0008 of the first write,
 214 and consequently the order of these two writes absolutely has to be
 215 preserved.
 216
 217 The suggestion from Mitch Alsup was to use a match system based on bits
 218 4 thru 10/11 of the address.  The idea being: we don't care if the matching
 219 is "too inclusive", i.e. we don't care if it includes addresses that don't
 220 actually overlap, because this just means "oh dear some LD/STs do not
 221 happen concurrently, they happen a few cycles later" (translation: Big Deal)
 222
 223 What we care about is if it were to **miss** some addresses that **do**
 224 actually overlap.  Therefore it is perfectly acceptable to use only a few
 225 bits of the address.  This is fortunate because the matching has to be
 226 done in a huge NxN Pascal's Triangle, and if we were to compare against
 227 the entirety of the address it would consume vast amounts of power and gates.
 228
 229 An enhancement of this idea is to turn the length of the operation
 230 (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the
 231 bottom 4 bits of the address to offset this mask and "line up" with
 232 the Memory byte read/write enable wires on the underlying Memory used
 233 in the L1 Cache.
 234
 235 Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary
 236 mask, can be "matched" using simple AND gate logic (instead of XOR for
 237 binary address matching), with the advantage that it is both trivial to
 238 use these masks as L1 Cache byte read/write enable lines, and furthermore
 239 it is straightforward to detect misaligned LD/STs crossing cache line
 240 boundaries.
 241
 242 Crossing over cache line boundaries is trivial in that the creation of
 243 the byte-map mask is permitted to be 24 bits in length (actually, only
 244 23 needed).  When the bottom 4 bits of the address are 0b1111 and the
 245 LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST)
 246 will be shifted up by 15 bits.  This can then be chopped into two
 247 segments:
 248
 249 * First segment is 0b1000 0000 0000 0000 and indicates that the
 250   first byte of the LD/ST is to go into byte 15 of the cache line
 251 * Second segment is 0b0111 1111 and indicates that bytes 2 through
 252   8 of the LD/ST must go into bytes 0 thru 7 of the **second**
 253   cache line at an address offset by 16 bytes from the first.
 254
 255 Thus we have actually split the LD/ST operation into two.  The AddrSplit
 256 class takes care of synchronising the two, by issuing two *separate*
 257 sets of LD/ST requests, waiting for both of them to complete (or indicate
 258 an error), and (in the case of a LD) merging the two.
 259
 260 The big advantage of this approach is that at no time does the L1 Cache
 261 need to know anything about the offsets from which the LD/ST came.  All
 262 it needs to know is: which bytes to read/write into which positions
 263 in the cache line(s).
 264
 265 Source:
 266
 267 * [Address Matcher](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=a47f635f4e9c56a7a13329810855576358110339;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 268 * [Address Splitter](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=bf89e0970e9a8b44c76018660114172f5a3061f4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
 269
 270 [[!img ld_st_splitter.png size="600x"]]
 271
 272 # Multi-input/output Dependency Cell and Computation Unit
 273
 274 apologies that this is best done using images rather than text.
 275 i'm doing a redesign of the (augmented) 6600 engine because there
 276 are a couple of design criteria/assumptions that do not fit our
 277 requirements:
 278
 279 1. operations are only 2-in, 1-out
 280 2. simultaneous register port read (and write) availability is guaranteed.
 281
 282 we require:
 283
 284 1. operations with up to *four* in and up to *three* out
 285 2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write
 286
 287 here are the two associated diagrams which describe the *original*
 288 6600 computational unit and FU-to-Regs Dependency Cell:
 289
 290 1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
 291 2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg
 292
 293 as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
 294 we found a signal missing from Mitch's book chapters, and tracked it down
 295 from the original Thornton "Design of a Computer": Read_Release.  this
 296 is a synchronisation / acknowledgement signal for Go_Read which is directly
 297 analogous to Req_Rel for Go_Write.
 298
 299 also in the dependency cell, we found that it is necessary to OR the
 300 two "Read" Oper1 and Oper2 signals together and to AND that with the
 301 Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
 302 hand-drawn orange OR gate.
 303
 304 thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
 305 any Read (oper1 *OR* oper2) is requested.
 306
 307
 308 now onto the additional modifications.
 309
 310 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
 311 4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg
 312
 313 firstly, the computation unit modifications:
 314
 315 * multiple Go_Read signals are present, GoRD1-3
 316 * multiple incoming operands are present, Op1-3
 317 * multiple Go_Write signals are present, GoWR1-3
 318 * multiple outgoing results are present, Out1-2
 319
 320 note that these are *NOT* necessarily 64-bit registers: they are in fact
 321 Carry Flags because we are implementing POWER9.  however (as mentioned
 322 yesterday in the huge 250+ discussion, as far as the Dep Matrices are
 323 concerned you still have to treat Carry-In and Carry-out as Read/Write
 324 Hazard-protected *actual* Registers)
 325
 326 in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
 327 that *both* registers will be read (and supplied) simultaneously from
 328 the Register File, the sequence - the Finite State Machine - is real
 329 simple:
 330
 331 * ISSUE  -> BUSY (latched)
 332 * RD-REQ -> GO_RD
 333 * WR-REQ -> GO_WR
 334 * repeat
 335
 336 [aside: there is a protective "revolving door" loop where the SR latch for
 337  each state in the FSM is guaranteed stable (never reaches "unknown") ]
 338
 339 in *this* diagram (3), we instead need:
 340
 341 * ISSUE   -> BUSY (latched)
 342 * RD-REQ1 -> GO_RD1     (may occur independent of RD2/3)
 343 * RD-REQ2 -> GO_RD2     (may occur independent of RD1/3)
 344 * RD-REQ3 -> GO_RD3     (may occur independent of RD1/2)
 345 * when all 3 of GO_RD1-3 have been asserted,
 346   ONLY THEN raise WR-REQ1-2
 347 * WR-REQ1 -> GO_WR1     (may occur independent of WR2)
 348 * WR-REQ2 -> GO_WR2     (may occur independent of WR1)
 349 * when all (2) of GO_WR1-2 have been asserted,
 350   ONLY THEN reset back to the beginning.
 351
 352 note the crucial difference is that read request and acknowledge (GO_RD)
 353 are *all independent* and may occur:
 354
 355 * in any order
 356 * in any combination
 357 * all at the same time
 358
 359 likewise for write-request/go-write.
 360
 361 thus, if there is only one spare READ Register File port available
 362 (because this particular Computation Unit is a low priority, but
 363 the other operations need only two Regfile Ports and the Regfile
 364 happens to be 3R1W), at least one of OP1-3 may get its operation.
 365
 366 thus, if we have three 2-operand operations and a 3R1W regfile:
 367
 368 * clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
 369 * clock cycle 2: the second grabs one more (Oper2) and the third grabs 2
 370
 371 compare this to the *original* 6600: if there are three 2-operand
 372 operations outstanding, they MUST go:
 373
 374 * clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
 375 * clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
 376 * clock cycle 3: the 3rd grabs 2 ports
 377
 378 this because the Comp Unit - and associated Dependency Matrices - *FORCE*
 379 the Comp Unit to only proceed when *ALL* necessary Register Read Ports
 380 are available (because there is only the one Go_Read signal).
 381
 382
 383 so my questions are:
 384
 385 * does the above look reasonable?  both in terms of the DM changes
 386   and CompUnit changes.
 387
 388 * the use of the three SR latches looks a little weird to me
 389   (bottom right corner of (3) which is a rewrite of the middle
 390   of the page.
 391
 392   it looks a little weird to have an SR Latch looped back
 393   "onto itself".  namely that when the inversion of both
 394   WR_REQ1 and WR_REQ2 going low triggers that AND gate
 395   (the one with the input from Q of an SR Latch), it *resets*
 396   that very same SR-Latch, which will cause a mini "blip"
 397   on Reset, doesn't it?
 398
 399   argh.  that doesn't feel right.  what should it be replaced with?
 400
 401 [[!img compunit_multi_rw.jpg size="600x"]]
 402
 403 [[!img dependence_cell_multi_pending.jpg size="600x"]]