44a70f1065615d438e253cccaccc6e25616d0cc5
[libreriscv.git] / 3d_gpu / architecture / 6600scoreboard.mdwn
1 # 6600-style Scoreboards
2
3 Images reproduced with kind permission from Mitch Alsup
4
5 # Modifications needed to Computation Unit and Group Picker
6
7 The scoreboard uses two big NOR gates respectively to determine when there
8 are no read/write hazards. These two NOR gates are permanently active
9 (per Function Unit) even if the Function Unit is idle.
10
11 In the case of the Write path, these "permanently-on" signals are gated
12 by a Write-Release-Request signal that would otherwise leave the Priority
13 Picker permanently selecting one of the Function Units (the highest priority).
14 However the same thing has to be done for the read path, as well.
15
16 Below are the modifications required to add a read-release path that
17 will prevent a Function Unit from requesting a GoRead signal when it
18 has no need to read registers. Note that once both the Busy and GoRead
19 signals combined are dropped, the ReadRelease is dropped.
20
21 Note that this is a loop: GoRead (ANDed with Busy) goes through
22 to the priority picker, which generates GoRead, so it is critical
23 (in a modern design) to use a clock-sync'd latch in this path.
24 The original 6600 used rising-edge and falling-edge of the clock
25 to avoid this issue.
26
27 [[!img comp_unit_req_rel.jpg]]
28 [[!img group_pick_rd_rel.jpg]]
29
30 [[!img priority_picker_16_yosys.png size="400x"]]
31
32 Source:
33
34 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
35 * [ALU Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compalu.py;h=f7b5e411a739e770777ceb71d7bd09fe4e70e8c0;hb=b08dee1c3e8cf0d635820693fe50cd0518caeed2)
36
37 # Multi-in cascading Priority Picker
38
39 Using the Group Picker as a fundamental unit, a cascading chain is created,
40 with each output "masking" an output from being selected in all down-chain
41 Pickers. Whilst the input is a single unary array of bits, the output is
42 *multiple* unary arrays where only one bit in each is set.
43
44 This can be used for "port selection", for example when there are multiple
45 Register File ports or multiple LOAD/STORE cache "ways", and there are many
46 more devices seeking access to those "ports" than there are actual ports.
47 (If the number of devices seeking access to ports were equal to the number
48 of ports, each device could be allocated its own dedicated port).
49
50 Click on image to see full-sized version:
51
52 [[!img multi_priority_picker.png size="800x"]]
53
54 Links:
55
56 * [Priority Pickers](https://git.libre-riscv.org/?p=nmutil.git;a=blob;f=src/nmutil/picker.py;hb=HEAD)
57 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005204.html>
58
59 # Modifications to Dependency Cell
60
61 Note: this version still requires CLK to operate on a HI-LO cycle.
62 Further modifications are needed to create an ISSUE-GORD-PAUSE ISSUE-GORD-PAUSE
63 sequence. For now however it is easier to stick with the original
64 diagrams produced by Mitch Alsup.
65
66 The dependency cell is responsible for recording that a Function Unit
67 requires the use of a dest or src register, which is given in UNARY.
68 It is also responsible for "defending" that unary register bit for
69 read and write hazards, and for also, on request (GoRead/GoWrite)
70 generating a "Register File Select" signal.
71
72 The sequence of operations for determining hazards is as follows:
73
74 * Issue goes HI when CLK is HI. If any of Dest / Oper1 / Oper2 are also HI,
75 the relevant SRLatch will go HI to indicate that this Function Unit requires
76 the use of this dest/src register
77 * Bear in mind that this cell works in conjunction with the FU-FU cells
78 * Issue is LOW when CLK is HI. This is where the "defending" comes into
79 play. There will be *another* Function Unit somewhere that has had
80 its Issue line raised. This cell needs to know if there is a conflict
81 (Read Hazard or Write Hazard).
82 * Therefore, *this* cell must, if either of the Oper1/Oper2 signals are
83 HI, output a "Read after Write" (RaW) hazard if its Dest Latch (Dest-Q) is HI.
84 This is the *Read_Pending* signal.
85 * Likewise, if either of the two SRC Latches (Oper1-Q or Oper2-Q) are HI,
86 this cell must output a "Write after Read" (WaR) hazard if the (other)
87 instruction has raised the unary Dest line.
88
89 The sequence for determining register select is as follows:
90
91 * After the Issue+CLK-HI has resulted in the relevant (unary) latches for
92 dest and src (unary) latches being set, at some point a GoRead (or GoWrite)
93 signal needs to be asserted
94 * The GoRead (or GoWrite) is asserted when *CLK is LOW*. The AND gate
95 on Reset ensures that the SRLatch *remains ENABLED*.
96 * This gives an opportunity for the Latch Q to be ANDed with the GoRead
97 (or GoWrite), raising an indicator flag that the register is being
98 "selected" by this Function Unit.
99 * The "select" outputs from the entire column (all Function Units for this
100 unary Register) are ORed together. Given that only one GoRead (or GoWrite)
101 is guaranteed to be ASSERTed (because that is the Priority Picker's job),
102 the ORing is acceptable.
103 * Whilst the GoRead (or GoWrite) signal is still asserted HI, the *CLK*
104 line goes *LOW*. With the Reset-AND-gate now being HI, this *clears* the
105 latch. This is the desired outcome because in the previous cycle (which
106 happened to be when CLK was LOW), the register file was read (or written)
107
108 The release of the latch happens to have a by-product of releasing the
109 "reservation", such that future instructions, if they ever test for
110 Read/Write hazards, will find that this Cell no longer responds: the
111 hazard has already passed as this Cell already indicated that it was
112 safe to read (or write) the register file, freeing up future instructions
113 from hazards in the process.
114
115 [[!img dependence_cell_pending.jpg]]
116
117 # Shadowing
118
119 Shadowing is important as it is the fundamental basis of:
120
121 * Precise exceptions
122 * Write-after-write hazard avoidance
123 * Correct multi-issue instruction sequencing
124 * Branch speculation
125
126 Modifications to the shadow circuit below allow the shadow flip-flops
127 to be automatically reset after a Function Unit "dies". Without these
128 modifications, the shadow unit may spuriously fire on subsequent re-use
129 due to some of the latches being left in a previous state.
130
131 Note that only "success" will cause the latch to reset. Note also
132 that the introduction of the NOT gate causes the latch to be more like
133 a DFF (register).
134
135 [[!img shadow.jpg]]
136
137 # LD/ST Computation Unit
138
139 Discussions:
140
141 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006167.html>
142 * <https://groups.google.com/forum/#!topic/comp.arch/qeMsE7UxvlI>
143
144 Walk-through Videos:
145
146 * <https://www.youtube.com/watch?v=idDn1norNl0>
147 * <https://www.youtube.com/watch?v=ipOe0cLOJWc>
148
149 The Load/Store Computation Unit is a little more complex, involving
150 three functions: LOAD, STORE, and LOAD-UPDATE. The SR Latches create
151 a forward-progressing Finite State Machine, with three possible paths:
152
153 * LD Mode will activate Issue, GoRead1, GoAddr then finally GoWrite1
154 * LD-UPDATE Mode will *additionally* activate GoWrite2.
155 * ST Mode will activate Issue, GoRead1, GoRead2, GoAddr then GoStore.
156 ST-UPDATE Mode will *additionally* activate GoWrite2.
157
158 These signals will be allowed to activate when the correct "Req" lines
159 are active. Minor complications are involved (extra latches) that respond
160 to an external API interface that has a more "traditional" valid/ready
161 signalling interface, with single-clock responses.
162
163 Source:
164
165 * [LD/ST Comp Units](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/experiment/compldst.py)
166
167 [[!img ld_st_comp_unit.jpg]]
168
169 # Memory-Memory Dependency Matrix
170
171 Due to the possibility of more than on LD/ST being in flight, it is necessary
172 to determine which memory operations are conflicting, and to preserve a
173 semblance of order. It turns out that as long as there is no *possibility*
174 of overlaps (note this wording carefully), and that LOADs are done separately
175 from STOREs, this is sufficient.
176
177 The first step then is to ensure that only a mutually-exclusive batch of LDs
178 *or* STs (not both) is detected, with the order between such batches being
179 preserved. This is what the memory-memory dependency matrix does.
180
181 "WAR" stands for "Write After Read" and is an SR Latch. "RAW" stands for
182 "Read After Write" and likewise is an SR Latch. Any LD which comes in
183 when a ST is pending will result in the relevant RAW SR Latch going active.
184 Likewise, any ST which comes in when a LD is pending results in the
185 relevant WAR SR Latch going active.
186
187 LDs can thus be prevented when it has any dependent RAW hazards active,
188 and likewise STs can be prevented when any dependent WAR hazards are active.
189 The matrix also ensures that ordering is preserved.
190
191 Note however that this is the equivalent of an ALU "FU-FU" Matrix. A
192 separate Register-Mem Dependency Matrix is *still needed* in order to
193 preserve the **register** read/write dependencies that occur between
194 instructions, where the Mem-Mem Matrix simply protects against memory
195 hazards.
196
197 Note also that it does not detect address clashes: that is the responsibility
198 of the Address Match Matrix.
199
200 Source:
201
202 * [Memory-Dependency Row](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_dependence_cell.py;h=2958d864cec75480b97a0725d9b3c44f53d2e7a0;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
203 * [Memory-Dependency Matrix](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/mem_fu_matrix.py;h=6b9ce140312290a26babe2e3e3d821ae3036e3ab;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
204
205 [[!img ld_st_dep_matrix.png size="600x"]]
206
207 # Address Match Matrix
208
209 This is an important adjunct to the Memory Dependency Matrices: it ensures
210 that no LDs or STs overlap, because if they did it could result in memory
211 corruption. Example: a 64-bit ST at address 0x0001 comes in at the
212 same time as a 64-bit ST to address 0x0002: the second write will overwrite
213 all writes to bytes in memory 0x0002 thru 0x0008 of the first write,
214 and consequently the order of these two writes absolutely has to be
215 preserved.
216
217 The suggestion from Mitch Alsup was to use a match system based on bits
218 4 thru 10/11 of the address. The idea being: we don't care if the matching
219 is "too inclusive", i.e. we don't care if it includes addresses that don't
220 actually overlap, because this just means "oh dear some LD/STs do not
221 happen concurrently, they happen a few cycles later" (translation: Big Deal)
222
223 What we care about is if it were to **miss** some addresses that **do**
224 actually overlap. Therefore it is perfectly acceptable to use only a few
225 bits of the address. This is fortunate because the matching has to be
226 done in a huge NxN Pascal's Triangle, and if we were to compare against
227 the entirety of the address it would consume vast amounts of power and gates.
228
229 An enhancement of this idea is to turn the length of the operation
230 (LD/ST 1 byte, 2 bytes, 4 or 8 bytes) into a byte-map "mask", using the
231 bottom 4 bits of the address to offset this mask and "line up" with
232 the Memory byte read/write enable wires on the underlying Memory used
233 in the L1 Cache.
234
235 Then, the bottom 4 bits and the LD/ST length, now turned into a 16-bit unary
236 mask, can be "matched" using simple AND gate logic (instead of XOR for
237 binary address matching), with the advantage that it is both trivial to
238 use these masks as L1 Cache byte read/write enable lines, and furthermore
239 it is straightforward to detect misaligned LD/STs crossing cache line
240 boundaries.
241
242 Crossing over cache line boundaries is trivial in that the creation of
243 the byte-map mask is permitted to be 24 bits in length (actually, only
244 23 needed). When the bottom 4 bits of the address are 0b1111 and the
245 LD/ST is an 8-byte operation, 0b1111 1111 (representing the 64-bit LD/ST)
246 will be shifted up by 15 bits. This can then be chopped into two
247 segments:
248
249 * First segment is 0b1000 0000 0000 0000 and indicates that the
250 first byte of the LD/ST is to go into byte 15 of the cache line
251 * Second segment is 0b0111 1111 and indicates that bytes 2 through
252 8 of the LD/ST must go into bytes 0 thru 7 of the **second**
253 cache line at an address offset by 16 bytes from the first.
254
255 Thus we have actually split the LD/ST operation into two. The AddrSplit
256 class takes care of synchronising the two, by issuing two *separate*
257 sets of LD/ST requests, waiting for both of them to complete (or indicate
258 an error), and (in the case of a LD) merging the two.
259
260 The big advantage of this approach is that at no time does the L1 Cache
261 need to know anything about the offsets from which the LD/ST came. All
262 it needs to know is: which bytes to read/write into which positions
263 in the cache line(s).
264
265 Source:
266
267 * [Address Matcher](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_match.py;h=a47f635f4e9c56a7a13329810855576358110339;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
268 * [Address Splitter](https://git.libre-riscv.org/?p=soc.git;a=blob;f=src/soc/scoreboard/addr_split.py;h=bf89e0970e9a8b44c76018660114172f5a3061f4;hb=a0e1af6c5dab5c324a8bf3a7ce6eb665d26a65c1)
269
270 [[!img ld_st_splitter.png size="600x"]]
271
272 # L0 Cache/Buffer
273
274 See:
275
276 * <https://bugs.libre-soc.org/show_bug.cgi?id=216>
277 * <https://bugs.libre-soc.org/show_bug.cgi?id=257>
278 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-April/006118.html>
279
280 The L0 cache/buffer needs to be kept extremely small due to it having
281 significant extra CAM functionality than a normal L1 cache. However,
282 crucially, the Memory Dependency Matrices and address-matching
283 [take care of certain things](https://bugs.libre-soc.org/show_bug.cgi?id=216#c20)
284 that greatly simplify its role.
285
286 The problem is that a standard "queue" in a multi-issue environment would
287 need to be massively-ported: 8-way read and 8-way write. However that's not
288 the only problem: the major problem is caused by the fact that we are
289 overloading "vectorisation" on top of multi-issue execution, where a
290 "normal" vector system would have a Vector LD/ST operation where sequences
291 of consecutive LDs/STs are part of the same operation, and thus a "full
292 cache line" worth of reads/writes is near-trivial to perform and detect.
293
294 Thus with the "element" LD/STs being farmed out to *individual* LD/ST
295 Computation Units, a batch of consecutive LD/ST operations arrive at the
296 LD/ST Buffer which could - hypothetically - be merged into a single
297 cache line, prior to passing them on to the L1 cache.
298
299 This is the primary task of the L0 Cache/Buffer: to resolve multiple
300 (potentially misaligned) 1/2/4/8 LD/ST operations (per cycle) into one
301 **single** L1 16-byte LD/ST operation.
302
303 The amount of wiring involved however is so enormous (3,000+ wires if
304 "only" 4-in 4-out multiplexing is done from the LD/ST Function Units) that
305 considerable care has to be taken to not massively overload the ASIC
306 layout.
307
308 To help with this, a recommendation from
309 [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/cbGAlcCjiZE)
310 came to do a split odd-even double-L1-cache system: have *two* L1 caches,
311 one dealing with even-numbered 16-byte cache lines (addressed by bit 4 == 0)
312 and one dealing with odd-numbered 16-byte cache lines (addr[4] == 1).
313 This trick doubles the sequential throughput whilst halving the bandwidth
314 of a drastically-overloaded multiplexer bus.
315 Thus, we can also have two L0 LD/ST Cache/Buffers (one each looking after
316 its corresponding L1 cache).
317
318 The next phase - task - of the L0 Cache/Buffer - is to identify and merge
319 any requests with the same top 5 bits. This becomes a trivial task (under
320 certain conditions, already satisfied by other components), by simply
321 picking the first request, and using that row's address as a search
322 pattern to match against all upper bits (5 onwards). When such a match
323 is located, then due to the job(s) carried out by prior components, the
324 byte-mask for all requests with the same upper address bits may simply be
325 ORed together.
326
327 This requires a little back-tracking to explain. The prerequisite
328 conditions are as follows:
329
330 * Mask, in each row of the L0 Cache/Buffer, encodes the bottom 4 LSBs
331 of the address **and** the length of the LD/ST operation (1/2/4/8 bytes),
332 in a "bitmap" form.
333 * These "Masks" have already been analysed for overlaps by the Address
334 Match Matrix: we **know** therefore that there are no overlaps (hence why
335 addresses with the same MSBs from bits 5 and above may have their
336 masks ORed together)
337
338 [[!img mem_l0_to_l1_bridge.png size="600x"]]
339
340 ## Twin L0 cache/buffer design
341
342 See <https://groups.google.com/d/msg/comp.arch/cbGAlcCjiZE/OPNAvWSHAQAJ>.
343 [Flaws](https://bugs.libre-soc.org/show_bug.cgi?id=216#c24)
344 in the above were detected, and needed correction.
345
346 Notes:
347
348 * The flaw detected above is that for each pair of LD/ST operations
349 coming from the Function Unit (to cover mis-aligned requests),
350 the Addr[4] bit is **mutually-exclusive**. i.e. it is **guaranteed**
351 that Addr[4] for the first FU port's LD/ST request will **never**
352 equal that of the second.
353 * Therefore, if the two requests are split into left/right separate L0
354 Cache/Buffers, the advantages and optimisations for XOR-comparison
355 of bits 12-48 of the address **may not take place**.
356 * Solution: merge both L0-left and L0-right into one L0 Cache/Buffer,
357 with twin left/right banks in the same L0 Cache/Buffer
358 * This then means that the number of rows may be reduced to 8
359 * It also means that Addr[12-48] may be stored (and compared) only once
360 * It does however mean that the reservation on the row has to wait for
361 *both* ports (left and right) to clear out their LD/ST operation(s).
362 * Addr[4] still selects whether the request is to go into left or right bank
363 * When the misaligned address bits 4-11 are all 0b11111111, this is not
364 a case that can be handled, because it implies that Addr[12:48] will
365 be **different** in the row. This case throws a misaligned exception.
366
367 Other than that, the design remains the same, as does the algorithm to
368 merge the bytemasks. This remains as follows:
369
370 * PriorityPicker selects one row
371 * For all rows greater than the selected row, if Addr[5:48] matches
372 then the bytemask is "merged" into the output-bytemask-selector
373 * The output-bytemask-selector is used as a "byte-enable" line on
374 a single 128-bit byte-level read-or-write (never both).
375
376 Twin 128-bit requests (read-or-write) are then passed directly through
377 to a pair of L1 Caches.
378
379 [[!img twin_l0_cache_buffer.jpg size="600x"]]
380
381 # Multi-input/output Dependency Cell and Computation Unit
382
383 * <https://www.youtube.com/watch?v=ohHbWRLDCfs>
384 * <https://youtu.be/H0Le4ZF0cd0>
385
386 apologies that this is best done using images rather than text.
387 i'm doing a redesign of the (augmented) 6600 engine because there
388 are a couple of design criteria/assumptions that do not fit our
389 requirements:
390
391 1. operations are only 2-in, 1-out
392 2. simultaneous register port read (and write) availability is guaranteed.
393
394 we require:
395
396 1. operations with up to *four* in and up to *three* out
397 2. sporadic availability of far less than 4 Reg-Read ports and 3 Reg-Write
398
399 here are the two associated diagrams which describe the *original*
400 6600 computational unit and FU-to-Regs Dependency Cell:
401
402 1. comp unit https://libre-soc.org/3d_gpu/comp_unit_req_rel.jpg
403 2. dep cell https://libre-soc.org/3d_gpu/dependence_cell_pending.jpg
404
405 as described here https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
406 we found a signal missing from Mitch's book chapters, and tracked it down
407 from the original Thornton "Design of a Computer": Read_Release. this
408 is a synchronisation / acknowledgement signal for Go_Read which is directly
409 analogous to Req_Rel for Go_Write.
410
411 also in the dependency cell, we found that it is necessary to OR the
412 two "Read" Oper1 and Oper2 signals together and to AND that with the
413 Write_Pending Latch (top latch in diagram 2.) as shown in the wonderfully
414 hand-drawn orange OR gate.
415
416 thus, Read-After-Write hazard occurs if there is a Write_Pending *AND*
417 any Read (oper1 *OR* oper2) is requested.
418
419
420 now onto the additional modifications.
421
422 3. comp unit https://libre-soc.org/3d_gpu/compunit_multi_rw.jpg
423 4. dep cell https://libre-soc.org/3d_gpu/dependence_cell_multi_pending.jpg
424
425 firstly, the computation unit modifications:
426
427 * multiple Go_Read signals are present, GoRD1-3
428 * multiple incoming operands are present, Op1-3
429 * multiple Go_Write signals are present, GoWR1-3
430 * multiple outgoing results are present, Out1-2
431
432 note that these are *NOT* necessarily 64-bit registers: they are in fact
433 Carry Flags because we are implementing POWER9. however (as mentioned
434 yesterday in the huge 250+ discussion, as far as the Dep Matrices are
435 concerned you still have to treat Carry-In and Carry-out as Read/Write
436 Hazard-protected *actual* Registers)
437
438 in the original 6600 comp unit diagram (1), because the "Go_Read" assumes
439 that *both* registers will be read (and supplied) simultaneously from
440 the Register File, the sequence - the Finite State Machine - is real
441 simple:
442
443 * ISSUE -> BUSY (latched)
444 * RD-REQ -> GO_RD
445 * WR-REQ -> GO_WR
446 * repeat
447
448 [aside: there is a protective "revolving door" loop where the SR latch for
449 each state in the FSM is guaranteed stable (never reaches "unknown") ]
450
451 in *this* diagram (3), we instead need:
452
453 * ISSUE -> BUSY (latched)
454 * RD-REQ1 -> GO_RD1 (may occur independent of RD2/3)
455 * RD-REQ2 -> GO_RD2 (may occur independent of RD1/3)
456 * RD-REQ3 -> GO_RD3 (may occur independent of RD1/2)
457 * when all 3 of GO_RD1-3 have been asserted,
458 ONLY THEN raise WR-REQ1-2
459 * WR-REQ1 -> GO_WR1 (may occur independent of WR2)
460 * WR-REQ2 -> GO_WR2 (may occur independent of WR1)
461 * when all (2) of GO_WR1-2 have been asserted,
462 ONLY THEN reset back to the beginning.
463
464 note the crucial difference is that read request and acknowledge (GO_RD)
465 are *all independent* and may occur:
466
467 * in any order
468 * in any combination
469 * all at the same time
470
471 likewise for write-request/go-write.
472
473 thus, if there is only one spare READ Register File port available
474 (because this particular Computation Unit is a low priority, but
475 the other operations need only two Regfile Ports and the Regfile
476 happens to be 3R1W), at least one of OP1-3 may get its operation.
477
478 thus, if we have three 2-operand operations and a 3R1W regfile:
479
480 * clock cycle 1: the first may grab 2 ports and the second grabs 1 (Oper1)
481 * clock cycle 2: the second grabs one more (Oper2) and the third grabs 2
482
483 compare this to the *original* 6600: if there are three 2-operand
484 operations outstanding, they MUST go:
485
486 * clock cycle 1: the first may grab 2 ports, NEITHER the 2nd nor 3rd proceed
487 * clock cycle 2: the second may grab 2 ports, 3rd may NOT proceed
488 * clock cycle 3: the 3rd grabs 2 ports
489
490 this because the Comp Unit - and associated Dependency Matrices - *FORCE*
491 the Comp Unit to only proceed when *ALL* necessary Register Read Ports
492 are available (because there is only the one Go_Read signal).
493
494
495 so my questions are:
496
497 * does the above look reasonable? both in terms of the DM changes
498 and CompUnit changes.
499
500 * the use of the three SR latches looks a little weird to me
501 (bottom right corner of (3) which is a rewrite of the middle
502 of the page.
503
504 it looks a little weird to have an SR Latch looped back
505 "onto itself". namely that when the inversion of both
506 WR_REQ1 and WR_REQ2 going low triggers that AND gate
507 (the one with the input from Q of an SR Latch), it *resets*
508 that very same SR-Latch, which will cause a mini "blip"
509 on Reset, doesn't it?
510
511 argh. that doesn't feel right. what should it be replaced with?
512
513 [[!img compunit_multi_rw.jpg size="600x"]]
514
515 [[!img dependence_cell_multi_pending.jpg size="600x"]]
516
517 # Corresponding Function-Unit Dependency Cell Modifications
518
519 * Video <https://youtu.be/_5fmPpInJ7U>
520
521 Original 6600 FU-FU Cell diagram:
522
523 [[!img fu_dep_cell_6600.jpg size="600x"]]
524
525 Augmented multi-GORD/GOWR 6600 FU-FU Cell diagram:
526
527 [[!img fu_dep_cell_multi_6600.jpg size="600x"]]
528