add conversation note

[libreriscv.git] / 3d_gpu / microarchitecture.mdwn
diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn

index b496631ff1d8be5403cd556a237d4fa28e48fe6e..bd20d476ee14c7d20f899745516e1310956ae965 100644 (file)
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -235,6 +235,123 @@ Reorder Buffer Entry
  * Ready
      - indicates that the instruction has completed execution: value is ready
  
+----
+
+Register Renaming resources
+
+* <https://www.youtube.com/watch?v=p4SdrUhZrBM>
+* <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
+* ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
+
+Video @ 3:24, "RAT" table - Register Aliasing Table:
+
+<img src="/3d_gpu/rat_table.png" />
+
+This scheme looks very much like a Reservation Station.
+
+----
+
+There is another way to get precise ordering of the writes in a scoreboard.
+First, one has to implement forwarding in the scoreboard.
+Second, the function units need an output queue <of say 4 registers>
+Now, one can launch an instruction and pick up its operand either
+from the RF or from the function unit output while the result sits
+in the function unit waiting for its GO_Write signal.
+
+Thus the launching of instructions is not delayed due to hazards
+but the results are delivered to the RF in program order.
+
+This looks surprisingly like a 'belt' at the end of the function unit.
+
+----
+
+> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
+
+ so, coming back to this diagram, i think if we stratify the
+Functional Units into lanes as well, we may get a multi-issue
+architecture.
+
+ the 6600 scoreboard rules - which are awesomely simple and actually
+involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
+in parallel because there will be no overlap between stratified registers.
+
+ if using that odd-even / msw-lsw division (instead of modulo 4 on the
+register number) it will be more like a 2-issue for standard RV
+instructions and a 4-issue for when SV 32-bit ops are loop-generated.
+
+ by subdividing the registers into odd-even banks we will need a
+_pair_ of (completely independent) register-renaming tables:
+  https://libre-riscv.org/3d_gpu/rat_table.png
+
+ for SIMD'd operations, if we have the same type of reservation
+station queue as with Tomasulo, it can be augmented with the byte-mask:
+if the byte-masks in the queue of both the src and dest registers do
+not overlap, the operations may be done in parallel.
+
+ i still have not yet thought through how the Reorder Buffer would
+work: here, again, i am tempted to recommend that, again, we "stratify"
+the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
+however the CAM is only 4-bit or 3-bit wide.
+
+ if an instruction's destination register does not meet the modulo
+requirements, that ROB entry is *left empty*.  this does mean that,
+for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
+4), and there are 4 sequential instructions that happen e.g. to have
+a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
+etc.... the ROB will only hold 8 such instructions
+
+and that i think is perfectly fine, because, statistically, it'll balance
+out, and SV generates sequentially-incrementing instruction registers,
+so *that* is fine, too.
+
+i'll keep working on diagrams, and also reading mitch alsup's chapters
+on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
+LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
+for ST, X6 and X7 for LD.
+
+----
+
+i took a shot at explaining this also on comp.arch today, and that
+allowed me to identify a problem with the proposed modulo-4 "lanes"
+stratification.
+
+when a result is created in one lane, it may need to be passed to the next
+lane.  that means that each of the other lanes needs to keep a watchful
+eye on when another lane updates the other regfiles (all 3 of them).
+
+when an incoming update occurs, there may be up to 3 register writes
+(that need to be queued?) that need to be broadcast (written) into
+reservation stations.
+
+what i'm not sure of is: can data consistency be preserved, even if
+there's a delay?  my big concern is that during the time where the data is
+broadcast from one lane, the head of the ROB arrives at that instruction
+(which is the "commit" condition), it gets committed, then, unfortunately,
+the same ROB# gets *reused*.
+
+now that i think about it, as long as the length of the queue is below
+the size of the Reorder Buffer (preferably well below), and as long as
+it's guaranteed to be emptied by the time the ROB cycles through the
+whole buffer, it *should* be okay.
+
+----
+
+> Don't forget that in these days of Spectre and Meltdown, merely
+> preventing dead instruction results from being written to registers or
+> memory is NOT ENOUGH. You also need to prevent load instructions from
+> altering cache and branch instructions from altering branch prediction
+> state.
+
+Which, oddly enough, provides a necessity for being able to consume
+multiple containers from the cache Miss buffers, which oddly enough,
+are what makes a crucial mechanism in the Virtual Vector Method work.
+
+In the past, one would forward the demand container to the waiting
+memref and then write the whole the line into the cache. S&M means you
+have to forward multiple times from the miss buffers and avoid damaging
+the cache until the instruction retires. VVM uses this to avoid having
+a vector strip mine the data cache.
+
  # References
  
  * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>