From 9cd39bfe350457569f2199623e0c0dbdbebf6ac0 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Fri, 7 Dec 2018 09:30:38 +0000
Subject: [PATCH] add conversation notes

---
 3d_gpu/microarchitecture.mdwn | 46 +++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn
index 04678c6e3..2a1161251 100644
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -263,6 +263,52 @@ but the results are delivered to the RF in program order.
 
 This looks surprisingly like a 'belt' at the end of the function unit.
 
+----
+
+> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
+
+ so, coming back to this diagram, i think if we stratify the
+Functional Units into lanes as well, we may get a multi-issue
+architecture.
+
+ the 6600 scoreboard rules - which are awesomely simple and actually
+involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
+in parallel because there will be no overlap between stratified registers.
+
+ if using that odd-even / msw-lsw division (instead of modulo 4 on the
+register number) it will be more like a 2-issue for standard RV
+instructions and a 4-issue for when SV 32-bit ops are loop-generated.
+
+ by subdividing the registers into odd-even banks we will need a
+_pair_ of (completely independent) register-renaming tables:
+  https://libre-riscv.org/3d_gpu/rat_table.png
+
+ for SIMD'd operations, if we have the same type of reservation
+station queue as with Tomasulo, it can be augmented with the byte-mask:
+if the byte-masks in the queue of both the src and dest registers do
+not overlap, the operations may be done in parallel.
+
+ i still have not yet thought through how the Reorder Buffer would
+work: here, again, i am tempted to recommend that, again, we "stratify"
+the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
+however the CAM is only 4-bit or 3-bit wide.
+
+ if an instruction's destination register does not meet the modulo
+requirements, that ROB entry is *left empty*.  this does mean that,
+for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
+4), and there are 4 sequential instructions that happen e.g. to have
+a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
+etc.... the ROB will only hold 8 such instructions
+
+and that i think is perfectly fine, because, statistically, it'll balance
+out, and SV generates sequentially-incrementing instruction registers,
+so *that* is fine, too.
+
+i'll keep working on diagrams, and also reading mitch alsup's chapters
+on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
+LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
+for ST, X6 and X7 for LD.
+
 # References
 
 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
-- 
2.30.2