From ed61887d13e0029f84a4646cc0a32d8fa34be825 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Mon, 17 Dec 2018 14:46:08 +0000
Subject: [PATCH] add comments

---
 3d_gpu/microarchitecture.mdwn | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn
index 7b9d37fa5..4ab8ce1c2 100644
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -394,6 +394,30 @@ You have to prove that the logic can never create a circular dependency,
 not a proof with test vectors, a logical proof like what we do with FP
 arithmetic these days.
 
+----
+
+
+>  however... we don't mind that, as the vectorisation engine will 
+>  be, for the most part, generating sequentially-increasing index
+>  dest *and* src registers, so we kinda get away with it.
+
+In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
+and read 4 registers at a time or write 4 registers at a time. Timing
+looks like:
+
+<pre>
+     |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
+                    |F123|F123|F123|F123|
+                         |Esk1|EsK2|EsK3|EsK4|
+                                        |EfK1|EfK2|EfK3|EfK4|
+</pre>
+
+4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
+Read as much as you need in 4 cycles for another operand, read as much
+as you need in 4 cycles for the last operand, then write as much as you
+can for the result. This simply requires flip-flops to capture the width
+and then deliver operands in parallel (serial to parallel converter) and
+similarly for writing.       
 # Design Layout
 
 ok,so continuing some thoughts-in-order notes:
-- 
2.30.2