From ed61887d13e0029f84a4646cc0a32d8fa34be825 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 17 Dec 2018 14:46:08 +0000 Subject: [PATCH] add comments --- 3d_gpu/microarchitecture.mdwn | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index 7b9d37fa5..4ab8ce1c2 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -394,6 +394,30 @@ You have to prove that the logic can never create a circular dependency, not a proof with test vectors, a logical proof like what we do with FP arithmetic these days. +---- + + +> however... we don't mind that, as the vectorisation engine will +> be, for the most part, generating sequentially-increasing index +> dest *and* src registers, so we kinda get away with it. + +In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM) +and read 4 registers at a time or write 4 registers at a time. Timing +looks like: + +
+     |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
+                    |F123|F123|F123|F123|
+                         |Esk1|EsK2|EsK3|EsK4|
+                                        |EfK1|EfK2|EfK3|EfK4|
+
+ +4 cycle FU shown. Read as much as you need in 4 cycles for one operand, +Read as much as you need in 4 cycles for another operand, read as much +as you need in 4 cycles for the last operand, then write as much as you +can for the result. This simply requires flip-flops to capture the width +and then deliver operands in parallel (serial to parallel converter) and +similarly for writing. # Design Layout ok,so continuing some thoughts-in-order notes: -- 2.30.2