X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=3d_gpu%2Fmicroarchitecture.mdwn;h=d26d8948478b9e9c3c5e76b2c6489c2b6cbc50c6;hb=1361c489ba642cbf1395097f10b730edc146b81d;hp=7975be84c063dc5e0eb71d69137156f9bd46ed2f;hpb=152ffba5919446044f3bbb3b8892913d672af3db;p=libreriscv.git

diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn
index 7975be84c..d26d89484 100644
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -25,10 +25,21 @@
   requires registers to have extra hidden bits: register
   x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
 
+See [[requirements_specification]]
+
 # Conversation Notes
 
 ----
 
+http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000310.html
+
+> We will need fast f32 <-> i16 at least since that is used for 16-bit
+> z-buffers. Since we don't have indexed load/store and need to manually
+> construct pointer vectors we will need fast i32 -> i64. We will also need
+> fast i32 <-> f32.
+
+----
+
 'm thinking about using tilelink (or something similar) internally as
 having a cache-coherent protocol is required for implementing Vulkan
 (unless you want to turn off the cache for the GPU memory, which I
@@ -354,6 +365,216 @@ have to forward multiple times from the miss buffers and avoid damaging
 the cache until the instruction retires. VVM uses this to avoid having
 a vector strip mine the data cache.
 
+----
+
+> I meant the renaming done as part of the SV extension, not the
+> microarchitectural renaming.
+
+ah ok, yes. right.  ok, so i don't know what to name that, and i'd
+been thinking of it in terms of "post-renaming", as in my mind, it's
+not really renaming, at all, it's... remapping.  or, vector
+"elements".
+
+as in: architecturally we already have a name (vector "elements").
+physically we already have a name: register file.
+
+i was initially thinking that the issue stage would take care of it,
+by producing:
+
+* post-remapped elements which are basically post-remapped register indices
+* a byte-mask indicating which *bytes* of the register are to be
+  modified and which left alone
+* an element-width that is effectively an augmentation of (part of) the opcode
+
+the element width goes into the ALU as an augmentation of the opcode
+because the 64-bit "register" now contains e.g. 16-bit "elements"
+indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a
+SIMD-style (predicated) operation to take place.
+
+now that i think about it, i think we may need to have the three
+phases be part of a pipeline, in a single dependency matrix.
+
+----
+
+I had a state machine in one chip that could come up out of power on in a
+state it could not get out of. Since this experience, I have a rule with
+state machines, A state machine must be able to go from any state to idle
+when the reset line is asserted.
+
+You have to prove that the logic can never create a circular dependency,
+not a proof with test vectors, a logical proof like what we do with FP
+arithmetic these days.
+
+----
+
+
+>  however... we don't mind that, as the vectorisation engine will
+>  be, for the most part, generating sequentially-increasing index
+>  dest *and* src registers, so we kinda get away with it.
+
+In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
+and read 4 registers at a time or write 4 registers at a time. Timing
+looks like:
+
+<pre>
+     |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
+                    |F123|F123|F123|F123|
+                         |Esk1|EsK2|EsK3|EsK4|
+                                        |EfK1|EfK2|EfK3|EfK4|
+</pre>
+
+4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
+Read as much as you need in 4 cycles for another operand, read as much
+as you need in 4 cycles for the last operand, then write as much as you
+can for the result. This simply requires flip-flops to capture the width
+and then deliver operands in parallel (serial to parallel converter) and
+similarly for writing.
+
+----
+
+* <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/32aNXIzeDQAJ>
+
+discussion of how to do dest-latches rather than src-latches.
+
+also includes need for forwarding to achieve it (synonymous with
+Tomasulo CDB).
+
+also, assigning a result number at issue time allows multiple results
+to be stored-and-forwarded, meaning that multiplying up the FUs is
+not needed.
+
+also, discussion of how to have multiple instructions issued even with
+the same dest reg: drop the reg-store and effectively rename them
+to "R.FU#".  exceptions under discussion.
+
+----
+
+Speculation
+
+<https://groups.google.com/forum/#!topic/comp.arch/mzXXTU2GUSo>
+
+There is a minimal partial order that is immune to SpetrÃ© amd friends,
+You have the dependence matrix that imposes a minimal partial order on
+executing instructions (at least in the architecture you have been
+discussing herein) You just have to prove that your matrix provides
+that minimal partial order for instructions.
+
+Then you have to prove that no cache/tlb state can be updated prior to the
+causing instruction being made retirable (not retired retirable).
+
+As to cache updates, all "reasonable" interfaces that service cache misses
+will have line buffers to deal with the inbound and outbound memory traffic.
+These buffers will provide the appropriate data to the execution stream,
+but not update the cache until the causing instruction becomes transitively
+retirable. This will put "a little" extra pressure on these buffers.
+
+As to the TLB it is easy enough on a TLB miss to fetch the mapping tables
+transitively and arrive at a PTE. This PTE cannot be installed until the
+causing instruction becomes retirable. The miss buffers are probably the
+right place, and if a second TLB miss occurs, you might just as well walk
+the tables again and if it hits the line in the buffer use the data from
+there. When we looked at this a long time ago, there was little benefit
+for being able to walk more than one TLB miss at a time.
+
+----
+
+Register Prefixes <a name="prefixes" />
+
+<pre>
+|           3      |           2      |           1      |           0      |
+| ---------------- | ---------------- | ---------------- | ---------------- |
+|                  | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+<pre>
+2x16-bit / 32-bit:
+
+| 9 8   | 7 6 5 |     4 3 |     2 1 | 0 |
+| ----- | ----- | ------- | ------- | - |
+| elwid | VL    | rs[6:5] | rd[6:5] | 0 |
+
+| 9 8 7 6 5 |      4 3 |   2 |   1 | 0 |
+| --------- | -------- | --- | --- | - |
+| predicate | predtarg | end | inv | 1 |
+
+
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+# MVX and other reg-shuffling
+
+<pre>
+> Crucial strategic op missing is MVX:
+> regs[rd]= regs[regs[rs1]]
+>
+we could modify the definition slightly:
+for i in 0..VL {
+    let offset = regs[rs1 + i];
+    // we could also limit on out-of-range
+    assert!(offset < VL); // trap on fail
+    regs[rd + i] = regs[rs2 + offset];
+}
+
+The dependency matrix would have the instruction depend on everything from
+rs2 to rs2 + VL and we let the execution unit figure it out. for
+simplicity, we could extend the dependencies to a power of 2 or something.
+
+We should add some constrained swizzle instructions for the more
+pipeline-friendly cases. One that will be important is:
+for i in (0..VL) {
+    let i = i * 4;
+    let s1: [0; 4];
+    for j in 0..4 {
+        s1[j] = regs[rs1 + i + j];
+    }
+    for j in 0..4 {
+        regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
+    }
+}
+Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
+as similar to a strided ld/st except for registers.
+</pre>
+
+# TLBs / Virtual Memory <a name="tlb" />
+
+----
+
+We were specifically looking for ways to not need large CAMs since they are
+power-hungry when designing the instruction scheduling logic, so it may be
+a good idea to have a smaller L1 TLB and a larger, slower, more
+power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can
+be 32-128 as long as the L2 cam isn't being activated every clock cycle. We
+can also share the L2 between the instruction and data caches.
+
+# Register File having same-cycle "forwarding"
+
+discussion about CDC 6600 Register File: it was capable of forwarding
+operands being written out to "reads", *in the same cycle*.  this
+effectively turns the Reg File *into* a "Forwarding Bus".
+
+we aim to only have (4 banks of) 2R1W ported register files,
+with *additional* Forwarding Multiplexers (which look exactly
+like multi-port regfile gate logic).
+
+suggestion by Mitch is to have a "demon" on the front of the regfile,
+<https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/qY2SYjd2DgAJ>,
+which:
+
+    basically, you are going to end up with a "demon" at the RF and when
+    all read reservations have been satisfied the demon determines if the
+    result needs to be written to the RF or discarded. The demon sees
+    the instruction issue process, the branch resolutions, and the FU
+    exceptions, and keeps track of whether the result needs to be written.
+    It then forwards the result from the FU and clears the slot, then writes
+    the result to the RF if needed.
+
 # Design Layout
 
 ok,so continuing some thoughts-in-order notes:
@@ -402,6 +623,21 @@ and there are several of them:
 
 ## Register Renaming
 
+There are several potential well-known schemes for register-renaming:
+*none of them will be used here*.  The scheme below is a new form of
+renaming that is a topologically and functionally **direct** equivalent
+of the Tomasulo Algorithm with a Reorder Buffer, that came from the
+"Register Alias Table" concept that is better suited to Scoreboards.
+It works by flattening out Reservation Stations to one per FU (requiring
+more FUs as a result).  On top of this the function normally carried
+out by "tags" of the RAT table may be merged-morphed into the role
+carried out by the ROB Destination Register CAM which may be merged-morphed
+into a single vector (per register) of 1-bit mutually-exclusive "CAMs"
+that are added, very simply, to the FU-Register Dependency Matrix.
+
+In this way, exactly as in the Tomasulo Algorithm, there is absolutely no
+need whatsoever for a separate PRF-ARF scheme.  The PRF *is* the ARF.
+
 Register-renaming will be done with a single extra mutually-exclusive bit
 in the FUxReg Dependency Matrix, which may be set on only one FU (per register).
 This bit indicates which of the FUs has the **most recent** destination
@@ -562,7 +798,6 @@ costs 2 address comparators to disambiguate this short shadow in the pipeline.
 This is a lower expense than building another read port into the RF, in
 both area and power, and uses the pipeline efficiently.
 
-
 # References
 
 * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>