Bug 1244: changes to description pospopcount

[libreriscv.git] / 3d_gpu / microarchitecture.mdwn
diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn

index 80796b63545c669bdd41096b0759e35e62b97ad5..c9473f5d5a78512c434938d77932e14a5c6ea158 100644 (file)
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -25,10 +25,21 @@
    requires registers to have extra hidden bits: register
    x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.
  
+See [[requirements_specification]]
+
  # Conversation Notes
  
  ----
  
+http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000310.html
+
+> We will need fast f32 <-> i16 at least since that is used for 16-bit
+> z-buffers. Since we don't have indexed load/store and need to manually
+> construct pointer vectors we will need fast i32 -> i64. We will also need
+> fast i32 <-> f32.
+
+----
+
  'm thinking about using tilelink (or something similar) internally as
  having a cache-coherent protocol is required for implementing Vulkan
  (unless you want to turn off the cache for the GPU memory, which I
@@ -200,7 +211,7 @@ be pending
  
  Advantages of Speculation, Load and Store hazards:
  
-* A store updates memoryy only when it reached the head of the ROB
+* A store updates memory only when it reached the head of the ROB
  * WAW and WAR type of hazards are eliminated with speculation
    (actual updating of memory occurs in order)
  * RAW hazards through memory are maintained by not allowing a load
@@ -354,6 +365,216 @@ have to forward multiple times from the miss buffers and avoid damaging
  the cache until the instruction retires. VVM uses this to avoid having
  a vector strip mine the data cache.
  
+----
+
+> I meant the renaming done as part of the SV extension, not the
+> microarchitectural renaming.
+
+ah ok, yes. right.  ok, so i don't know what to name that, and i'd
+been thinking of it in terms of "post-renaming", as in my mind, it's
+not really renaming, at all, it's... remapping.  or, vector
+"elements".
+
+as in: architecturally we already have a name (vector "elements").
+physically we already have a name: register file.
+
+i was initially thinking that the issue stage would take care of it,
+by producing:
+
+* post-remapped elements which are basically post-remapped register indices
+* a byte-mask indicating which *bytes* of the register are to be
+  modified and which left alone
+* an element-width that is effectively an augmentation of (part of) the opcode
+
+the element width goes into the ALU as an augmentation of the opcode
+because the 64-bit "register" now contains e.g. 16-bit "elements"
+indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a
+SIMD-style (predicated) operation to take place.
+
+now that i think about it, i think we may need to have the three
+phases be part of a pipeline, in a single dependency matrix.
+
+----
+
+I had a state machine in one chip that could come up out of power on in a
+state it could not get out of. Since this experience, I have a rule with
+state machines, A state machine must be able to go from any state to idle
+when the reset line is asserted.
+
+You have to prove that the logic can never create a circular dependency,
+not a proof with test vectors, a logical proof like what we do with FP
+arithmetic these days.
+
+----
+
+
+>  however... we don't mind that, as the vectorisation engine will
+>  be, for the most part, generating sequentially-increasing index
+>  dest *and* src registers, so we kinda get away with it.
+
+In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM)
+and read 4 registers at a time or write 4 registers at a time. Timing
+looks like:
+
+<pre>
+     |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
+                    |F123|F123|F123|F123|
+                         |Esk1|EsK2|EsK3|EsK4|
+                                        |EfK1|EfK2|EfK3|EfK4|
+</pre>
+
+4 cycle FU shown. Read as much as you need in 4 cycles for one operand,
+Read as much as you need in 4 cycles for another operand, read as much
+as you need in 4 cycles for the last operand, then write as much as you
+can for the result. This simply requires flip-flops to capture the width
+and then deliver operands in parallel (serial to parallel converter) and
+similarly for writing.
+
+----
+
+* <https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/32aNXIzeDQAJ>
+
+discussion of how to do dest-latches rather than src-latches.
+
+also includes need for forwarding to achieve it (synonymous with
+Tomasulo CDB).
+
+also, assigning a result number at issue time allows multiple results
+to be stored-and-forwarded, meaning that multiplying up the FUs is
+not needed.
+
+also, discussion of how to have multiple instructions issued even with
+the same dest reg: drop the reg-store and effectively rename them
+to "R.FU#".  exceptions under discussion.
+
+----
+
+Speculation
+
+<https://groups.google.com/forum/#!topic/comp.arch/mzXXTU2GUSo>
+
+There is a minimal partial order that is immune to Spetré amd friends,
+You have the dependence matrix that imposes a minimal partial order on
+executing instructions (at least in the architecture you have been
+discussing herein) You just have to prove that your matrix provides
+that minimal partial order for instructions.
+
+Then you have to prove that no cache/tlb state can be updated prior to the
+causing instruction being made retirable (not retired retirable).
+
+As to cache updates, all "reasonable" interfaces that service cache misses
+will have line buffers to deal with the inbound and outbound memory traffic.
+These buffers will provide the appropriate data to the execution stream,
+but not update the cache until the causing instruction becomes transitively
+retirable. This will put "a little" extra pressure on these buffers.
+
+As to the TLB it is easy enough on a TLB miss to fetch the mapping tables
+transitively and arrive at a PTE. This PTE cannot be installed until the
+causing instruction becomes retirable. The miss buffers are probably the
+right place, and if a second TLB miss occurs, you might just as well walk
+the tables again and if it hits the line in the buffer use the data from
+there. When we looked at this a long time ago, there was little benefit
+for being able to walk more than one TLB miss at a time.
+
+----
+
+Register Prefixes <a name="prefixes" />
+
+<pre>
+|           3      |           2      |           1      |           0      |
+| ---------------- | ---------------- | ---------------- | ---------------- |
+|                  | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+<pre>
+2x16-bit / 32-bit:
+
+| 9 8   | 7 6 5 |     4 3 |     2 1 | 0 |
+| ----- | ----- | ------- | ------- | - |
+| elwid | VL    | rs[6:5] | rd[6:5] | 0 |
+
+| 9 8 7 6 5 |      4 3 |   2 |   1 | 0 |
+| --------- | -------- | --- | --- | - |
+| predicate | predtarg | end | inv | 1 |
+
+
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+# MVX and other reg-shuffling
+
+<pre>
+> Crucial strategic op missing is MVX:
+> regs[rd]= regs[regs[rs1]]
+>
+we could modify the definition slightly:
+for i in 0..VL {
+    let offset = regs[rs1 + i];
+    // we could also limit on out-of-range
+    assert!(offset < VL); // trap on fail
+    regs[rd + i] = regs[rs2 + offset];
+}
+
+The dependency matrix would have the instruction depend on everything from
+rs2 to rs2 + VL and we let the execution unit figure it out. for
+simplicity, we could extend the dependencies to a power of 2 or something.
+
+We should add some constrained swizzle instructions for the more
+pipeline-friendly cases. One that will be important is:
+for i in (0..VL) {
+    let i = i * 4;
+    let s1: [0; 4];
+    for j in 0..4 {
+        s1[j] = regs[rs1 + i + j];
+    }
+    for j in 0..4 {
+        regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
+    }
+}
+Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
+as similar to a strided ld/st except for registers.
+</pre>
+
+# TLBs / Virtual Memory <a name="tlb" />
+
+----
+
+We were specifically looking for ways to not need large CAMs since they are
+power-hungry when designing the instruction scheduling logic, so it may be
+a good idea to have a smaller L1 TLB and a larger, slower, more
+power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can
+be 32-128 as long as the L2 cam isn't being activated every clock cycle. We
+can also share the L2 between the instruction and data caches.
+
+# Register File having same-cycle "forwarding"
+
+discussion about CDC 6600 Register File: it was capable of forwarding
+operands being written out to "reads", *in the same cycle*.  this
+effectively turns the Reg File *into* a "Forwarding Bus".
+
+we aim to only have (4 banks of) 2R1W ported register files,
+with *additional* Forwarding Multiplexers (which look exactly
+like multi-port regfile gate logic).
+
+suggestion by Mitch is to have a "demon" on the front of the regfile,
+<https://groups.google.com/d/msg/comp.arch/gedwgWzCK4A/qY2SYjd2DgAJ>,
+which:
+
+    basically, you are going to end up with a "demon" at the RF and when
+    all read reservations have been satisfied the demon determines if the
+    result needs to be written to the RF or discarded. The demon sees
+    the instruction issue process, the branch resolutions, and the FU
+    exceptions, and keeps track of whether the result needs to be written.
+    It then forwards the result from the FU and clears the slot, then writes
+    the result to the RF if needed.
+
  # Design Layout
  
  ok,so continuing some thoughts-in-order notes:
@@ -363,42 +584,60 @@ ok,so continuing some thoughts-in-order notes:
  scoreboards are not just scoreboards, they are dependency matrices,
  and there are several of them:
  
-* one for LOAD/STORE-to-LOAD/STORE:
-    1. most recent LOADs prevent later STOREs
-    2. most recent STOREs prevent later LOADs.
+* one for LOAD/STORE-to-LOAD/STORE
+  - most recent LOADs prevent later STOREs
+  - most recent STOREs prevent later LOADs.
+  - a separate process analyses LOAD-STORE addresses for
+    conflicts, based on sufficient bits to assess uniqueness
+    as opposed to precise and exact matches
  * one for Function-Unit to Function-Unit.
-    3. it expresses both RAW and WAW hazards through "Go_Write"
-     and "Go_Read" signals, which are stopped from proceeding by
-     dependent 1-bit CAM latches
-    4. exceptions may ALSO be made "precise" by holding a "Write prevention"
-     signal.  only when the Function Unit knows that an exception is
-     not going to occur (memory has been fetched, for example), does
-     it release the signal
-    5. speculative branch execution likewise may hold a "Write prevention",
-       however it also needs a "Go die" signal, to clear out the
-       incorrectly-taken branch.
-    6. LOADs/STOREs *also* must be considered as "Functional Units" and thus
-       must also have corresponding entries (plural) in the FU-to-FU Matrix
-    7. it is permitted for ALUs to *BEGIN* execution (read operands are
-       valid) without being permitted to *COMMIT*.  thus, each FU must
-       store (buffer) results, until such time as a "commit" signal is
-       received
-    8. we may need to express an inter-dependence on the instruction order
-       (raising the WAW hazard line to do so) as a way to preserve execution
-       order.  only the oldest instructions will have this flag dropped,
-       permitting execution that has *begun* to also reach "commit" phase.
+  - it expresses both RAW and WAW hazards through "Go_Write"
+    and "Go_Read" signals, which are stopped from proceeding by
+    dependent 1-bit CAM latches
+  - exceptions may ALSO be made "precise" by holding a "Write prevention"
+    signal.  only when the Function Unit knows that an exception is
+    not going to occur (memory has been fetched, for example), does
+    it release the signal
+  - speculative branch execution likewise may hold a "Write prevention",
+    however it also needs a "Go die" signal, to clear out the
+    incorrectly-taken branch.
+  - LOADs/STOREs *also* must be considered as "Functional Units" and thus
+    must also have corresponding entries (plural) in the FU-to-FU Matrix
+  - it is permitted for ALUs to *BEGIN* execution (read operands are
+    valid) without being permitted to *COMMIT*.  thus, each FU must
+    store (buffer) results, until such time as a "commit" signal is
+    received
+  - we may need to express an inter-dependence on the instruction order
+    (raising the WAW hazard line to do so) as a way to preserve execution
+    order.  only the oldest instructions will have this flag dropped,
+    permitting execution that has *begun* to also reach "commit" phase.
  * one for Function-Unit to Registers.
-    1. it expresses the read and write requirements: the source
-      and destination registers on which the operation depends.  source
-      registers are marked "need read", dest registers marked
-      "need write".
-    2. by having *more than one* Functional Unit matrix row per ALU
-      it becomes possible to effectively achieve "Reservation Stations"
-      orthogonality with the Tomasulo Algorithm.  the FU row must, like
-      RS's, take and store a copy of the src register values.
+  - it expresses the read and write requirements: the source
+    and destination registers on which the operation depends.  source
+    registers are marked "need read", dest registers marked
+    "need write".
+  - by having *more than one* Functional Unit matrix row per ALU
+    it becomes possible to effectively achieve "Reservation Stations"
+    orthogonality with the Tomasulo Algorithm.  the FU row must, like
+    RS's, take and store a copy of the src register values.
  
  ## Register Renaming
  
+There are several potential well-known schemes for register-renaming:
+*none of them will be used here*.  The scheme below is a new form of
+renaming that is a topologically and functionally **direct** equivalent
+of the Tomasulo Algorithm with a Reorder Buffer, that came from the
+"Register Alias Table" concept that is better suited to Scoreboards.
+It works by flattening out Reservation Stations to one per FU (requiring
+more FUs as a result).  On top of this the function normally carried
+out by "tags" of the RAT table may be merged-morphed into the role
+carried out by the ROB Destination Register CAM which may be merged-morphed
+into a single vector (per register) of 1-bit mutually-exclusive "CAMs"
+that are added, very simply, to the FU-Register Dependency Matrix.
+
+In this way, exactly as in the Tomasulo Algorithm, there is absolutely no
+need whatsoever for a separate PRF-ARF scheme.  The PRF *is* the ARF.
+
  Register-renaming will be done with a single extra mutually-exclusive bit
  in the FUxReg Dependency Matrix, which may be set on only one FU (per register).
  This bit indicates which of the FUs has the **most recent** destination
@@ -460,7 +699,7 @@ on the destination   register number
  * the Register Read and Write signals are then "striped" such that
    read/write requests for every 2nd (or 4th) register are "grouped" and
    will have to fight for access to a multiplexer in order to access
-  registers that do not   have the same modulo 2 (or 4) match.
+  registers that do not have the same modulo 2 (or 4) match.
  * we MAY potentially be able to drop the destination (write) multiplexer(s)
    by only permitting FU rows with the same modulo to write to that
    destination bank.  FUs with indices 0,4,8,12 may only write to registers
@@ -559,6 +798,11 @@ costs 2 address comparators to disambiguate this short shadow in the pipeline.
  This is a lower expense than building another read port into the RF, in
  both area and power, and uses the pipeline efficiently.
  
+# Explicit Vector Length (EVL) extension to LLVM <a name="llvm_evl" />
+
+* <https://reviews.llvm.org/D57504>
+* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html>
+* <http://lists.llvm.org/pipermail/llvm-dev/2019-January/129822.html>
  
  # References