(no commit message)

[libreriscv.git] / 3d_gpu / microarchitecture.mdwn
diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn

index 26b51a32a4b5ffd4e338b96b52706e048168a0c4..c9473f5d5a78512c434938d77932e14a5c6ea158 100644 (file)
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -476,6 +476,83 @@ the tables again and if it hits the line in the buffer use the data from
  there. When we looked at this a long time ago, there was little benefit
  for being able to walk more than one TLB miss at a time.
  
+----
+
+Register Prefixes <a name="prefixes" />
+
+<pre>
+|           3      |           2      |           1      |           0      |
+| ---------------- | ---------------- | ---------------- | ---------------- |
+|                  | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+<pre>
+2x16-bit / 32-bit:
+
+| 9 8   | 7 6 5 |     4 3 |     2 1 | 0 |
+| ----- | ----- | ------- | ------- | - |
+| elwid | VL    | rs[6:5] | rd[6:5] | 0 |
+
+| 9 8 7 6 5 |      4 3 |   2 |   1 | 0 |
+| --------- | -------- | --- | --- | - |
+| predicate | predtarg | end | inv | 1 |
+
+
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+</pre>
+
+# MVX and other reg-shuffling
+
+<pre>
+> Crucial strategic op missing is MVX:
+> regs[rd]= regs[regs[rs1]]
+>
+we could modify the definition slightly:
+for i in 0..VL {
+    let offset = regs[rs1 + i];
+    // we could also limit on out-of-range
+    assert!(offset < VL); // trap on fail
+    regs[rd + i] = regs[rs2 + offset];
+}
+
+The dependency matrix would have the instruction depend on everything from
+rs2 to rs2 + VL and we let the execution unit figure it out. for
+simplicity, we could extend the dependencies to a power of 2 or something.
+
+We should add some constrained swizzle instructions for the more
+pipeline-friendly cases. One that will be important is:
+for i in (0..VL) {
+    let i = i * 4;
+    let s1: [0; 4];
+    for j in 0..4 {
+        s1[j] = regs[rs1 + i + j];
+    }
+    for j in 0..4 {
+        regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
+    }
+}
+Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
+as similar to a strided ld/st except for registers.
+</pre>
+
+# TLBs / Virtual Memory <a name="tlb" />
+
+----
+
+We were specifically looking for ways to not need large CAMs since they are
+power-hungry when designing the instruction scheduling logic, so it may be
+a good idea to have a smaller L1 TLB and a larger, slower, more
+power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can
+be 32-128 as long as the L2 cam isn't being activated every clock cycle. We
+can also share the L2 between the instruction and data caches.
+
  # Register File having same-cycle "forwarding"
  
  discussion about CDC 6600 Register File: it was capable of forwarding
@@ -721,6 +798,12 @@ costs 2 address comparators to disambiguate this short shadow in the pipeline.
  This is a lower expense than building another read port into the RF, in
  both area and power, and uses the pipeline efficiently.
  
+# Explicit Vector Length (EVL) extension to LLVM <a name="llvm_evl" />
+
+* <https://reviews.llvm.org/D57504>
+* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html>
+* <http://lists.llvm.org/pipermail/llvm-dev/2019-January/129822.html>
+
  # References
  
  * <https://en.wikipedia.org/wiki/Tomasulo_algorithm>