(no commit message)

[libreriscv.git] / simple_v_extension / remap.mdwn
diff --git a/simple_v_extension/remap.mdwn b/simple_v_extension/remap.mdwn

index 1c1742a24c6821250035cc47bfe219b13e87a460..570c0817e38b65cf0241378ac90068505d64c54d 100644 (file)
--- a/simple_v_extension/remap.mdwn
+++ b/simple_v_extension/remap.mdwn
@@ -6,14 +6,13 @@ This section is under revision (and is optional)
  
  # REMAP CSR <a name="remap" />
  
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
  There is one 32-bit CSR which may be used to indicate which registers,
  if used in any operation, must be "reshaped" (re-mapped) from a linear
  form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  access to elements within a register.
  
+Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
+
  The 32-bit REMAP CSR may reshape up to 3 registers:
  
  | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
@@ -34,19 +33,20 @@ whilst the CSR Register entries pointing to them are disabled, instead.
  
  # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
  There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
  which have the same format.  When each SHAPE CSR is set entirely to zeros,
  remapping is disabled: the register's elements are a linear (1D) vector.
  
-| 26..24  | 23      | 22..16  | 15      | 14..8   | 7       | 6..0    |
-| ------- | --      | ------- | --      | ------- | --      | ------- |
-| permute | offs[2] | zdimsz  | offs[1] | ydimsz  | offs[0] | xdimsz  |
+| 29..24 | 23..21  | 20..18  | 17..12  | 11..6   | 5..0    |
+| ------ | ------- | ------- | ------- | -------- | ------- |
+| modulo | invxyz | permute | zdimsz  | ydimsz  | xdimsz  |
+
+modulo will cause the output to wrap and remain within the range 0 to modulo. The value zero disables modulus application.
  
-offs is a 3-bit field, spread out across bits 7, 15 and 23, which
-is added to the element index during the loop calculation.
+invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
+
+offs is a 4-bit field, spread out across bits 7, 15 and 23, which
+is added to the element index during the loop calculation. It is added prior to the dimensional remapping.
  
  xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
  that the array dimensionality for that dimension is 1.  A value of xdimsz=2
@@ -81,11 +81,17 @@ shows this more clearly, and may be executed as a python program:
      lims = [xdim, ydim, zdim]
      idxs = [0,0,0] # starting indices
      order = [1,0,2] # experiment with different permutations, here
-    offs = 0        # experiment with different offsets, here
+    modulo = 64     # experiment with different modulus, here
+    invxyz = [0,0,0] 
  
      for idx in range(xdim * ydim * zdim):
-        new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
-        print new_idx,
+        ix = [0] * 3
+        for i in range(3):
+            ix[i] = idxs[i]
+            if invxyz[i]:
+                ix[i] = lims[i] - ix[i]
+        new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
+        print new_idx % modulo
          for i in range(3):
              idxs[order[i]] = idxs[order[i]] + 1
              if (idxs[order[i]] != lims[order[i]]):
@@ -161,3 +167,38 @@ pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
  register-renaming will have an easier time dealing with this than
  DSP-style SIMD micro-architectures.
  
+# 4x4 Matrix to vec4 Multiply Example
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
+
+    fmac f4, f0, f8, f4
+    fmac f5, f0, f9, f5
+    fmac f6, f0, f10, f6
+    fmac f7, f0, f11, f7
+    fmac f4, f1, f12, f4
+    fmac f5, f1, f13, f5
+    fmac f6, f1, f14, f6
+    fmac f7, f1, f15, f7
+    fmac f4, f2, f16, f4
+    fmac f5, f2, f17, f5
+    fmac f6, f2, f18, f6
+    fmac f7, f2, f19, f7
+    fmac f4, f3, f20, f4
+    fmac f5, f3, f21, f5
+    fmac f6, f3, f22, f6
+    fmac f7, f3, f23, f7
+
+The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE CSRs and applying a rotating SHAPE CSR to f8 in order to get it to apply four times to compute the four columns worth of vectors.