inversion limit wrong

[libreriscv.git] / simple_v_extension / remap.mdwn
diff --git a/simple_v_extension/remap.mdwn b/simple_v_extension/remap.mdwn

index 1c1742a24c6821250035cc47bfe219b13e87a460..b78280427f253c471116f69c9bd611c27b4542ad 100644 (file)
--- a/simple_v_extension/remap.mdwn
+++ b/simple_v_extension/remap.mdwn
@@ -6,14 +6,13 @@ This section is under revision (and is optional)
  
  # REMAP CSR <a name="remap" />
  
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
  There is one 32-bit CSR which may be used to indicate which registers,
  if used in any operation, must be "reshaped" (re-mapped) from a linear
  form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  access to elements within a register.
  
+Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
+
  The 32-bit REMAP CSR may reshape up to 3 registers:
  
  | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
@@ -34,63 +33,53 @@ whilst the CSR Register entries pointing to them are disabled, instead.
  
  # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
  There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
-which have the same format.  When each SHAPE CSR is set entirely to zeros,
-remapping is disabled: the register's elements are a linear (1D) vector.
-
-| 26..24  | 23      | 22..16  | 15      | 14..8   | 7       | 6..0    |
-| ------- | --      | ------- | --      | ------- | --      | ------- |
-| permute | offs[2] | zdimsz  | offs[1] | ydimsz  | offs[0] | xdimsz  |
-
-offs is a 3-bit field, spread out across bits 7, 15 and 23, which
-is added to the element index during the loop calculation.
-
-xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
-that the array dimensionality for that dimension is 1.  A value of xdimsz=2
-would indicate that in the first dimension there are 3 elements in the
-array.  The format of the array is therefore as follows:
-
-    array[xdim+1][ydim+1][zdim+1]
-
-However whilst illustrative of the dimensionality, that does not take the
-"permute" setting into account.  "permute" may be any one of six values
-(0-5, with values of 6 and 7 being reserved, and not legal).  The table
-below shows how the permutation dimensionality order works:
-
-| permute | order | array format             |
-| ------- | ----- | ------------------------ |
-| 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
-| 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
-| 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
-| 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
-| 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
-| 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
-
-In other words, the "permute" option changes the order in which
-nested for-loops over the array would be done.  The algorithm below
-shows this more clearly, and may be executed as a python program:
-
-    # mapidx = REMAP.shape2
-    xdim = 3 # SHAPE[mapidx].xdim_sz+1
-    ydim = 4 # SHAPE[mapidx].ydim_sz+1
-    zdim = 5 # SHAPE[mapidx].zdim_sz+1
+which have the same format.  
+
+[[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
+
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
+
+    xdim = 3
+    ydim = 4
+    zdim = 1
  
      lims = [xdim, ydim, zdim]
      idxs = [0,0,0] # starting indices
-    order = [1,0,2] # experiment with different permutations, here
-    offs = 0        # experiment with different offsets, here
+    order = [0,1,2] # experiment with different permutations, here
+    offset = 2     # experiment with different offset, here
+    VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
+    applydim = 0
+    invxyz = [0,0,0]
+
+    # run for offset iterations before actually starting
+    for idx in range(offset):
+        for i in range(3):
+            idxs[order[i]] = idxs[order[i]] + 1
+            if (idxs[order[i]] != lims[order[i]]):
+                break
+            idxs[order[i]] = 0
+
+    break_count = 0
  
-    for idx in range(xdim * ydim * zdim):
-        new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
+    for idx in range(VL):
+        ix = [0] * 3
+        for i in range(3):
+            if i >= applydim:
+                ix[i] = idxs[i]
+            if invxyz[i]:
+                ix[i] = lims[i] - 1 - ix[i]
+        new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
          print new_idx,
+        break_count += 1
+        if break_count == lims[order[0]]:
+            print
+            break_count = 0
          for i in range(3):
              idxs[order[i]] = idxs[order[i]] + 1
              if (idxs[order[i]] != lims[order[i]]):
                  break
-            print
              idxs[order[i]] = 0
  
  Here, it is assumed that this algorithm be run within all pseudo-code
@@ -161,3 +150,38 @@ pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
  register-renaming will have an easier time dealing with this than
  DSP-style SIMD micro-architectures.
  
+# 4x4 Matrix to vec4 Multiply Example
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
+
+    fmac f4, f0, f8, f4
+    fmac f5, f0, f9, f5
+    fmac f6, f0, f10, f6
+    fmac f7, f0, f11, f7
+    fmac f4, f1, f12, f4
+    fmac f5, f1, f13, f5
+    fmac f6, f1, f14, f6
+    fmac f7, f1, f15, f7
+    fmac f4, f2, f16, f4
+    fmac f5, f2, f17, f5
+    fmac f6, f2, f18, f6
+    fmac f7, f2, f19, f7
+    fmac f4, f3, f20, f4
+    fmac f5, f3, f21, f5
+    fmac f6, f3, f22, f6
+    fmac f7, f3, f23, f7
+
+The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 CSRs, and applying a rotating 1D SHAPE CSR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.