reword SHAPE, reintroduce offset

[libreriscv.git] / simple_v_extension / remap.mdwn
diff --git a/simple_v_extension/remap.mdwn b/simple_v_extension/remap.mdwn

index 8447b6907400cf766b1774f0c9d85d77ed8ad856..27b183bd2a15d9b381bee10518e63d796a14cf96 100644 (file)
--- a/simple_v_extension/remap.mdwn
+++ b/simple_v_extension/remap.mdwn
@@ -11,7 +11,7 @@ if used in any operation, must be "reshaped" (re-mapped) from a linear
  form to a 2D or 3D transposed form, or "offset" to permit arbitrary
  access to elements within a register.
  
-Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs
+Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Three CSRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.
  
  The 32-bit REMAP CSR may reshape up to 3 registers:
  
@@ -34,62 +34,52 @@ whilst the CSR Register entries pointing to them are disabled, instead.
  # SHAPE 1D/2D/3D vector-matrix remapping CSRs
  
  There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
-which have the same format.  When each SHAPE CSR is set entirely to zeros,
-remapping is disabled: the register's elements are a linear (1D) vector.
+which have the same format.  
  
-| 31..25 | 24..22  | 21-18   | 17..12  | 11..6   | 5..0    |
-| ------ | ------- | --      | ------- | ------- | --      | ------- |
-| modulo | permute | offs    | zdimsz  | ydimsz  | xdimsz  |
+[[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
  
-modulo is applied to the output, causing it to cycle within the range 0..modulo-1. Note that zero indicates "unlimited". With VL being a maximum of 64, modulo is also 6 bits. Modulo is applied after dimensional remapping.
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
  
-offs is a 4-bit field, spread out across bits 7, 15 and 23, which
-is added to the element index during the loop calculation. It is added prior to the dimensional remapping.
-
-xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
-that the array dimensionality for that dimension is 1.  A value of xdimsz=2
-would indicate that in the first dimension there are 3 elements in the
-array.  The format of the array is therefore as follows:
-
-    array[xdim+1][ydim+1][zdim+1]
-
-However whilst illustrative of the dimensionality, that does not take the
-"permute" setting into account.  "permute" may be any one of six values
-(0-5, with values of 6 and 7 being reserved, and not legal).  The table
-below shows how the permutation dimensionality order works:
-
-| permute | order | array format             |
-| ------- | ----- | ------------------------ |
-| 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
-| 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
-| 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
-| 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
-| 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
-| 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
-
-In other words, the "permute" option changes the order in which
-nested for-loops over the array would be done.  The algorithm below
-shows this more clearly, and may be executed as a python program:
-
-    # mapidx = REMAP.shape2
-    xdim = 3 # SHAPE[mapidx].xdim_sz+1
-    ydim = 4 # SHAPE[mapidx].ydim_sz+1
-    zdim = 5 # SHAPE[mapidx].zdim_sz+1
+    xdim = 3
+    ydim = 4
+    zdim = 1
  
      lims = [xdim, ydim, zdim]
      idxs = [0,0,0] # starting indices
-    order = [1,0,2] # experiment with different permutations, here
-    offs = 0        # experiment with different offsets, here
-    modulo = 64     # set different modulus, here
-
-    for idx in range(xdim * ydim * zdim):
-        new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
-        print new_idx % modulo
+    order = [0,1,2] # experiment with different permutations, here
+    offset = 2     # experiment with different offset, here
+    VL = xdim * ydim * zdim
+    applydim = 0
+    invxyz = [0,0,0]
+
+    # run for offset iterations before actually starting
+    for idx in range(offset):
          for i in range(3):
              idxs[order[i]] = idxs[order[i]] + 1
              if (idxs[order[i]] != lims[order[i]]):
                  break
+            idxs[order[i]] = 0
+
+    break_count = 0
+
+    for idx in range(VL):
+        ix = [0] * 3
+        for i in range(3):
+            if i >= applydim:
+                ix[i] = idxs[i]
+            if invxyz[i]:
+                ix[i] = lims[i] - ix[i]
+        new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
+        print new_idx,
+        break_count += 1
+        if break_count == lims[order[0]]:
              print
+            break_count = 0
+        for i in range(3):
+            idxs[order[i]] = idxs[order[i]] + 1
+            if (idxs[order[i]] != lims[order[i]]):
+                break
              idxs[order[i]] = 0
  
  Here, it is assumed that this algorithm be run within all pseudo-code
@@ -160,3 +150,38 @@ pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
  register-renaming will have an easier time dealing with this than
  DSP-style SIMD micro-architectures.
  
+# 4x4 Matrix to vec4 Multiply Example
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
+
+    fmac f4, f0, f8, f4
+    fmac f5, f0, f9, f5
+    fmac f6, f0, f10, f6
+    fmac f7, f0, f11, f7
+    fmac f4, f1, f12, f4
+    fmac f5, f1, f13, f5
+    fmac f6, f1, f14, f6
+    fmac f7, f1, f15, f7
+    fmac f4, f2, f16, f4
+    fmac f5, f2, f17, f5
+    fmac f6, f2, f18, f6
+    fmac f7, f2, f19, f7
+    fmac f4, f3, f20, f4
+    fmac f5, f3, f21, f5
+    fmac f6, f3, f22, f6
+    fmac f7, f3, f23, f7
+
+The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 CSRs, and applying a rotating 1D SHAPE CSR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.