From ce1a85a5f0fa63223dee7abeadc160672a69147b Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sat, 9 Jan 2021 22:14:06 +0000
Subject: [PATCH]

---
 openpower/sv/remap.mdwn | 161 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 openpower/sv/remap.mdwn
diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn
new file mode 100644
index 000000000..211afecb2
--- /dev/null
+++ b/openpower/sv/remap.mdwn
@@ -0,0 +1,161 @@
+[[!tag standards]]
+
+# REMAP <a name="remap" />
+
+REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear
+form to a 2D or 3D transposed form, or "offset" to permit arbitrary
+access to elements, independently on each Vector src or dest register.
+
+Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Four SPRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.  Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming.
+
+# SHAPE 1D/2D/3D vector-matrix remapping SPRs
+
+There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
+which have the same format.  
+
+[[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
+
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
+
+    xdim = 3
+    ydim = 4
+    zdim = 1
+
+    lims = [xdim, ydim, zdim]
+    idxs = [0,0,0] # starting indices
+    order = [0,1,2] # experiment with different permutations, here
+    offset = 2     # experiment with different offset, here
+    VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
+    applydim = 0
+    invxyz = [0,0,0]
+
+    # run for offset iterations before actually starting
+    for idx in range(offset):
+        for i in range(3):
+            idxs[order[i]] = idxs[order[i]] + 1
+            if (idxs[order[i]] != lims[order[i]]):
+                break
+            idxs[order[i]] = 0
+
+    break_count = 0
+
+    for idx in range(VL):
+        ix = [0] * 3
+        for i in range(3):
+            if i >= applydim:
+                ix[i] = idxs[i]
+            if invxyz[i]:
+                ix[i] = lims[i] - 1 - ix[i]
+        new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
+        print new_idx,
+        break_count += 1
+        if break_count == lims[order[0]]:
+            print
+            break_count = 0
+        for i in range(3):
+            idxs[order[i]] = idxs[order[i]] + 1
+            if (idxs[order[i]] != lims[order[i]]):
+                break
+            idxs[order[i]] = 0
+
+Here, it is assumed that this algorithm be run within all pseudo-code
+throughout this document where a (parallelism) for-loop would normally
+run from 0 to VL-1 to refer to contiguous register
+elements; instead, where REMAP indicates to do so, the element index
+is run through the above algorithm to work out the **actual** element
+index, instead.  Given that there are four possible SHAPE entries, up to
+four separate registers in any given operation may be simultaneously
+remapped:
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+      ...
+      ...
+     Â for (i = 0; i < VL; i++)
+        xSTATE.srcoffs = i # save context
+        if (predval & 1<<i) # predication uses intregs
+     Â     Â ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
+                                 ireg[rs2+remap(irs2)];
+           if (!int_vec[rd ].isvector) break;
+        if (int_vec[rd ].isvector) Â { id += 1; }
+        if (int_vec[rs1].isvector) Â { irs1 += 1; }
+        if (int_vec[rs2].isvector) Â { irs2 += 1; }
+
+By changing remappings, 2D matrices may be transposed "in-place" for one
+operation, followed by setting a different permutation order without
+having to move the values in the registers to or from memory.
+
+Note that:
+
+* Over-running the register file clearly has to be detected and
+  an illegal instruction exception thrown
+* When non-default elwidths are set, the exact same algorithm still
+  applies (i.e. it offsets elements *within* registers rather than
+  entire registers).
+* If permute option 000 is utilised, the actual order of the
+  reindexing does not change.  However, modulo MVL still occurs
+  which will result in repeated operations (use with caution).
+* If two or more dimensions are set to zero, the actual order does not change!
+* The above algorithm is pseudo-code **only**.  Actual implementations
+  will need to take into account the fact that the element for-looping
+  must be **re-entrant**, due to the possibility of exceptions occurring.
+  See SVSTATE SPR, which records the current element index.
+* Twin-predicated operations require **two** separate and distinct
+  element offsets.  The above pseudo-code algorithm will be applied
+  separately and independently to each, should each of the two
+  operands be remapped.  *This even includes C.LDSP* and other operations
+  in that category, where in that case it will be the **offset** that is
+  remapped (see Compressed Stack LOAD/STORE section).
+* Offset is especially useful, on its own, for accessing elements
+  within the middle of a register.  Without offsets, it is necessary
+  to either use a predicated MV, skipping the first elements, or
+  performing a LOAD/STORE cycle to memory.
+  With offsets, the data does not have to be moved.
+* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
+  less than MVL is **perfectly legal**, albeit very obscure.  It permits
+  entries to be regularly presented to operands **more than once**, thus
+  allowing the same underlying registers to act as an accumulator of
+  multiple vector or matrix operations, for example.
+
+Clearly here some considerable care needs to be taken as the remapping
+could hypothetically create arithmetic operations that target the
+exact same underlying registers, resulting in data corruption due to
+pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
+register-renaming will have an easier time dealing with this than
+DSP-style SIMD micro-architectures.
+
+# 4x4 Matrix to vec4 Multiply Example
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
+
+    fmac f4, f0, f8, f4
+    fmac f5, f0, f9, f5
+    fmac f6, f0, f10, f6
+    fmac f7, f0, f11, f7
+    fmac f4, f1, f12, f4
+    fmac f5, f1, f13, f5
+    fmac f6, f1, f14, f6
+    fmac f7, f1, f15, f7
+    fmac f4, f2, f16, f4
+    fmac f5, f2, f17, f5
+    fmac f6, f2, f18, f6
+    fmac f7, f2, f19, f7
+    fmac f4, f3, f20, f4
+    fmac f5, f3, f21, f5
+    fmac f6, f3, f22, f6
+    fmac f7, f3, f23, f7
+
+The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs, and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.
-- 
2.30.2