From c38a2d9661c87453ed3bbc11bc3974f07d5cc130 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Tue, 16 Oct 2018 01:31:36 +0100
Subject: [PATCH] add reshaping section

---
 simple_v_extension/specification.mdwn | 135 +++++++++++++++++++++++---
 1 file changed, 122 insertions(+), 13 deletions(-)

diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn
index fef048cd2..b46d937f1 100644
--- a/simple_v_extension/specification.mdwn
+++ b/simple_v_extension/specification.mdwn
@@ -65,6 +65,7 @@ tables which are used at the register decode phase.
 
 * A register CSR key-value table (8 32-bit CSRs of 2 16-bits each)
 * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each)
+* A "reshaping"
 
 There are also four additional CSRs for User-Mode:
 
@@ -416,6 +417,111 @@ for the storage of comparisions: in these specific circumstances
 the requirement for there to be an active CSR *register* entry
 is removed.
 
+## REMAP CSR
+
+There is one 32-bit CSR which may be used to indicate which registers,
+if used in any operation, must be "reshaped" (re-mapped) from a linear
+form to a 2D or 3D transposed form.  The 32-bit REMAP CSR may reshape
+up to 3 registers:
+
+| 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
+| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
+| shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
+
+regidx0-2 refer not to the Registe CSR CAM entry but to the underlying
+*real* register (see regidx, the value) and consequently is 7-bits wide.
+shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
+Bits 7, 15, 23, 30 and 31 are reserved, and must be set to zero.
+
+## SHAPE 1D/2D/3D vector-matrix remapping CSRs
+
+There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
+which have the same format.  When each SHAPE CSR is set entirely to zeros,
+remapping is disabled: the register's elements are a linear (1D) vector.
+
+| 26..24  | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
+| ------- | -- | ------- | -- | ------- | -- | ------- |
+| permute | 0  | zdimsz  | 0  | ydimsz  | 0  | xdimsz  |
+
+xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
+that the array dimensionality for that dimension is 1.  A value of xdimsz=2
+would indicate that in the first dimension there are 3 elements in the
+array.  The format of the array is therefore as follows:
+
+    array[xdim+1][ydim+1][zdim+1]
+
+However whilst illustrative of the dimensionality, that does not take the
+"permute" setting into account.  "permute" may be any one of six values
+(0-5, with values of 6 and 7 being reserved, and not legal).  The table
+below shows how the permutation dimensionality order works:
+
+| permute | order | array format             |
+| ------- | ----- | ------------------------ |
+| 000     | 0,1,2 | (xdim+1){ydim+1)(zdim+1) |
+| 001     | 0,2,1 | (xdim+1){zdim+1)(ydim+1) |
+| 010     | 1,0,2 | (ydim+1){xdim+1)(zdim+1) |
+| 011     | 1,2,0 | (ydim+1){zdim+1)(xdim+1) |
+| 100     | 2,0,1 | (zdim+1){xdim+1)(ydim+1) |
+| 101     | 2,1,0 | (zdim+1){ydim+1)(xdim+1) |
+
+In other words, the "permute" option changes the order in which
+nested for-loops over the array would be done.  The algorithm below
+shows this more clearly, and may be executed as a python program:
+
+    # mapidx = REMAP.shape2
+    xdim = 3 # SHAPE[mapidx].xdim_sz+1
+    ydim = 4 # SHAPE[mapidx].ydim_sz+1
+    zdim = 5 # SHAPE[mapidx].zdim_sz+1
+
+    lims = [xdim, ydim, zdim]
+    idxs = [0,0,0]
+    order = [1,0,2]
+
+    for idx in range(xdim * ydim * zdim):
+        new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
+        print new_idx,
+        for i in range(3):
+            idxs[order[i]] = idxs[order[i]] + 1
+            if (idxs[order[i]] != lims[order[i]]):
+                break
+            print
+            idxs[order[i]] = 0
+
+Here, it is assumed that this algorithm be run within all pseudo-code
+throughout this document where a (parallelism) for-loop would normally
+run from 0 to VL-1 and then use that to refer to contiguous register
+elements; instead, where REMAP indicates to do so, the element index
+is run through the above algorithm to work out the **actual** element
+index.  Given that there are three possible SHAPE entries, up to
+three separate registers in any given operation may be simultaneously
+remapped.
+
+In this way, 2D matrices may be transposed "in-place" for one operation,
+followed by setting a different permutation order without having to
+move the values in the registers to or from memory.  Also, the reason
+for having REMAP separate from the three SHAPE CSRs is so that in a
+chain of matrix multiplications and additions, for example, the SHAPE
+CSRs need only be set up once; only the REMAP CSR need be changed to
+target different
+
+Note that:
+
+* If permute option 000 is utilised, the actual order of the
+  reindexing does not change!
+* If two or more dimensions are set to zero, the actual order does not change!
+* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
+  less than MVL is **perfectly legal**, albeit very obscure.  It permits
+  entries to be regularly presented to operands **more than once**, thus
+  allowing the same underlying registers to act as an accumulator of
+  multiple vector or matrix operations, for example.
+
+Clearly here some considerable care needs to be taken as the remapping
+could hypothetically create arithmetic operations that target the
+exact same underlying registers, resulting in data corruption due to
+pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
+register-renaming will have an easier time dealing with this than
+DSP-style SIMD micro-architectures.
+
 # Instruction Execution Order
 
 Simple-V behaves as if it is a hardware-level "macro expansion system",
@@ -484,6 +590,22 @@ All other operations using registers are automatically parallelised.
 This includes AMOMAX, AMOSWAP and so on, where particular care and
 attention must be paid.
 
+Example pseudo-code for an integer ADD operation (including scalar operations).
+Floating-point uses fp csrs.
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+     Â int i, id=0, irs1=0, irs2=0;
+     Â predval = get_pred_val(FALSE, rd);
+     Â rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+     Â rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+     Â rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+     Â for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+     Â     Â ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+        if (int_vec[rd ].isvector) Â { id += 1; }
+        if (int_vec[rs1].isvector) Â { irs1 += 1; }
+        if (int_vec[rs2].isvector) Â { irs2 += 1; }
+
 ## Instruction Format
 
 There are **no operations added to SV, at all**.
@@ -881,16 +1003,3 @@ for element-grouping, if there is unused space within a register
   they are set to all ones on write and are ignored on read, matching the
   existing standard for storing smaller FP values in larger registers.
 
-## 2D remapping
-
-idea: have an additional set of register CSRs that indicate that, instead of
-a straight 1D linear relationship, the element index is put through
-a (reasonably simple) 2D processing algorithm.  in this way, 4x3 blocks
-of registers can have the ordering changed to 3x4 for example:
-
-    xdim = CSRtb[reg].x_sz
-    ydim = CSRtb[reg].y_sz
-
-    for idx in range(VL):
-        new_idx = (idx % xdim) * ydim + (idx / xdim)
-
-- 
2.30.2