--- /dev/null
+# NOTE
+
+This section is under revision (and is optional)
+
+# REMAP CSR <a name="remap" />
+
+(Note: both the REMAP and SHAPE sections are best read after the
+ rest of the document has been read)
+
+There is one 32-bit CSR which may be used to indicate which registers,
+if used in any operation, must be "reshaped" (re-mapped) from a linear
+form to a 2D or 3D transposed form, or "offset" to permit arbitrary
+access to elements within a register.
+
+The 32-bit REMAP CSR may reshape up to 3 registers:
+
+| 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
+| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
+| shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
+
+regidx0-2 refer not to the Register CSR CAM entry but to the underlying
+*real* register (see regidx, the value) and consequently is 7-bits wide.
+When set to zero (referring to x0), clearly reshaping x0 is pointless,
+so is used to indicate "disabled".
+shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
+Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
+
+It is anticipated that these specialist CSRs not be very often used.
+Unlike the CSR Register and Predication tables, the REMAP CSRs use
+the full 7-bit regidx so that they can be set once and left alone,
+whilst the CSR Register entries pointing to them are disabled, instead.
+
+# SHAPE 1D/2D/3D vector-matrix remapping CSRs
+
+(Note: both the REMAP and SHAPE sections are best read after the
+ rest of the document has been read)
+
+There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
+which have the same format. When each SHAPE CSR is set entirely to zeros,
+remapping is disabled: the register's elements are a linear (1D) vector.
+
+| 26..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
+| ------- | -- | ------- | -- | ------- | -- | ------- |
+| permute | offs[2] | zdimsz | offs[1] | ydimsz | offs[0] | xdimsz |
+
+offs is a 3-bit field, spread out across bits 7, 15 and 23, which
+is added to the element index during the loop calculation.
+
+xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
+that the array dimensionality for that dimension is 1. A value of xdimsz=2
+would indicate that in the first dimension there are 3 elements in the
+array. The format of the array is therefore as follows:
+
+ array[xdim+1][ydim+1][zdim+1]
+
+However whilst illustrative of the dimensionality, that does not take the
+"permute" setting into account. "permute" may be any one of six values
+(0-5, with values of 6 and 7 being reserved, and not legal). The table
+below shows how the permutation dimensionality order works:
+
+| permute | order | array format |
+| ------- | ----- | ------------------------ |
+| 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
+| 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
+| 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
+| 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
+| 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
+| 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
+
+In other words, the "permute" option changes the order in which
+nested for-loops over the array would be done. The algorithm below
+shows this more clearly, and may be executed as a python program:
+
+ # mapidx = REMAP.shape2
+ xdim = 3 # SHAPE[mapidx].xdim_sz+1
+ ydim = 4 # SHAPE[mapidx].ydim_sz+1
+ zdim = 5 # SHAPE[mapidx].zdim_sz+1
+
+ lims = [xdim, ydim, zdim]
+ idxs = [0,0,0] # starting indices
+ order = [1,0,2] # experiment with different permutations, here
+ offs = 0 # experiment with different offsets, here
+
+ for idx in range(xdim * ydim * zdim):
+ new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
+ print new_idx,
+ for i in range(3):
+ idxs[order[i]] = idxs[order[i]] + 1
+ if (idxs[order[i]] != lims[order[i]]):
+ break
+ print
+ idxs[order[i]] = 0
+
+Here, it is assumed that this algorithm be run within all pseudo-code
+throughout this document where a (parallelism) for-loop would normally
+run from 0 to VL-1 to refer to contiguous register
+elements; instead, where REMAP indicates to do so, the element index
+is run through the above algorithm to work out the **actual** element
+index, instead. Given that there are three possible SHAPE entries, up to
+three separate registers in any given operation may be simultaneously
+remapped:
+
+ function op_add(rd, rs1, rs2) # add not VADD!
+ ...
+ ...
+ for (i = 0; i < VL; i++)
+ xSTATE.srcoffs = i # save context
+ if (predval & 1<<i) # predication uses intregs
+ ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
+ ireg[rs2+remap(irs2)];
+ if (!int_vec[rd ].isvector) break;
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+
+By changing remappings, 2D matrices may be transposed "in-place" for one
+operation, followed by setting a different permutation order without
+having to move the values in the registers to or from memory. Also,
+the reason for having REMAP separate from the three SHAPE CSRs is so
+that in a chain of matrix multiplications and additions, for example,
+the SHAPE CSRs need only be set up once; only the REMAP CSR need be
+changed to target different registers.
+
+Note that:
+
+* Over-running the register file clearly has to be detected and
+ an illegal instruction exception thrown
+* When non-default elwidths are set, the exact same algorithm still
+ applies (i.e. it offsets elements *within* registers rather than
+ entire registers).
+* If permute option 000 is utilised, the actual order of the
+ reindexing does not change!
+* If two or more dimensions are set to zero, the actual order does not change!
+* The above algorithm is pseudo-code **only**. Actual implementations
+ will need to take into account the fact that the element for-looping
+ must be **re-entrant**, due to the possibility of exceptions occurring.
+ See MSTATE CSR, which records the current element index.
+* Twin-predicated operations require **two** separate and distinct
+ element offsets. The above pseudo-code algorithm will be applied
+ separately and independently to each, should each of the two
+ operands be remapped. *This even includes C.LDSP* and other operations
+ in that category, where in that case it will be the **offset** that is
+ remapped (see Compressed Stack LOAD/STORE section).
+* Offset is especially useful, on its own, for accessing elements
+ within the middle of a register. Without offsets, it is necessary
+ to either use a predicated MV, skipping the first elements, or
+ performing a LOAD/STORE cycle to memory.
+ With offsets, the data does not have to be moved.
+* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
+ less than MVL is **perfectly legal**, albeit very obscure. It permits
+ entries to be regularly presented to operands **more than once**, thus
+ allowing the same underlying registers to act as an accumulator of
+ multiple vector or matrix operations, for example.
+
+Clearly here some considerable care needs to be taken as the remapping
+could hypothetically create arithmetic operations that target the
+exact same underlying registers, resulting in data corruption due to
+pipeline overlaps. Out-of-order / Superscalar micro-architectures with
+register-renaming will have an easier time dealing with this than
+DSP-style SIMD micro-architectures.
+
See [[appendix]] for more details on fail-on-first modes, as well as
pseudo-code, below.
-## REMAP CSR <a name="remap" />
-
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
-There is one 32-bit CSR which may be used to indicate which registers,
-if used in any operation, must be "reshaped" (re-mapped) from a linear
-form to a 2D or 3D transposed form, or "offset" to permit arbitrary
-access to elements within a register.
-
-The 32-bit REMAP CSR may reshape up to 3 registers:
-
-| 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
-| ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
-| shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
-
-regidx0-2 refer not to the Register CSR CAM entry but to the underlying
-*real* register (see regidx, the value) and consequently is 7-bits wide.
-When set to zero (referring to x0), clearly reshaping x0 is pointless,
-so is used to indicate "disabled".
-shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
-Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
-
-It is anticipated that these specialist CSRs not be very often used.
-Unlike the CSR Register and Predication tables, the REMAP CSRs use
-the full 7-bit regidx so that they can be set once and left alone,
-whilst the CSR Register entries pointing to them are disabled, instead.
-
-## SHAPE 1D/2D/3D vector-matrix remapping CSRs
-
-(Note: both the REMAP and SHAPE sections are best read after the
- rest of the document has been read)
-
-There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
-which have the same format. When each SHAPE CSR is set entirely to zeros,
-remapping is disabled: the register's elements are a linear (1D) vector.
-
-| 26..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
-| ------- | -- | ------- | -- | ------- | -- | ------- |
-| permute | offs[2] | zdimsz | offs[1] | ydimsz | offs[0] | xdimsz |
-
-offs is a 3-bit field, spread out across bits 7, 15 and 23, which
-is added to the element index during the loop calculation.
-
-xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
-that the array dimensionality for that dimension is 1. A value of xdimsz=2
-would indicate that in the first dimension there are 3 elements in the
-array. The format of the array is therefore as follows:
-
- array[xdim+1][ydim+1][zdim+1]
-
-However whilst illustrative of the dimensionality, that does not take the
-"permute" setting into account. "permute" may be any one of six values
-(0-5, with values of 6 and 7 being reserved, and not legal). The table
-below shows how the permutation dimensionality order works:
-
-| permute | order | array format |
-| ------- | ----- | ------------------------ |
-| 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
-| 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
-| 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
-| 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
-| 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
-| 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
-
-In other words, the "permute" option changes the order in which
-nested for-loops over the array would be done. The algorithm below
-shows this more clearly, and may be executed as a python program:
-
- # mapidx = REMAP.shape2
- xdim = 3 # SHAPE[mapidx].xdim_sz+1
- ydim = 4 # SHAPE[mapidx].ydim_sz+1
- zdim = 5 # SHAPE[mapidx].zdim_sz+1
-
- lims = [xdim, ydim, zdim]
- idxs = [0,0,0] # starting indices
- order = [1,0,2] # experiment with different permutations, here
- offs = 0 # experiment with different offsets, here
-
- for idx in range(xdim * ydim * zdim):
- new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
- print new_idx,
- for i in range(3):
- idxs[order[i]] = idxs[order[i]] + 1
- if (idxs[order[i]] != lims[order[i]]):
- break
- print
- idxs[order[i]] = 0
-
-Here, it is assumed that this algorithm be run within all pseudo-code
-throughout this document where a (parallelism) for-loop would normally
-run from 0 to VL-1 to refer to contiguous register
-elements; instead, where REMAP indicates to do so, the element index
-is run through the above algorithm to work out the **actual** element
-index, instead. Given that there are three possible SHAPE entries, up to
-three separate registers in any given operation may be simultaneously
-remapped:
-
- function op_add(rd, rs1, rs2) # add not VADD!
- ...
- ...
- for (i = 0; i < VL; i++)
- xSTATE.srcoffs = i # save context
- if (predval & 1<<i) # predication uses intregs
- ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
- ireg[rs2+remap(irs2)];
- if (!int_vec[rd ].isvector) break;
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-By changing remappings, 2D matrices may be transposed "in-place" for one
-operation, followed by setting a different permutation order without
-having to move the values in the registers to or from memory. Also,
-the reason for having REMAP separate from the three SHAPE CSRs is so
-that in a chain of matrix multiplications and additions, for example,
-the SHAPE CSRs need only be set up once; only the REMAP CSR need be
-changed to target different registers.
-
-Note that:
-
-* Over-running the register file clearly has to be detected and
- an illegal instruction exception thrown
-* When non-default elwidths are set, the exact same algorithm still
- applies (i.e. it offsets elements *within* registers rather than
- entire registers).
-* If permute option 000 is utilised, the actual order of the
- reindexing does not change!
-* If two or more dimensions are set to zero, the actual order does not change!
-* The above algorithm is pseudo-code **only**. Actual implementations
- will need to take into account the fact that the element for-looping
- must be **re-entrant**, due to the possibility of exceptions occurring.
- See MSTATE CSR, which records the current element index.
-* Twin-predicated operations require **two** separate and distinct
- element offsets. The above pseudo-code algorithm will be applied
- separately and independently to each, should each of the two
- operands be remapped. *This even includes C.LDSP* and other operations
- in that category, where in that case it will be the **offset** that is
- remapped (see Compressed Stack LOAD/STORE section).
-* Offset is especially useful, on its own, for accessing elements
- within the middle of a register. Without offsets, it is necessary
- to either use a predicated MV, skipping the first elements, or
- performing a LOAD/STORE cycle to memory.
- With offsets, the data does not have to be moved.
-* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
- less than MVL is **perfectly legal**, albeit very obscure. It permits
- entries to be regularly presented to operands **more than once**, thus
- allowing the same underlying registers to act as an accumulator of
- multiple vector or matrix operations, for example.
-
-Clearly here some considerable care needs to be taken as the remapping
-could hypothetically create arithmetic operations that target the
-exact same underlying registers, resulting in data corruption due to
-pipeline overlaps. Out-of-order / Superscalar micro-architectures with
-register-renaming will have an easier time dealing with this than
-DSP-style SIMD micro-architectures.
+## REMAP and SHAPE CSRs <a name="remap" />
+
+See optional [[remap]] section.
# Instruction Execution Order