+## REMAP Matrix pseudocode
+
+The algorithm below shows how REMAP works more clearly, and may be
+executed as a python program:
+
+```
+[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
+```
+
+An easier-to-read version (using python iterators) is given in
+a later section of this Appendix.
+
+Each element index from the for-loop `0..VL-1`
+is run through the above algorithm to work out the **actual** element
+index, instead. Given that there are four possible SHAPE entries, up to
+four separate registers in any given operation may be simultaneously
+remapped:
+
+```
+ function op_add(RT, RA, RB) # add not VADD!
+ for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
+ SVSTATE.srcstep = i # save context
+ if (predval & 1<<i) # predication mask
+ GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
+ GPR[RB+remap3(irs2)];
+ if (!RT.isvector) break;
+ if (RT.isvector) { id += 1; }
+ if (RA.isvector) { irs1 += 1; }
+ if (RB.isvector) { irs2 += 1; }
+```
+
+By changing remappings, 2D matrices may be transposed "in-place" for one
+operation, followed by setting a different permutation order without
+having to move the values in the registers to or from memory.
+
+Note that:
+
+* Over-running the register file clearly has to be detected and
+ an illegal instruction exception thrown
+* When non-default elwidths are set, the exact same algorithm still
+ applies (i.e. it offsets *polymorphic* elements *within* registers rather
+ than entire registers).
+* If permute option 000 is utilised, the actual order of the
+ reindexing does not change. However, modulo MVL still occurs
+ which will result in repeated operations (use with caution).
+* If two or more dimensions are set to zero, the actual order does not change!
+* The above algorithm is pseudo-code **only**. Actual implementations
+ will need to take into account the fact that the element for-looping
+ must be **re-entrant**, due to the possibility of exceptions occurring.
+ See SVSTATE SPR, which records the current element index.
+ Continuing after return from an interrupt may introduce latency
+ due to re-computation of the remapped offsets.
+* Twin-predicated operations require **two** separate and distinct
+ element offsets. The above pseudo-code algorithm will be applied
+ separately and independently to each, should each of the two
+ operands be remapped. *This even includes unit-strided LD/ST*
+ and other operations
+ in that category, where in that case it will be the **address offset**
+ that is remapped: `EA <- (RA) + immediate * REMAP(elementoffset)`.
+* Offset is especially useful, on its own, for accessing elements
+ within the middle of a register. Without offsets, it is necessary
+ to either use a predicated MV, skipping the first elements, or
+ performing a LOAD/STORE cycle to memory.
+ With offsets, the data does not have to be moved.
+* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
+ less than MVL is **perfectly legal**, albeit very obscure. It permits
+ entries to be regularly presented to operands **more than once**, thus
+ allowing the same underlying registers to act as an accumulator of
+ multiple vector or matrix operations, for example.
+* Note especially that Program Order **must** still be respected
+ even when overlaps occur that read or write the same register
+ elements *including polymorphic ones*
+
+Clearly here some considerable care needs to be taken as the remapping
+could hypothetically create arithmetic operations that target the
+exact same underlying registers, resulting in data corruption due to
+pipeline overlaps. Out-of-order / Superscalar micro-architectures with
+register-renaming will have an easier time dealing with this than
+DSP-style SIMD micro-architectures.
+
+
+### 4x4 Matrix to vec4 Multiply (4x4 by 1x4)
+
+The following settings will allow a 4x4 matrix (starting at f8), expressed
+as a sequence of 16 numbers first by row then by column, to be multiplied
+by a vector of length 4 (starting at f0), using a single FMAC instruction.
+
+* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
+* FMAC f4, f0, f8, f4
+
+The permutation on SHAPE0 will use f0 as a vec4 source. On the first
+four iterations through the hardware loop, the REMAPed index will not
+increment. On the second four, the index will increase by one. Likewise
+on each subsequent group of four.
+
+The permutation on SHAPE1 will increment f4 continuously cycling through
+f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment
+straight sequentially through the 16 values f8-f23 in the Matrix. The
+equivalent sequence thus is issued:
+
+```
+ fmac f4, f0, f8, f4
+ fmac f5, f0, f9, f5
+ fmac f6, f0, f10, f6
+ fmac f7, f0, f11, f7
+ fmac f4, f1, f12, f4
+ fmac f5, f1, f13, f5
+ fmac f6, f1, f14, f6
+ fmac f7, f1, f15, f7
+ fmac f4, f2, f16, f4
+ fmac f5, f2, f17, f5
+ fmac f6, f2, f18, f6
+ fmac f7, f2, f19, f7
+ fmac f4, f3, f20, f4
+ fmac f5, f3, f21, f5
+ fmac f6, f3, f22, f6
+ fmac f7, f3, f23, f7
+```
+
+Hardware should easily pipeline the above FMACs and as long as each FMAC
+completes in 4 cycles or less there should be 100% sustained throughput,
+from the one single Vector FMAC.
+
+The only other instruction required is to ensure that f4-f7 are
+initialised (usually to zero) however obviously if used as part
+of some other computation, which is frequently the case, then
+clearly the zeroing is not needed.
+
+## REMAP FFT, DFT, NTT
+
+The algorithm from a later section of this Appendix shows how FFT REMAP works,
+and it may be executed as a standalone python3 program.
+The executable code is designed to illustrate how a hardware
+implementation may generate Indices which are completely
+independent of the Execution of element-level operations,
+even for something as complex as a Triple-loop Tukey-Cooley
+Schedule. A comprehensive demo and test suite may be found
+[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
+including Complex Number FFT which deploys Vertical-First Mode
+on top of the REMAP Schedules.
+
+Other uses include more than DFT and NTT: as abstracted RISC-paradigm
+the Schedules are not
+restricted in any way or tied to any particular instruction.
+If the programmer can find any algorithm
+which has identical triple nesting then the FFT Schedule may be
+used even there.
+
## svshape pseudocode
```
SVSTATE[46-bit] <- 1
```
-## REMAP Matrix pseudocode
-
-The algorithm below shows how REMAP works more clearly, and may be
-executed as a python program:
-
-```
-[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
-```
-
-An easier-to-read version (using python iterators) is given in
-a later section of this Appendix.
-
-Each element index from the for-loop `0..VL-1`
-is run through the above algorithm to work out the **actual** element
-index, instead. Given that there are four possible SHAPE entries, up to
-four separate registers in any given operation may be simultaneously
-remapped:
-
-```
- function op_add(RT, RA, RB) # add not VADD!
- for (i=0,id=0,irs1=0,irs2=0; i < VL; i++)
- SVSTATE.srcstep = i # save context
- if (predval & 1<<i) # predication mask
- GPR[RT+remap1(id)] <= GPR[RA+remap2(irs1)] +
- GPR[RB+remap3(irs2)];
- if (!RT.isvector) break;
- if (RT.isvector) { id += 1; }
- if (RA.isvector) { irs1 += 1; }
- if (RB.isvector) { irs2 += 1; }
-```
-
-By changing remappings, 2D matrices may be transposed "in-place" for one
-operation, followed by setting a different permutation order without
-having to move the values in the registers to or from memory.
-
-Note that:
-
-* Over-running the register file clearly has to be detected and
- an illegal instruction exception thrown
-* When non-default elwidths are set, the exact same algorithm still
- applies (i.e. it offsets *polymorphic* elements *within* registers rather
- than entire registers).
-* If permute option 000 is utilised, the actual order of the
- reindexing does not change. However, modulo MVL still occurs
- which will result in repeated operations (use with caution).
-* If two or more dimensions are set to zero, the actual order does not change!
-* The above algorithm is pseudo-code **only**. Actual implementations
- will need to take into account the fact that the element for-looping
- must be **re-entrant**, due to the possibility of exceptions occurring.
- See SVSTATE SPR, which records the current element index.
- Continuing after return from an interrupt may introduce latency
- due to re-computation of the remapped offsets.
-* Twin-predicated operations require **two** separate and distinct
- element offsets. The above pseudo-code algorithm will be applied
- separately and independently to each, should each of the two
- operands be remapped. *This even includes unit-strided LD/ST*
- and other operations
- in that category, where in that case it will be the **address offset**
- that is remapped: `EA <- (RA) + immediate * REMAP(elementoffset)`.
-* Offset is especially useful, on its own, for accessing elements
- within the middle of a register. Without offsets, it is necessary
- to either use a predicated MV, skipping the first elements, or
- performing a LOAD/STORE cycle to memory.
- With offsets, the data does not have to be moved.
-* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
- less than MVL is **perfectly legal**, albeit very obscure. It permits
- entries to be regularly presented to operands **more than once**, thus
- allowing the same underlying registers to act as an accumulator of
- multiple vector or matrix operations, for example.
-* Note especially that Program Order **must** still be respected
- even when overlaps occur that read or write the same register
- elements *including polymorphic ones*
-
-Clearly here some considerable care needs to be taken as the remapping
-could hypothetically create arithmetic operations that target the
-exact same underlying registers, resulting in data corruption due to
-pipeline overlaps. Out-of-order / Superscalar micro-architectures with
-register-renaming will have an easier time dealing with this than
-DSP-style SIMD micro-architectures.
-
-
-### 4x4 Matrix to vec4 Multiply (4x4 by 1x4)
-
-The following settings will allow a 4x4 matrix (starting at f8), expressed
-as a sequence of 16 numbers first by row then by column, to be multiplied
-by a vector of length 4 (starting at f0), using a single FMAC instruction.
-
-* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
-* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
-* VL=16, f4=vec, f0=vec, f8=vec
-* FMAC f4, f0, f8, f4
-
-The permutation on SHAPE0 will use f0 as a vec4 source. On the first
-four iterations through the hardware loop, the REMAPed index will not
-increment. On the second four, the index will increase by one. Likewise
-on each subsequent group of four.
-
-The permutation on SHAPE1 will increment f4 continuously cycling through
-f4-f7 every four iterations of the hardware loop.
-
-At the same time, VL will, because there is no SHAPE on f8, increment
-straight sequentially through the 16 values f8-f23 in the Matrix. The
-equivalent sequence thus is issued:
-
-```
- fmac f4, f0, f8, f4
- fmac f5, f0, f9, f5
- fmac f6, f0, f10, f6
- fmac f7, f0, f11, f7
- fmac f4, f1, f12, f4
- fmac f5, f1, f13, f5
- fmac f6, f1, f14, f6
- fmac f7, f1, f15, f7
- fmac f4, f2, f16, f4
- fmac f5, f2, f17, f5
- fmac f6, f2, f18, f6
- fmac f7, f2, f19, f7
- fmac f4, f3, f20, f4
- fmac f5, f3, f21, f5
- fmac f6, f3, f22, f6
- fmac f7, f3, f23, f7
-```
-
-Hardware should easily pipeline the above FMACs and as long as each FMAC
-completes in 4 cycles or less there should be 100% sustained throughput,
-from the one single Vector FMAC.
-
-The only other instruction required is to ensure that f4-f7 are
-initialised (usually to zero) however obviously if used as part
-of some other computation, which is frequently the case, then
-clearly the zeroing is not needed.
-
-## REMAP FFT, DFT, NTT
-
-The algorithm from a later section of this Appendix shows how FFT REMAP works,
-and it may be executed as a standalone python3 program.
-The executable code is designed to illustrate how a hardware
-implementation may generate Indices which are completely
-independent of the Execution of element-level operations,
-even for something as complex as a Triple-loop Tukey-Cooley
-Schedule. A comprehensive demo and test suite may be found
-[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)
-including Complex Number FFT which deploys Vertical-First Mode
-on top of the REMAP Schedules.
-
-Other uses include more than DFT and NTT: as abstracted RISC-paradigm
-the Schedules are not
-restricted in any way or tied to any particular instruction.
-If the programmer can find any algorithm
-which has identical triple nesting then the FFT Schedule may be
-used even there.
-
[[!tag standards]]
---------