(no commit message)

[libreriscv.git] / openpower / sv / remap.mdwn
diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn

index 0b7d89f59c9dc121088b8f0a8be2af10d22b7ad5..f7f95c5be2cd0dec719ba2e4b92bde5ebc7dd9f9 100644 (file)
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -15,7 +15,8 @@
  REMAP is an advanced form of Vector "Structure Packing" that provides
  hardware-level support for commonly-used *nested* loop patterns that would
  otherwise require full inline loop unrolling.  For more general reordering
-an Indexed REMAP mode is available (an abstracted analog to `xxperm`).
+an Indexed REMAP mode is available (a RISC-paradigm
+abstracted analog to `xxperm`).
  
  REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped"
  (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset"
@@ -39,6 +40,16 @@ Additional uses include regular "Structure Packing" such as RGB pixel
  data extraction and reforming (although less costly vec2/3/4 reshaping
  is achievable with `PACK/UNPACK`).
  
+Even once designed as an independent RISC-paradigm abstraction system
+it was realised that Matrix REMAP could be applied to min/max instructions to
+achieve Floyd-Warshall Graph computations, or to AND/OR Ternary
+bitmanipulation to compute Warshall Transitive Closure, or
+to perform Cryptographic Matrix operations with Galois Field
+variants of Multiply-Accumulate and many more uses expected to be
+discovered. This *without
+adding actual explicit Vector opcodes for any of the same*.
+
+Thus it should be very clear:
  REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  Vector ISAs which would typically only have a limited set of instructions
  that can be structure-packed (LD/ST and Move operations
@@ -54,18 +65,26 @@ Swizzle *can* however be applied to the same
  instruction as REMAP, providing re-sequencing of
  Subvector elements which REMAP cannot. Also as explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack Mode bits
  can extend down into Sub-vector elements to influence vec2/vec3/vec4
-sequential reordering, but even here, REMAP is not *individually*
+sequential reordering, but even here, REMAP reordering is not *individually*
  extended down to the actual sub-vector elements themselves.
+This keeps the relevant Predicate Mask bit applicable to the Subvector
+group, just as it does when REMAP is not active.
  
  In its general form, REMAP is quite expensive to set up, and on some
  implementations may introduce latency, so should realistically be used
  only where it is worthwhile.  Given that even with latency the fact
-that up to 127 operations can be requested to be issued (from a single
+that up to 127 operations can be Deterministically issued (from a single
  instruction) it should be clear that REMAP should not be dismissed
  for *possible* latency alone.  Commonly-used patterns such as Matrix
  Multiply, DCT and FFT have helper instruction options which make REMAP
  easier to use.
  
+*Future specification note: future versions of the REMAP Management instructions
+will extend to EXT1xx Prefixed variants. This will overcome some of the limitations
+present in the 32-bit variants of the REMAP Management instructions that at
+present require direct writing to SVSHAPE0-3 SPRs.  Additional
+REMAP Modes may also be introduced at that time.*
+
  There are four types of REMAP:
  
  * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
@@ -74,7 +93,7 @@ There are four types of REMAP:
  * **FFT/DCT**, with full triple-loop in-place support: limited to
    Power-2 RADIX
  * **Indexing**, for any general-purpose reordering, also includes
-  limited 2D reshaping.
+  limited 2D reshaping as well as Element "offsetting".
  * **Parallel Reduction**, for scheduling a sequence of operations
    in a Deterministic fashion, in a way that may be parallelised,
    to reduce a Vector down to a single value.
@@ -100,22 +119,29 @@ results are Deterministically computed they may be useful.
  Additionally, because the intermediate results are always written out
  it is possible to service Precise Interrupts without affecting latency
  (a common limitation of Vector ISAs implementing explicit
-Parallel Reduction instructions).
+Parallel Reduction instructions, because their Architectural State cannot
+hold the partial results).
  
  ## Basic principle
  
+The following illustrates why REMAP was added.
+
  * normal vector element read/write of operands would be sequential
    (0 1 2 3 ....)
  * this is not appropriate for (e.g.) Matrix multiply which requires
    accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
    with this.  both are expensive (copy large vectors, spill through memory)
-  and very few Packed SIMD ISAs cope with non-Power-2.
+  and very few Packed SIMD ISAs cope with non-Power-2
+  (Duplicate-data inline-loop-unrolling is the costly solution)
  * REMAP **redefines** the order of access according to set
    (Deterministic) "Schedules".
  * Matrix Schedules are not at all restricted to power-of-two boundaries
    making it unnecessary to have for example specialised 3x4 transpose
    instructions of other Vector ISAs.
+* DCT and FFT REMAP are RADIX-2 limited but this is the case in existing Packed/Predicated
+  SIMD ISAs anyway (and Bluestein Convolution is typically deployed to
+  solve that).
  
  Only the most commonly-used algorithms in computer science have REMAP
  support, due to the high cost in both the ISA and in hardware.  For
@@ -136,9 +162,9 @@ matrix to create
  a 5x4 result:
  
  ```
-    svshape 5, 4, 3, 0, 0            # Outer Product
-    svremap 15, 1, 2, 3, 0, 0, 0, 0
-    sv.fmadds *0, *32, *64, *0
+    svshape 5,4,3,0,0         # Outer Product 5x4 by 4x3
+    svremap 15,1,2,3,0,0,0,0  # link Schedule to registers
+    sv.fmadds *0,*32,*64,*0   # 60 FMACs get executed here
  ```
  
  * svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
@@ -149,7 +175,7 @@ a 5x4 result:
    - RC to use SVSHAPE3
    - RT to use SVSHAPE0
    - RS Remapping to not be activated
-* sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
+* sv.fmadds has vectors RT=0, RA=32, RB=64, RC=0
  * With REMAP being active each register's element index is
    *independently* transformed using the specified SHAPEs.
  
@@ -161,6 +187,37 @@ need to perform additional Transpose or register copy instructions.
  The example above may be executed as a unit test and demo,
  [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
  
+*Hardware Architectural note: with the Scheduling applying as a Phase between
+Decode and Issue in a Deterministic fashion the Register Hazards may be
+easily computed and a standard Out-of-Order Micro-Architecture exploited to good
+effect.  Even an In-Order system may observe that for large Outer Product
+Schedules there will be no stalls, but if the Matrices are particularly
+small size an In-Order system would have to stall, just as it would if
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream.*
+
+## Horizontal-Parallelism Hint
+
+`SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
+fully independent.  Hardware is permitted to assume that groups of elements
+up to `hphint` in size need not have Register (or Memory) Hazards created
+between them (including when `hphint > VL`).
+
+If care is not taken in setting `hphint` correctly it may wreak havoc.
+For example Matrix Outer Product relies on the innermost loop computations
+being independent.  If `hphint` is set to greater than the Outer Product
+depth then data corruption is guaranteed to occur.
+
+Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
+is independent, but that there is strict *inter-layer* Register Hazards.
+Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
+data corruption is guaranteed.
+
+Thus the key message is that setting `hphint` requires in-depth knowledge
+of the REMAP Algorithm Schedules, given in the Appendix.
+
  ## REMAP types
  
  This section summarises the motivation for each REMAP Schedule
@@ -180,7 +237,8 @@ works if one of the dimensions X or Y are power-two. Prime Numbers
  (5x7, 3x5) become deeply problematic to unroll.
  
  Even traditional Scalable Vector ISAs have issues with Matrices, often
-having to perform data Transpose by pushing out through Memory and back,
+having to perform data Transpose by pushing out through Memory and back
+(costly),
  or computing Transposition Indices (costly) then copying to another
  Vector (costly).
  
@@ -196,10 +254,9 @@ restricted to 127: up to 127 FMAs (or other operation)
  may be performed in total.
  Also given that it is in-registers only at present some care has to be
  taken on regfile resource utilisation. However it is perfectly possible
-to utilise Matrix REMAP to perform the three inner-most "kernel"
-("Tiling") loops of
-the usual 6-level large Matrix Multiply, without the usual difficulties
-associated with SIMD.
+to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
+the usual 6-level "Tiled" large Matrix Multiply, without the usual 
+difficulties associated with SIMD.
  
  Also the `svshape` instruction only provides access to part of the
  Matrix REMAP capability. Rotation and mirroring need to be done by
@@ -266,7 +323,8 @@ issued and executed.
  
  The original motivation for Indexed REMAP was to mitigate the need to add
  an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
-a stand-alone instruction.  Usually a Vector ISA would add a non-conflicting
+a stand-alone instruction
+(`GPR(RT) <- GPR(GPR(RA))`).  Usually a Vector ISA would add a non-conflicting
  variant (as in VSX `vperm`) but it is common to need to permute by source,
  with the risk of conflict, that has to be resolved, for example, in AVX-512
  with `conflictd`.
@@ -279,6 +337,10 @@ all *already* critically depend on overlapping Reads/Writes: Matrix
  uses overlapping registers as accumulators.  Thus the Register Hazard
  Management needed by Indexed REMAP *has* to be in place anyway.
  
+*Programmer's Note: `hphint` may be used to help hardware identify
+parallelism opportunities but it is critical to remember that the
+groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
+
  The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
  clearly that of the additional reading of the GPRs to be used as Indices,
  plus the setup cost associated with creating those same Indices.
@@ -296,18 +358,21 @@ with an Index exceeding VL-1.*
  
  Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
  (Power ISA v3.0B) operation is leveraged, unmodified, to give the
-*appearance* and *effect* of Reduction.
+*appearance* and *effect* of Reduction. Parallel Reduction is not limited
+to Power-of-two but is limited as usual by the total number of
+element operations (127) as well as available register file size.
  
  In Horizontal-First Mode, Vector-result reduction **requires**
  the destination to be a Vector, which will be used to store
-intermediary results.
+intermediary results, in order to achieve a correct final
+result.
  
  Given that the tree-reduction schedule is deterministic,
  Interrupts and exceptions
  can therefore also be precise.  The final result will be in the first
  non-predicate-masked-out destination element, but due again to
  the deterministic schedule programmers may find uses for the intermediate
-results.
+results, even for non-commutative Defined Word operations.
  
  When Rc=1 a corresponding Vector of co-resultant CRs is also
  created.  No special action is taken: the result *and its CR Field*
@@ -348,7 +413,8 @@ completely separate from the actual element-level (scalar) operations,
  Move operations are **not** included in the Schedule.  This means that
  the Schedule leaves the final (scalar) result in the first-non-masked 
  element of the Vector used.  With the predicate mask being dynamic
-(but deterministic) this result could be anywhere.
+(but deterministic) at a superficial glance it seems this result
+could be anywhere.
  
  If that result is needed to be moved to a (single) scalar register
  then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
@@ -371,7 +437,9 @@ It may be better to perform a pre-copy
  of the values, compressing them (VREDUCE-style) into a contiguous block,
  which will guarantee that the result goes into the very first element
  of the destination vector, in which case clearly no follow-up
-predicated vector-to-scalar MV operation is needed.
+predicated vector-to-scalar MV operation is needed. A VREDUCE effect
+is achieved by setting just a source predicate mask on Twin-Predicated
+operations.
  
  **Usage conditions**
  
@@ -408,6 +476,10 @@ which will turn the Schedule around such that issuing of the Scalar
  Defined Words is done with SUBVL looping as the inner loop not the
  outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
  
+*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
+will clearly result in data corruption.  It may be best to perform
+a Pack/Unpack Transposing copy of the data first*
+
  ## Determining Register Hazards
  
  For high-performance (Multi-Issue, Out-of-Order) systems it is critical
@@ -450,27 +522,29 @@ Predication is used.
  
  The following bits of the SVSTATE SPR are used for REMAP:
  
-|32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
-| --  | --  | --  | --  | --  | ----- | ------ |
-|mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst    |
+```
+    |32:33|34:35|36:37|38:39|40:41| 42:46 | 62     |
+    | --  | --  | --  | --  | --  | ----- | ------ |
+    |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst  |
+```
  
  mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
  mi0-2 apply to RA, RB, RC respectively, as input registers, and
  likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
  SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
-SVSHAPE is actively applied or not.
+SVSHAPE is actively applied or not, and if so, to which registers.
  
-* bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
-* bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
-* bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
-* bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
-* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
+* bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
+* bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
+* bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
+* bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
+* bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
    (LD/ST-with-update has an implicit 2nd write register, RA)
  
  The "persistence" bit if set will result in all Active REMAPs being applied
  indefinitely.
  
-----------------
+-----------
  
  \newpage{}
  
@@ -478,14 +552,10 @@ indefinitely.
  
  SVRM-Form:
  
-    svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
-
  |0     |6     |11  |13   |15   |17   |19   |21    | 22:25 |26:31  |
  | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
  | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
  
-SVRM-Form
-
  * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
  
  Pseudo-code:
@@ -547,21 +617,25 @@ which have the same format.
  Shape is 32-bits.  When SHAPE is set entirely to zeros, remapping is
  disabled: the register's elements are a linear (1D) vector.
  
-|31.30|29..28 |27..24| 23..21 | 20..18  | 17..12  |11..6 |5..0  | Mode  |
-|---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
-|mode |skip   |offset| invxyz | permute | zdimsz  |ydimsz|xdimsz|Matrix |
-|0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
-|0b01 |submode|offset| invxyz | submode2| zdimsz  |mode  |xdimsz|DCT/FFT|
-|0b10 |submode|offset| invxyz | rsvd    | rsvd    |rsvd  |xdimsz|Preduce|
-|0b11 |       |      |        |         |         |      |      |rsvd   |
+|0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
+|----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
+|xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |Matrix |
+|xdimsz|ydimsz|SVGPR    | 11/     |sk1/invxy|offset|elwidth|0b00 |Indexed|
+|xdimsz|mode  | zdimsz  | submode2| invxyz  |offset|submode|0b01 |DCT/FFT|
+| rsvd |rsvd  |xdimsz   | rsvd    | invxyz  |offset|submode|0b10 |Preduce|
+|      |      |         |         |         |      |       |0b11 |rsvd   |
  
-mode sets different behaviours (straight matrix multiply, FFT, DCT).
+`mode` sets different behaviours (straight matrix multiply, FFT, DCT).
  
  * **mode=0b00** sets straight Matrix Mode
  * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
  * **mode=0b01** sets "FFT/DCT" mode and activates submodes
  * **mode=0b10** sets "Parallel Reduction" Schedules.
  
+*Architectural Resource Allocation note: the four SVSHAPE SPRs are best
+allocated sequentially and contiguously in order that `sv.mtspr` may
+be used*
+
  ## Parallel Reduction Mode
  
  Creates the Schedules for Parallel Tree Reduction.
@@ -763,228 +837,7 @@ SVM-Form
  | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
  |PO  | SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
  
-```
-    # for convenience, VL to be calculated and stored in SVSTATE
-    vlen <- [0] * 7
-    mscale[0:5] <- 0b000001 # for scaling MAXVL
-    itercount[0:6] <- [0] * 7
-    SVSTATE[0:31] <- [0] * 32
-    # only overwrite REMAP if "persistence" is zero
-    if (SVSTATE[62] = 0b0) then
-        SVSTATE[32:33] <- 0b00
-        SVSTATE[34:35] <- 0b00
-        SVSTATE[36:37] <- 0b00
-        SVSTATE[38:39] <- 0b00
-        SVSTATE[40:41] <- 0b00
-        SVSTATE[42:46] <- 0b00000
-        SVSTATE[62] <- 0b0
-        SVSTATE[63] <- 0b0
-    # clear out all SVSHAPEs
-    SVSHAPE0[0:31] <- [0] * 32
-    SVSHAPE1[0:31] <- [0] * 32
-    SVSHAPE2[0:31] <- [0] * 32
-    SVSHAPE3[0:31] <- [0] * 32
-
-    # set schedule up for multiply
-    if (SVrm = 0b0000) then
-        # VL in Matrix Multiply is xd*yd*zd
-        xd <- (0b00 || SVxd) + 1
-        yd <- (0b00 || SVyd) + 1
-        zd <- (0b00 || SVzd) + 1
-        n <- xd * yd * zd
-        vlen[0:6] <- n[14:20]
-        # set up template in SVSHAPE0, then copy to 1-3
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[6:11] <- (0b0 || SVyd)   # ydim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim
-        SVSHAPE0[28:29] <- 0b11           # skip z
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
-        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
-        SVSHAPE3[0:31] <- SVSHAPE0[0:31]
-        # set up FRA
-        SVSHAPE1[18:20] <- 0b001          # permute x,z,y
-        SVSHAPE1[28:29] <- 0b01           # skip z
-        # FRC
-        SVSHAPE2[18:20] <- 0b001          # permute x,z,y
-        SVSHAPE2[28:29] <- 0b11           # skip y
-
-    # set schedule up for FFT butterfly
-    if (SVrm = 0b0001) then
-        # calculate O(N log2 N)
-        n <- [0] * 3
-        do while n < 5
-           if SVxd[4-n] = 0 then
-               leave
-           n <- n + 1
-        n <- ((0b0 || SVxd) + 1) * n
-        vlen[0:6] <- n[1:7]
-        # set up template in SVSHAPE0, then copy to 1-3
-        # for FRA and FRT
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D FFT)
-        mscale <- (0b0 || SVzd) + 1
-        SVSHAPE0[30:31] <- 0b01          # Butterfly mode
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
-        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
-        # set up FRB and FRS
-        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
-        # FRC (coefficients)
-        SVSHAPE2[28:29] <- 0b10           # k schedule
-
-    # set schedule up for (i)DCT Inner butterfly
-    # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
-    if ((SVrm = 0b0100) |
-        (SVrm = 0b1100)) then
-        # calculate O(N log2 N)
-        n <- [0] * 3
-        do while n < 5
-           if SVxd[4-n] = 0 then
-               leave
-           n <- n + 1
-        n <- ((0b0 || SVxd) + 1) * n
-        vlen[0:6] <- n[1:7]
-        # set up template in SVSHAPE0, then copy to 1-3
-        # set up FRB and FRS
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
-        mscale <- (0b0 || SVzd) + 1
-        if (SVrm = 0b1100) then
-            SVSHAPE0[30:31] <- 0b11          # iDCT mode
-            SVSHAPE0[18:20] <- 0b011         # iDCT Inner Butterfly sub-mode
-        else
-            SVSHAPE0[30:31] <- 0b01          # DCT mode
-            SVSHAPE0[18:20] <- 0b001         # DCT Inner Butterfly sub-mode
-            SVSHAPE0[21:23] <- 0b001         # "inverse" on outer loop
-        SVSHAPE0[6:11] <- 0b000011       # (i)DCT Inner Butterfly mode 4
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
-        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
-        if (SVrm != 0b0100) & (SVrm != 0b1100) then
-            SVSHAPE3[0:31] <- SVSHAPE0[0:31]
-        # for FRA and FRT
-        SVSHAPE0[28:29] <- 0b01           # j+halfstep schedule
-        # for cos coefficient
-        SVSHAPE2[28:29] <- 0b10           # ci (k for mode 4) schedule
-        SVSHAPE2[12:17] <- 0b000000       # reset costable "striding" to 1
-        if (SVrm != 0b0100) & (SVrm != 0b1100) then
-            SVSHAPE3[28:29] <- 0b11           # size schedule
-
-    # set schedule up for (i)DCT Outer butterfly
-    if (SVrm = 0b0011) | (SVrm = 0b1011) then
-        # calculate O(N log2 N) number of outer butterfly overlapping adds
-        vlen[0:6] <- [0] * 7
-        n <- 0b000
-        size <- 0b0000001
-        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
-        itercount[0:6] <- (0b0 || itercount[0:5])
-        do while n < 5
-           if SVxd[4-n] = 0 then
-               leave
-           n <- n + 1
-           count <- (itercount - 0b0000001) * size
-           vlen[0:6] <- vlen + count[7:13]
-           size[0:6] <- (size[1:6] || 0b0)
-           itercount[0:6] <- (0b0 || itercount[0:5])
-        # set up template in SVSHAPE0, then copy to 1-3
-        # set up FRB and FRS
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
-        mscale <- (0b0 || SVzd) + 1
-        if (SVrm = 0b1011) then
-            SVSHAPE0[30:31] <- 0b11      # iDCT mode
-            SVSHAPE0[18:20] <- 0b011     # iDCT Outer Butterfly sub-mode
-            SVSHAPE0[21:23] <- 0b101     # "inverse" on outer and inner loop
-        else
-            SVSHAPE0[30:31] <- 0b01      # DCT mode
-            SVSHAPE0[18:20] <- 0b100     # DCT Outer Butterfly sub-mode
-        SVSHAPE0[6:11] <- 0b000010       # DCT Butterfly mode
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
-        SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
-        # for FRA and FRT
-        SVSHAPE1[28:29] <- 0b01           # j+halfstep schedule
-        # reset costable "striding" to 1
-        SVSHAPE2[12:17] <- 0b000000
-
-    # set schedule up for DCT COS table generation
-    if (SVrm = 0b0101) | (SVrm = 0b1101) then
-        # calculate O(N log2 N)
-        vlen[0:6] <- [0] * 7
-        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
-        itercount[0:6] <- (0b0 || itercount[0:5])
-        n <- [0] * 3
-        do while n < 5
-           if SVxd[4-n] = 0 then
-               leave
-           n <- n + 1
-           vlen[0:6] <- vlen + itercount
-           itercount[0:6] <- (0b0 || itercount[0:5])
-        # set up template in SVSHAPE0, then copy to 1-3
-        # set up FRB and FRS
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
-        mscale <- (0b0 || SVzd) + 1
-        SVSHAPE0[30:31] <- 0b01          # DCT/FFT mode
-        SVSHAPE0[6:11] <- 0b000100       # DCT Inner Butterfly COS-gen mode
-        if (SVrm = 0b0101) then
-            SVSHAPE0[21:23] <- 0b001     # "inverse" on outer loop for DCT
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
-        SVSHAPE2[0:31] <- SVSHAPE0[0:31]
-        # for cos coefficient
-        SVSHAPE1[28:29] <- 0b10           # ci schedule
-        SVSHAPE2[28:29] <- 0b11           # size schedule
-
-    # set schedule up for iDCT / DCT inverse of half-swapped ordering
-    if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
-        vlen[0:6] <- (0b00 || SVxd) + 0b0000001
-        # set up template in SVSHAPE0
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
-        mscale <- (0b0 || SVzd) + 1
-        if (SVrm = 0b1110) then
-            SVSHAPE0[18:20] <- 0b001     # DCT opposite half-swap
-        if (SVrm = 0b1111) then
-            SVSHAPE0[30:31] <- 0b01          # FFT mode
-        else
-            SVSHAPE0[30:31] <- 0b11          # DCT mode
-        SVSHAPE0[6:11] <- 0b000101       # DCT "half-swap" mode
-
-    # set schedule up for parallel reduction
-    if (SVrm = 0b0111) then
-        # calculate the total number of operations (brute-force)
-        vlen[0:6] <- [0] * 7
-        itercount[0:6] <- (0b00 || SVxd) + 0b0000001
-        step[0:6] <- 0b0000001
-        i[0:6] <- 0b0000000
-        do while step <u itercount
-            newstep <- step[1:6] || 0b0
-            j[0:6] <- 0b0000000
-            do while (j+step <u itercount)
-                j <- j + newstep
-                i <- i + 1
-            step <- newstep
-        # VL in Parallel-Reduce is the number of operations
-        vlen[0:6] <- i
-        # set up template in SVSHAPE0, then copy to 1. only 2 needed
-        SVSHAPE0[0:5] <- (0b0 || SVxd)   # xdim
-        SVSHAPE0[12:17] <- (0b0 || SVzd)   # zdim - "striding" (2D DCT)
-        mscale <- (0b0 || SVzd) + 1
-        SVSHAPE0[30:31] <- 0b10          # parallel reduce submode
-        # copy
-        SVSHAPE1[0:31] <- SVSHAPE0[0:31]
-        # set up right operand (left operand 28:29 is zero)
-        SVSHAPE1[28:29] <- 0b01           # right operand
-
-    # set VL, MVL and Vertical-First
-    m[0:12] <- vlen * mscale
-    maxvl[0:6] <- m[6:12]
-    SVSTATE[0:6] <- maxvl  # MAVXL
-    SVSTATE[7:13] <- vlen  # VL
-    SVSTATE[63] <- vf
-```
+See [[sv/remap/appendix]] for `svshape` pseudocode
  
  Special Registers Altered:
  
@@ -1070,66 +923,7 @@ SVI-Form
  
  * svindex SVG,rmm,SVd,ew,SVyx,mm,sk
  
-Pseudo-code:
-
-```
-    # based on nearest MAXVL compute other dimension
-    MVL <- SVSTATE[0:6]
-    d <- [0] * 6
-    dim <- SVd+1
-    do while d*dim <u ([0]*4 || MVL)
-       d <- d + 1
-
-    # set up template, then copy once location identified
-    shape <- [0]*32
-    shape[30:31] <- 0b00            # mode
-    if SVyx = 0 then
-        shape[18:20] <- 0b110       # indexed xd/yd
-        shape[0:5] <- (0b0 || SVd)  # xdim
-        if sk = 0 then shape[6:11] <- 0 # ydim
-        else           shape[6:11] <- 0b111111 # ydim max
-    else
-        shape[18:20] <- 0b111       # indexed yd/xd
-        if sk = 1 then shape[6:11] <- 0 # ydim
-        else           shape[6:11] <- d-1 # ydim max
-        shape[0:5] <- (0b0 || SVd) # ydim
-    shape[12:17] <- (0b0 || SVG)        # SVGPR
-    shape[28:29] <- ew                  # element-width override
-    shape[21] <- sk                     # skip 1st dimension
-
-    # select the mode for updating SVSHAPEs
-    SVSTATE[62] <- mm # set or clear persistence
-    if mm = 0 then
-        # clear out all SVSHAPEs first
-        SVSHAPE0[0:31] <- [0] * 32
-        SVSHAPE1[0:31] <- [0] * 32
-        SVSHAPE2[0:31] <- [0] * 32
-        SVSHAPE3[0:31] <- [0] * 32
-        SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
-        SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
-        idx <- 0
-        for bit = 0 to 4
-            if rmm[4-bit] then
-                # activate requested shape
-                if idx = 0 then SVSHAPE0 <- shape
-                if idx = 1 then SVSHAPE1 <- shape
-                if idx = 2 then SVSHAPE2 <- shape
-                if idx = 3 then SVSHAPE3 <- shape
-                SVSTATE[bit*2+32:bit*2+33] <- idx
-                # increment shape index, modulo 4
-                if idx = 3 then idx <- 0
-                else            idx <- idx + 1
-    else
-        # refined SVSHAPE/REMAP update mode
-        bit <- rmm[0:2]
-        idx <- rmm[3:4]
-        if idx = 0 then SVSHAPE0 <- shape
-        if idx = 1 then SVSHAPE1 <- shape
-        if idx = 2 then SVSHAPE2 <- shape
-        if idx = 3 then SVSHAPE3 <- shape
-        SVSTATE[bit*2+32:bit*2+33] <- idx
-        SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svindex` pseudocode
  
  Special Registers Altered:
  
@@ -1260,64 +1054,7 @@ SVM2-Form
  
  * svshape2 offs,yx,rmm,SVd,sk,mm
  
-Pseudo-code:
-
-```
-    # based on nearest MAXVL compute other dimension
-    MVL <- SVSTATE[0:6]
-    d <- [0] * 6
-    dim <- SVd+1
-    do while d*dim <u ([0]*4 || MVL)
-       d <- d + 1
-    # set up template, then copy once location identified
-    shape <- [0]*32
-    shape[30:31] <- 0b00            # mode
-    shape[0:5] <- (0b0 || SVd)      # x/ydim
-    if SVyx = 0 then
-        shape[18:20] <- 0b000       # ordering xd/yd(/zd)
-        if sk = 0 then shape[6:11] <- 0 # ydim
-        else           shape[6:11] <- 0b111111 # ydim max
-    else
-        shape[18:20] <- 0b010       # ordering yd/xd(/zd)
-        if sk = 1 then shape[6:11] <- 0 # ydim
-        else           shape[6:11] <- d-1 # ydim max
-    # offset (the prime purpose of this instruction)
-    shape[24:27] <- SVo         # offset
-    if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
-    else           shape[28:29] <- 0b00 # no skipping
-    # select the mode for updating SVSHAPEs
-    SVSTATE[62] <- mm # set or clear persistence
-    if mm = 0 then
-        # clear out all SVSHAPEs first
-        SVSHAPE0[0:31] <- [0] * 32
-        SVSHAPE1[0:31] <- [0] * 32
-        SVSHAPE2[0:31] <- [0] * 32
-        SVSHAPE3[0:31] <- [0] * 32
-        SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
-        SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
-        idx <- 0
-        for bit = 0 to 4
-            if rmm[4-bit] then
-                # activate requested shape
-                if idx = 0 then SVSHAPE0 <- shape
-                if idx = 1 then SVSHAPE1 <- shape
-                if idx = 2 then SVSHAPE2 <- shape
-                if idx = 3 then SVSHAPE3 <- shape
-                SVSTATE[bit*2+32:bit*2+33] <- idx
-                # increment shape index, modulo 4
-                if idx = 3 then idx <- 0
-                else            idx <- idx + 1
-    else
-        # refined SVSHAPE/REMAP update mode
-        bit <- rmm[0:2]
-        idx <- rmm[3:4]
-        if idx = 0 then SVSHAPE0 <- shape
-        if idx = 1 then SVSHAPE1 <- shape
-        if idx = 2 then SVSHAPE2 <- shape
-        if idx = 3 then SVSHAPE3 <- shape
-        SVSTATE[bit*2+32:bit*2+33] <- idx
-        SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svshape2` pseudocode
  
  Special Registers Altered:
  
@@ -1327,18 +1064,18 @@ Special Registers Altered:
  
  `svshape2` is an additional convenience instruction that prioritises
  setting `SVSHAPE.offset`. Its primary purpose is for use when
-element-width overrides are used. It has identical capabilities to `svindex` and
+element-width overrides are used. It has identical capabilities to `svindex`
  in terms of both options (skip, etc.) and ability to activate REMAP
-(rmm, mask mode) but unlike `svindex` it does not set GPR REMAP,
+(rmm, mask mode) but unlike `svindex` it does not set GPR REMAP:
  only a 1D or 2D `svshape`, and
-unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate.
+unlike `svshape` it can set an arbitrary `SVSHAPE.offset` immediate.
  
  One of the limitations of Simple-V is that Vector elements start on the boundary
  of the Scalar regfile, which is fine when element-width overrides are not
  needed. If the starting point of a Vector with smaller elwidths must begin
  in the middle of a register, normally there would be no way to do so except
-through LD/ST.  `SVSHAPE.offset` caters for this scenario and `svshape2`is
-makes it easier.
+through costly LD/ST.  `SVSHAPE.offset` caters for this scenario and `svshape2`
+makes it easier to access.
  
  **Operand Fields**: