REMAP is an advanced form of Vector "Structure Packing" that provides
hardware-level support for commonly-used *nested* loop patterns that would
otherwise require full inline loop unrolling. For more general reordering
-an Indexed REMAP mode is available (an abstracted analog to `xxperm`).
+an Indexed REMAP mode is available (a RISC-paradigm
+abstracted analog to `xxperm`).
REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped"
(re-mapped) from a linear form to a 2D or 3D transposed form, or "offset"
data extraction and reforming (although less costly vec2/3/4 reshaping
is achievable with `PACK/UNPACK`).
+Even once designed as an independent RISC-paradigm abstraction system
+it was realised that Matrix REMAP could be applied to min/max instructions to
+achieve Floyd-Warshall Graph computations, or to AND/OR Ternary
+bitmanipulation to compute Warshall Transitive Closure, or
+to perform Cryptographic Matrix operations with Galois Field
+variants of Multiply-Accumulate and many more uses expected to be
+discovered. This *without
+adding actual explicit Vector opcodes for any of the same*.
+
+Thus it should be very clear:
REMAP, like all of SV, is abstracted out, meaning that unlike traditional
Vector ISAs which would typically only have a limited set of instructions
that can be structure-packed (LD/ST and Move operations
Multiply, DCT and FFT have helper instruction options which make REMAP
easier to use.
+*Future specification note: future versions of the REMAP Management instructions
+will extend to EXT1xx Prefixed variants. This will overcome some of the limitations
+present in the 32-bit variants of the REMAP Management instructions that at
+present require direct writing to SVSHAPE0-3 SPRs. Additional
+REMAP Modes may also be introduced at that time.*
+
There are four types of REMAP:
* **Matrix**, also known as 2D and 3D reshaping, can perform in-place
effect. Even an In-Order system may observe that for large Outer Product
Schedules there will be no stalls, but if the Matrices are particularly
small size an In-Order system would have to stall, just as it would if
-the operations were loop-unrolled without Simple-V*.
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream.*
+
+## Horizontal-Parallelism Hint
+
+`SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
+fully independent. Hardware is permitted to assume that groups of elements
+up to `hphint` in size need not have Register (or Memory) Hazards created
+between them (including when `hphint > VL`).
+
+If care is not taken in setting `hphint` correctly it may wreak havoc.
+For example Matrix Outer Product relies on the innermost loop computations
+being independent. If `hphint` is set to greater than the Outer Product
+depth then data corruption is guaranteed to occur.
+
+Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
+is independent, but that there is strict *inter-layer* Register Hazards.
+Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
+data corruption is guaranteed.
+
+Thus the key message is that setting `hphint` requires in-depth knowledge
+of the REMAP Algorithm Schedules, given in the Appendix.
## REMAP types
(5x7, 3x5) become deeply problematic to unroll.
Even traditional Scalable Vector ISAs have issues with Matrices, often
-having to perform data Transpose by pushing out through Memory and back,
+having to perform data Transpose by pushing out through Memory and back
+(costly),
or computing Transposition Indices (costly) then copying to another
Vector (costly).
may be performed in total.
Also given that it is in-registers only at present some care has to be
taken on regfile resource utilisation. However it is perfectly possible
-to utilise Matrix REMAP to perform the three inner-most "kernel"
-("Tiling") loops of
-the usual 6-level large Matrix Multiply, without the usual difficulties
-associated with SIMD.
+to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
+the usual 6-level "Tiled" large Matrix Multiply, without the usual
+difficulties associated with SIMD.
Also the `svshape` instruction only provides access to part of the
Matrix REMAP capability. Rotation and mirroring need to be done by
The original motivation for Indexed REMAP was to mitigate the need to add
an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
-a stand-alone instruction. Usually a Vector ISA would add a non-conflicting
+a stand-alone instruction
+(`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting
variant (as in VSX `vperm`) but it is common to need to permute by source,
with the risk of conflict, that has to be resolved, for example, in AVX-512
with `conflictd`.
uses overlapping registers as accumulators. Thus the Register Hazard
Management needed by Indexed REMAP *has* to be in place anyway.
+*Programmer's Note: `hphint` may be used to help hardware identify
+parallelism opportunities but it is critical to remember that the
+groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
+
The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
clearly that of the additional reading of the GPRs to be used as Indices,
plus the setup cost associated with creating those same Indices.
Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
(Power ISA v3.0B) operation is leveraged, unmodified, to give the
-*appearance* and *effect* of Reduction.
+*appearance* and *effect* of Reduction. Parallel Reduction is not limited
+to Power-of-two but is limited as usual by the total number of
+element operations (127) as well as available register file size.
In Horizontal-First Mode, Vector-result reduction **requires**
the destination to be a Vector, which will be used to store
-intermediary results.
+intermediary results, in order to achieve a correct final
+result.
Given that the tree-reduction schedule is deterministic,
Interrupts and exceptions
can therefore also be precise. The final result will be in the first
non-predicate-masked-out destination element, but due again to
the deterministic schedule programmers may find uses for the intermediate
-results.
+results, even for non-commutative Defined Word operations.
When Rc=1 a corresponding Vector of co-resultant CRs is also
created. No special action is taken: the result *and its CR Field*
Move operations are **not** included in the Schedule. This means that
the Schedule leaves the final (scalar) result in the first-non-masked
element of the Vector used. With the predicate mask being dynamic
-(but deterministic) this result could be anywhere.
+(but deterministic) at a superficial glance it seems this result
+could be anywhere.
If that result is needed to be moved to a (single) scalar register
then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
of the values, compressing them (VREDUCE-style) into a contiguous block,
which will guarantee that the result goes into the very first element
of the destination vector, in which case clearly no follow-up
-predicated vector-to-scalar MV operation is needed.
+predicated vector-to-scalar MV operation is needed. A VREDUCE effect
+is achieved by setting just a source predicate mask on Twin-Predicated
+operations.
**Usage conditions**
Defined Words is done with SUBVL looping as the inner loop not the
outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
+*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
+will clearly result in data corruption. It may be best to perform
+a Pack/Unpack Transposing copy of the data first*
+
## Determining Register Hazards
For high-performance (Multi-Issue, Out-of-Order) systems it is critical
The following bits of the SVSTATE SPR are used for REMAP:
-|32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
-| -- | -- | -- | -- | -- | ----- | ------ |
-|mi0 |mi1 |mi2 |mo0 |mo1 | SVme | RMpst |
+```
+ |32:33|34:35|36:37|38:39|40:41| 42:46 | 62 |
+ | -- | -- | -- | -- | -- | ----- | ------ |
+ |mi0 |mi1 |mi2 |mo0 |mo1 | SVme | RMpst |
+```
mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
mi0-2 apply to RA, RB, RC respectively, as input registers, and
likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
-SVSHAPE is actively applied or not.
+SVSHAPE is actively applied or not, and if so, to which registers.
-* bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
-* bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
-* bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
-* bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
-* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
+* bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
+* bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
+* bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
+* bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
+* bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
(LD/ST-with-update has an implicit 2nd write register, RA)
The "persistence" bit if set will result in all Active REMAPs being applied
indefinitely.
-----------------
+-----------
\newpage{}
SVRM-Form:
- svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
-
|0 |6 |11 |13 |15 |17 |19 |21 | 22:25 |26:31 |
| -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- |
| PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO |
-SVRM-Form
-
* svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
Pseudo-code:
Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is
disabled: the register's elements are a linear (1D) vector.
-|31.30|29..28 |27..24| 23..21 | 20..18 | 17..12 |11..6 |5..0 | Mode |
-|---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
-|mode |skip |offset| invxyz | permute | zdimsz |ydimsz|xdimsz|Matrix |
-|0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
-|0b01 |submode|offset| invxyz | submode2| zdimsz |mode |xdimsz|DCT/FFT|
-|0b10 |submode|offset| invxyz | rsvd | rsvd |rsvd |xdimsz|Preduce|
-|0b11 | | | | | | | |rsvd |
+|0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode |
+|----- |----- | ------- | ------- | ------ |------|------ |---- | ----- |
+|xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix |
+|xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed|
+|xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT|
+| rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Preduce|
+| | | | | | | |0b11 |rsvd |
-mode sets different behaviours (straight matrix multiply, FFT, DCT).
+`mode` sets different behaviours (straight matrix multiply, FFT, DCT).
* **mode=0b00** sets straight Matrix Mode
* **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
* **mode=0b01** sets "FFT/DCT" mode and activates submodes
* **mode=0b10** sets "Parallel Reduction" Schedules.
+*Architectural Resource Allocation note: the four SVSHAPE SPRs are best
+allocated sequentially and contiguously in order that `sv.mtspr` may
+be used*
+
## Parallel Reduction Mode
Creates the Schedules for Parallel Tree Reduction.
| -- | -- | --- | ----- | ------ | -- | ------| -------- |
|PO | SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
-```
- # for convenience, VL to be calculated and stored in SVSTATE
- vlen <- [0] * 7
- mscale[0:5] <- 0b000001 # for scaling MAXVL
- itercount[0:6] <- [0] * 7
- SVSTATE[0:31] <- [0] * 32
- # only overwrite REMAP if "persistence" is zero
- if (SVSTATE[62] = 0b0) then
- SVSTATE[32:33] <- 0b00
- SVSTATE[34:35] <- 0b00
- SVSTATE[36:37] <- 0b00
- SVSTATE[38:39] <- 0b00
- SVSTATE[40:41] <- 0b00
- SVSTATE[42:46] <- 0b00000
- SVSTATE[62] <- 0b0
- SVSTATE[63] <- 0b0
- # clear out all SVSHAPEs
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
-
- # set schedule up for multiply
- if (SVrm = 0b0000) then
- # VL in Matrix Multiply is xd*yd*zd
- xd <- (0b00 || SVxd) + 1
- yd <- (0b00 || SVyd) + 1
- zd <- (0b00 || SVzd) + 1
- n <- xd * yd * zd
- vlen[0:6] <- n[14:20]
- # set up template in SVSHAPE0, then copy to 1-3
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[6:11] <- (0b0 || SVyd) # ydim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim
- SVSHAPE0[28:29] <- 0b11 # skip z
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # set up FRA
- SVSHAPE1[18:20] <- 0b001 # permute x,z,y
- SVSHAPE1[28:29] <- 0b01 # skip z
- # FRC
- SVSHAPE2[18:20] <- 0b001 # permute x,z,y
- SVSHAPE2[28:29] <- 0b11 # skip y
-
- # set schedule up for FFT butterfly
- if (SVrm = 0b0001) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # for FRA and FRT
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D FFT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # set up FRB and FRS
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # FRC (coefficients)
- SVSHAPE2[28:29] <- 0b10 # k schedule
-
- # set schedule up for (i)DCT Inner butterfly
- # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
- if ((SVrm = 0b0100) |
- (SVrm = 0b1100)) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1100) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Inner Butterfly sub-mode
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b001 # DCT Inner Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop
- SVSHAPE0[6:11] <- 0b000011 # (i)DCT Inner Butterfly mode 4
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # for FRA and FRT
- SVSHAPE0[28:29] <- 0b01 # j+halfstep schedule
- # for cos coefficient
- SVSHAPE2[28:29] <- 0b10 # ci (k for mode 4) schedule
- SVSHAPE2[12:17] <- 0b000000 # reset costable "striding" to 1
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[28:29] <- 0b11 # size schedule
-
- # set schedule up for (i)DCT Outer butterfly
- if (SVrm = 0b0011) | (SVrm = 0b1011) then
- # calculate O(N log2 N) number of outer butterfly overlapping adds
- vlen[0:6] <- [0] * 7
- n <- 0b000
- size <- 0b0000001
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- count <- (itercount - 0b0000001) * size
- vlen[0:6] <- vlen + count[7:13]
- size[0:6] <- (size[1:6] || 0b0)
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1011) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode
- SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
- SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
- # for FRA and FRT
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # reset costable "striding" to 1
- SVSHAPE2[12:17] <- 0b000000
-
- # set schedule up for DCT COS table generation
- if (SVrm = 0b0101) | (SVrm = 0b1101) then
- # calculate O(N log2 N)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- vlen[0:6] <- vlen + itercount
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode
- SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode
- if (SVrm = 0b0101) then
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # for cos coefficient
- SVSHAPE1[28:29] <- 0b10 # ci schedule
- SVSHAPE2[28:29] <- 0b11 # size schedule
-
- # set schedule up for iDCT / DCT inverse of half-swapped ordering
- if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
- vlen[0:6] <- (0b00 || SVxd) + 0b0000001
- # set up template in SVSHAPE0
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1110) then
- SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap
- if (SVrm = 0b1111) then
- SVSHAPE0[30:31] <- 0b01 # FFT mode
- else
- SVSHAPE0[30:31] <- 0b11 # DCT mode
- SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode
-
- # set schedule up for parallel reduction
- if (SVrm = 0b0111) then
- # calculate the total number of operations (brute-force)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- step[0:6] <- 0b0000001
- i[0:6] <- 0b0000000
- do while step <u itercount
- newstep <- step[1:6] || 0b0
- j[0:6] <- 0b0000000
- do while (j+step <u itercount)
- j <- j + newstep
- i <- i + 1
- step <- newstep
- # VL in Parallel-Reduce is the number of operations
- vlen[0:6] <- i
- # set up template in SVSHAPE0, then copy to 1. only 2 needed
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b10 # parallel reduce submode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- # set up right operand (left operand 28:29 is zero)
- SVSHAPE1[28:29] <- 0b01 # right operand
-
- # set VL, MVL and Vertical-First
- m[0:12] <- vlen * mscale
- maxvl[0:6] <- m[6:12]
- SVSTATE[0:6] <- maxvl # MAVXL
- SVSTATE[7:13] <- vlen # VL
- SVSTATE[63] <- vf
-```
+See [[sv/remap/appendix]] for `svshape` pseudocode
Special Registers Altered:
* svindex SVG,rmm,SVd,ew,SVyx,mm,sk
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
-
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- if SVyx = 0 then
- shape[18:20] <- 0b110 # indexed xd/yd
- shape[0:5] <- (0b0 || SVd) # xdim
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b111 # indexed yd/xd
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- shape[0:5] <- (0b0 || SVd) # ydim
- shape[12:17] <- (0b0 || SVG) # SVGPR
- shape[28:29] <- ew # element-width override
- shape[21] <- sk # skip 1st dimension
-
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svindex` pseudocode
Special Registers Altered:
* svshape2 offs,yx,rmm,SVd,sk,mm
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- shape[0:5] <- (0b0 || SVd) # x/ydim
- if SVyx = 0 then
- shape[18:20] <- 0b000 # ordering xd/yd(/zd)
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b010 # ordering yd/xd(/zd)
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- # offset (the prime purpose of this instruction)
- shape[24:27] <- SVo # offset
- if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
- else shape[28:29] <- 0b00 # no skipping
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svshape2` pseudocode
Special Registers Altered:
`svshape2` is an additional convenience instruction that prioritises
setting `SVSHAPE.offset`. Its primary purpose is for use when
-element-width overrides are used. It has identical capabilities to `svindex` and
+element-width overrides are used. It has identical capabilities to `svindex`
in terms of both options (skip, etc.) and ability to activate REMAP
-(rmm, mask mode) but unlike `svindex` it does not set GPR REMAP,
+(rmm, mask mode) but unlike `svindex` it does not set GPR REMAP:
only a 1D or 2D `svshape`, and
-unlike `svshape` it can set an arbirrary `SVSHAPE.offset` immediate.
+unlike `svshape` it can set an arbitrary `SVSHAPE.offset` immediate.
One of the limitations of Simple-V is that Vector elements start on the boundary
of the Scalar regfile, which is fine when element-width overrides are not
needed. If the starting point of a Vector with smaller elwidths must begin
in the middle of a register, normally there would be no way to do so except
-through LD/ST. `SVSHAPE.offset` caters for this scenario and `svshape2`is
-makes it easier.
+through costly LD/ST. `SVSHAPE.offset` caters for this scenario and `svshape2`
+makes it easier to access.
**Operand Fields**: