(a common limitation of Vector ISAs implementing explicit
Parallel Reduction instructions).
-# Basic principle
+## Basic principle
* normal vector element read/write of operands would be sequential
(0 1 2 3 ....)
support, due to the high cost in both the ISA and in hardware. For
arbitrary remapping the `Indexed` REMAP may be used.
-# Example Usage
+## Example Usage
* `svshape` to set the type of reordering to be applied to an
otherwise usual `0..VL-1` hardware for-loop
The example above may be executed as a unit test and demo,
[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
-# REMAP types
+## REMAP types
This section summarises the motivation for each REMAP Schedule
and briefly goes over their characteristics and limitations.
Further details on the Deterministic Precise-Interruptible algorithms
used in these Schedules is found in the [[sv/remap/appendix]].
-## Matrix (1D/2D/3D shaping)
+### Matrix (1D/2D/3D shaping)
Matrix Multiplication is a huge part of High-Performance Compute,
and 3D.
programming the SVSHAPE SPRs directly, which can take a lot more
instructions.
-## FFT/DCT Triple Loop
+### FFT/DCT Triple Loop
DCT and FFT are some of the most astonishingly used algorithms in
Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least
to compute arbitrary length is demonstrated by
[Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
-## Indexed
+### Indexed
The purpose of Indexing is to provide a generalised version of
Vector ISA "Permute" instructions, such as VSX `vperm`. The
CR Fields may then be used as Predicate Masks to exclude those operations
with an Index exceeding VL-1.*
-## Parallel Reduction
+### Parallel Reduction
Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
(Power ISA v3.0B) operation is leveraged, unmodified, to give the
In either case the result is in the element with the first bit set in
the predicate mask.
-For *some* implementations
+Programmer's Note: For *some* hardware implementations
the vector-to-scalar copy may be a slow operation, as may the Predicated
Parallel Reduction itself.
It may be better to perform a pre-copy
The simplest usage is to perform an overwrite, specifying all three
register operands the same.
+```
svshape parallelreduce, 6
sv.add *8, *8, *8
+```
The Reduction Schedule will issue the Parallel Tree Reduction spanning
registers 8 through 13, by adjusting the offsets to RT, RA and RB as
intermediary computations will be written to: the remaining elements
will **not** be overwritten and will **not** be zero'd.
+```
svshape parallelreduce, 6
sv.add *0, *8, *8
+```
However it is critical to note that if the source and destination are
not the same then the trick of using a follow-up vector-scalar MV will
not work.
-## Sub-Vector Horizontal Reduction
+### Sub-Vector Horizontal Reduction
Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
on all first Subvector elements, followed by another separate independent
the Matrix is transposed (like Pack/Unpack)
before still applying the Parallel Reduction to the **row**.
-# Determining Register Hazards
+## Determining Register Hazards
For high-performance (Multi-Issue, Out-of-Order) systems it is critical
to be able to statically determine the extent of Vectors in order to
with varying degrees of refinement possible at correspondingly
increasing levels of complexity in hardware.
-# REMAP area of SVSTATE
+## REMAP area of SVSTATE
The following bits of the SVSTATE SPR are used for REMAP:
None
+`svremap` determines the relationship between registers and SVSHAPE SPRs.
+The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
+determine which shape is applied to an activated register. the `pst` bit if
+cleared indicated that the REMAP operation shall only apply to the immediately-following
+instruction. If set then REMAP remains permanently enabled until such time as it is
+explicitly disabled, either by `setvl` setting a new MAXVL, or with another
+`svremap` instruction.
+
# SHAPE Remapping SPRs
There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
Matrix-style reordering still applies to the indices, except limited
to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
-(Y,X). Only one dimension may optionally be skipped. Inversion of either
-X or Y or both is possible. Pseudocode for Indexed Mode (including elwidth
+(Y,X) for in-place Transposition.
+Only one dimension may optionally be skipped. Inversion of either
+X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
overrides) may be written in terms of Matrix Mode, specifically
purposed to ensure that the 3rd dimension (Z) has no effect:
# svshape instruction <a name="svshape"> </a>
-`svshape` is a convenience instruction that reduces instruction
-count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
-(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
-including VL and MAXVL. Using `svshape` therefore does not also
-require `setvl`.
-
Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
svshape SVxd,SVyd,SVzd,SVRM,vf
| -- | -- | --- | ----- | ------ | -- | ------| -------- |
|OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
+ # set up FRB and FRS
+ SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
+ SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
+ mscale <- (0b0 || SVzd) + 1
+ if (SVrm = 0b1011) then
+ SVSHAPE0[30:31] <- 0b11 # iDCT mode
+ SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode
+ SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop
+ else
+ SVSHAPE0[30:31] <- 0b01 # DCT mode
+ SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode
+ SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode
+ # copy
+ SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
+ SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
+ # for FRA and FRT
+ SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
+ # reset costable "striding" to 1
+ SVSHAPE2[12:17] <- 0b000000
+ # set schedule up for DCT COS table generation
+ if (SVrm = 0b0101) | (SVrm = 0b1101) then
+ # calculate O(N log2 N)
+ vlen[0:6] <- [0] * 7
+ itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+ itercount[0:6] <- (0b0 || itercount[0:5])
+ n <- [0] * 3
+ do while n < 5
+ if SVxd[4-n] = 0 then
+ leave
+ n <- n + 1
+ vlen[0:6] <- vlen + itercount
+ itercount[0:6] <- (0b0 || itercount[0:5])
+ # set up template in SVSHAPE0, then copy to 1-3
+ # set up FRB and FRS
+ SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
+ SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
+ mscale <- (0b0 || SVzd) + 1
+ SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode
+ SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode
+ if (SVrm = 0b0101) then
+ SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT
+ # copy
+ SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+ SVSHAPE2[0:31] <- SVSHAPE0[0:31]
+ # for cos coefficient
+ SVSHAPE1[28:29] <- 0b10 # ci schedule
+ SVSHAPE2[28:29] <- 0b11 # size schedule
+ # set schedule up for iDCT / DCT inverse of half-swapped ordering
+ if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
+ vlen[0:6] <- (0b00 || SVxd) + 0b0000001
+ # set up template in SVSHAPE0
+ SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
+ SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
+ mscale <- (0b0 || SVzd) + 1
+ if (SVrm = 0b1110) then
+ SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap
+ if (SVrm = 0b1111) then
+ SVSHAPE0[30:31] <- 0b01 # FFT mode
+ else
+ SVSHAPE0[30:31] <- 0b11 # DCT mode
+ SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode
+ # set schedule up for parallel reduction
+ if (SVrm = 0b0111) then
+ # calculate the total number of operations (brute-force)
+ vlen[0:6] <- [0] * 7
+ itercount[0:6] <- (0b00 || SVxd) + 0b0000001
+ step[0:6] <- 0b0000001
+ i[0:6] <- 0b0000000
+ do while step <u itercount
+ newstep <- step[1:6] || 0b0
+ j[0:6] <- 0b0000000
+ do while (j+step <u itercount)
+ j <- j + newstep
+ i <- i + 1
+ step <- newstep
+ # VL in Parallel-Reduce is the number of operations
+ vlen[0:6] <- i
+ # set up template in SVSHAPE0, then copy to 1. only 2 needed
+ SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
+ SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
+ mscale <- (0b0 || SVzd) + 1
+ SVSHAPE0[30:31] <- 0b10 # parallel reduce submode
+ # copy
+ SVSHAPE1[0:31] <- SVSHAPE0[0:31]
+ # set up right operand (left operand 28:29 is zero)
+ SVSHAPE1[28:29] <- 0b01 # right operand
+ # set VL, MVL and Vertical-First
+ m[0:12] <- vlen * mscale
+ maxvl[0:6] <- m[6:12]
+ SVSTATE[0:6] <- maxvl # MAVXL
+ SVSTATE[7:13] <- vlen # VL
+ SVSTATE[63] <- vf
+
+Special Registers Altered:
+
+ None
+
+`svshape` is a convenience instruction that reduces instruction
+count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
+(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
+including VL and MAXVL. Using `svshape` therefore does not also
+require `setvl`.
+
Fields:
* **SVxd** - SV REMAP "xdim"
* **SVzd** - SV REMAP "zdim"
* **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
* **vf** - sets "Vertical-First" mode
-* **XO** - standard 6-bit XO field
*Note: SVxd, SVyz and SVzd are all stored "off-by-one". In the assembler
mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
+There are 14 REMAP Modes (2 bits are RESERVED for `svshape2`)
+
| SVRM | Remap Mode description |
| -- | -- |
| 0b0000 | Matrix 1/2/3D |
| 0b1111 | FFT half-swap |
Examples showing how all of these Modes operate exists in the online
-[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
-and the full pseudocode setting up all SPRs
-is in the [[openpower/isa/simplev]] page.
+[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD). Explaining
+these Modes further in detail is beyond the scope of this document.
In Indexed Mode, there are only 5 bits available to specify the GPR
to use, out of 128 GPRs (7 bit numbering). Therefore, only the top