how best to process the exact same equivalent loop-unrolled instruction
stream.*
+## Horizontal-Parallelism Hint
+
+`SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
+fully independent. Hardware is permitted to assume that groups of elements
+up to `hphint` in size need not have Register (or Memory) Hazards created
+between them (including when `hphint > VL`).
+
+If care is not taken in setting `hphint` correctly it may wreak havoc.
+For example Matrix Outer Product relies on the innermost loop computations
+being independent. If `hphint` is set to greater than the Outer Product
+depth then data corruption is guaranteed to occur.
+
+Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
+is independent, but that there is strict *inter-layer* Register Hazards.
+Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
+data corruption is guaranteed.
+
+Thus the key message is that setting `hphint` requires in-depth knowledge
+of the REMAP Algorithm Schedules, given in the Appendix.
+
## REMAP types
This section summarises the motivation for each REMAP Schedule
uses overlapping registers as accumulators. Thus the Register Hazard
Management needed by Indexed REMAP *has* to be in place anyway.
+*Programmer's Note: `hphint` may be used to help hardware identify
+parallelism opportunities but it is critical to remember that the
+groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
+
The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
clearly that of the additional reading of the GPRs to be used as Indices,
plus the setup cost associated with creating those same Indices.
Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
(Power ISA v3.0B) operation is leveraged, unmodified, to give the
-*appearance* and *effect* of Reduction.
+*appearance* and *effect* of Reduction. Parallel Reduction is not limited
+to Power-of-two but is limited as usual by the total number of
+element operations (127) as well as available register file size.
In Horizontal-First Mode, Vector-result reduction **requires**
the destination to be a Vector, which will be used to store
-intermediary results.
+intermediary results, in order to achieve a correct final
+result.
Given that the tree-reduction schedule is deterministic,
Interrupts and exceptions
can therefore also be precise. The final result will be in the first
non-predicate-masked-out destination element, but due again to
the deterministic schedule programmers may find uses for the intermediate
-results.
+results, even for non-commutative Defined Word operations.
When Rc=1 a corresponding Vector of co-resultant CRs is also
created. No special action is taken: the result *and its CR Field*
Move operations are **not** included in the Schedule. This means that
the Schedule leaves the final (scalar) result in the first-non-masked
element of the Vector used. With the predicate mask being dynamic
-(but deterministic) this result could be anywhere.
+(but deterministic) at a superficial glance it seems this result
+could be anywhere.
If that result is needed to be moved to a (single) scalar register
then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
of the values, compressing them (VREDUCE-style) into a contiguous block,
which will guarantee that the result goes into the very first element
of the destination vector, in which case clearly no follow-up
-predicated vector-to-scalar MV operation is needed.
+predicated vector-to-scalar MV operation is needed. A VREDUCE effect
+is achieved by setting just a source predicate mask on Twin-Predicated
+operations.
**Usage conditions**
Defined Words is done with SUBVL looping as the inner loop not the
outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
+*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
+will clearly result in data corruption. It may be best to perform
+a Pack/Unpack Transposing copy of the data first*
+
## Determining Register Hazards
For high-performance (Multi-Issue, Out-of-Order) systems it is critical
mi0-2 apply to RA, RB, RC respectively, as input registers, and
likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
-SVSHAPE is actively applied or not.
+SVSHAPE is actively applied or not, and if so, to which registers.
-* bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
-* bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
-* bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
-* bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
-* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
+* bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
+* bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
+* bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
+* bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
+* bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
(LD/ST-with-update has an implicit 2nd write register, RA)
The "persistence" bit if set will result in all Active REMAPs being applied
Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is
disabled: the register's elements are a linear (1D) vector.
-|31.30|29..28 |27..24| 23..21 | 20..18 | 17..12 |11..6 |5..0 | Mode |
-|---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
-|mode |skip |offset| invxyz | permute | zdimsz |ydimsz|xdimsz|Matrix |
-|0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
-|0b01 |submode|offset| invxyz | submode2| zdimsz |mode |xdimsz|DCT/FFT|
-|0b10 |submode|offset| invxyz | rsvd | rsvd |rsvd |xdimsz|Preduce|
-|0b11 | | | | | | | |rsvd |
+|0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode |
+|----- |----- | ------- | ------- | ------ |------|------ |---- | ----- |
+|xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix |
+|xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed|
+|xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT|
+| rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Preduce|
+| | | | | | | |0b11 |rsvd |
-mode sets different behaviours (straight matrix multiply, FFT, DCT).
+`mode` sets different behaviours (straight matrix multiply, FFT, DCT).
* **mode=0b00** sets straight Matrix Mode
* **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
* **mode=0b01** sets "FFT/DCT" mode and activates submodes
* **mode=0b10** sets "Parallel Reduction" Schedules.
+*Architectural Resource Allocation note: the four SVSHAPE SPRs are best
+allocated sequentially and contiguously in order that `sv.mtspr` may
+be used*
+
## Parallel Reduction Mode
Creates the Schedules for Parallel Tree Reduction.
| -- | -- | --- | ----- | ------ | -- | ------| -------- |
|PO | SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
-```
- # for convenience, VL to be calculated and stored in SVSTATE
- vlen <- [0] * 7
- mscale[0:5] <- 0b000001 # for scaling MAXVL
- itercount[0:6] <- [0] * 7
- SVSTATE[0:31] <- [0] * 32
- # only overwrite REMAP if "persistence" is zero
- if (SVSTATE[62] = 0b0) then
- SVSTATE[32:33] <- 0b00
- SVSTATE[34:35] <- 0b00
- SVSTATE[36:37] <- 0b00
- SVSTATE[38:39] <- 0b00
- SVSTATE[40:41] <- 0b00
- SVSTATE[42:46] <- 0b00000
- SVSTATE[62] <- 0b0
- SVSTATE[63] <- 0b0
- # clear out all SVSHAPEs
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
-
- # set schedule up for multiply
- if (SVrm = 0b0000) then
- # VL in Matrix Multiply is xd*yd*zd
- xd <- (0b00 || SVxd) + 1
- yd <- (0b00 || SVyd) + 1
- zd <- (0b00 || SVzd) + 1
- n <- xd * yd * zd
- vlen[0:6] <- n[14:20]
- # set up template in SVSHAPE0, then copy to 1-3
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[6:11] <- (0b0 || SVyd) # ydim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim
- SVSHAPE0[28:29] <- 0b11 # skip z
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # set up FRA
- SVSHAPE1[18:20] <- 0b001 # permute x,z,y
- SVSHAPE1[28:29] <- 0b01 # skip z
- # FRC
- SVSHAPE2[18:20] <- 0b001 # permute x,z,y
- SVSHAPE2[28:29] <- 0b11 # skip y
-
- # set schedule up for FFT butterfly
- if (SVrm = 0b0001) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # for FRA and FRT
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D FFT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # set up FRB and FRS
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # FRC (coefficients)
- SVSHAPE2[28:29] <- 0b10 # k schedule
-
- # set schedule up for (i)DCT Inner butterfly
- # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
- if ((SVrm = 0b0100) |
- (SVrm = 0b1100)) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1100) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Inner Butterfly sub-mode
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b001 # DCT Inner Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop
- SVSHAPE0[6:11] <- 0b000011 # (i)DCT Inner Butterfly mode 4
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # for FRA and FRT
- SVSHAPE0[28:29] <- 0b01 # j+halfstep schedule
- # for cos coefficient
- SVSHAPE2[28:29] <- 0b10 # ci (k for mode 4) schedule
- SVSHAPE2[12:17] <- 0b000000 # reset costable "striding" to 1
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[28:29] <- 0b11 # size schedule
-
- # set schedule up for (i)DCT Outer butterfly
- if (SVrm = 0b0011) | (SVrm = 0b1011) then
- # calculate O(N log2 N) number of outer butterfly overlapping adds
- vlen[0:6] <- [0] * 7
- n <- 0b000
- size <- 0b0000001
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- count <- (itercount - 0b0000001) * size
- vlen[0:6] <- vlen + count[7:13]
- size[0:6] <- (size[1:6] || 0b0)
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1011) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode
- SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
- SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
- # for FRA and FRT
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # reset costable "striding" to 1
- SVSHAPE2[12:17] <- 0b000000
-
- # set schedule up for DCT COS table generation
- if (SVrm = 0b0101) | (SVrm = 0b1101) then
- # calculate O(N log2 N)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- vlen[0:6] <- vlen + itercount
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode
- SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode
- if (SVrm = 0b0101) then
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # for cos coefficient
- SVSHAPE1[28:29] <- 0b10 # ci schedule
- SVSHAPE2[28:29] <- 0b11 # size schedule
-
- # set schedule up for iDCT / DCT inverse of half-swapped ordering
- if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
- vlen[0:6] <- (0b00 || SVxd) + 0b0000001
- # set up template in SVSHAPE0
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1110) then
- SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap
- if (SVrm = 0b1111) then
- SVSHAPE0[30:31] <- 0b01 # FFT mode
- else
- SVSHAPE0[30:31] <- 0b11 # DCT mode
- SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode
-
- # set schedule up for parallel reduction
- if (SVrm = 0b0111) then
- # calculate the total number of operations (brute-force)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- step[0:6] <- 0b0000001
- i[0:6] <- 0b0000000
- do while step <u itercount
- newstep <- step[1:6] || 0b0
- j[0:6] <- 0b0000000
- do while (j+step <u itercount)
- j <- j + newstep
- i <- i + 1
- step <- newstep
- # VL in Parallel-Reduce is the number of operations
- vlen[0:6] <- i
- # set up template in SVSHAPE0, then copy to 1. only 2 needed
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b10 # parallel reduce submode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- # set up right operand (left operand 28:29 is zero)
- SVSHAPE1[28:29] <- 0b01 # right operand
-
- # set VL, MVL and Vertical-First
- m[0:12] <- vlen * mscale
- maxvl[0:6] <- m[6:12]
- SVSTATE[0:6] <- maxvl # MAVXL
- SVSTATE[7:13] <- vlen # VL
- SVSTATE[63] <- vf
-```
+See [[sv/remap/appendix]] for `svshape` pseudocode
Special Registers Altered:
* svindex SVG,rmm,SVd,ew,SVyx,mm,sk
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
-
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- if SVyx = 0 then
- shape[18:20] <- 0b110 # indexed xd/yd
- shape[0:5] <- (0b0 || SVd) # xdim
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b111 # indexed yd/xd
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- shape[0:5] <- (0b0 || SVd) # ydim
- shape[12:17] <- (0b0 || SVG) # SVGPR
- shape[28:29] <- ew # element-width override
- shape[21] <- sk # skip 1st dimension
-
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svindex` pseudocode
Special Registers Altered:
* svshape2 offs,yx,rmm,SVd,sk,mm
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- shape[0:5] <- (0b0 || SVd) # x/ydim
- if SVyx = 0 then
- shape[18:20] <- 0b000 # ordering xd/yd(/zd)
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b010 # ordering yd/xd(/zd)
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- # offset (the prime purpose of this instruction)
- shape[24:27] <- SVo # offset
- if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
- else shape[28:29] <- 0b00 # no skipping
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svshape2` pseudocode
Special Registers Altered: