how best to process the exact same equivalent loop-unrolled instruction
stream.*
+## Horizontal-Parallelism Hint
+
+`SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
+fully independent. Hardware is permitted to assume that groups of elements
+up to `hphint` in size need not have Register (or Memory) Hazards created
+between them (including when `hphint > VL`).
+
+If care is not taken in setting `hphint` correctly it may wreak havoc.
+For example Matrix Outer Product relies on the innermost loop computations
+being independent. If `hphint` is set to greater than the Outer Product
+depth then data corruption is guaranteed to occur.
+
+Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
+is independent, but that there is strict *inter-layer* Register Hazards.
+Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
+data corruption is guaranteed.
+
+Thus the key message is that setting `hphint` requires in-depth knowledge
+of the REMAP Algorithm Schedules, given in the Appendix.
+
## REMAP types
This section summarises the motivation for each REMAP Schedule
uses overlapping registers as accumulators. Thus the Register Hazard
Management needed by Indexed REMAP *has* to be in place anyway.
+*Programmer's Note: `hphint` may be used to help hardware identify
+parallelism opportunities but it is critical to remember that the
+groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
+
The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
clearly that of the additional reading of the GPRs to be used as Indices,
plus the setup cost associated with creating those same Indices.
can therefore also be precise. The final result will be in the first
non-predicate-masked-out destination element, but due again to
the deterministic schedule programmers may find uses for the intermediate
-results.
+results, even for non-commutative Defined Word operations.
When Rc=1 a corresponding Vector of co-resultant CRs is also
created. No special action is taken: the result *and its CR Field*
mi0-2 apply to RA, RB, RC respectively, as input registers, and
likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
-SVSHAPE is actively applied or not.
+SVSHAPE is actively applied or not, and if so, to which registers.
-* bit 0 of SVme indicates if mi0 is applied to RA / FRA / BA / BFA
-* bit 1 of SVme indicates if mi1 is applied to RB / FRB / BB
-* bit 2 of SVme indicates if mi2 is applied to RC / FRC / BC
-* bit 3 of SVme indicates if mo0 is applied to RT / FRT / BT / BF
-* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
+* bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
+* bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
+* bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
+* bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
+* bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
(LD/ST-with-update has an implicit 2nd write register, RA)
The "persistence" bit if set will result in all Active REMAPs being applied
Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is
disabled: the register's elements are a linear (1D) vector.
-|31.30|29..28 |27..24| 23..21 | 20..18 | 17..12 |11..6 |5..0 | Mode |
-|---- |------ |------| ------ | ------- | ------- |----- |----- | ----- |
-|mode |skip |offset| invxyz | permute | zdimsz |ydimsz|xdimsz|Matrix |
-|0b00 |elwidth|offset|sk1/invxy|0b110/0b111|SVGPR|ydimsz|xdimsz|Indexed|
-|0b01 |submode|offset| invxyz | submode2| zdimsz |mode |xdimsz|DCT/FFT|
-|0b10 |submode|offset| invxyz | rsvd | rsvd |rsvd |xdimsz|Preduce|
-|0b11 | | | | | | | |rsvd |
+|0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31| Mode |
+|----- |----- | ------- | ------- | ------ |------|------ |---- | ----- |
+|xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |Matrix |
+|xdimsz|ydimsz|SVGPR | 11/ |sk1/invxy|offset|elwidth|0b00 |Indexed|
+|xdimsz|mode | zdimsz | submode2| invxyz |offset|submode|0b01 |DCT/FFT|
+| rsvd |rsvd |xdimsz | rsvd | invxyz |offset|submode|0b10 |Preduce|
+| | | | | | | |0b11 |rsvd |
-mode sets different behaviours (straight matrix multiply, FFT, DCT).
+`mode` sets different behaviours (straight matrix multiply, FFT, DCT).
* **mode=0b00** sets straight Matrix Mode
* **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
| -- | -- | --- | ----- | ------ | -- | ------| -------- |
|PO | SVxd | SVyd | SVzd | SVRM | vf | XO | svshape |
-```
- # for convenience, VL to be calculated and stored in SVSTATE
- vlen <- [0] * 7
- mscale[0:5] <- 0b000001 # for scaling MAXVL
- itercount[0:6] <- [0] * 7
- SVSTATE[0:31] <- [0] * 32
- # only overwrite REMAP if "persistence" is zero
- if (SVSTATE[62] = 0b0) then
- SVSTATE[32:33] <- 0b00
- SVSTATE[34:35] <- 0b00
- SVSTATE[36:37] <- 0b00
- SVSTATE[38:39] <- 0b00
- SVSTATE[40:41] <- 0b00
- SVSTATE[42:46] <- 0b00000
- SVSTATE[62] <- 0b0
- SVSTATE[63] <- 0b0
- # clear out all SVSHAPEs
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
-
- # set schedule up for multiply
- if (SVrm = 0b0000) then
- # VL in Matrix Multiply is xd*yd*zd
- xd <- (0b00 || SVxd) + 1
- yd <- (0b00 || SVyd) + 1
- zd <- (0b00 || SVzd) + 1
- n <- xd * yd * zd
- vlen[0:6] <- n[14:20]
- # set up template in SVSHAPE0, then copy to 1-3
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[6:11] <- (0b0 || SVyd) # ydim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim
- SVSHAPE0[28:29] <- 0b11 # skip z
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # set up FRA
- SVSHAPE1[18:20] <- 0b001 # permute x,z,y
- SVSHAPE1[28:29] <- 0b01 # skip z
- # FRC
- SVSHAPE2[18:20] <- 0b001 # permute x,z,y
- SVSHAPE2[28:29] <- 0b11 # skip y
-
- # set schedule up for FFT butterfly
- if (SVrm = 0b0001) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # for FRA and FRT
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D FFT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # set up FRB and FRS
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # FRC (coefficients)
- SVSHAPE2[28:29] <- 0b10 # k schedule
-
- # set schedule up for (i)DCT Inner butterfly
- # SVrm Mode 4 (Mode 12 for iDCT) is for on-the-fly (Vertical-First Mode)
- if ((SVrm = 0b0100) |
- (SVrm = 0b1100)) then
- # calculate O(N log2 N)
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- n <- ((0b0 || SVxd) + 1) * n
- vlen[0:6] <- n[1:7]
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1100) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Inner Butterfly sub-mode
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b001 # DCT Inner Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop
- SVSHAPE0[6:11] <- 0b000011 # (i)DCT Inner Butterfly mode 4
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[0:31] <- SVSHAPE0[0:31]
- # for FRA and FRT
- SVSHAPE0[28:29] <- 0b01 # j+halfstep schedule
- # for cos coefficient
- SVSHAPE2[28:29] <- 0b10 # ci (k for mode 4) schedule
- SVSHAPE2[12:17] <- 0b000000 # reset costable "striding" to 1
- if (SVrm != 0b0100) & (SVrm != 0b1100) then
- SVSHAPE3[28:29] <- 0b11 # size schedule
-
- # set schedule up for (i)DCT Outer butterfly
- if (SVrm = 0b0011) | (SVrm = 0b1011) then
- # calculate O(N log2 N) number of outer butterfly overlapping adds
- vlen[0:6] <- [0] * 7
- n <- 0b000
- size <- 0b0000001
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- count <- (itercount - 0b0000001) * size
- vlen[0:6] <- vlen + count[7:13]
- size[0:6] <- (size[1:6] || 0b0)
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1011) then
- SVSHAPE0[30:31] <- 0b11 # iDCT mode
- SVSHAPE0[18:20] <- 0b011 # iDCT Outer Butterfly sub-mode
- SVSHAPE0[21:23] <- 0b101 # "inverse" on outer and inner loop
- else
- SVSHAPE0[30:31] <- 0b01 # DCT mode
- SVSHAPE0[18:20] <- 0b100 # DCT Outer Butterfly sub-mode
- SVSHAPE0[6:11] <- 0b000010 # DCT Butterfly mode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31] # j+halfstep schedule
- SVSHAPE2[0:31] <- SVSHAPE0[0:31] # costable coefficients
- # for FRA and FRT
- SVSHAPE1[28:29] <- 0b01 # j+halfstep schedule
- # reset costable "striding" to 1
- SVSHAPE2[12:17] <- 0b000000
-
- # set schedule up for DCT COS table generation
- if (SVrm = 0b0101) | (SVrm = 0b1101) then
- # calculate O(N log2 N)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- itercount[0:6] <- (0b0 || itercount[0:5])
- n <- [0] * 3
- do while n < 5
- if SVxd[4-n] = 0 then
- leave
- n <- n + 1
- vlen[0:6] <- vlen + itercount
- itercount[0:6] <- (0b0 || itercount[0:5])
- # set up template in SVSHAPE0, then copy to 1-3
- # set up FRB and FRS
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b01 # DCT/FFT mode
- SVSHAPE0[6:11] <- 0b000100 # DCT Inner Butterfly COS-gen mode
- if (SVrm = 0b0101) then
- SVSHAPE0[21:23] <- 0b001 # "inverse" on outer loop for DCT
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- SVSHAPE2[0:31] <- SVSHAPE0[0:31]
- # for cos coefficient
- SVSHAPE1[28:29] <- 0b10 # ci schedule
- SVSHAPE2[28:29] <- 0b11 # size schedule
-
- # set schedule up for iDCT / DCT inverse of half-swapped ordering
- if (SVrm = 0b0110) | (SVrm = 0b1110) | (SVrm = 0b1111) then
- vlen[0:6] <- (0b00 || SVxd) + 0b0000001
- # set up template in SVSHAPE0
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- if (SVrm = 0b1110) then
- SVSHAPE0[18:20] <- 0b001 # DCT opposite half-swap
- if (SVrm = 0b1111) then
- SVSHAPE0[30:31] <- 0b01 # FFT mode
- else
- SVSHAPE0[30:31] <- 0b11 # DCT mode
- SVSHAPE0[6:11] <- 0b000101 # DCT "half-swap" mode
-
- # set schedule up for parallel reduction
- if (SVrm = 0b0111) then
- # calculate the total number of operations (brute-force)
- vlen[0:6] <- [0] * 7
- itercount[0:6] <- (0b00 || SVxd) + 0b0000001
- step[0:6] <- 0b0000001
- i[0:6] <- 0b0000000
- do while step <u itercount
- newstep <- step[1:6] || 0b0
- j[0:6] <- 0b0000000
- do while (j+step <u itercount)
- j <- j + newstep
- i <- i + 1
- step <- newstep
- # VL in Parallel-Reduce is the number of operations
- vlen[0:6] <- i
- # set up template in SVSHAPE0, then copy to 1. only 2 needed
- SVSHAPE0[0:5] <- (0b0 || SVxd) # xdim
- SVSHAPE0[12:17] <- (0b0 || SVzd) # zdim - "striding" (2D DCT)
- mscale <- (0b0 || SVzd) + 1
- SVSHAPE0[30:31] <- 0b10 # parallel reduce submode
- # copy
- SVSHAPE1[0:31] <- SVSHAPE0[0:31]
- # set up right operand (left operand 28:29 is zero)
- SVSHAPE1[28:29] <- 0b01 # right operand
-
- # set VL, MVL and Vertical-First
- m[0:12] <- vlen * mscale
- maxvl[0:6] <- m[6:12]
- SVSTATE[0:6] <- maxvl # MAVXL
- SVSTATE[7:13] <- vlen # VL
- SVSTATE[63] <- vf
-```
+See [[sv/remap/appendix]] for `svshape` pseudocode
Special Registers Altered:
* svindex SVG,rmm,SVd,ew,SVyx,mm,sk
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
-
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- if SVyx = 0 then
- shape[18:20] <- 0b110 # indexed xd/yd
- shape[0:5] <- (0b0 || SVd) # xdim
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b111 # indexed yd/xd
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- shape[0:5] <- (0b0 || SVd) # ydim
- shape[12:17] <- (0b0 || SVG) # SVGPR
- shape[28:29] <- ew # element-width override
- shape[21] <- sk # skip 1st dimension
-
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svindex` pseudocode
Special Registers Altered:
* svshape2 offs,yx,rmm,SVd,sk,mm
-Pseudo-code:
-
-```
- # based on nearest MAXVL compute other dimension
- MVL <- SVSTATE[0:6]
- d <- [0] * 6
- dim <- SVd+1
- do while d*dim <u ([0]*4 || MVL)
- d <- d + 1
- # set up template, then copy once location identified
- shape <- [0]*32
- shape[30:31] <- 0b00 # mode
- shape[0:5] <- (0b0 || SVd) # x/ydim
- if SVyx = 0 then
- shape[18:20] <- 0b000 # ordering xd/yd(/zd)
- if sk = 0 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- 0b111111 # ydim max
- else
- shape[18:20] <- 0b010 # ordering yd/xd(/zd)
- if sk = 1 then shape[6:11] <- 0 # ydim
- else shape[6:11] <- d-1 # ydim max
- # offset (the prime purpose of this instruction)
- shape[24:27] <- SVo # offset
- if sk = 1 then shape[28:29] <- 0b01 # skip 1st dimension
- else shape[28:29] <- 0b00 # no skipping
- # select the mode for updating SVSHAPEs
- SVSTATE[62] <- mm # set or clear persistence
- if mm = 0 then
- # clear out all SVSHAPEs first
- SVSHAPE0[0:31] <- [0] * 32
- SVSHAPE1[0:31] <- [0] * 32
- SVSHAPE2[0:31] <- [0] * 32
- SVSHAPE3[0:31] <- [0] * 32
- SVSTATE[32:41] <- [0] * 10 # clear REMAP.mi/o
- SVSTATE[42:46] <- rmm # rmm exactly REMAP.SVme
- idx <- 0
- for bit = 0 to 4
- if rmm[4-bit] then
- # activate requested shape
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- # increment shape index, modulo 4
- if idx = 3 then idx <- 0
- else idx <- idx + 1
- else
- # refined SVSHAPE/REMAP update mode
- bit <- rmm[0:2]
- idx <- rmm[3:4]
- if idx = 0 then SVSHAPE0 <- shape
- if idx = 1 then SVSHAPE1 <- shape
- if idx = 2 then SVSHAPE2 <- shape
- if idx = 3 then SVSHAPE3 <- shape
- SVSTATE[bit*2+32:bit*2+33] <- idx
- SVSTATE[46-bit] <- 1
-```
+See [[sv/remap/appendix]] for `svshape2` pseudocode
Special Registers Altered: