[[!tag standards]]
# REMAP
*
* add svindex
* see [[sv/propagation]] for a future way to apply
REMAP.
REMAP is an advanced form of Vector "Structure Packing" that
provides hardware-level support for commonly-used *nested* loop patterns.
For more general reordering an Indexed REMAP mode is available.
REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
from a linear form to a 2D or 3D transposed form, or "offset" to permit
arbitrary access to elements (when elwidth overrides are used),
independently on each Vector src or dest
register.
The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
data in-place. Four SPRs are provided so that a single FMAC may be
used in a single loop to perform for example 4x4 times 4x4 Matrix multiplication,
generating 64 FMACs. Additional uses include regular "Structure Packing"
such as RGB pixel data extraction and reforming.
REMAP, like all of SV, is abstracted out, meaning that unlike traditional
Vector ISAs which would typically only have a limited set of instructions
that can be structure-packed (LD/ST typically), REMAP may be applied to
literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
Note that REMAP does not *directly* apply to sub-vector elements: that
is what swizzle is for. Swizzle *can* however be applied to the same
instruction as REMAP. As explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack EXTRA Mode bits
can extend down into Sub-vector elements to perform vec2/vec3/vec4
sequential reordering, but even here, REMAP is not extended down to
the actual sub-vector elements themselves.
In its general form, REMAP is quite expensive to set up, and on some
implementations introduce
latency, so should realistically be used only where it is worthwhile.
Commonly-used patterns such as Matrix Multiply, DCT and FFT have
helper instruction options which make REMAP easier to use.
There are three types of REMAP:
* **Matrix**, also known as 2D and 3D reshaping
* **FFT/DCT**, with full triple-loop in-place support: limited to
Power-2 RADIX
* **Indexing**, for any general-purpose reordering, also includes
limited 2D reshaping. Currently
under development.
# Principle
* normal vector element read/write of operands would be sequential
(0 1 2 3 ....)
* this is not appropriate for (e.g.) Matrix multiply which requires
accessing elements in alternative sequences (0 3 6 1 4 7 ...)
* normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
with this. both are expensive (copy large vectors, spill through memory)
and very few Packed SIMD ISAs cope with non-Power-2.
* REMAP **redefines** the order of access according to set "Schedules".
* The Schedules are not necessarily restricted to power-of-two boundaries
making it unnecessary to have for example specialised 3x4 transpose
instructions.
Only the most commonly-used algorithms in computer science have REMAP
support, due to the high cost in both the ISA and in hardware. For
arbitrary remapping the `Indexed` REMAP may be used.
# Executive Summary Usage
* `svshape` to set the type of reordering to be applied to an
otherwise usual `0..VL-1` hardware for-loop
* `svremap` to set which registers a given reordering is to apply to
(RA, RT etc)
* `sv.instruction` where any Vectorised register marked by `svremap`
will have its ordering REMAPPED according to the schedule set
by `svshape`.
The following illustrative example multiplies a 3x4 and a 5x3
matrix to create
a 5x4 result:
svshape 5, 4, 3, 0, 0
svremap 31, 1, 2, 3, 0, 0, 0, 0
sv.fmadds 0.v, 8.v, 16.v, 0.v
* svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
* svremap activates all five registers RA RB RC RT RS (31)
* svremap requests:
- RA to use SVSHAPE1
- RB to use SVSHAPE2
- RC to use SVSHAPE3
- RT to use SVSHAPE0
- RS to use SVSHAPE0
* sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
* With REMAP being active each register's element index is
*independently* transformed using the specified SHAPEs.
Thus the Vector Loop is arranged such that the use of
the multiply-and-accumulate instruction executes precisely the required
Schedule to perform an in-place in-registers Matrix Multiply with no
need to perform additional Transpose or register copy instructions.
The example above may be executed as a unit test and demo,
[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
# REMAP types
This section summarises the motivation for each REMAP Schedule
and briefly goes over their characteristics and limitations.
## Matrix (1D/2D/3D shaping)
Matrix Multiplication is a huge part of High-Performance Compute,
and 3D.
In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
Aside from the cost of the load on the L1 I-Cache, the trick only
works if one of the dimensions X or Y are power-two. Prime Numbers
(5x7, 3x5) become deeply problematic to unroll.
Even traditional Scalable Vector ISAs have issues with Matrices, often
having to perform data Transpose by pushing out through Memory and back,
or computing Transposition Indices (costly) then copying to another
Vector (costly).
Matrix REMAP was thus designed to solve these issues by providing Hardware
Assisted
"Schedules" that can view what would otherwise be limited to a strictly
linear Vector as instead being 2D (even 3D) *in-place* reordered.
With both Transposition and non-power-two being supported the issues
faced by other ISAs are mitigated.
Limitations of Matrix REMAP are that the Vector Length (VL) is currently
restricted to 127: up to 127 FMAs may be performed in total (potentially
127 vec2/3/4 FMAs may be used but this requires additional research).
Also given that it is in-registers only at present some care has to be
taken on regfile resource utilisation. However it is perfectly possible
to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
the usual 6-level large Matrix Multiply, without the usual difficulties
associated with SIMD.
Also the `svshape` instruction only provides access to part of the
Matrix REMAP capability. Rotation and mirroring need to be done by
programming the SVSHAPE SPRs directly, which can take a lot more
instructions.
## FFT/DCT Triple Loop
TODO
## Indexed
The purpose of Indexing is to provide a generalised version of
Vector ISA "Permute" instructions, such as VSX `vperm`. The
Indexing is abstracted out and may be applied to much more
than an element move/copy, and is not limited for example
to the number of bytes that can fit into a VSX register.
Indexing may be applied to LD/ST (even on Indexed LD/ST
instructions such as `sv.lbzx`), arithmetic operations,
extsw: there is no artificial limit.
The only major caveat is that the registers to be used as
Indices must not be modified by any instruction after Indexed Mode
is established, and neither must MAXVL be altered. Failure to observe
these conditions results in `UNDEFINED` behaviour.
These conditions allow a Read-After-Write (RAW) Hazard to be created on
the entire range of Indices to be subsequently used, but a corresponding
Write-After-Read Hazard by any instruction that modifies the Indices
**does not have to be created**. Given the large number of registers
involved in Indexing this is a huge resource saving and reduction
in micro-architectural complexity. MAXVL is likewise
included in the RAW Hazards because it is involved in calculating
how many registers are to be considered Indices.
With these restrictions in place high-performance implementations
may read-cache the Indices from the point where a given `svindex` instruction
is called or SVSHAPE SPRs altered.
# REMAP area of SVSTATE
The following bits of the SVSTATE SPR are used for REMAP:
|32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
| -- | -- | -- | -- | -- | ----- | ------ |
|mi0 |mi1 |mi2 |mo0 |mo1 | SVme | RMpst |
mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
mi0-2 apply to RA, RB, RC respectively, as input registers, and
likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
SVSHAPE is actively applied or not.
* bit 0 of SVme indicates if mi0 is applied to RA / FRA
* bit 1 of SVme indicates if mi1 is applied to RB / FRB
* bit 2 of SVme indicates if mi2 is applied to RC / FRC
* bit 3 of SVme indicates if mo0 is applied to RT / FRT
* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
(LD/ST-with-update has an implicit 2nd write register, RA)
# svremap instruction
There is also a corresponding SVRM-Form for the svremap
instruction which matches the above SPR:
svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
|0 |6 |11 |13 |15 |17 |19 |21 | 22.25 |26..31 |
| -- | -- | -- | -- | -- | -- | -- | -- | ---- | ----- |
| PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO |
# SHAPE Remapping SPRs
There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
which have the same format.
[[!inline pages="openpower/sv/shape_table_format" raw="yes" ]]
# svshape instruction
`svshape` is a convenience instruction that reduces instruction
count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
(overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
including VL and MAXVL. Using `svshape` therefore does not also
require `setvl`.
Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
svshape SVxd,SVyd,SVzd,SVRM,vf
| 0.5|6.10 |11.15 |16..20 | 21..25 | 25 | 26..31| name |
| -- | -- | --- | ----- | ------ | -- | ------| -------- |
|OPCD| SVxd | SVyd | SVzd | SVRM | vf | XO | svstate |
Fields:
* **SVxd** - SV REMAP "xdim"
* **SVyd** - SV REMAP "ydim"
* **SVzd** - SV REMAP "zdim"
* **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
* **vf** - sets "Vertical-First" mode
* **XO** - standard 6-bit XO field
*Note: SVxd, SVyz and SVzd are all stored "off-by-one". In the assembler
mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
| SVRM | Remap Mode description |
| -- | -- |
| 0b0000 | Matrix 1/2/3D |
| 0b0001 | FFT Butterfly |
| 0b0010 | DCT Inner butterfly, pre-calculated coefficients |
| 0b0011 | DCT Outer butterfly |
| 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
| 0b0101 | DCT COS table index generation |
| 0b0110 | DCT half-swap |
| 0b0111 | reserved |
| 0b1000 | reserved |
| 0b1001 | reserved |
| 0b1010 | iDCT Inner butterfly, pre-calculated coefficients |
| 0b1011 | iDCT Outer butterfly |
| 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
| 0b1101 | iDCT COS table index generation |
| 0b1110 | iDCT half-swap |
| 0b1111 | FFT half-swap |
Examples showing how all of these Modes operate exists in the online
[SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
and the full pseudocode setting up all SPRs
is in the [[openpower/isa/simplev]] page.
In Indexed Mode, there are only 5 bits available to specify the GPR
to use, out of 128 GPRs (7 bit numbering). Therefore, only the top
5 bits are given in the `SVxd` field: the bottom two implicit bits
will be zero (`SVxd || 0b00`).
`svshape` has *limited applicability* due to being a 32-bit instruction.
The full capability of SVSHAPE SPRs may be accessed by directly writing
to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
larger than 32, and in-place Transpose. Potentially a future v3.1 Prefixed
instruction, `psvshape`, may extend the capability here.
# svindex instruction
`svindex` is a convenience instruction that reduces instruction
count for Indexed REMAP Mode. It sets up
(overwrites) all required SVSHAPE SPRs and can modify the REMAP
SPR as well. The relevant SPRs *may* be directly programmed with
`mtspr` however it is laborious to do so: svindex saves instructions
covering much of Indexed REMAP capability.
Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
svindex RS,rmm,SVd,ew,yx,mr,sk
| 0.5|6.10 |11.15 |16.20 | 21..25 | 26..31| name | Form |
| -- | -- | --- | ---- | ----------- | ------| -------- | ---- |
|OPCD| RS | rmm | SVd | ew/yx/mm/sk | XO | svindex | SVI-Form |
Fields:
* **SVd** - SV REMAP x/y dim
* **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
controlled by mm
* **ew** - sets element width override on RS
* **RS** - GPR RS<<2 to be used for Indexing
* **yx** - 2D reordering to be used if yx=1
* **mm** - mask mode. determines how `rmm` is interpreted.
* **sk** - Dimension skipping enabled
* **XO** - standard 6-bit XO field
*Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
"off-by-one". In the assembler
mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
When `mm=0`:
* `rmm`, like REMAP.SVme, has bit 0
correspond to mi0, bit 1 to mi1, bit 2 to mi2,
bit 3 to mo0 and bit 4 to mi1
* all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
* for each bit set in the 5-bit `rmm`, in order, the first
as-yet-unset SVSHAPE will be updated
with the other operands in the instruction, and the REMAP
SPR set.
* If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
* SVSTATE persistence bit is cleared
* No other alterations to SVSTATE are carried out
Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
uses mi2. REMAP.SVme is also set to 0b00110, REMAP.mi1=0
(SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
Rough algorithmic form:
marray = [mi0, mi1, mi2, mo0, mo1]
idx = 0
for bit = 0 to 4:
if not rmm[bit]: continue
setup(SVSHAPE[idx])
SVSTATE{marray[bit]} = idx
idx = (idx+1) modulo 4
When `mm=1`:
* bits 0-2 of `rmm` indicate an index selecting mi0-mo1
* bits 3-4 of `rmm` indicate which SVSHAPE 0-3 shall be updated
* only the selected SVSHAPE is overwritten
* only the relevant bits in the REMAP area of SVSTATE are updated
* REMAP persistence bit is set.
Example 1: if `rmm`=0b10011 then mo0 is selected and SVSHAPE2
to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
set to 2 (SVSHAPE2).
Example 2: if `rmm`=0b11100 then mo1 is selected and SVSHAPE3
to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
set to 3 (SVSHAPE3).
Rough algorithmic form:
marray = [mi0, mi1, mi2, mo0, mo1]
bit = rmm[0:2]
idx = rmm[3:4]
setup(SVSHAPE[idx])
SVSTATE{marray[bit]} = idx
SVSTATE.pst = 1
In essence, `mm=0` is intended for use to set as much of the
REMAP State SPRs as practical with a single instruction,
whilst `mm=1` is intended to be a little more refined.
# REMAP Matrix pseudocode
The algorithm below shows how REMAP works more clearly, and may be
executed as a python program:
```
[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
```
An easier-to-read version (using python iterators) shows the loop nesting:
```
[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
```
Each element index from the for-loop `0..VL-1`
is run through the above algorithm to work out the **actual** element
index, instead. Given that there are four possible SHAPE entries, up to
four separate registers in any given operation may be simultaneously
remapped:
function op_add(rd, rs1, rs2) # add not VADD!
...
...
for (i = 0; i < VL; i++)
xSTATE.srcoffs = i # save context
if (predval & 1<
> Just a note: interpreting + as 'or', and * as 'and',
> operating on Boolean matrices,
> and having result, X, and Y be the exact same matrix,
> updated while being used,
> gives the traditional Warshall transitive-closure
> algorithm, if the loops are nested exactly in thie order.
this can be done with the ternary instruction which has
an in-place triple boolean input:
RT = RT | (RA & RB)
and also has a CR Field variant of the same
notes from conversations:
> > for y in y_r:
> > for x in x_r:
> > for z in z_r:
> > result[y][x] +=
> > a[y][z] *
> > b[z][x]
> This nesting of loops works for matrix multiply, but not for transitive
> closure.
> > it can be done:
> >
> > for z in z_r:
> > for y in y_r:
> > for x in x_r:
> > result[y][x] +=
> > a[y][z] *
> > b[z][x]
>
> And this ordering of loops *does* work for transitive closure, when a,
> b, and result are the very same matrix, updated while being used.
>
> By the way, I believe there is a graph algorithm that does the
> transitive closure thing, but instead of using boolean, "and", and "or",
> they use real numbers, addition, and minimum. I think that one computes
> shortest paths between vertices.
>
> By the time the z'th iteration of the z loop begins, the algorithm has
> already peocessed paths that go through vertices numbered < z, and it
> adds paths that go through vertices numbered z.
>
> For this to work, the outer loop has to be the one on teh subscript that
> bridges a and b (which in this case are teh same matrix, of course).
# SUBVL Remap
Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
sub-vectir elements were simply part of the main VL loop. This is the
*complete opposite* of predication which **only** applies to the whole
vec2/3/4. In pseudocode this would be:
for (i = 0; i < VL; i++)
if (predval & 1<
Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
a 2D REMAP allows:
* the column bytes (as a vec4) to be iterated over as an inner loop,
progressing vertically (`a00 a10 a20 a30`)
* the columns themselves to be iterated as an outer loop
* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
This entirely in-place without special 128-bit opcodes. Below is
the pseudocode for [[!wikipedia Rijndael MixColumns]]
```
void gmix_column(unsigned char *r) {
unsigned char a[4];
unsigned char b[4];
unsigned char c;
unsigned char h;
// no swizzle here but still SUBVL.Remap
// can be done as vec4 byte-level
// elwidth overrides though.
for (c = 0; c < 4; c++) {
a[c] = r[c];
h = (unsigned char)((signed char)r[c] >> 7);
b[c] = r[c] << 1;
b[c] ^= 0x1B & h; /* Rijndael's Galois field */
}
// SUBVL.Remap still needed here
// bytelevel elwidth overrides and vec4
// These may then each be 4x 8bit bit Swizzled
// r0.vec4 = b.vec4
// r0.vec4 ^= a.vec4.WXYZ
// r0.vec4 ^= a.vec4.ZWXY
// r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}
```
With the assumption made by the above code that the column bytes have
already been turned around (vertical rather than horizontal) SUBVL.REMAP
may transparently fill that role, in-place, without a complex byte-level
mv operation.
The application of the swizzles allows the remapped vec4 a, b and r
variables to perform four straight linear 32 bit XOR operations where a
scalar processor would be required to perform 16 byte-level individual
operations. Given wide enough SIMD backends in hardware these 3 bit
XORs may be done as single-cycle operations across the entire 128 bit
Rijndael Matrix.
The other alternative is to simply perform the actual 4x4 GF(256) Matrix
Multiply using the MDS Matrix.
# TODO
* investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
in https://bugs.libre-soc.org/show_bug.cgi?id=653
* Triangular REMAP
* Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)