[[!tag standards]]
# REMAP
*
see [[sv/propagation]] because it is currently the only way to apply
REMAP.
REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
from a linear form to a 2D or 3D transposed form, or "offset" to permit
arbitrary access to elements, independently on each Vector src or dest
register.
Their primary use is for Matrix Multiplication, reordering of sequential
data in-place. Four SPRs are provided so that a single FMAC may be
used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
generating 64 FMACs. Additional uses include regular "Structure Packing"
such as RGB pixel data extraction and reforming.
REMAP, like all of SV, is abstracted out, meaning that unlike traditional
Vector ISAs which would typically only have a limited set of instructions
that can be structure-packed (LD/ST typically), REMAP may be applied to
literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
Note that REMAP does not apply to sub-vector elements: that is what
swizzle is for. Swizzle *can* however be applied to the same instruction
as REMAP.
REMAP is quite expensive to set up, and on some implementations introduce
latency, so should realistically be used only where it is worthwhile
# Principle
* normal vector element read/write as operands would be sequential
(0 1 2 3 ....)
* this is not appropriate for (e.g.) Matrix multiply which requires
accessing elements in alternative sequences (0 3 6 1 4 7 ...)
* normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
with this. both are expensive (copy large vectors, spill through memory)
* REMAP **redefines** the order of access according to set "Schedules"
Only the most commonly-used algorithms in computer science have REMAP
support, due to the high cost in both the ISA and in hardware.
# REMAP SPR
| 0 | 2 | 4 | 6 | 8 | 10.14 | 15..23 |
| -- | -- | -- | -- | -- | ----- | ------ |
|mi0 |mi1 |mi2 |mo0 |mo1 | SVme | rsv |
mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
mi0-2 apply to RA, RB, RC respectively, as input registers, and
likewise mo0-1 apply to output registers (FRT, FRS respectively).
SVme is 5 bits, and indicates indicate whether the
SVSHAPE is actively applied or not.
* bit 0 of SVme indicates if mi0 is applied to RA / FRA
* bit 1 of SVme indicates if mi1 is applied to RB / FRB
* bit 2 of SVme indicates if mi2 is applied to RC / FRC
* bit 3 of SVme indicates if mo0 is applied to RT / FRT
* bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
(LD/ST-with-update has an implicit 2nd write register, RA)
There is also a corresponding SVRM-Form for the svremap
instruction which matches the above SPR:
|0 |6 |11 |13 |15 |17 |19 |21 | 22 |26 |31 |
| PO | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst | rsvd | XO | / |
# SHAPE 1D/2D/3D vector-matrix remapping SPRs
There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
which have the same format.
[[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
The algorithm below shows how REMAP works more clearly, and may be
executed as a python program:
```
[[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
```
An easier-to-read version (using python iterators) shows the loop nesting:
```
[[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
```
Each element index from the for-loop `0..VL-1`
is run through the above algorithm to work out the **actual** element
index, instead. Given that there are four possible SHAPE entries, up to
four separate registers in any given operation may be simultaneously
remapped:
function op_add(rd, rs1, rs2) # add not VADD!
...
...
for (i = 0; i < VL; i++)
xSTATE.srcoffs = i # save context
if (predval & 1< Just a note: interpreting + as 'or', and * as 'and',
> operating on Boolean matrices,
> and having result, X, and Y be the exact same matrix,
> updated while being used,
> gives the traditional Warshall transitive-closure
> algorithm, if the loops are nested exactly in thie order.
this can be done with the ternary instruction which has
an in-place triple boolean input:
RT = RT | (RA & RB)
and also has a CR Field variant of the same
notes from conversations:
> > for y in y_r:
> > for x in x_r:
> > for z in z_r:
> > result[y][x] +=
> > a[y][z] *
> > b[z][x]
> This nesting of loops works for matrix multiply, but not for transitive
> closure.
> > it can be done:
> >
> > for z in z_r:
> > for y in y_r:
> > for x in x_r:
> > result[y][x] +=
> > a[y][z] *
> > b[z][x]
>
> And this ordering of loops *does* work for transitive closure, when a,
> b, and result are the very same matrix, updated while being used.
>
> By the way, I believe there is a graph algorithm that does the
> transitive closure thing, but instead of using boolean, "and", and "or",
> they use real numbers, addition, and minimum. I think that one computes
> shortest paths between vertices.
>
> By the time the z'th iteration of the z loop begins, the algorithm has
> already peocessed paths that go through vertices numbered < z, and it
> adds paths that go through vertices numbered z.
>
> For this to work, the outer loop has to be the one on teh subscript that
> bridges a and b (which in this case are teh same matrix, of course).
# SUBVL Remap
Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
sub-vectir elements were simply part of the main VL loop. This is the
*complete opposite* of predication which **only** applies to the whole
vec2/3/4. In pseudocode this would be:
for (i = 0; i < VL; i++)
if (predval & 1<
Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
a 2D REMAP allows:
* the column bytes (as a vec4) to be iterated over as an inner loop,
progressing vertically (`a00 a10 a20 a30`)
* the columns themselves to be iterated as an outer loop
* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
This entirely in-place without special 128-bit opcodes. Below is
the pseudocode for [[!wikipedia Rijndael MixColumns]]
```
void gmix_column(unsigned char *r) {
unsigned char a[4];
unsigned char b[4];
unsigned char c;
unsigned char h;
// no swizzle here but still SUBVL.Remap
// can be done as vec4 byte-level
// elwidth overrides though.
for (c = 0; c < 4; c++) {
a[c] = r[c];
h = (unsigned char)((signed char)r[c] >> 7);
b[c] = r[c] << 1;
b[c] ^= 0x1B & h; /* Rijndael's Galois field */
}
// SUBVL.Remap still needed here
// bytelevel elwidth overrides and vec4
// These may then each be 4x 8bit bit Swizzled
// r0.vec4 = b.vec4
// r0.vec4 ^= a.vec4.WXYZ
// r0.vec4 ^= a.vec4.ZWXY
// r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}
```
With the assumption made by the above code that the column bytes have
already been turned around (vertical rather than horizontal) SUBVL.REMAP
may transparently fill that role, in-place, without a complex byte-level
mv operation.
The application of the swizzles allows the remapped vec4 a, b and r
variables to perform four straight linear 32 bit XOR operations where a
scalar processor would be required to perform 16 byte-level individual
operations. Given wide enough SIMD backends in hardware these 3 bit
XORs may be done as single-cycle operations across the entire 128 bit
Rijndael Matrix.
The other alternative is to simply perform the actual 4x4 GF(256) Matrix
Multiply using the MDS Matrix.
# TODO
investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
in https://bugs.libre-soc.org/show_bug.cgi?id=653