[[!tag standards]]

# REMAP Matrix pseudocode

The algorithm below shows how REMAP works more clearly, and may be
executed as a python program:

```
[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
```

An easier-to-read version (using python iterators) shows the loop nesting:

```
[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
```

Each element index from the for-loop `0..VL-1`
is run through the above algorithm to work out the **actual** element
index, instead.  Given that there are four possible SHAPE entries, up to
four separate registers in any given operation may be simultaneously
remapped:

    function op_add(rd, rs1, rs2) # add not VADD!
      ...
      ...
      for (i = 0; i < VL; i++)
        xSTATE.srcoffs = i # save context
        if (predval & 1<<i) # predication uses intregs
           ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
                                  ireg[rs2+remap3(irs2)];
           if (!int_vec[rd ].isvector) break;
        if (int_vec[rd ].isvector)  { id += 1; }
        if (int_vec[rs1].isvector)  { irs1 += 1; }
        if (int_vec[rs2].isvector)  { irs2 += 1; }

By changing remappings, 2D matrices may be transposed "in-place" for one
operation, followed by setting a different permutation order without
having to move the values in the registers to or from memory.

Note that:

* Over-running the register file clearly has to be detected and
  an illegal instruction exception thrown
* When non-default elwidths are set, the exact same algorithm still
  applies (i.e. it offsets *polymorphic* elements *within* registers rather 
  than entire registers).
* If permute option 000 is utilised, the actual order of the
  reindexing does not change.  However, modulo MVL still occurs
  which will result in repeated operations (use with caution).
* If two or more dimensions are set to zero, the actual order does not change!
* The above algorithm is pseudo-code **only**.  Actual implementations
  will need to take into account the fact that the element for-looping
  must be **re-entrant**, due to the possibility of exceptions occurring.
  See SVSTATE SPR, which records the current element index.
  Continuing after return from an interrupt may introduce latency
  due to re-computation of the remapped offsets.
* Twin-predicated operations require **two** separate and distinct
  element offsets.  The above pseudo-code algorithm will be applied
  separately and independently to each, should each of the two
  operands be remapped.  *This even includes unit-strided LD/ST*
  and other operations
  in that category, where in that case it will be the **offset** that is
  remapped.
* Offset is especially useful, on its own, for accessing elements
  within the middle of a register.  Without offsets, it is necessary
  to either use a predicated MV, skipping the first elements, or
  performing a LOAD/STORE cycle to memory.
  With offsets, the data does not have to be moved.
* Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
  less than MVL is **perfectly legal**, albeit very obscure.  It permits
  entries to be regularly presented to operands **more than once**, thus
  allowing the same underlying registers to act as an accumulator of
  multiple vector or matrix operations, for example.
* Note especially that Program Order **must** still be respected
  even when overlaps occur that read or write the same register
  elements *including polymorphic ones*

Clearly here some considerable care needs to be taken as the remapping
could hypothetically create arithmetic operations that target the
exact same underlying registers, resulting in data corruption due to
pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
register-renaming will have an easier time dealing with this than
DSP-style SIMD micro-architectures.

# REMAP FFT pseudocode

The algorithm below shows how FFT REMAP works, and may be
executed as a python program:

```
[[!inline pages="openpower/sv/remap_fft_yield.py" quick="yes" raw="yes" ]]
```

The executable code above is designed to show how a hardware
implementation may generate Indices which are completely
independent of the Execution of element-level operations,
even for something as complex as a Triple-loop Tukey-Cooley
Schedule. A comprehensive demo and test suite may be found
[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD)

Other uses include more than DFT and NTT: the Schedules are not
restricted in any way and if the programmer can find any algorithm
which has identical triple nesting then the FFT Schedule may be
used.

# 4x4 Matrix to vec4 Multiply Example

The following settings will allow a 4x4 matrix (starting at f8), expressed
as a sequence of 16 numbers first by row then by column, to be multiplied
by a vector of length 4 (starting at f0), using a single FMAC instruction.

* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
* VL=16, f4=vec, f0=vec, f8=vec
* FMAC f4, f0, f8, f4

The permutation on SHAPE0 will use f0 as a vec4 source. On the first
four iterations through the hardware loop, the REMAPed index will not
increment. On the second four, the index will increase by one. Likewise
on each subsequent group of four.

The permutation on SHAPE1 will increment f4 continuously cycling through
f4-f7 every four iterations of the hardware loop.

At the same time, VL will, because there is no SHAPE on f8, increment
straight sequentially through the 16 values f8-f23 in the Matrix. The
equivalent sequence thus is issued:

    fmac f4, f0, f8, f4
    fmac f5, f0, f9, f5
    fmac f6, f0, f10, f6
    fmac f7, f0, f11, f7
    fmac f4, f1, f12, f4
    fmac f5, f1, f13, f5
    fmac f6, f1, f14, f6
    fmac f7, f1, f15, f7
    fmac f4, f2, f16, f4
    fmac f5, f2, f17, f5
    fmac f6, f2, f18, f6
    fmac f7, f2, f19, f7
    fmac f4, f3, f20, f4
    fmac f5, f3, f21, f5
    fmac f6, f3, f22, f6
    fmac f7, f3, f23, f7

The only other instruction required is to ensure that f4-f7 are
initialised (usually to zero).

It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
the same technique applied to four independent vectors, can be done by
setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
it to apply four times to compute the four columns worth of vectors.

# Warshall transitive closure algorithm

TODO move to [[sv/remap/discussion]] page, copied from here
http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html

with thanks to Hendrik.

<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>

> Just a note:  interpreting + as 'or', and * as 'and',
> operating on Boolean matrices, 
> and having result, X, and Y be the exact same matrix,
> updated while being used,
> gives the traditional Warshall transitive-closure
> algorithm, if the loops are nested exactly in thie order.

this can be done with the ternary instruction which has
an in-place triple boolean input:

    RT = RT | (RA & RB)

and also has a CR Field variant of the same

notes from conversations:

> > for y in y_r:
> >  for x in x_r:
> >    for z in z_r:
> >      result[y][x] +=
> >         a[y][z] *
> >         b[z][x]

> This nesting of loops works for matrix multiply, but not for transitive
> closure. 

> > it can be done:
> >
> >   for z in z_r:
> >    for y in y_r:
> >     for x in x_r:
> >       result[y][x] +=
> >          a[y][z] *
> >          b[z][x]
>
> And this ordering of loops *does* work for transitive closure, when a,
> b, and result are the very same matrix, updated while being used.
>
> By the way, I believe there is a graph algorithm that does the
> transitive closure thing, but instead of using boolean, "and", and "or",
> they use real numbers, addition, and minimum.  I think that one computes
> shortest paths between vertices.
>
> By the time the z'th iteration of the z loop begins, the algorithm has
> already peocessed paths that go through vertices numbered < z, and it
> adds paths that go through vertices numbered z.
>
> For this to work, the outer loop has to be the one on the subscript that
> bridges a and b (which in this case are teh same matrix, of course).

# SUBVL Remap

Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
itself must be considered to be the "element".  To perform REMAP
on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
due to the sub-elements themselves being contiguous, treat them as
such and use Indexing, or add one
extra dimension to Matrix REMAP, the inner dimension being the size
of the Subvector (2, 3, or 4).

Note that Swizzle on Sub-vectors may be applied on top of REMAP.
Where this is appropriate is the Rijndael MixColumns
stage:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />

Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
a 2D REMAP allows:

* the column bytes (as a vec4) to be iterated over as an inner loop,
  progressing vertically (`a00 a10 a20 a30`)
* the columns themselves to be iterated as an outer loop
* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.

This entirely in-place without special 128-bit opcodes.  Below is
the pseudocode for [[!wikipedia Rijndael MixColumns]]

```
void gmix_column(unsigned char *r) {
    unsigned char a[4];
    unsigned char b[4];
    unsigned char c;
    unsigned char h;
    // no swizzle here but vec4 byte-level
    // elwidth overrides can be done though.
    for (c = 0; c < 4; c++) {
        a[c] = r[c];
        h = (unsigned char)((signed char)r[c] >> 7);
        b[c] = r[c] << 1;
        b[c] ^= 0x1B & h; /* Rijndael's Galois field */
    }
    // These may then each be 4x 8bit Swizzled
    // r0.vec4 = b.vec4
    // r0.vec4 ^= a.vec4.WXYZ
    // r0.vec4 ^= a.vec4.ZWXY
    // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; 
    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
}
```

The application of the swizzles allows the remapped vec4 a, b and r
variables to perform four straight linear 32 bit XOR operations where a
scalar processor would be required to perform 16 byte-level individual
operations.  Given wide enough SIMD backends in hardware these 3 bit
XORs may be done as single-cycle operations across the entire 128 bit
Rijndael Matrix.

The other alternative is to simply perform the actual 4x4 GF(256) Matrix
Multiply using the MDS Matrix.