and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
it to apply four times to compute the four columns worth of vectors.
-# Warshall transitive closure algorithm
-
-TODO move to [[sv/remap/discussion]] page, copied from here
-http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
-
-with thanks to Hendrik.
-
-<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
-
-> Just a note: interpreting + as 'or', and * as 'and',
-> operating on Boolean matrices,
-> and having result, X, and Y be the exact same matrix,
-> updated while being used,
-> gives the traditional Warshall transitive-closure
-> algorithm, if the loops are nested exactly in thie order.
-
-this can be done with the ternary instruction which has
-an in-place triple boolean input:
-
-```
- RT = RT | (RA & RB)
-```
-
-and also has a CR Field variant of the same
-
-notes from conversations:
-
-> > for y in y_r:
-> > for x in x_r:
-> > for z in z_r:
-> > result[y][x] +=
-> > a[y][z] *
-> > b[z][x]
-
-> This nesting of loops works for matrix multiply, but not for transitive
-> closure.
-
-> > it can be done:
-> >
-> > for z in z_r:
-> > for y in y_r:
-> > for x in x_r:
-> > result[y][x] +=
-> > a[y][z] *
-> > b[z][x]
->
-> And this ordering of loops *does* work for transitive closure, when a,
-> b, and result are the very same matrix, updated while being used.
->
-> By the way, I believe there is a graph algorithm that does the
-> transitive closure thing, but instead of using boolean, "and", and "or",
-> they use real numbers, addition, and minimum. I think that one computes
-> shortest paths between vertices.
->
-> By the time the z'th iteration of the z loop begins, the algorithm has
-> already peocessed paths that go through vertices numbered < z, and it
-> adds paths that go through vertices numbered z.
->
-> For this to work, the outer loop has to be the one on the subscript that
-> bridges a and b (which in this case are teh same matrix, of course).
-
-# SUBVL Remap
-
-Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
-itself must be considered to be the "element". To perform REMAP
-on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
-due to the sub-elements themselves being contiguous, treat them as
-such and use Indexing, or add one
-extra dimension to Matrix REMAP, the inner dimension being the size
-of the Subvector (2, 3, or 4).
-
-Note that Swizzle on Sub-vectors may be applied on top of REMAP.
-Where this is appropriate is the Rijndael MixColumns
-stage:
-
-<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
-
-Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
-a 2D REMAP allows:
-
-* the column bytes (as a vec4) to be iterated over as an inner loop,
- progressing vertically (`a00 a10 a20 a30`)
-* the columns themselves to be iterated as an outer loop
-* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
-
-This entirely in-place without special 128-bit opcodes. Below is
-the pseudocode for [[!wikipedia Rijndael MixColumns]]
-
-```
-void gmix_column(unsigned char *r) {
- unsigned char a[4];
- unsigned char b[4];
- unsigned char c;
- unsigned char h;
- // no swizzle here but vec4 byte-level
- // elwidth overrides can be done though.
- for (c = 0; c < 4; c++) {
- a[c] = r[c];
- h = (unsigned char)((signed char)r[c] >> 7);
- b[c] = r[c] << 1;
- b[c] ^= 0x1B & h; /* Rijndael's Galois field */
- }
- // These may then each be 4x 8bit Swizzled
- // r0.vec4 = b.vec4
- // r0.vec4 ^= a.vec4.WXYZ
- // r0.vec4 ^= a.vec4.ZWXY
- // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
- r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
- r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
- r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
- r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
-}
-```
-
-The application of the swizzles allows the remapped vec4 a, b and r
-variables to perform four straight linear 32 bit XOR operations where a
-scalar processor would be required to perform 16 byte-level individual
-operations. Given wide enough SIMD backends in hardware these 3 bit
-XORs may be done as single-cycle operations across the entire 128 bit
-Rijndael Matrix.
-
-The other alternative is to simply perform the actual 4x4 GF(256) Matrix
-Multiply using the MDS Matrix.
+---------
+\newpage{}