* Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
* Convolution REMAP
+# Warshall transitive closure algorithm
+
+TODO move to [[sv/remap/discussion]] page, copied from here
+http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
+
+with thanks to Hendrik.
+
+<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
+
+> Just a note: interpreting + as 'or', and * as 'and',
+> operating on Boolean matrices,
+> and having result, X, and Y be the exact same matrix,
+> updated while being used,
+> gives the traditional Warshall transitive-closure
+> algorithm, if the loops are nested exactly in thie order.
+
+this can be done with the ternary instruction which has
+an in-place triple boolean input:
+
+```
+ RT = RT | (RA & RB)
+```
+
+and also has a CR Field variant of the same
+
+notes from conversations:
+
+> > for y in y_r:
+> > for x in x_r:
+> > for z in z_r:
+> > result[y][x] +=
+> > a[y][z] *
+> > b[z][x]
+
+> This nesting of loops works for matrix multiply, but not for transitive
+> closure.
+
+> > it can be done:
+> >
+> > for z in z_r:
+> > for y in y_r:
+> > for x in x_r:
+> > result[y][x] +=
+> > a[y][z] *
+> > b[z][x]
+>
+> And this ordering of loops *does* work for transitive closure, when a,
+> b, and result are the very same matrix, updated while being used.
+>
+> By the way, I believe there is a graph algorithm that does the
+> transitive closure thing, but instead of using boolean, "and", and "or",
+> they use real numbers, addition, and minimum. I think that one computes
+> shortest paths between vertices.
+>
+> By the time the z'th iteration of the z loop begins, the algorithm has
+> already peocessed paths that go through vertices numbered < z, and it
+> adds paths that go through vertices numbered z.
+>
+> For this to work, the outer loop has to be the one on the subscript that
+> bridges a and b (which in this case are teh same matrix, of course).
+
+# SUBVL Remap
+
+Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
+itself must be considered to be the "element". To perform REMAP
+on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
+due to the sub-elements themselves being contiguous, treat them as
+such and use Indexing, or add one
+extra dimension to Matrix REMAP, the inner dimension being the size
+of the Subvector (2, 3, or 4).
+
+Note that Swizzle on Sub-vectors may be applied on top of REMAP.
+Where this is appropriate is the Rijndael MixColumns
+stage:
+
+<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
+
+Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
+a 2D REMAP allows:
+
+* the column bytes (as a vec4) to be iterated over as an inner loop,
+ progressing vertically (`a00 a10 a20 a30`)
+* the columns themselves to be iterated as an outer loop
+* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
+
+This entirely in-place without special 128-bit opcodes. Below is
+the pseudocode for [[!wikipedia Rijndael MixColumns]]
+
+```
+void gmix_column(unsigned char *r) {
+ unsigned char a[4];
+ unsigned char b[4];
+ unsigned char c;
+ unsigned char h;
+ // no swizzle here but vec4 byte-level
+ // elwidth overrides can be done though.
+ for (c = 0; c < 4; c++) {
+ a[c] = r[c];
+ h = (unsigned char)((signed char)r[c] >> 7);
+ b[c] = r[c] << 1;
+ b[c] ^= 0x1B & h; /* Rijndael's Galois field */
+ }
+ // These may then each be 4x 8bit Swizzled
+ // r0.vec4 = b.vec4
+ // r0.vec4 ^= a.vec4.WXYZ
+ // r0.vec4 ^= a.vec4.ZWXY
+ // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
+ r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
+ r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
+ r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
+ r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
+}
+```
+
+The application of the swizzles allows the remapped vec4 a, b and r
+variables to perform four straight linear 32 bit XOR operations where a
+scalar processor would be required to perform 16 byte-level individual
+operations. Given wide enough SIMD backends in hardware these 3 bit
+XORs may be done as single-cycle operations across the entire 128 bit
+Rijndael Matrix.
+
+The other alternative is to simply perform the actual 4x4 GF(256) Matrix
+Multiply using the MDS Matrix.
+