From 19daf3a96c25ca4bb329555893e877eb8dc09b62 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 16 Apr 2023 11:03:22 +0100
Subject: [PATCH]

---
 openpower/sv/remap/discussion.mdwn | 124 +++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
diff --git a/openpower/sv/remap/discussion.mdwn b/openpower/sv/remap/discussion.mdwn
index 818228ba6..5bf74a751 100644
--- a/openpower/sv/remap/discussion.mdwn
+++ b/openpower/sv/remap/discussion.mdwn
@@ -26,3 +26,127 @@ in https://bugs.libre-soc.org/show_bug.cgi?id=653
 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
 * Convolution REMAP
 
+# Warshall transitive closure algorithm
+
+TODO move to [[sv/remap/discussion]] page, copied from here
+http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
+
+with thanks to Hendrik.
+
+<https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
+
+> Just a note:  interpreting + as 'or', and * as 'and',
+> operating on Boolean matrices, 
+> and having result, X, and Y be the exact same matrix,
+> updated while being used,
+> gives the traditional Warshall transitive-closure
+> algorithm, if the loops are nested exactly in thie order.
+
+this can be done with the ternary instruction which has
+an in-place triple boolean input:
+
+```
+    RT = RT | (RA & RB)
+```
+
+and also has a CR Field variant of the same
+
+notes from conversations:
+
+> > for y in y_r:
+> >  for x in x_r:
+> >    for z in z_r:
+> >      result[y][x] +=
+> >         a[y][z] *
+> >         b[z][x]
+
+> This nesting of loops works for matrix multiply, but not for transitive
+> closure. 
+
+> > it can be done:
+> >
+> > Â  for z in z_r:
+> > Â  Â for y in y_r:
+> > Â  Â  for x in x_r:
+> > Â  Â  Â  result[y][x] +=
+> > Â  Â  Â  Â  Â a[y][z] *
+> > Â  Â  Â  Â  Â b[z][x]
+>
+> And this ordering of loops *does* work for transitive closure, when a,
+> b, and result are the very same matrix, updated while being used.
+>
+> By the way, I believe there is a graph algorithm that does the
+> transitive closure thing, but instead of using boolean, "and", and "or",
+> they use real numbers, addition, and minimum. Â I think that one computes
+> shortest paths between vertices.
+>
+> By the time the z'th iteration of the z loop begins, the algorithm has
+> already peocessed paths that go through vertices numbered < z, and it
+> adds paths that go through vertices numbered z.
+>
+> For this to work, the outer loop has to be the one on the subscript that
+> bridges a and b (which in this case are teh same matrix, of course).
+
+# SUBVL Remap
+
+Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4
+itself must be considered to be the "element".  To perform REMAP
+on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or,
+due to the sub-elements themselves being contiguous, treat them as
+such and use Indexing, or add one
+extra dimension to Matrix REMAP, the inner dimension being the size
+of the Subvector (2, 3, or 4).
+
+Note that Swizzle on Sub-vectors may be applied on top of REMAP.
+Where this is appropriate is the Rijndael MixColumns
+stage:
+
+<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
+
+Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
+a 2D REMAP allows:
+
+* the column bytes (as a vec4) to be iterated over as an inner loop,
+  progressing vertically (`a00 a10 a20 a30`)
+* the columns themselves to be iterated as an outer loop
+* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
+
+This entirely in-place without special 128-bit opcodes.  Below is
+the pseudocode for [[!wikipedia Rijndael MixColumns]]
+
+```
+void gmix_column(unsigned char *r) {
+    unsigned char a[4];
+    unsigned char b[4];
+    unsigned char c;
+    unsigned char h;
+    // no swizzle here but vec4 byte-level
+    // elwidth overrides can be done though.
+    for (c = 0; c < 4; c++) {
+        a[c] = r[c];
+        h = (unsigned char)((signed char)r[c] >> 7);
+        b[c] = r[c] << 1;
+        b[c] ^= 0x1B & h; /* Rijndael's Galois field */
+    }
+    // These may then each be 4x 8bit Swizzled
+    // r0.vec4 = b.vec4
+    // r0.vec4 ^= a.vec4.WXYZ
+    // r0.vec4 ^= a.vec4.ZWXY
+    // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
+    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
+    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
+    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; 
+    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
+}
+```
+
+The application of the swizzles allows the remapped vec4 a, b and r
+variables to perform four straight linear 32 bit XOR operations where a
+scalar processor would be required to perform 16 byte-level individual
+operations.  Given wide enough SIMD backends in hardware these 3 bit
+XORs may be done as single-cycle operations across the entire 128 bit
+Rijndael Matrix.
+
+The other alternative is to simply perform the actual 4x4 GF(256) Matrix
+Multiply using the MDS Matrix.
+
-- 
2.30.2