From 19daf3a96c25ca4bb329555893e877eb8dc09b62 Mon Sep 17 00:00:00 2001 From: lkcl Date: Sun, 16 Apr 2023 11:03:22 +0100 Subject: [PATCH] --- openpower/sv/remap/discussion.mdwn | 124 +++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/openpower/sv/remap/discussion.mdwn b/openpower/sv/remap/discussion.mdwn index 818228ba6..5bf74a751 100644 --- a/openpower/sv/remap/discussion.mdwn +++ b/openpower/sv/remap/discussion.mdwn @@ -26,3 +26,127 @@ in https://bugs.libre-soc.org/show_bug.cgi?id=653 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix) * Convolution REMAP +# Warshall transitive closure algorithm + +TODO move to [[sv/remap/discussion]] page, copied from here +http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html + +with thanks to Hendrik. + + + +> Just a note: interpreting + as 'or', and * as 'and', +> operating on Boolean matrices, +> and having result, X, and Y be the exact same matrix, +> updated while being used, +> gives the traditional Warshall transitive-closure +> algorithm, if the loops are nested exactly in thie order. + +this can be done with the ternary instruction which has +an in-place triple boolean input: + +``` + RT = RT | (RA & RB) +``` + +and also has a CR Field variant of the same + +notes from conversations: + +> > for y in y_r: +> > for x in x_r: +> > for z in z_r: +> > result[y][x] += +> > a[y][z] * +> > b[z][x] + +> This nesting of loops works for matrix multiply, but not for transitive +> closure. + +> > it can be done: +> > +> >   for z in z_r: +> >    for y in y_r: +> >     for x in x_r: +> >       result[y][x] += +> >          a[y][z] * +> >          b[z][x] +> +> And this ordering of loops *does* work for transitive closure, when a, +> b, and result are the very same matrix, updated while being used. +> +> By the way, I believe there is a graph algorithm that does the +> transitive closure thing, but instead of using boolean, "and", and "or", +> they use real numbers, addition, and minimum.  I think that one computes +> shortest paths between vertices. +> +> By the time the z'th iteration of the z loop begins, the algorithm has +> already peocessed paths that go through vertices numbered < z, and it +> adds paths that go through vertices numbered z. +> +> For this to work, the outer loop has to be the one on the subscript that +> bridges a and b (which in this case are teh same matrix, of course). + +# SUBVL Remap + +Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4 +itself must be considered to be the "element". To perform REMAP +on the elements of a vec2/3/4, either use [[sv/mv.swizzle]], or, +due to the sub-elements themselves being contiguous, treat them as +such and use Indexing, or add one +extra dimension to Matrix REMAP, the inner dimension being the size +of the Subvector (2, 3, or 4). + +Note that Swizzle on Sub-vectors may be applied on top of REMAP. +Where this is appropriate is the Rijndael MixColumns +stage: + + + +Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33` +a 2D REMAP allows: + +* the column bytes (as a vec4) to be iterated over as an inner loop, + progressing vertically (`a00 a10 a20 a30`) +* the columns themselves to be iterated as an outer loop +* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed. + +This entirely in-place without special 128-bit opcodes. Below is +the pseudocode for [[!wikipedia Rijndael MixColumns]] + +``` +void gmix_column(unsigned char *r) { + unsigned char a[4]; + unsigned char b[4]; + unsigned char c; + unsigned char h; + // no swizzle here but vec4 byte-level + // elwidth overrides can be done though. + for (c = 0; c < 4; c++) { + a[c] = r[c]; + h = (unsigned char)((signed char)r[c] >> 7); + b[c] = r[c] << 1; + b[c] ^= 0x1B & h; /* Rijndael's Galois field */ + } + // These may then each be 4x 8bit Swizzled + // r0.vec4 = b.vec4 + // r0.vec4 ^= a.vec4.WXYZ + // r0.vec4 ^= a.vec4.ZWXY + // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX + r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1]; + r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2]; + r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; + r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0]; +} +``` + +The application of the swizzles allows the remapped vec4 a, b and r +variables to perform four straight linear 32 bit XOR operations where a +scalar processor would be required to perform 16 byte-level individual +operations. Given wide enough SIMD backends in hardware these 3 bit +XORs may be done as single-cycle operations across the entire 128 bit +Rijndael Matrix. + +The other alternative is to simply perform the actual 4x4 GF(256) Matrix +Multiply using the MDS Matrix. + -- 2.30.2