From: Luke Kenneth Casson Leighton Date: Fri, 15 Jul 2022 18:27:07 +0000 (+0100) Subject: split out examples on remap appendix X-Git-Tag: opf_rfc_ls005_v1~1185 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=62a7d0b0387764a88bd9fdce744416b5f770cce2;p=libreriscv.git split out examples on remap appendix --- diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index 62f109b09..0e916c8e3 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -5,8 +5,8 @@ * * add svindex * svindex in simulator -* see [[sv/propagation]] for a future way to apply -REMAP. +* see [[sv/propagation]] for a future way to apply REMAP. +* see [[sv/remap/appendix]] for examples and usage REMAP is an advanced form of Vector "Structure Packing" that provides hardware-level support for commonly-used *nested* loop patterns. @@ -420,276 +420,6 @@ whilst `mm=1` is intended to be a little more refined. Beyond these mappings it becomes necessary to write directly to the SVSTATE SPRs manually. -# REMAP Matrix pseudocode - -The algorithm below shows how REMAP works more clearly, and may be -executed as a python program: - -``` -[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]] -``` - -An easier-to-read version (using python iterators) shows the loop nesting: - -``` -[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]] -``` - -Each element index from the for-loop `0..VL-1` -is run through the above algorithm to work out the **actual** element -index, instead. Given that there are four possible SHAPE entries, up to -four separate registers in any given operation may be simultaneously -remapped: - - function op_add(rd, rs1, rs2) # add not VADD! - ... - ... -  for (i = 0; i < VL; i++) - xSTATE.srcoffs = i # save context - if (predval & 1< - -> Just a note: interpreting + as 'or', and * as 'and', -> operating on Boolean matrices, -> and having result, X, and Y be the exact same matrix, -> updated while being used, -> gives the traditional Warshall transitive-closure -> algorithm, if the loops are nested exactly in thie order. - -this can be done with the ternary instruction which has -an in-place triple boolean input: - - RT = RT | (RA & RB) - -and also has a CR Field variant of the same - -notes from conversations: - -> > for y in y_r: -> > for x in x_r: -> > for z in z_r: -> > result[y][x] += -> > a[y][z] * -> > b[z][x] - -> This nesting of loops works for matrix multiply, but not for transitive -> closure. - -> > it can be done: -> > -> >   for z in z_r: -> >    for y in y_r: -> >     for x in x_r: -> >       result[y][x] += -> >          a[y][z] * -> >          b[z][x] -> -> And this ordering of loops *does* work for transitive closure, when a, -> b, and result are the very same matrix, updated while being used. -> -> By the way, I believe there is a graph algorithm that does the -> transitive closure thing, but instead of using boolean, "and", and "or", -> they use real numbers, addition, and minimum.  I think that one computes -> shortest paths between vertices. -> -> By the time the z'th iteration of the z loop begins, the algorithm has -> already peocessed paths that go through vertices numbered < z, and it -> adds paths that go through vertices numbered z. -> -> For this to work, the outer loop has to be the one on the subscript that -> bridges a and b (which in this case are teh same matrix, of course). - -# SUBVL Remap - -Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4 -itself must be considered to be the "element". To perform REMAP -on the elements of a vec2/3/4, either use Swizzle, or, -due to the sub-elements themselves being contiguous, treat them as -such and use Indexing, or add one -extra dimension to Matrix REMAP, the inner dimension being the size -of the Subvector (2, 3, or 4). - -Note that Swizzle on Sub-vectors may be applied on top of REMAP. -Where this is appropriate is the Rijndael MixColumns -stage: - - - -Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33` -a 2D REMAP allows: - -* the column bytes (as a vec4) to be iterated over as an inner loop, - progressing vertically (`a00 a10 a20 a30`) -* the columns themselves to be iterated as an outer loop -* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed. - -This entirely in-place without special 128-bit opcodes. Below is -the pseudocode for [[!wikipedia Rijndael MixColumns]] - -``` -void gmix_column(unsigned char *r) { - unsigned char a[4]; - unsigned char b[4]; - unsigned char c; - unsigned char h; - // no swizzle here but vec4 byte-level - // elwidth overrides can be done though. - for (c = 0; c < 4; c++) { - a[c] = r[c]; - h = (unsigned char)((signed char)r[c] >> 7); - b[c] = r[c] << 1; - b[c] ^= 0x1B & h; /* Rijndael's Galois field */ - } - // These may then each be 4x 8bit Swizzled - // r0.vec4 = b.vec4 - // r0.vec4 ^= a.vec4.WXYZ - // r0.vec4 ^= a.vec4.ZWXY - // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX - r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1]; - r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2]; - r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; - r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0]; -} -``` - -The application of the swizzles allows the remapped vec4 a, b and r -variables to perform four straight linear 32 bit XOR operations where a -scalar processor would be required to perform 16 byte-level individual -operations. Given wide enough SIMD backends in hardware these 3 bit -XORs may be done as single-cycle operations across the entire 128 bit -Rijndael Matrix. - -The other alternative is to simply perform the actual 4x4 GF(256) Matrix -Multiply using the MDS Matrix. - # TODO * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429 diff --git a/openpower/sv/remap/appendix.mdwn b/openpower/sv/remap/appendix.mdwn new file mode 100644 index 000000000..655a8a59a --- /dev/null +++ b/openpower/sv/remap/appendix.mdwn @@ -0,0 +1,272 @@ +[[!tag standards]] + +# REMAP Matrix pseudocode + +The algorithm below shows how REMAP works more clearly, and may be +executed as a python program: + +``` +[[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]] +``` + +An easier-to-read version (using python iterators) shows the loop nesting: + +``` +[[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]] +``` + +Each element index from the for-loop `0..VL-1` +is run through the above algorithm to work out the **actual** element +index, instead. Given that there are four possible SHAPE entries, up to +four separate registers in any given operation may be simultaneously +remapped: + + function op_add(rd, rs1, rs2) # add not VADD! + ... + ... +  for (i = 0; i < VL; i++) + xSTATE.srcoffs = i # save context + if (predval & 1< + +> Just a note: interpreting + as 'or', and * as 'and', +> operating on Boolean matrices, +> and having result, X, and Y be the exact same matrix, +> updated while being used, +> gives the traditional Warshall transitive-closure +> algorithm, if the loops are nested exactly in thie order. + +this can be done with the ternary instruction which has +an in-place triple boolean input: + + RT = RT | (RA & RB) + +and also has a CR Field variant of the same + +notes from conversations: + +> > for y in y_r: +> > for x in x_r: +> > for z in z_r: +> > result[y][x] += +> > a[y][z] * +> > b[z][x] + +> This nesting of loops works for matrix multiply, but not for transitive +> closure. + +> > it can be done: +> > +> >   for z in z_r: +> >    for y in y_r: +> >     for x in x_r: +> >       result[y][x] += +> >          a[y][z] * +> >          b[z][x] +> +> And this ordering of loops *does* work for transitive closure, when a, +> b, and result are the very same matrix, updated while being used. +> +> By the way, I believe there is a graph algorithm that does the +> transitive closure thing, but instead of using boolean, "and", and "or", +> they use real numbers, addition, and minimum.  I think that one computes +> shortest paths between vertices. +> +> By the time the z'th iteration of the z loop begins, the algorithm has +> already peocessed paths that go through vertices numbered < z, and it +> adds paths that go through vertices numbered z. +> +> For this to work, the outer loop has to be the one on the subscript that +> bridges a and b (which in this case are teh same matrix, of course). + +# SUBVL Remap + +Remapping of SUBVL (vec2/3/4) elements is not permitted: the vec2/3/4 +itself must be considered to be the "element". To perform REMAP +on the elements of a vec2/3/4, either use Swizzle, or, +due to the sub-elements themselves being contiguous, treat them as +such and use Indexing, or add one +extra dimension to Matrix REMAP, the inner dimension being the size +of the Subvector (2, 3, or 4). + +Note that Swizzle on Sub-vectors may be applied on top of REMAP. +Where this is appropriate is the Rijndael MixColumns +stage: + + + +Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33` +a 2D REMAP allows: + +* the column bytes (as a vec4) to be iterated over as an inner loop, + progressing vertically (`a00 a10 a20 a30`) +* the columns themselves to be iterated as an outer loop +* a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed. + +This entirely in-place without special 128-bit opcodes. Below is +the pseudocode for [[!wikipedia Rijndael MixColumns]] + +``` +void gmix_column(unsigned char *r) { + unsigned char a[4]; + unsigned char b[4]; + unsigned char c; + unsigned char h; + // no swizzle here but vec4 byte-level + // elwidth overrides can be done though. + for (c = 0; c < 4; c++) { + a[c] = r[c]; + h = (unsigned char)((signed char)r[c] >> 7); + b[c] = r[c] << 1; + b[c] ^= 0x1B & h; /* Rijndael's Galois field */ + } + // These may then each be 4x 8bit Swizzled + // r0.vec4 = b.vec4 + // r0.vec4 ^= a.vec4.WXYZ + // r0.vec4 ^= a.vec4.ZWXY + // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX + r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1]; + r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2]; + r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; + r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0]; +} +``` + +The application of the swizzles allows the remapped vec4 a, b and r +variables to perform four straight linear 32 bit XOR operations where a +scalar processor would be required to perform 16 byte-level individual +operations. Given wide enough SIMD backends in hardware these 3 bit +XORs may be done as single-cycle operations across the entire 128 bit +Rijndael Matrix. + +The other alternative is to simply perform the actual 4x4 GF(256) Matrix +Multiply using the MDS Matrix. +