openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
   6 * see  [[sv/propagation]] for a future way to apply
   7 REMAP.
   8
   9 REMAP is an advanced form of Vector "Structure Packing" that
  10 provides hardware-level support for commonly-used *nested* loop patterns.
  11 For more general reordering an Indexed REMAP mode is available.
  12
  13 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
  14 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  15 arbitrary access to elements, independently on each Vector src or dest
  16 register.
  17
  18 The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
  19 data in-place.  Four SPRs are provided so that a single FMAC may be
  20 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
  21 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  22 such as RGB pixel data extraction and reforming.
  23
  24 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  25 Vector ISAs which would typically only have a limited set of instructions
  26 that can be structure-packed (LD/ST typically), REMAP may be applied to
  27 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  28
  29 Note that REMAP does not apply to sub-vector elements: that is what
  30 swizzle is for.  Swizzle *can* however be applied to the same instruction
  31 as REMAP.
  32
  33 In its general form, REMAP is quite expensive to set up, and on some
  34 implementations introduce
  35 latency, so should realistically be used only where it is worthwhile.
  36 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
  37 helper instruction options which make REMAP easier to use.
  38
  39 There are three types of REMAP:
  40
  41 * **Matrix**, also known as 2D and 3D reshaping
  42 * **FFT/DCT**, with full triple-loop in-place support: limited to
  43   Power-2 RADIX
  44 * **Indexing**, for any general-purpose reordering. Currently
  45   under development.
  46
  47 # Principle
  48
  49 * normal vector element read/write of operands would be sequential
  50   (0 1 2 3 ....)
  51 * this is not appropriate for (e.g.) Matrix multiply which requires
  52   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  53 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  54   with this.  both are expensive (copy large vectors, spill through memory)
  55   and very few Packed SIMD ISAs cope with non-Power-2.
  56 * REMAP **redefines** the order of access according to set "Schedules".
  57 * The Schedules are not necessarily restricted to power-of-two boundaries
  58   making it unnecessary to have for example specialised 3x4 transpose
  59   instructions.
  60
  61 Only the most commonly-used algorithms in computer science have REMAP
  62 support, due to the high cost in both the ISA and in hardware.  For
  63 arbitrary remapping the `Indexed` REMAP may be used.
  64
  65 # Executive Summary Usage
  66
  67 * `svshape` to set the type of reordering to be applied to an
  68   otherwise usual `0..VL-1` hardware for-loop
  69 * `svremap` to set which registers a given reordering is to apply to
  70   (RA, RT etc)
  71 * `sv.instruction` where any Vectotised register marked by `svremap`
  72   will have its ordering REMAPPED according to the schedule set
  73   by `svshape`.
  74
  75 The following illustrative example multiplies a 3x4 and a 5x3
  76 matrix to create
  77 a 5x4 result:
  78
  79     svshape 5, 4, 3, 0, 0
  80     svremap 31, 1, 2, 3, 0, 0, 0, 0
  81     sv.fmadds 0.v, 8.v, 16.v, 0.v
  82
  83 The example may be executed as a unit test and demo,
  84 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
  85
  86 # REMAP types
  87
  88 This section summarises the motivation for each REMAP Schedule
  89 and briefly goes over their characteristics and limitations.
  90
  91 ## Matrix (1D/2D/3D shaping)
  92
  93 TODO
  94
  95 ## FFT/DCT Triple Loop
  96
  97 TODO
  98
  99 ## Indexed
 100
 101 The purpose of Indexing is to provide a generalised version of
 102 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 103 Indexing is abstracted out and may be applied to much more
 104 than an element move/copy, and is not limited for example
 105 to the number of bytes that can fit into a VSX register.
 106 Indexing may be applied to LD/ST (even on Indexed LD/ST
 107 instructions such as `sv.lbzx`), arithmetic operations,
 108 extsw: there is no artificial limit.
 109
 110 The only major caveat is that the registers to be used as
 111 Indices must not be modified by any instruction after Indexed Mode
 112 is established, and neither must MAXVL be altered. Failure to observe
 113 these conditions results in `UNDEFINED` behaviour.
 114 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 115 the entire range of Indices to be subsequently used, but a corresponding
 116 Write-After-Read Hazard by any instruction that modifies the Indices
 117 **does not have to be created**. Given the large number of registers
 118 involved in Indexing this is a huge resource saving and reduction
 119 in micro-architectural complexity. MAXVL is likewise
 120 included in the RAW Hazards because it is involved in calculating
 121 how many registers are to be considered Indices.
 122
 123
 124 # REMAP SPR
 125
 126 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
 127 | -- | -- | -- | -- | -- | ----- | ------ |
 128 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme  | rsv    |
 129
 130 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 131 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 132 likewise mo0-1 apply to output registers (FRT, FRS respectively).
 133 SVme is 5 bits, and indicates indicate whether the
 134 SVSHAPE is actively applied or not.
 135
 136 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
 137 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
 138 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
 139 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
 140 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
 141   (LD/ST-with-update has an implicit 2nd write register, RA)
 142
 143 # svremap instruction
 144
 145 There is also a corresponding SVRM-Form for the svremap
 146 instruction which matches the above SPR:
 147
 148     svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
 149
 150 |0     |6     |11  |13   |15   |17   |19   |21    | 22   |26     |31 |
 151 | --   | --   | -- | --  | --  | --  | --  | --   | ---- | ----- |-- |
 152 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd | XO    | / |
 153
 154 # SHAPE Remapping SPRs
 155
 156 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 157 which have the same format.
 158
 159 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
 160
 161 # svshape instruction
 162
 163 `svshape` is a convenience instruction that reduces instruction
 164 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
 165 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
 166 including VL and MAXVL. Using `svshape` therefore does not also
 167 require `setvl`.
 168
 169 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 170
 171     svshape SVxd,SVyd,SVzd,SVRM,vf
 172
 173 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..31|  name    |
 174 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 175 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svstate  |
 176
 177 Fields:
 178
 179 * **SVxd** - SV REMAP "xdim"
 180 * **SVyd** - SV REMAP "ydim"
 181 * **SVzd** - SV REMAP "zdim"
 182 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
 183 * **vf** - sets "Vertical-First" mode
 184 * **XO** - standard 6-bit XO field
 185
 186 | SVRM   | Remap Mode description |
 187 | --     | --              |
 188 | 0b0000 | Matrix 1/2/3D    |
 189 | 0b0001 | FFT Butterfly   |
 190 | 0b0010 | DCT Inner butterfly, pre-calculated coefficients |
 191 | 0b0011 | DCT Outer butterfly  |
 192 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 193 | 0b0101 | DCT COS table index generation |
 194 | 0b0110 | DCT half-swap   |
 195 | 0b0111 | reserved  |
 196 | 0b1000 | reserved |
 197 | 0b1001 | reserved  |
 198 | 0b1010 | iDCT Inner butterfly, pre-calculated coefficients |
 199 | 0b1011 | iDCT Outer butterfly  |
 200 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 201 | 0b1101 | iDCT COS table index generation |
 202 | 0b1110 | iDCT half-swap   |
 203 | 0b1111 | FFT half-swap   |
 204
 205 Examples showing how all of these Modes operate exists in the online
 206 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
 207 and the full pseudocode setting up all SPRs
 208 is in the [[openpower/isa/simplev]] page.
 209
 210 In Indexed Mode, there are only 5 bits available to specify the GPR
 211 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
 212 5 bits are given in the `SVxd` field: the bottom two implicit bits
 213 will be zero (`SVxd || 0b00`).
 214
 215 `svshape` has *limited applicability* due to being a 32-bit instruction.
 216 The full capability of SVSHAPE SPRs may be accessed by directly writing
 217 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
 218 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
 219 instruction, `psvshape`, may extend the capability here.
 220
 221 # svindex instruction
 222
 223 `svindex` is a convenience instruction that reduces instruction
 224 count for Indexed REMAP Mode. It sets up
 225 (overwrites) all required SVSHAPE SPRs and can modify the REMAP
 226 SPR as well.
 227
 228 Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
 229
 230     svindex RS,mask,SVd,ew,yz,mr,sk
 231
 232 | 0.5|6.10 |11.15  |16.20 | 21..25      | 26..31|  name    |
 233 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 234 |OPCD| RS  | mask  | SVd  | ew/yx/mm/sk | XO    | svindex |
 235
 236 Fields:
 237
 238 * **SVd** - SV REMAP x/y dim
 239 * **mask** - sets remap mi0-2/mo0-1 and SVSHAPEs, controlled by mm
 240 * **ew** - sets element width override
 241 * **RS** - GPR RS<<2 to be used for Indexing
 242 * **yx** - 2D reordering to be used if yx=1
 243 * **mm** - mask mode. determines how mask is interpreted.
 244 * **sk** - Dimension skipping enabled
 245 * **XO** - standard 6-bit XO field
 246
 247 When `mm=0`:
 248
 249 * mask, like REMAP.SVme, has bit 0
 250   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
 251   bit 3 to mo0 and bit 4 to mi1
 252 * all SVSHAPEs and the REMAP SPR are first reset (initialised to zero)
 253 * for each bit set in the 5-bit mask, in order, the first
 254   as-yet-unset SVSHAPE will be updated
 255   with the other operands in the instruction, and the REMAP
 256   SPR set.
 257 * If all 5 bits of mask are set then both mi0 and mo1 use SVSHAPE0.
 258
 259 Example 1: if mask=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
 260 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
 261 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
 262 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
 263
 264 Example 2: if mask=0b10001 then again SVSHAPE0 and SVSHAPE1
 265 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
 266 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
 267
 268 When `mm=1`:
 269
 270 * bits 0-2 of mask indicate an index selecting mi0-mo1
 271 * bits 3-4 of mask indicate which SVSHAPE 0-3 shall be updated
 272 * only the selected SVSHAPE is overwritten
 273 * only the relevant bits in the REMAP SPR are updated
 274
 275 Example 1: if mask=0b10011 then mo0 is selected and SVSHAPE2
 276 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
 277 set to 2 (SVSHAPE2).
 278
 279 Example 2: if mask=0b11100 then mo1 is selected and SVSHAPE3
 280 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
 281 set to 3 (SVSHAPE3).
 282
 283 In essence, `mm=0` is intended for use to set as much of the
 284 REMAP State SPRs as practical with a single instruction,
 285 whilst `mm=1` is intended to be a little more refined.
 286
 287 # REMAP Matrix pseudocode
 288
 289 The algorithm below shows how REMAP works more clearly, and may be
 290 executed as a python program:
 291
 292 ```
 293 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
 294 ```
 295
 296 An easier-to-read version (using python iterators) shows the loop nesting:
 297
 298 ```
 299 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
 300 ```
 301
 302 Each element index from the for-loop `0..VL-1`
 303 is run through the above algorithm to work out the **actual** element
 304 index, instead.  Given that there are four possible SHAPE entries, up to
 305 four separate registers in any given operation may be simultaneously
 306 remapped:
 307
 308     function op_add(rd, rs1, rs2) # add not VADD!
 309       ...
 310       ...
 311       for (i = 0; i < VL; i++)
 312         xSTATE.srcoffs = i # save context
 313         if (predval & 1<<i) # predication uses intregs
 314            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 315                                   ireg[rs2+remap3(irs2)];
 316            if (!int_vec[rd ].isvector) break;
 317         if (int_vec[rd ].isvector)  { id += 1; }
 318         if (int_vec[rs1].isvector)  { irs1 += 1; }
 319         if (int_vec[rs2].isvector)  { irs2 += 1; }
 320
 321 By changing remappings, 2D matrices may be transposed "in-place" for one
 322 operation, followed by setting a different permutation order without
 323 having to move the values in the registers to or from memory.
 324
 325 Note that:
 326
 327 * Over-running the register file clearly has to be detected and
 328   an illegal instruction exception thrown
 329 * When non-default elwidths are set, the exact same algorithm still
 330   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 331   than entire registers).
 332 * If permute option 000 is utilised, the actual order of the
 333   reindexing does not change.  However, modulo MVL still occurs
 334   which will result in repeated operations (use with caution).
 335 * If two or more dimensions are set to zero, the actual order does not change!
 336 * The above algorithm is pseudo-code **only**.  Actual implementations
 337   will need to take into account the fact that the element for-looping
 338   must be **re-entrant**, due to the possibility of exceptions occurring.
 339   See SVSTATE SPR, which records the current element index.
 340   Continuing after return from an interrupt may introduce latency
 341   due to re-computation of the remapped offsets.
 342 * Twin-predicated operations require **two** separate and distinct
 343   element offsets.  The above pseudo-code algorithm will be applied
 344   separately and independently to each, should each of the two
 345   operands be remapped.  *This even includes unit-strided LD/ST*
 346   and other operations
 347   in that category, where in that case it will be the **offset** that is
 348   remapped.
 349 * Offset is especially useful, on its own, for accessing elements
 350   within the middle of a register.  Without offsets, it is necessary
 351   to either use a predicated MV, skipping the first elements, or
 352   performing a LOAD/STORE cycle to memory.
 353   With offsets, the data does not have to be moved.
 354 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 355   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 356   entries to be regularly presented to operands **more than once**, thus
 357   allowing the same underlying registers to act as an accumulator of
 358   multiple vector or matrix operations, for example.
 359 * Note especially that Program Order **must** still be respected
 360   even when overlaps occur that read or write the same register
 361   elements *including polymorphic ones*
 362
 363 Clearly here some considerable care needs to be taken as the remapping
 364 could hypothetically create arithmetic operations that target the
 365 exact same underlying registers, resulting in data corruption due to
 366 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 367 register-renaming will have an easier time dealing with this than
 368 DSP-style SIMD micro-architectures.
 369
 370 # 4x4 Matrix to vec4 Multiply Example
 371
 372 The following settings will allow a 4x4 matrix (starting at f8), expressed
 373 as a sequence of 16 numbers first by row then by column, to be multiplied
 374 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 375
 376 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 377 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 378 * VL=16, f4=vec, f0=vec, f8=vec
 379 * FMAC f4, f0, f8, f4
 380
 381 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 382 four iterations through the hardware loop, the REMAPed index will not
 383 increment. On the second four, the index will increase by one. Likewise
 384 on each subsequent group of four.
 385
 386 The permutation on SHAPE1 will increment f4 continuously cycling through
 387 f4-f7 every four iterations of the hardware loop.
 388
 389 At the same time, VL will, because there is no SHAPE on f8, increment
 390 straight sequentially through the 16 values f8-f23 in the Matrix. The
 391 equivalent sequence thus is issued:
 392
 393     fmac f4, f0, f8, f4
 394     fmac f5, f0, f9, f5
 395     fmac f6, f0, f10, f6
 396     fmac f7, f0, f11, f7
 397     fmac f4, f1, f12, f4
 398     fmac f5, f1, f13, f5
 399     fmac f6, f1, f14, f6
 400     fmac f7, f1, f15, f7
 401     fmac f4, f2, f16, f4
 402     fmac f5, f2, f17, f5
 403     fmac f6, f2, f18, f6
 404     fmac f7, f2, f19, f7
 405     fmac f4, f3, f20, f4
 406     fmac f5, f3, f21, f5
 407     fmac f6, f3, f22, f6
 408     fmac f7, f3, f23, f7
 409
 410 The only other instruction required is to ensure that f4-f7 are
 411 initialised (usually to zero).
 412
 413 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 414 the same technique applied to four independent vectors, can be done by
 415 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 416 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 417 it to apply four times to compute the four columns worth of vectors.
 418
 419 # Warshall transitive closure algorithm
 420
 421 TODO move to [[sv/remap/discussion]] page, copied from here
 422 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 423
 424 with thanks to Hendrik.
 425
 426 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
 427
 428 > Just a note:  interpreting + as 'or', and * as 'and',
 429 > operating on Boolean matrices,
 430 > and having result, X, and Y be the exact same matrix,
 431 > updated while being used,
 432 > gives the traditional Warshall transitive-closure
 433 > algorithm, if the loops are nested exactly in thie order.
 434
 435 this can be done with the ternary instruction which has
 436 an in-place triple boolean input:
 437
 438     RT = RT | (RA & RB)
 439
 440 and also has a CR Field variant of the same
 441
 442 notes from conversations:
 443
 444 > > for y in y_r:
 445 > >  for x in x_r:
 446 > >    for z in z_r:
 447 > >      result[y][x] +=
 448 > >         a[y][z] *
 449 > >         b[z][x]
 450
 451 > This nesting of loops works for matrix multiply, but not for transitive
 452 > closure.
 453
 454 > > it can be done:
 455 > >
 456 > >   for z in z_r:
 457 > >    for y in y_r:
 458 > >     for x in x_r:
 459 > >       result[y][x] +=
 460 > >          a[y][z] *
 461 > >          b[z][x]
 462 >
 463 > And this ordering of loops *does* work for transitive closure, when a,
 464 > b, and result are the very same matrix, updated while being used.
 465 >
 466 > By the way, I believe there is a graph algorithm that does the
 467 > transitive closure thing, but instead of using boolean, "and", and "or",
 468 > they use real numbers, addition, and minimum.  I think that one computes
 469 > shortest paths between vertices.
 470 >
 471 > By the time the z'th iteration of the z loop begins, the algorithm has
 472 > already peocessed paths that go through vertices numbered < z, and it
 473 > adds paths that go through vertices numbered z.
 474 >
 475 > For this to work, the outer loop has to be the one on teh subscript that
 476 > bridges a and b (which in this case are teh same matrix, of course).
 477
 478 # SUBVL Remap
 479
 480 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 481 sub-vectir elements were simply part of the main VL loop.  This is the
 482 *complete opposite* of predication which **only** applies to the whole
 483 vec2/3/4.  In pseudocode this would be:
 484
 485       for (i = 0; i < VL; i++)
 486         if (predval & 1<<i) # apply to VL not SUBVL
 487           for (j = 0; j < SUBVL; j++)
 488              id = i*SUBVL + j # not, "id=i".
 489              ireg[RT+remap1(id)] ...
 490
 491 The reason for allowing SUBVL Remaps is that some regular patterns using
 492 Swizzle which would otherwise require multiple explicit instructions
 493 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 494 instead.  Not however that Swizzle is *still permitted to be applied*.
 495
 496 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 497 stage:
 498
 499 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 500
 501 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 502 a 2D REMAP allows:
 503
 504 * the column bytes (as a vec4) to be iterated over as an inner loop,
 505   progressing vertically (`a00 a10 a20 a30`)
 506 * the columns themselves to be iterated as an outer loop
 507 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 508
 509 This entirely in-place without special 128-bit opcodes.  Below is
 510 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 511
 512 ```
 513 void gmix_column(unsigned char *r) {
 514     unsigned char a[4];
 515     unsigned char b[4];
 516     unsigned char c;
 517     unsigned char h;
 518     // no swizzle here but still SUBVL.Remap
 519     // can be done as vec4 byte-level
 520     // elwidth overrides though.
 521     for (c = 0; c < 4; c++) {
 522         a[c] = r[c];
 523         h = (unsigned char)((signed char)r[c] >> 7);
 524         b[c] = r[c] << 1;
 525         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 526     }
 527     // SUBVL.Remap still needed here
 528     // bytelevel elwidth overrides and vec4
 529     // These may then each be 4x 8bit bit Swizzled
 530     // r0.vec4 = b.vec4
 531     // r0.vec4 ^= a.vec4.WXYZ
 532     // r0.vec4 ^= a.vec4.ZWXY
 533     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 534     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 535     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 536     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 537     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 538 }
 539 ```
 540
 541 With the assumption made by the above code that the column bytes have
 542 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 543 may transparently fill that role, in-place, without a complex byte-level
 544 mv operation.
 545
 546 The application of the swizzles allows the remapped vec4 a, b and r
 547 variables to perform four straight linear 32 bit XOR operations where a
 548 scalar processor would be required to perform 16 byte-level individual
 549 operations.  Given wide enough SIMD backends in hardware these 3 bit
 550 XORs may be done as single-cycle operations across the entire 128 bit
 551 Rijndael Matrix.
 552
 553 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 554 Multiply using the MDS Matrix.
 555
 556 # TODO
 557
 558 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 559 in https://bugs.libre-soc.org/show_bug.cgi?id=653
 560 * Triangular REMAP
 561 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
 562