openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
   7 * see  [[sv/propagation]] for a future way to apply
   8 REMAP.
   9
  10 REMAP is an advanced form of Vector "Structure Packing" that
  11 provides hardware-level support for commonly-used *nested* loop patterns.
  12 For more general reordering an Indexed REMAP mode is available.
  13
  14 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
  15 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  16 arbitrary access to elements (when elwidth overrides are used),
  17 independently on each Vector src or dest
  18 register.
  19
  20 The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
  21 data in-place.  Four SPRs are provided so that a single FMAC may be
  22 used in a single loop to perform for example 4x4 times 4x4 Matrix multiplication,
  23 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  24 such as RGB pixel data extraction and reforming.
  25
  26 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  27 Vector ISAs which would typically only have a limited set of instructions
  28 that can be structure-packed (LD/ST typically), REMAP may be applied to
  29 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  30
  31 Note that REMAP does not apply to sub-vector elements: that is what
  32 swizzle is for.  Swizzle *can* however be applied to the same instruction
  33 as REMAP.
  34
  35 In its general form, REMAP is quite expensive to set up, and on some
  36 implementations introduce
  37 latency, so should realistically be used only where it is worthwhile.
  38 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
  39 helper instruction options which make REMAP easier to use.
  40
  41 There are three types of REMAP:
  42
  43 * **Matrix**, also known as 2D and 3D reshaping
  44 * **FFT/DCT**, with full triple-loop in-place support: limited to
  45   Power-2 RADIX
  46 * **Indexing**, for any general-purpose reordering. Currently
  47   under development.
  48
  49 # Principle
  50
  51 * normal vector element read/write of operands would be sequential
  52   (0 1 2 3 ....)
  53 * this is not appropriate for (e.g.) Matrix multiply which requires
  54   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  55 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  56   with this.  both are expensive (copy large vectors, spill through memory)
  57   and very few Packed SIMD ISAs cope with non-Power-2.
  58 * REMAP **redefines** the order of access according to set "Schedules".
  59 * The Schedules are not necessarily restricted to power-of-two boundaries
  60   making it unnecessary to have for example specialised 3x4 transpose
  61   instructions.
  62
  63 Only the most commonly-used algorithms in computer science have REMAP
  64 support, due to the high cost in both the ISA and in hardware.  For
  65 arbitrary remapping the `Indexed` REMAP may be used.
  66
  67 # Executive Summary Usage
  68
  69 * `svshape` to set the type of reordering to be applied to an
  70   otherwise usual `0..VL-1` hardware for-loop
  71 * `svremap` to set which registers a given reordering is to apply to
  72   (RA, RT etc)
  73 * `sv.instruction` where any Vectorised register marked by `svremap`
  74   will have its ordering REMAPPED according to the schedule set
  75   by `svshape`.
  76
  77 The following illustrative example multiplies a 3x4 and a 5x3
  78 matrix to create
  79 a 5x4 result:
  80
  81     svshape 5, 4, 3, 0, 0
  82     svremap 31, 1, 2, 3, 0, 0, 0, 0
  83     sv.fmadds 0.v, 8.v, 16.v, 0.v
  84
  85 * svshape sets up the four SVSHAPE SPRS
  86 * svremap activates all five registers RA RB RC RT RS (31)
  87 * svremap requests:
  88   - RA to use SVSHAPE1
  89   - RB to use SVSHAPE2
  90   - RC to use SVSHAPE3
  91   - RT to use SVSHAPE0
  92   - RS to use SVSHAPE0
  93 * sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
  94 * With REMAP being active each register's element index is
  95   *independently* transformed using the specified SHAPEs.
  96
  97 Thus the Vector Loop is arranged such that the use of
  98 the multiply-and-accumulate instruction executes precisely the required
  99 Schedule to perform an in-place in-registers Matrix Multiply with no
 100 need to perform additional Transpose or register copy instructions.
 101 The example above may be executed as a unit test and demo,
 102 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
 103
 104 # REMAP types
 105
 106 This section summarises the motivation for each REMAP Schedule
 107 and briefly goes over their characteristics and limitations.
 108
 109 ## Matrix (1D/2D/3D shaping)
 110
 111 TODO
 112
 113 ## FFT/DCT Triple Loop
 114
 115 TODO
 116
 117 ## Indexed
 118
 119 The purpose of Indexing is to provide a generalised version of
 120 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 121 Indexing is abstracted out and may be applied to much more
 122 than an element move/copy, and is not limited for example
 123 to the number of bytes that can fit into a VSX register.
 124 Indexing may be applied to LD/ST (even on Indexed LD/ST
 125 instructions such as `sv.lbzx`), arithmetic operations,
 126 extsw: there is no artificial limit.
 127
 128 The only major caveat is that the registers to be used as
 129 Indices must not be modified by any instruction after Indexed Mode
 130 is established, and neither must MAXVL be altered. Failure to observe
 131 these conditions results in `UNDEFINED` behaviour.
 132 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 133 the entire range of Indices to be subsequently used, but a corresponding
 134 Write-After-Read Hazard by any instruction that modifies the Indices
 135 **does not have to be created**. Given the large number of registers
 136 involved in Indexing this is a huge resource saving and reduction
 137 in micro-architectural complexity. MAXVL is likewise
 138 included in the RAW Hazards because it is involved in calculating
 139 how many registers are to be considered Indices.
 140
 141
 142 # REMAP SPR
 143
 144 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
 145 | -- | -- | -- | -- | -- | ----- | ------ |
 146 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme  | rsv    |
 147
 148 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 149 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 150 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
 151 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
 152 SVSHAPE is actively applied or not.
 153
 154 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
 155 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
 156 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
 157 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
 158 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
 159   (LD/ST-with-update has an implicit 2nd write register, RA)
 160
 161 # svremap instruction
 162
 163 There is also a corresponding SVRM-Form for the svremap
 164 instruction which matches the above SPR:
 165
 166     svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
 167
 168 |0     |6     |11  |13   |15   |17   |19   |21    | 22.25 |26..31 |
 169 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 170 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 171
 172 # SHAPE Remapping SPRs
 173
 174 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 175 which have the same format.
 176
 177 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
 178
 179 # svshape instruction
 180
 181 `svshape` is a convenience instruction that reduces instruction
 182 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
 183 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
 184 including VL and MAXVL. Using `svshape` therefore does not also
 185 require `setvl`.
 186
 187 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 188
 189     svshape SVxd,SVyd,SVzd,SVRM,vf
 190
 191 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..31|  name    |
 192 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 193 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svstate  |
 194
 195 Fields:
 196
 197 * **SVxd** - SV REMAP "xdim"
 198 * **SVyd** - SV REMAP "ydim"
 199 * **SVzd** - SV REMAP "zdim"
 200 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
 201 * **vf** - sets "Vertical-First" mode
 202 * **XO** - standard 6-bit XO field
 203
 204 | SVRM   | Remap Mode description |
 205 | --     | --              |
 206 | 0b0000 | Matrix 1/2/3D    |
 207 | 0b0001 | FFT Butterfly   |
 208 | 0b0010 | DCT Inner butterfly, pre-calculated coefficients |
 209 | 0b0011 | DCT Outer butterfly  |
 210 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 211 | 0b0101 | DCT COS table index generation |
 212 | 0b0110 | DCT half-swap   |
 213 | 0b0111 | reserved  |
 214 | 0b1000 | reserved |
 215 | 0b1001 | reserved  |
 216 | 0b1010 | iDCT Inner butterfly, pre-calculated coefficients |
 217 | 0b1011 | iDCT Outer butterfly  |
 218 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 219 | 0b1101 | iDCT COS table index generation |
 220 | 0b1110 | iDCT half-swap   |
 221 | 0b1111 | FFT half-swap   |
 222
 223 Examples showing how all of these Modes operate exists in the online
 224 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
 225 and the full pseudocode setting up all SPRs
 226 is in the [[openpower/isa/simplev]] page.
 227
 228 In Indexed Mode, there are only 5 bits available to specify the GPR
 229 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
 230 5 bits are given in the `SVxd` field: the bottom two implicit bits
 231 will be zero (`SVxd || 0b00`).
 232
 233 `svshape` has *limited applicability* due to being a 32-bit instruction.
 234 The full capability of SVSHAPE SPRs may be accessed by directly writing
 235 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
 236 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
 237 instruction, `psvshape`, may extend the capability here.
 238
 239 # svindex instruction
 240
 241 `svindex` is a convenience instruction that reduces instruction
 242 count for Indexed REMAP Mode. It sets up
 243 (overwrites) all required SVSHAPE SPRs and can modify the REMAP
 244 SPR as well.
 245
 246 Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
 247
 248     svindex RS,mask,SVd,ew,yz,mr,sk
 249
 250 | 0.5|6.10 |11.15  |16.20 | 21..25      | 26..31|  name    | Form |
 251 | -- | --  | ---   | ---- | ----------- | ------| -------- | ---- |
 252 |OPCD| RS  | mask  | SVd  | ew/yx/mm/sk | XO    | svindex | SVI-Form |
 253
 254 Fields:
 255
 256 * **SVd** - SV REMAP x/y dim
 257 * **mask** - sets remap mi0-2/mo0-1 and SVSHAPEs, controlled by mm
 258 * **ew** - sets element width override
 259 * **RS** - GPR RS<<2 to be used for Indexing
 260 * **yx** - 2D reordering to be used if yx=1
 261 * **mm** - mask mode. determines how mask is interpreted.
 262 * **sk** - Dimension skipping enabled
 263 * **XO** - standard 6-bit XO field
 264
 265 When `mm=0`:
 266
 267 * mask, like REMAP.SVme, has bit 0
 268   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
 269   bit 3 to mo0 and bit 4 to mi1
 270 * all SVSHAPEs and the REMAP SPR are first reset (initialised to zero)
 271 * for each bit set in the 5-bit mask, in order, the first
 272   as-yet-unset SVSHAPE will be updated
 273   with the other operands in the instruction, and the REMAP
 274   SPR set.
 275 * If all 5 bits of mask are set then both mi0 and mo1 use SVSHAPE0.
 276
 277 Example 1: if mask=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
 278 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
 279 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
 280 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
 281
 282 Example 2: if mask=0b10001 then again SVSHAPE0 and SVSHAPE1
 283 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
 284 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
 285
 286 When `mm=1`:
 287
 288 * bits 0-2 of mask indicate an index selecting mi0-mo1
 289 * bits 3-4 of mask indicate which SVSHAPE 0-3 shall be updated
 290 * only the selected SVSHAPE is overwritten
 291 * only the relevant bits in the REMAP SPR are updated
 292
 293 Example 1: if mask=0b10011 then mo0 is selected and SVSHAPE2
 294 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
 295 set to 2 (SVSHAPE2).
 296
 297 Example 2: if mask=0b11100 then mo1 is selected and SVSHAPE3
 298 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
 299 set to 3 (SVSHAPE3).
 300
 301 In essence, `mm=0` is intended for use to set as much of the
 302 REMAP State SPRs as practical with a single instruction,
 303 whilst `mm=1` is intended to be a little more refined.
 304
 305 # REMAP Matrix pseudocode
 306
 307 The algorithm below shows how REMAP works more clearly, and may be
 308 executed as a python program:
 309
 310 ```
 311 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
 312 ```
 313
 314 An easier-to-read version (using python iterators) shows the loop nesting:
 315
 316 ```
 317 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
 318 ```
 319
 320 Each element index from the for-loop `0..VL-1`
 321 is run through the above algorithm to work out the **actual** element
 322 index, instead.  Given that there are four possible SHAPE entries, up to
 323 four separate registers in any given operation may be simultaneously
 324 remapped:
 325
 326     function op_add(rd, rs1, rs2) # add not VADD!
 327       ...
 328       ...
 329       for (i = 0; i < VL; i++)
 330         xSTATE.srcoffs = i # save context
 331         if (predval & 1<<i) # predication uses intregs
 332            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 333                                   ireg[rs2+remap3(irs2)];
 334            if (!int_vec[rd ].isvector) break;
 335         if (int_vec[rd ].isvector)  { id += 1; }
 336         if (int_vec[rs1].isvector)  { irs1 += 1; }
 337         if (int_vec[rs2].isvector)  { irs2 += 1; }
 338
 339 By changing remappings, 2D matrices may be transposed "in-place" for one
 340 operation, followed by setting a different permutation order without
 341 having to move the values in the registers to or from memory.
 342
 343 Note that:
 344
 345 * Over-running the register file clearly has to be detected and
 346   an illegal instruction exception thrown
 347 * When non-default elwidths are set, the exact same algorithm still
 348   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 349   than entire registers).
 350 * If permute option 000 is utilised, the actual order of the
 351   reindexing does not change.  However, modulo MVL still occurs
 352   which will result in repeated operations (use with caution).
 353 * If two or more dimensions are set to zero, the actual order does not change!
 354 * The above algorithm is pseudo-code **only**.  Actual implementations
 355   will need to take into account the fact that the element for-looping
 356   must be **re-entrant**, due to the possibility of exceptions occurring.
 357   See SVSTATE SPR, which records the current element index.
 358   Continuing after return from an interrupt may introduce latency
 359   due to re-computation of the remapped offsets.
 360 * Twin-predicated operations require **two** separate and distinct
 361   element offsets.  The above pseudo-code algorithm will be applied
 362   separately and independently to each, should each of the two
 363   operands be remapped.  *This even includes unit-strided LD/ST*
 364   and other operations
 365   in that category, where in that case it will be the **offset** that is
 366   remapped.
 367 * Offset is especially useful, on its own, for accessing elements
 368   within the middle of a register.  Without offsets, it is necessary
 369   to either use a predicated MV, skipping the first elements, or
 370   performing a LOAD/STORE cycle to memory.
 371   With offsets, the data does not have to be moved.
 372 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 373   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 374   entries to be regularly presented to operands **more than once**, thus
 375   allowing the same underlying registers to act as an accumulator of
 376   multiple vector or matrix operations, for example.
 377 * Note especially that Program Order **must** still be respected
 378   even when overlaps occur that read or write the same register
 379   elements *including polymorphic ones*
 380
 381 Clearly here some considerable care needs to be taken as the remapping
 382 could hypothetically create arithmetic operations that target the
 383 exact same underlying registers, resulting in data corruption due to
 384 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 385 register-renaming will have an easier time dealing with this than
 386 DSP-style SIMD micro-architectures.
 387
 388 # 4x4 Matrix to vec4 Multiply Example
 389
 390 The following settings will allow a 4x4 matrix (starting at f8), expressed
 391 as a sequence of 16 numbers first by row then by column, to be multiplied
 392 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 393
 394 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 395 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 396 * VL=16, f4=vec, f0=vec, f8=vec
 397 * FMAC f4, f0, f8, f4
 398
 399 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 400 four iterations through the hardware loop, the REMAPed index will not
 401 increment. On the second four, the index will increase by one. Likewise
 402 on each subsequent group of four.
 403
 404 The permutation on SHAPE1 will increment f4 continuously cycling through
 405 f4-f7 every four iterations of the hardware loop.
 406
 407 At the same time, VL will, because there is no SHAPE on f8, increment
 408 straight sequentially through the 16 values f8-f23 in the Matrix. The
 409 equivalent sequence thus is issued:
 410
 411     fmac f4, f0, f8, f4
 412     fmac f5, f0, f9, f5
 413     fmac f6, f0, f10, f6
 414     fmac f7, f0, f11, f7
 415     fmac f4, f1, f12, f4
 416     fmac f5, f1, f13, f5
 417     fmac f6, f1, f14, f6
 418     fmac f7, f1, f15, f7
 419     fmac f4, f2, f16, f4
 420     fmac f5, f2, f17, f5
 421     fmac f6, f2, f18, f6
 422     fmac f7, f2, f19, f7
 423     fmac f4, f3, f20, f4
 424     fmac f5, f3, f21, f5
 425     fmac f6, f3, f22, f6
 426     fmac f7, f3, f23, f7
 427
 428 The only other instruction required is to ensure that f4-f7 are
 429 initialised (usually to zero).
 430
 431 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 432 the same technique applied to four independent vectors, can be done by
 433 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 434 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 435 it to apply four times to compute the four columns worth of vectors.
 436
 437 # Warshall transitive closure algorithm
 438
 439 TODO move to [[sv/remap/discussion]] page, copied from here
 440 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 441
 442 with thanks to Hendrik.
 443
 444 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
 445
 446 > Just a note:  interpreting + as 'or', and * as 'and',
 447 > operating on Boolean matrices,
 448 > and having result, X, and Y be the exact same matrix,
 449 > updated while being used,
 450 > gives the traditional Warshall transitive-closure
 451 > algorithm, if the loops are nested exactly in thie order.
 452
 453 this can be done with the ternary instruction which has
 454 an in-place triple boolean input:
 455
 456     RT = RT | (RA & RB)
 457
 458 and also has a CR Field variant of the same
 459
 460 notes from conversations:
 461
 462 > > for y in y_r:
 463 > >  for x in x_r:
 464 > >    for z in z_r:
 465 > >      result[y][x] +=
 466 > >         a[y][z] *
 467 > >         b[z][x]
 468
 469 > This nesting of loops works for matrix multiply, but not for transitive
 470 > closure.
 471
 472 > > it can be done:
 473 > >
 474 > >   for z in z_r:
 475 > >    for y in y_r:
 476 > >     for x in x_r:
 477 > >       result[y][x] +=
 478 > >          a[y][z] *
 479 > >          b[z][x]
 480 >
 481 > And this ordering of loops *does* work for transitive closure, when a,
 482 > b, and result are the very same matrix, updated while being used.
 483 >
 484 > By the way, I believe there is a graph algorithm that does the
 485 > transitive closure thing, but instead of using boolean, "and", and "or",
 486 > they use real numbers, addition, and minimum.  I think that one computes
 487 > shortest paths between vertices.
 488 >
 489 > By the time the z'th iteration of the z loop begins, the algorithm has
 490 > already peocessed paths that go through vertices numbered < z, and it
 491 > adds paths that go through vertices numbered z.
 492 >
 493 > For this to work, the outer loop has to be the one on teh subscript that
 494 > bridges a and b (which in this case are teh same matrix, of course).
 495
 496 # SUBVL Remap
 497
 498 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 499 sub-vectir elements were simply part of the main VL loop.  This is the
 500 *complete opposite* of predication which **only** applies to the whole
 501 vec2/3/4.  In pseudocode this would be:
 502
 503       for (i = 0; i < VL; i++)
 504         if (predval & 1<<i) # apply to VL not SUBVL
 505           for (j = 0; j < SUBVL; j++)
 506              id = i*SUBVL + j # not, "id=i".
 507              ireg[RT+remap1(id)] ...
 508
 509 The reason for allowing SUBVL Remaps is that some regular patterns using
 510 Swizzle which would otherwise require multiple explicit instructions
 511 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 512 instead.  Not however that Swizzle is *still permitted to be applied*.
 513
 514 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 515 stage:
 516
 517 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 518
 519 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 520 a 2D REMAP allows:
 521
 522 * the column bytes (as a vec4) to be iterated over as an inner loop,
 523   progressing vertically (`a00 a10 a20 a30`)
 524 * the columns themselves to be iterated as an outer loop
 525 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 526
 527 This entirely in-place without special 128-bit opcodes.  Below is
 528 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 529
 530 ```
 531 void gmix_column(unsigned char *r) {
 532     unsigned char a[4];
 533     unsigned char b[4];
 534     unsigned char c;
 535     unsigned char h;
 536     // no swizzle here but still SUBVL.Remap
 537     // can be done as vec4 byte-level
 538     // elwidth overrides though.
 539     for (c = 0; c < 4; c++) {
 540         a[c] = r[c];
 541         h = (unsigned char)((signed char)r[c] >> 7);
 542         b[c] = r[c] << 1;
 543         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 544     }
 545     // SUBVL.Remap still needed here
 546     // bytelevel elwidth overrides and vec4
 547     // These may then each be 4x 8bit bit Swizzled
 548     // r0.vec4 = b.vec4
 549     // r0.vec4 ^= a.vec4.WXYZ
 550     // r0.vec4 ^= a.vec4.ZWXY
 551     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 552     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 553     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 554     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 555     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 556 }
 557 ```
 558
 559 With the assumption made by the above code that the column bytes have
 560 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 561 may transparently fill that role, in-place, without a complex byte-level
 562 mv operation.
 563
 564 The application of the swizzles allows the remapped vec4 a, b and r
 565 variables to perform four straight linear 32 bit XOR operations where a
 566 scalar processor would be required to perform 16 byte-level individual
 567 operations.  Given wide enough SIMD backends in hardware these 3 bit
 568 XORs may be done as single-cycle operations across the entire 128 bit
 569 Rijndael Matrix.
 570
 571 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 572 Multiply using the MDS Matrix.
 573
 574 # TODO
 575
 576 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 577 in https://bugs.libre-soc.org/show_bug.cgi?id=653
 578 * Triangular REMAP
 579 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
 580