openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
   7 * see  [[sv/propagation]] for a future way to apply
   8 REMAP.
   9
  10 REMAP is an advanced form of Vector "Structure Packing" that
  11 provides hardware-level support for commonly-used *nested* loop patterns.
  12 For more general reordering an Indexed REMAP mode is available.
  13
  14 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
  15 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  16 arbitrary access to elements (when elwidth overrides are used),
  17 independently on each Vector src or dest
  18 register.
  19
  20 The initial primary motivation of REMAP was for Matrix Multiplication, reordering of sequential
  21 data in-place.  Four SPRs are provided so that a single FMAC may be
  22 used in a single loop to perform for example 4x4 times 4x4 Matrix multiplication,
  23 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  24 such as RGB pixel data extraction and reforming.
  25
  26 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  27 Vector ISAs which would typically only have a limited set of instructions
  28 that can be structure-packed (LD/ST typically), REMAP may be applied to
  29 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  30
  31 Note that REMAP does not *directly* apply to sub-vector elements: that
  32 is what swizzle is for.  Swizzle *can* however be applied to the same
  33 instruction as REMAP.  As explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack EXTRA Mode bits
  34 can extend down into Sub-vector elements to perform vec2/vec3/vec4
  35 sequential reordering, but even here, REMAP is not extended down to
  36 the actual sub-vector elements themselves.
  37
  38 In its general form, REMAP is quite expensive to set up, and on some
  39 implementations introduce
  40 latency, so should realistically be used only where it is worthwhile.
  41 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
  42 helper instruction options which make REMAP easier to use.
  43
  44 There are three types of REMAP:
  45
  46 * **Matrix**, also known as 2D and 3D reshaping
  47 * **FFT/DCT**, with full triple-loop in-place support: limited to
  48   Power-2 RADIX
  49 * **Indexing**, for any general-purpose reordering, also includes
  50   limited 2D reshaping. Currently
  51   under development.
  52
  53 # Principle
  54
  55 * normal vector element read/write of operands would be sequential
  56   (0 1 2 3 ....)
  57 * this is not appropriate for (e.g.) Matrix multiply which requires
  58   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  59 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  60   with this.  both are expensive (copy large vectors, spill through memory)
  61   and very few Packed SIMD ISAs cope with non-Power-2.
  62 * REMAP **redefines** the order of access according to set "Schedules".
  63 * The Schedules are not necessarily restricted to power-of-two boundaries
  64   making it unnecessary to have for example specialised 3x4 transpose
  65   instructions.
  66
  67 Only the most commonly-used algorithms in computer science have REMAP
  68 support, due to the high cost in both the ISA and in hardware.  For
  69 arbitrary remapping the `Indexed` REMAP may be used.
  70
  71 # Executive Summary Usage
  72
  73 * `svshape` to set the type of reordering to be applied to an
  74   otherwise usual `0..VL-1` hardware for-loop
  75 * `svremap` to set which registers a given reordering is to apply to
  76   (RA, RT etc)
  77 * `sv.instruction` where any Vectorised register marked by `svremap`
  78   will have its ordering REMAPPED according to the schedule set
  79   by `svshape`.
  80
  81 The following illustrative example multiplies a 3x4 and a 5x3
  82 matrix to create
  83 a 5x4 result:
  84
  85     svshape 5, 4, 3, 0, 0
  86     svremap 31, 1, 2, 3, 0, 0, 0, 0
  87     sv.fmadds 0.v, 8.v, 16.v, 0.v
  88
  89 * svshape sets up the four SVSHAPE SPRS for a Matrix Schedule
  90 * svremap activates all five registers RA RB RC RT RS (31)
  91 * svremap requests:
  92   - RA to use SVSHAPE1
  93   - RB to use SVSHAPE2
  94   - RC to use SVSHAPE3
  95   - RT to use SVSHAPE0
  96   - RS to use SVSHAPE0
  97 * sv.fmadds has RT=0.v, RA=8.v, RB=16.v, RC=0.v
  98 * With REMAP being active each register's element index is
  99   *independently* transformed using the specified SHAPEs.
 100
 101 Thus the Vector Loop is arranged such that the use of
 102 the multiply-and-accumulate instruction executes precisely the required
 103 Schedule to perform an in-place in-registers Matrix Multiply with no
 104 need to perform additional Transpose or register copy instructions.
 105 The example above may be executed as a unit test and demo,
 106 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94)
 107
 108 # REMAP types
 109
 110 This section summarises the motivation for each REMAP Schedule
 111 and briefly goes over their characteristics and limitations.
 112
 113 ## Matrix (1D/2D/3D shaping)
 114
 115 Matrix Multiplication is a huge part of High-Performance Compute,
 116 and 3D.
 117 In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
 118 Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
 119 cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
 120 Aside from the cost of the load on the L1 I-Cache, the trick only
 121 works if one of the dimensions X or Y are power-two. Prime Numbers
 122 (5x7, 3x5) become deeply problematic to unroll.
 123
 124 Even traditional Scalable Vector ISAs have issues with Matrices, often
 125 having to perform data Transpose by pushing out through Memory and back,
 126 or computing Transposition Indices (costly) then copying to another
 127 Vector (costly).
 128
 129 Matrix REMAP was thus designed to solve these issues by providing Hardware
 130 Assisted
 131 "Schedules" that can view what would otherwise be limited to a strictly
 132 linear Vector as instead being 2D (even 3D) *in-place* reordered.
 133 With both Transposition and non-power-two being supported the issues
 134 faced by other ISAs are mitigated.
 135
 136 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 137 restricted to 127: up to 127 FMAs may be performed in total (potentially
 138 127 vec2/3/4 FMAs may be used but this requires additional research).
 139 Also given that it is in-registers only at present some care has to be
 140 taken on regfile resource utilisation. However it is perfectly possible
 141 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
 142 the usual 6-level large Matrix Multiply, without the usual difficulties
 143 associated with SIMD.
 144
 145 Also the `svshape` instruction only provides access to part of the
 146 Matrix REMAP capability. Rotation and mirroring need to be done by
 147 programming the SVSHAPE SPRs directly, which can take a lot more
 148 instructions.
 149
 150 ## FFT/DCT Triple Loop
 151
 152 TODO
 153
 154 ## Indexed
 155
 156 The purpose of Indexing is to provide a generalised version of
 157 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 158 Indexing is abstracted out and may be applied to much more
 159 than an element move/copy, and is not limited for example
 160 to the number of bytes that can fit into a VSX register.
 161 Indexing may be applied to LD/ST (even on Indexed LD/ST
 162 instructions such as `sv.lbzx`), arithmetic operations,
 163 extsw: there is no artificial limit.
 164
 165 The only major caveat is that the registers to be used as
 166 Indices must not be modified by any instruction after Indexed Mode
 167 is established, and neither must MAXVL be altered. Failure to observe
 168 these conditions results in `UNDEFINED` behaviour.
 169 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 170 the entire range of Indices to be subsequently used, but a corresponding
 171 Write-After-Read Hazard by any instruction that modifies the Indices
 172 **does not have to be created**. Given the large number of registers
 173 involved in Indexing this is a huge resource saving and reduction
 174 in micro-architectural complexity. MAXVL is likewise
 175 included in the RAW Hazards because it is involved in calculating
 176 how many registers are to be considered Indices.
 177
 178 With these restrictions in place high-performance implementations
 179 may read-cache the Indices from the point where a given `svindex` instruction
 180 is called or SVSHAPE SPRs altered.
 181
 182 # REMAP area of SVSTATE
 183
 184 The following bits of the SVSTATE SPR are used for REMAP:
 185
 186 |32.33|34.35|36.37|38.39|40.41| 42.46 | 62 |
 187 | --  | --  | --  | --  | --  | ----- | ------ |
 188 |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst    |
 189
 190 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 191 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 192 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
 193 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
 194 SVSHAPE is actively applied or not.
 195
 196 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
 197 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
 198 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
 199 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
 200 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS / RS
 201   (LD/ST-with-update has an implicit 2nd write register, RA)
 202
 203 # svremap instruction <a name="svremap"> </a>
 204
 205 There is also a corresponding SVRM-Form for the svremap
 206 instruction which matches the above SPR:
 207
 208     svremap SVme,mi0,mi1,mi2,mo0,mo2,pst
 209
 210 |0     |6     |11  |13   |15   |17   |19   |21    | 22.25 |26..31 |
 211 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 212 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 213
 214 # SHAPE Remapping SPRs
 215
 216 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 217 which have the same format.
 218
 219 [[!inline pages="openpower/sv/shape_table_format" raw="yes" ]]
 220
 221 # svshape instruction  <a name="svshape"> </a>
 222
 223 `svshape` is a convenience instruction that reduces instruction
 224 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
 225 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
 226 including VL and MAXVL. Using `svshape` therefore does not also
 227 require `setvl`.
 228
 229 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 230
 231     svshape SVxd,SVyd,SVzd,SVRM,vf
 232
 233 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..31|  name    |
 234 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 235 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svstate  |
 236
 237 Fields:
 238
 239 * **SVxd** - SV REMAP "xdim"
 240 * **SVyd** - SV REMAP "ydim"
 241 * **SVzd** - SV REMAP "zdim"
 242 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
 243 * **vf** - sets "Vertical-First" mode
 244 * **XO** - standard 6-bit XO field
 245
 246 *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
 247 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
 248
 249 | SVRM   | Remap Mode description |
 250 | --     | --              |
 251 | 0b0000 | Matrix 1/2/3D    |
 252 | 0b0001 | FFT Butterfly   |
 253 | 0b0010 | DCT Inner butterfly, pre-calculated coefficients |
 254 | 0b0011 | DCT Outer butterfly  |
 255 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 256 | 0b0101 | DCT COS table index generation |
 257 | 0b0110 | DCT half-swap   |
 258 | 0b0111 | reserved  |
 259 | 0b1000 | reserved |
 260 | 0b1001 | reserved  |
 261 | 0b1010 | iDCT Inner butterfly, pre-calculated coefficients |
 262 | 0b1011 | iDCT Outer butterfly  |
 263 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 264 | 0b1101 | iDCT COS table index generation |
 265 | 0b1110 | iDCT half-swap   |
 266 | 0b1111 | FFT half-swap   |
 267
 268 Examples showing how all of these Modes operate exists in the online
 269 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD)
 270 and the full pseudocode setting up all SPRs
 271 is in the [[openpower/isa/simplev]] page.
 272
 273 In Indexed Mode, there are only 5 bits available to specify the GPR
 274 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
 275 5 bits are given in the `SVxd` field: the bottom two implicit bits
 276 will be zero (`SVxd || 0b00`).
 277
 278 `svshape` has *limited applicability* due to being a 32-bit instruction.
 279 The full capability of SVSHAPE SPRs may be accessed by directly writing
 280 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
 281 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
 282 instruction, `psvshape`, may extend the capability here.
 283
 284 # svindex instruction  <a name="svindex"> </a>
 285
 286 `svindex` is a convenience instruction that reduces instruction
 287 count for Indexed REMAP Mode. It sets up
 288 (overwrites) all required SVSHAPE SPRs and can modify the REMAP
 289 SPR as well.  The relevant SPRs *may* be directly programmed with
 290 `mtspr` however it is laborious to do so: svindex saves instructions
 291 covering much of Indexed REMAP capability.
 292
 293 Form: SVI-Form SV "Indexed" Form (see [[isatables/fields.text]])
 294
 295     svindex SVG,rmm,SVd,ew,yx,mr,sk
 296
 297 | 0.5|6.10 |11.15  |16.20 | 21..25      | 26..31|  name    | Form |
 298 | -- | --  | ---   | ---- | ----------- | ------| -------- | ---- |
 299 |OPCD| SVG | rmm   | SVd  | ew/yx/mm/sk | XO    | svindex | SVI-Form |
 300
 301 Fields:
 302
 303 * **SVd** - SV REMAP x/y dim
 304 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
 305   controlled by mm
 306 * **ew** - sets element width override on RS
 307 * **SVG** - GPR SVG<<2 to be used for Indexing
 308 * **yx** - 2D reordering to be used if yx=1
 309 * **mm** - mask mode. determines how `rmm` is interpreted.
 310 * **sk** - Dimension skipping enabled
 311 * **XO** - standard 6-bit XO field
 312
 313 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
 314 "off-by-one".  In the assembler
 315 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
 316
 317 *Note: when `yx=1,sk=0` the second dimension is calculated as
 318 `CEIL(MAXVL/SVd)`*.
 319
 320 When `mm=0`:
 321
 322 * `rmm`, like REMAP.SVme, has bit 0
 323   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
 324   bit 3 to mo0 and bit 4 to mi1
 325 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
 326 * for each bit set in the 5-bit `rmm`, in order, the first
 327   as-yet-unset SVSHAPE will be updated
 328   with the other operands in the instruction, and the REMAP
 329   SPR set.
 330 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
 331 * SVSTATE persistence bit is cleared
 332 * No other alterations to SVSTATE are carried out
 333
 334 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
 335 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
 336 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
 337 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
 338
 339 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
 340 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
 341 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
 342
 343 Rough algorithmic form:
 344
 345     marray = [mi0, mi1, mi2, mo0, mo1]
 346     idx = 0
 347     for bit = 0 to 4:
 348         if not rmm[bit]: continue
 349         setup(SVSHAPE[idx])
 350         SVSTATE{marray[bit]} = idx
 351         idx = (idx+1) modulo 4
 352
 353 When `mm=1`:
 354
 355 * bits 0-2 of `rmm` indicate an index selecting mi0-mo1
 356 * bits 3-4 of `rmm` indicate which SVSHAPE 0-3 shall be updated
 357 * only the selected SVSHAPE is overwritten
 358 * only the relevant bits in the REMAP area of SVSTATE are updated
 359 * REMAP persistence bit is set.
 360
 361 Example 1: if `rmm`=0b10011 then mo0 is selected and SVSHAPE2
 362 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
 363 set to 2 (SVSHAPE2).
 364
 365 Example 2: if `rmm`=0b11100 then mo1 is selected and SVSHAPE3
 366 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
 367 set to 3 (SVSHAPE3).
 368
 369 Rough algorithmic form:
 370
 371     marray = [mi0, mi1, mi2, mo0, mo1]
 372     bit = rmm[0:2]
 373     idx = rmm[3:4]
 374     setup(SVSHAPE[idx])
 375     SVSTATE{marray[bit]} = idx
 376     SVSTATE.pst = 1
 377
 378 In essence, `mm=0` is intended for use to set as much of the
 379 REMAP State SPRs as practical with a single instruction,
 380 whilst `mm=1` is intended to be a little more refined.
 381
 382 **Usage guidelines**
 383
 384 * **Disable 2D mapping**: to only perform Indexing without
 385  reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
 386  or equal to VL)
 387 * **Modulo 1D mapping**: to perform Indexing cycling through the
 388  first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
 389  no requirement to set VL equal to a multiple of N.
 390 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
 391  `xdim=M,ydim=CEIL(MAXVL/M)`.
 392
 393 Beyond these mappings it becomes necessary to write directly to
 394 the SVSTATE SPRs manually.
 395
 396 # REMAP Matrix pseudocode
 397
 398 The algorithm below shows how REMAP works more clearly, and may be
 399 executed as a python program:
 400
 401 ```
 402 [[!inline pages="openpower/sv/remap.py" quick="yes" raw="yes" ]]
 403 ```
 404
 405 An easier-to-read version (using python iterators) shows the loop nesting:
 406
 407 ```
 408 [[!inline pages="openpower/sv/remapyield.py" quick="yes" raw="yes" ]]
 409 ```
 410
 411 Each element index from the for-loop `0..VL-1`
 412 is run through the above algorithm to work out the **actual** element
 413 index, instead.  Given that there are four possible SHAPE entries, up to
 414 four separate registers in any given operation may be simultaneously
 415 remapped:
 416
 417     function op_add(rd, rs1, rs2) # add not VADD!
 418       ...
 419       ...
 420       for (i = 0; i < VL; i++)
 421         xSTATE.srcoffs = i # save context
 422         if (predval & 1<<i) # predication uses intregs
 423            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 424                                   ireg[rs2+remap3(irs2)];
 425            if (!int_vec[rd ].isvector) break;
 426         if (int_vec[rd ].isvector)  { id += 1; }
 427         if (int_vec[rs1].isvector)  { irs1 += 1; }
 428         if (int_vec[rs2].isvector)  { irs2 += 1; }
 429
 430 By changing remappings, 2D matrices may be transposed "in-place" for one
 431 operation, followed by setting a different permutation order without
 432 having to move the values in the registers to or from memory.
 433
 434 Note that:
 435
 436 * Over-running the register file clearly has to be detected and
 437   an illegal instruction exception thrown
 438 * When non-default elwidths are set, the exact same algorithm still
 439   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 440   than entire registers).
 441 * If permute option 000 is utilised, the actual order of the
 442   reindexing does not change.  However, modulo MVL still occurs
 443   which will result in repeated operations (use with caution).
 444 * If two or more dimensions are set to zero, the actual order does not change!
 445 * The above algorithm is pseudo-code **only**.  Actual implementations
 446   will need to take into account the fact that the element for-looping
 447   must be **re-entrant**, due to the possibility of exceptions occurring.
 448   See SVSTATE SPR, which records the current element index.
 449   Continuing after return from an interrupt may introduce latency
 450   due to re-computation of the remapped offsets.
 451 * Twin-predicated operations require **two** separate and distinct
 452   element offsets.  The above pseudo-code algorithm will be applied
 453   separately and independently to each, should each of the two
 454   operands be remapped.  *This even includes unit-strided LD/ST*
 455   and other operations
 456   in that category, where in that case it will be the **offset** that is
 457   remapped.
 458 * Offset is especially useful, on its own, for accessing elements
 459   within the middle of a register.  Without offsets, it is necessary
 460   to either use a predicated MV, skipping the first elements, or
 461   performing a LOAD/STORE cycle to memory.
 462   With offsets, the data does not have to be moved.
 463 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 464   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 465   entries to be regularly presented to operands **more than once**, thus
 466   allowing the same underlying registers to act as an accumulator of
 467   multiple vector or matrix operations, for example.
 468 * Note especially that Program Order **must** still be respected
 469   even when overlaps occur that read or write the same register
 470   elements *including polymorphic ones*
 471
 472 Clearly here some considerable care needs to be taken as the remapping
 473 could hypothetically create arithmetic operations that target the
 474 exact same underlying registers, resulting in data corruption due to
 475 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 476 register-renaming will have an easier time dealing with this than
 477 DSP-style SIMD micro-architectures.
 478
 479 # 4x4 Matrix to vec4 Multiply Example
 480
 481 The following settings will allow a 4x4 matrix (starting at f8), expressed
 482 as a sequence of 16 numbers first by row then by column, to be multiplied
 483 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 484
 485 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 486 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 487 * VL=16, f4=vec, f0=vec, f8=vec
 488 * FMAC f4, f0, f8, f4
 489
 490 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 491 four iterations through the hardware loop, the REMAPed index will not
 492 increment. On the second four, the index will increase by one. Likewise
 493 on each subsequent group of four.
 494
 495 The permutation on SHAPE1 will increment f4 continuously cycling through
 496 f4-f7 every four iterations of the hardware loop.
 497
 498 At the same time, VL will, because there is no SHAPE on f8, increment
 499 straight sequentially through the 16 values f8-f23 in the Matrix. The
 500 equivalent sequence thus is issued:
 501
 502     fmac f4, f0, f8, f4
 503     fmac f5, f0, f9, f5
 504     fmac f6, f0, f10, f6
 505     fmac f7, f0, f11, f7
 506     fmac f4, f1, f12, f4
 507     fmac f5, f1, f13, f5
 508     fmac f6, f1, f14, f6
 509     fmac f7, f1, f15, f7
 510     fmac f4, f2, f16, f4
 511     fmac f5, f2, f17, f5
 512     fmac f6, f2, f18, f6
 513     fmac f7, f2, f19, f7
 514     fmac f4, f3, f20, f4
 515     fmac f5, f3, f21, f5
 516     fmac f6, f3, f22, f6
 517     fmac f7, f3, f23, f7
 518
 519 The only other instruction required is to ensure that f4-f7 are
 520 initialised (usually to zero).
 521
 522 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 523 the same technique applied to four independent vectors, can be done by
 524 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 525 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 526 it to apply four times to compute the four columns worth of vectors.
 527
 528 # Warshall transitive closure algorithm
 529
 530 TODO move to [[sv/remap/discussion]] page, copied from here
 531 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 532
 533 with thanks to Hendrik.
 534
 535 <https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm>
 536
 537 > Just a note:  interpreting + as 'or', and * as 'and',
 538 > operating on Boolean matrices,
 539 > and having result, X, and Y be the exact same matrix,
 540 > updated while being used,
 541 > gives the traditional Warshall transitive-closure
 542 > algorithm, if the loops are nested exactly in thie order.
 543
 544 this can be done with the ternary instruction which has
 545 an in-place triple boolean input:
 546
 547     RT = RT | (RA & RB)
 548
 549 and also has a CR Field variant of the same
 550
 551 notes from conversations:
 552
 553 > > for y in y_r:
 554 > >  for x in x_r:
 555 > >    for z in z_r:
 556 > >      result[y][x] +=
 557 > >         a[y][z] *
 558 > >         b[z][x]
 559
 560 > This nesting of loops works for matrix multiply, but not for transitive
 561 > closure.
 562
 563 > > it can be done:
 564 > >
 565 > >   for z in z_r:
 566 > >    for y in y_r:
 567 > >     for x in x_r:
 568 > >       result[y][x] +=
 569 > >          a[y][z] *
 570 > >          b[z][x]
 571 >
 572 > And this ordering of loops *does* work for transitive closure, when a,
 573 > b, and result are the very same matrix, updated while being used.
 574 >
 575 > By the way, I believe there is a graph algorithm that does the
 576 > transitive closure thing, but instead of using boolean, "and", and "or",
 577 > they use real numbers, addition, and minimum.  I think that one computes
 578 > shortest paths between vertices.
 579 >
 580 > By the time the z'th iteration of the z loop begins, the algorithm has
 581 > already peocessed paths that go through vertices numbered < z, and it
 582 > adds paths that go through vertices numbered z.
 583 >
 584 > For this to work, the outer loop has to be the one on teh subscript that
 585 > bridges a and b (which in this case are teh same matrix, of course).
 586
 587 # SUBVL Remap
 588
 589 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 590 sub-vectir elements were simply part of the main VL loop.  This is the
 591 *complete opposite* of predication which **only** applies to the whole
 592 vec2/3/4.  In pseudocode this would be:
 593
 594       for (i = 0; i < VL; i++)
 595         if (predval & 1<<i) # apply to VL not SUBVL
 596           for (j = 0; j < SUBVL; j++)
 597              id = i*SUBVL + j # not, "id=i".
 598              ireg[RT+remap1(id)] ...
 599
 600 The reason for allowing SUBVL Remaps is that some regular patterns using
 601 Swizzle which would otherwise require multiple explicit instructions
 602 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 603 instead.  Not however that Swizzle is *still permitted to be applied*.
 604
 605 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 606 stage:
 607
 608 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 609
 610 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 611 a 2D REMAP allows:
 612
 613 * the column bytes (as a vec4) to be iterated over as an inner loop,
 614   progressing vertically (`a00 a10 a20 a30`)
 615 * the columns themselves to be iterated as an outer loop
 616 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 617
 618 This entirely in-place without special 128-bit opcodes.  Below is
 619 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 620
 621 ```
 622 void gmix_column(unsigned char *r) {
 623     unsigned char a[4];
 624     unsigned char b[4];
 625     unsigned char c;
 626     unsigned char h;
 627     // no swizzle here but still SUBVL.Remap
 628     // can be done as vec4 byte-level
 629     // elwidth overrides though.
 630     for (c = 0; c < 4; c++) {
 631         a[c] = r[c];
 632         h = (unsigned char)((signed char)r[c] >> 7);
 633         b[c] = r[c] << 1;
 634         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 635     }
 636     // SUBVL.Remap still needed here
 637     // bytelevel elwidth overrides and vec4
 638     // These may then each be 4x 8bit bit Swizzled
 639     // r0.vec4 = b.vec4
 640     // r0.vec4 ^= a.vec4.WXYZ
 641     // r0.vec4 ^= a.vec4.ZWXY
 642     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 643     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 644     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 645     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 646     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 647 }
 648 ```
 649
 650 With the assumption made by the above code that the column bytes have
 651 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 652 may transparently fill that role, in-place, without a complex byte-level
 653 mv operation.
 654
 655 The application of the swizzles allows the remapped vec4 a, b and r
 656 variables to perform four straight linear 32 bit XOR operations where a
 657 scalar processor would be required to perform 16 byte-level individual
 658 operations.  Given wide enough SIMD backends in hardware these 3 bit
 659 XORs may be done as single-cycle operations across the entire 128 bit
 660 Rijndael Matrix.
 661
 662 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 663 Multiply using the MDS Matrix.
 664
 665 # TODO
 666
 667 * investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 668 in https://bugs.libre-soc.org/show_bug.cgi?id=653
 669 * Triangular REMAP
 670 * Cross-Product REMAP (actually, skew Matrix: https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix)
 671