openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
   6
   7 see [[sv/propagation]] because it is currently the only way to apply
   8 REMAP.
   9
  10 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
  11 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  12 arbitrary access to elements, independently on each Vector src or dest
  13 register.
  14
  15 Their primary use is for Matrix Multiplication, reordering of sequential
  16 data in-place.  Four SPRs are provided so that a single FMAC may be
  17 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
  18 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  19 such as RGB pixel data extraction and reforming.
  20
  21 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  22 Vector ISAs which would typically only have a limited set of instructions
  23 that can be structure-packed (LD/ST typically), REMAP may be applied to
  24 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  25
  26 Note that REMAP does not apply to sub-vector elements: that is what
  27 swizzle is for.  Swizzle *can* however be applied to the same instruction
  28 as REMAP.
  29
  30 REMAP is quite expensive to set up, and on some implementations introduce
  31 latency, so should realistically be used only where it is worthwhile
  32
  33 # Principle
  34
  35 * normal vector element read/write as operands would be sequential
  36   (0 1 2 3 ....)
  37 * this is not appropriate for (e.g.) Matrix multiply which requires
  38   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  39 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  40   with this.  both are expensive (copy large vectors, spill through memory)
  41 * REMAP **redefines** the order of access according to set "Schedules"
  42
  43 Only the most commonly-used algorithms in computer science have REMAP
  44 support, due to the high cost in both the ISA and in hardware.
  45
  46 # REMAP SPR
  47
  48 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
  49 | -- | -- | -- | -- | -- | ----- | ------ |
  50 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme  | rsv    |
  51
  52 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
  53 mi0-2 apply to RA, RB, RC respectively, as input registers, and
  54 likewise mo0-1 apply to output registers (FRT, FRS respectively).
  55 SVme is 5 bits, and indicates indicate whether the
  56 SVSHAPE is actively applied or not.
  57
  58 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
  59 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
  60 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
  61 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
  62 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
  63   (LD/ST-with-update has an implicit 2nd write register, RA)
  64
  65 There is also a corresponding SVRM-Form for the svremap
  66 instruction which matches the above SPR:
  67
  68     |0     |6     |11  |13   |15   |17   |19   |21    | 22    |26     |31 |
  69     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd | XO    | / |
  70
  71 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
  72
  73 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
  74 which have the same format.
  75
  76 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
  77
  78 The algorithm below shows how REMAP works more clearly, and may be
  79 executed as a python program:
  80
  81 ```
  82 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
  83 ```
  84
  85 An easier-to-read version (using python iterators) shows the loop nesting:
  86
  87 ```
  88 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
  89 ```
  90
  91 Each element index from the for-loop `0..VL-1`
  92 is run through the above algorithm to work out the **actual** element
  93 index, instead.  Given that there are four possible SHAPE entries, up to
  94 four separate registers in any given operation may be simultaneously
  95 remapped:
  96
  97     function op_add(rd, rs1, rs2) # add not VADD!
  98       ...
  99       ...
 100       for (i = 0; i < VL; i++)
 101         xSTATE.srcoffs = i # save context
 102         if (predval & 1<<i) # predication uses intregs
 103            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 104                                   ireg[rs2+remap3(irs2)];
 105            if (!int_vec[rd ].isvector) break;
 106         if (int_vec[rd ].isvector)  { id += 1; }
 107         if (int_vec[rs1].isvector)  { irs1 += 1; }
 108         if (int_vec[rs2].isvector)  { irs2 += 1; }
 109
 110 By changing remappings, 2D matrices may be transposed "in-place" for one
 111 operation, followed by setting a different permutation order without
 112 having to move the values in the registers to or from memory.
 113
 114 Note that:
 115
 116 * Over-running the register file clearly has to be detected and
 117   an illegal instruction exception thrown
 118 * When non-default elwidths are set, the exact same algorithm still
 119   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 120   than entire registers).
 121 * If permute option 000 is utilised, the actual order of the
 122   reindexing does not change.  However, modulo MVL still occurs
 123   which will result in repeated operations (use with caution).
 124 * If two or more dimensions are set to zero, the actual order does not change!
 125 * The above algorithm is pseudo-code **only**.  Actual implementations
 126   will need to take into account the fact that the element for-looping
 127   must be **re-entrant**, due to the possibility of exceptions occurring.
 128   See SVSTATE SPR, which records the current element index.
 129   Continuing after return from an interrupt may introduce latency
 130   due to re-computation of the remapped offsets.
 131 * Twin-predicated operations require **two** separate and distinct
 132   element offsets.  The above pseudo-code algorithm will be applied
 133   separately and independently to each, should each of the two
 134   operands be remapped.  *This even includes unit-strided LD/ST*
 135   and other operations
 136   in that category, where in that case it will be the **offset** that is
 137   remapped.
 138 * Offset is especially useful, on its own, for accessing elements
 139   within the middle of a register.  Without offsets, it is necessary
 140   to either use a predicated MV, skipping the first elements, or
 141   performing a LOAD/STORE cycle to memory.
 142   With offsets, the data does not have to be moved.
 143 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 144   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 145   entries to be regularly presented to operands **more than once**, thus
 146   allowing the same underlying registers to act as an accumulator of
 147   multiple vector or matrix operations, for example.
 148 * Note especially that Program Order **must** still be respected
 149   even when overlaps occur that read or write the same register
 150   elements *including polymorphic ones*
 151
 152 Clearly here some considerable care needs to be taken as the remapping
 153 could hypothetically create arithmetic operations that target the
 154 exact same underlying registers, resulting in data corruption due to
 155 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 156 register-renaming will have an easier time dealing with this than
 157 DSP-style SIMD micro-architectures.
 158
 159 ## svstate instruction
 160
 161 Please note: this is **not** intended for production.  It sets up
 162 (overwrites) all required SVSHAPE SPRs and indicates that the
 163 *next instruction* shall have those REMAP shapes applied to it,
 164 assuming that instruction is of the form FRT,FRA,FRC,FRB.
 165
 166 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 167
 168 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..30 |31|  name    |
 169 | -- | --   | ---   | ----- | ------ | -- | ------ |--| -------- |
 170 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO     |/ | svstate  |
 171
 172
 173 Fields:
 174
 175 * **SVxd** - SV REMAP "xdim"
 176 * **SVyd** - SV REMAP "ydim"
 177 * **SVMM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
 178 * **vf** - sets "Vertical-First" mode
 179 * **XO** - standard 5-bit XO field
 180
 181 # 4x4 Matrix to vec4 Multiply Example
 182
 183 The following settings will allow a 4x4 matrix (starting at f8), expressed
 184 as a sequence of 16 numbers first by row then by column, to be multiplied
 185 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 186
 187 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 188 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 189 * VL=16, f4=vec, f0=vec, f8=vec
 190 * FMAC f4, f0, f8, f4
 191
 192 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 193 four iterations through the hardware loop, the REMAPed index will not
 194 increment. On the second four, the index will increase by one. Likewise
 195 on each subsequent group of four.
 196
 197 The permutation on SHAPE1 will increment f4 continuously cycling through
 198 f4-f7 every four iterations of the hardware loop.
 199
 200 At the same time, VL will, because there is no SHAPE on f8, increment
 201 straight sequentially through the 16 values f8-f23 in the Matrix. The
 202 equivalent sequence thus is issued:
 203
 204     fmac f4, f0, f8, f4
 205     fmac f5, f0, f9, f5
 206     fmac f6, f0, f10, f6
 207     fmac f7, f0, f11, f7
 208     fmac f4, f1, f12, f4
 209     fmac f5, f1, f13, f5
 210     fmac f6, f1, f14, f6
 211     fmac f7, f1, f15, f7
 212     fmac f4, f2, f16, f4
 213     fmac f5, f2, f17, f5
 214     fmac f6, f2, f18, f6
 215     fmac f7, f2, f19, f7
 216     fmac f4, f3, f20, f4
 217     fmac f5, f3, f21, f5
 218     fmac f6, f3, f22, f6
 219     fmac f7, f3, f23, f7
 220
 221 The only other instruction required is to ensure that f4-f7 are
 222 initialised (usually to zero).
 223
 224 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 225 the same technique applied to four independent vectors, can be done by
 226 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 227 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 228 it to apply four times to compute the four columns worth of vectors.
 229
 230 # Warshall transitive closure algorithm
 231
 232 TODO move to [[sv/remap/discussion]] page, copied from here
 233 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 234
 235 with thanks to Hendrik.
 236
 237 > Just a note:  interpreting + as 'or', and * as 'and',
 238 > operating on Boolean matrices,
 239 > and having result, X, and Y be the exact same matrix,
 240 > updated while being used,
 241 > gives the traditional Warshall transitive-closure
 242 > algorithm, if the loops are nested exactly in thie order.
 243
 244 this can be done with the ternary instruction which has
 245 an in-place triple boolean input:
 246
 247     RT = RT | (RA & RB)
 248
 249 and also has a CR Field variant of the same
 250
 251 notes from conversations:
 252
 253 > > for y in y_r:
 254 > >  for x in x_r:
 255 > >    for z in z_r:
 256 > >      result[y][x] +=
 257 > >         a[y][z] *
 258 > >         b[z][x]
 259
 260 > This nesting of loops works for matrix multiply, but not for transitive
 261 > closure.
 262
 263 > > it can be done:
 264 > >
 265 > >   for z in z_r:
 266 > >    for y in y_r:
 267 > >     for x in x_r:
 268 > >       result[y][x] +=
 269 > >          a[y][z] *
 270 > >          b[z][x]
 271 >
 272 > And this ordering of loops *does* work for transitive closure, when a,
 273 > b, and result are the very same matrix, updated while being used.
 274 >
 275 > By the way, I believe there is a graph algorithm that does the
 276 > transitive closure thing, but instead of using boolean, "and", and "or",
 277 > they use real numbers, addition, and minimum.  I think that one computes
 278 > shortest paths between vertices.
 279 >
 280 > By the time the z'th iteration of the z loop begins, the algorithm has
 281 > already peocessed paths that go through vertices numbered < z, and it
 282 > adds paths that go through vertices numbered z.
 283 >
 284 > For this to work, the outer loop has to be the one on teh subscript that
 285 > bridges a and b (which in this case are teh same matrix, of course).
 286
 287 # SUBVL Remap
 288
 289 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 290 sub-vectir elements were simply part of the main VL loop.  This is the
 291 *complete opposite* of predication which **only** applies to the whole
 292 vec2/3/4.  In pseudocode this would be:
 293
 294       for (i = 0; i < VL; i++)
 295         if (predval & 1<<i) # apply to VL not SUBVL
 296           for (j = 0; j < SUBVL; j++)
 297              id = i*SUBVL + j # not, "id=i".
 298              ireg[RT+remap1(id)] ...
 299
 300 The reason for allowing SUBVL Remaps is that some regular patterns using
 301 Swizzle which would otherwise require multiple explicit instructions
 302 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 303 instead.  Not however that Swizzle is *still permitted to be applied*.
 304
 305 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 306 stage:
 307
 308 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 309
 310 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 311 a 2D REMAP allows:
 312
 313 * the column bytes (as a vec4) to be iterated over as an inner loop,
 314   progressing vertically (`a00 a10 a20 a30`)
 315 * the columns themselves to be iterated as an outer loop
 316 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 317
 318 This entirely in-place without special 128-bit opcodes.  Below is
 319 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 320
 321 ```
 322 void gmix_column(unsigned char *r) {
 323     unsigned char a[4];
 324     unsigned char b[4];
 325     unsigned char c;
 326     unsigned char h;
 327     // no swizzle here but still SUBVL.Remap
 328     // can be done as vec4 byte-level
 329     // elwidth overrides though.
 330     for (c = 0; c < 4; c++) {
 331         a[c] = r[c];
 332         h = (unsigned char)((signed char)r[c] >> 7);
 333         b[c] = r[c] << 1;
 334         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 335     }
 336     // SUBVL.Remap still needed here
 337     // bytelevel elwidth overrides and vec4
 338     // These may then each be 4x 8bit bit Swizzled
 339     // r0.vec4 = b.vec4
 340     // r0.vec4 ^= a.vec4.WXYZ
 341     // r0.vec4 ^= a.vec4.ZWXY
 342     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 343     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 344     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 345     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 346     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 347 }
 348 ```
 349
 350 With the assumption made by the above code that the column bytes have
 351 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 352 may transparently fill that role, in-place, without a complex byte-level
 353 mv operation.
 354
 355 The application of the swizzles allows the remapped vec4 a, b and r
 356 variables to perform four straight linear 32 bit XOR operations where a
 357 scalar processor would be required to perform 16 byte-level individual
 358 operations.  Given wide enough SIMD backends in hardware these 3 bit
 359 XORs may be done as single-cycle operations across the entire 128 bit
 360 Rijndael Matrix.
 361
 362 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 363 Multiply using the MDS Matrix.
 364
 365 # TODO
 366
 367 investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 368 in https://bugs.libre-soc.org/show_bug.cgi?id=653