openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=143>
   6 * see  [[sv/propagation]] for a future way to apply
   7 REMAP.
   8
   9 REMAP is an advanced form of Vector "Structure Packing" that
  10 provides hardware-level support for commonly-used *nested* loop patterns.
  11
  12
  13 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
  14 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  15 arbitrary access to elements, independently on each Vector src or dest
  16 register.
  17
  18 Their primary use is for Matrix Multiplication, reordering of sequential
  19 data in-place.  Four SPRs are provided so that a single FMAC may be
  20 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
  21 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  22 such as RGB pixel data extraction and reforming.
  23
  24 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  25 Vector ISAs which would typically only have a limited set of instructions
  26 that can be structure-packed (LD/ST typically), REMAP may be applied to
  27 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  28
  29 Note that REMAP does not apply to sub-vector elements: that is what
  30 swizzle is for.  Swizzle *can* however be applied to the same instruction
  31 as REMAP.
  32
  33 In its general form, REMAP is quite expensive to set up, and on some
  34 implementations introduce
  35 latency, so should realistically be used only where it is worthwhile.
  36 Commonly-used patterns such as Matrix Multiply, DCT and FFT have
  37 helper instruction options which make REMAP easier to use.
  38
  39 # Principle
  40
  41 * normal vector element read/write as operands would be sequential
  42   (0 1 2 3 ....)
  43 * this is not appropriate for (e.g.) Matrix multiply which requires
  44   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  45 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  46   with this.  both are expensive (copy large vectors, spill through memory)
  47 * REMAP **redefines** the order of access according to set "Schedules".
  48 * The Schedules are not necessarily restricted to power-of-two boundaries
  49   making it unnecessary to have for exsmple specialised 3x4 transpose
  50   instructions.
  51
  52 Only the most commonly-used algorithms in computer science have REMAP
  53 support, due to the high cost in both the ISA and in hardware.
  54
  55 # REMAP SPR
  56
  57 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
  58 | -- | -- | -- | -- | -- | ----- | ------ |
  59 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme  | rsv    |
  60
  61 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
  62 mi0-2 apply to RA, RB, RC respectively, as input registers, and
  63 likewise mo0-1 apply to output registers (FRT, FRS respectively).
  64 SVme is 5 bits, and indicates indicate whether the
  65 SVSHAPE is actively applied or not.
  66
  67 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
  68 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
  69 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
  70 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
  71 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
  72   (LD/ST-with-update has an implicit 2nd write register, RA)
  73
  74 There is also a corresponding SVRM-Form for the svremap
  75 instruction which matches the above SPR:
  76
  77     |0     |6     |11  |13   |15   |17   |19   |21    | 22    |26     |31 |
  78     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd | XO    | / |
  79
  80 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
  81
  82 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
  83 which have the same format.
  84
  85 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
  86
  87 The algorithm below shows how REMAP works more clearly, and may be
  88 executed as a python program:
  89
  90 ```
  91 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
  92 ```
  93
  94 An easier-to-read version (using python iterators) shows the loop nesting:
  95
  96 ```
  97 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
  98 ```
  99
 100 Each element index from the for-loop `0..VL-1`
 101 is run through the above algorithm to work out the **actual** element
 102 index, instead.  Given that there are four possible SHAPE entries, up to
 103 four separate registers in any given operation may be simultaneously
 104 remapped:
 105
 106     function op_add(rd, rs1, rs2) # add not VADD!
 107       ...
 108       ...
 109       for (i = 0; i < VL; i++)
 110         xSTATE.srcoffs = i # save context
 111         if (predval & 1<<i) # predication uses intregs
 112            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 113                                   ireg[rs2+remap3(irs2)];
 114            if (!int_vec[rd ].isvector) break;
 115         if (int_vec[rd ].isvector)  { id += 1; }
 116         if (int_vec[rs1].isvector)  { irs1 += 1; }
 117         if (int_vec[rs2].isvector)  { irs2 += 1; }
 118
 119 By changing remappings, 2D matrices may be transposed "in-place" for one
 120 operation, followed by setting a different permutation order without
 121 having to move the values in the registers to or from memory.
 122
 123 Note that:
 124
 125 * Over-running the register file clearly has to be detected and
 126   an illegal instruction exception thrown
 127 * When non-default elwidths are set, the exact same algorithm still
 128   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 129   than entire registers).
 130 * If permute option 000 is utilised, the actual order of the
 131   reindexing does not change.  However, modulo MVL still occurs
 132   which will result in repeated operations (use with caution).
 133 * If two or more dimensions are set to zero, the actual order does not change!
 134 * The above algorithm is pseudo-code **only**.  Actual implementations
 135   will need to take into account the fact that the element for-looping
 136   must be **re-entrant**, due to the possibility of exceptions occurring.
 137   See SVSTATE SPR, which records the current element index.
 138   Continuing after return from an interrupt may introduce latency
 139   due to re-computation of the remapped offsets.
 140 * Twin-predicated operations require **two** separate and distinct
 141   element offsets.  The above pseudo-code algorithm will be applied
 142   separately and independently to each, should each of the two
 143   operands be remapped.  *This even includes unit-strided LD/ST*
 144   and other operations
 145   in that category, where in that case it will be the **offset** that is
 146   remapped.
 147 * Offset is especially useful, on its own, for accessing elements
 148   within the middle of a register.  Without offsets, it is necessary
 149   to either use a predicated MV, skipping the first elements, or
 150   performing a LOAD/STORE cycle to memory.
 151   With offsets, the data does not have to be moved.
 152 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 153   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 154   entries to be regularly presented to operands **more than once**, thus
 155   allowing the same underlying registers to act as an accumulator of
 156   multiple vector or matrix operations, for example.
 157 * Note especially that Program Order **must** still be respected
 158   even when overlaps occur that read or write the same register
 159   elements *including polymorphic ones*
 160
 161 Clearly here some considerable care needs to be taken as the remapping
 162 could hypothetically create arithmetic operations that target the
 163 exact same underlying registers, resulting in data corruption due to
 164 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 165 register-renaming will have an easier time dealing with this than
 166 DSP-style SIMD micro-architectures.
 167
 168 ## svstate instruction
 169
 170 Please note: this is **not** intended for production.  It sets up
 171 (overwrites) all required SVSHAPE SPRs and indicates that the
 172 *next instruction* shall have those REMAP shapes applied to it,
 173 assuming that instruction is of the form FRT,FRA,FRC,FRB.
 174
 175 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 176
 177 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..30 |31|  name    |
 178 | -- | --   | ---   | ----- | ------ | -- | ------ |--| -------- |
 179 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO     |/ | svstate  |
 180
 181
 182 Fields:
 183
 184 * **SVxd** - SV REMAP "xdim"
 185 * **SVyd** - SV REMAP "ydim"
 186 * **SVMM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
 187 * **vf** - sets "Vertical-First" mode
 188 * **XO** - standard 5-bit XO field
 189
 190 # 4x4 Matrix to vec4 Multiply Example
 191
 192 The following settings will allow a 4x4 matrix (starting at f8), expressed
 193 as a sequence of 16 numbers first by row then by column, to be multiplied
 194 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 195
 196 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 197 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 198 * VL=16, f4=vec, f0=vec, f8=vec
 199 * FMAC f4, f0, f8, f4
 200
 201 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 202 four iterations through the hardware loop, the REMAPed index will not
 203 increment. On the second four, the index will increase by one. Likewise
 204 on each subsequent group of four.
 205
 206 The permutation on SHAPE1 will increment f4 continuously cycling through
 207 f4-f7 every four iterations of the hardware loop.
 208
 209 At the same time, VL will, because there is no SHAPE on f8, increment
 210 straight sequentially through the 16 values f8-f23 in the Matrix. The
 211 equivalent sequence thus is issued:
 212
 213     fmac f4, f0, f8, f4
 214     fmac f5, f0, f9, f5
 215     fmac f6, f0, f10, f6
 216     fmac f7, f0, f11, f7
 217     fmac f4, f1, f12, f4
 218     fmac f5, f1, f13, f5
 219     fmac f6, f1, f14, f6
 220     fmac f7, f1, f15, f7
 221     fmac f4, f2, f16, f4
 222     fmac f5, f2, f17, f5
 223     fmac f6, f2, f18, f6
 224     fmac f7, f2, f19, f7
 225     fmac f4, f3, f20, f4
 226     fmac f5, f3, f21, f5
 227     fmac f6, f3, f22, f6
 228     fmac f7, f3, f23, f7
 229
 230 The only other instruction required is to ensure that f4-f7 are
 231 initialised (usually to zero).
 232
 233 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 234 the same technique applied to four independent vectors, can be done by
 235 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 236 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 237 it to apply four times to compute the four columns worth of vectors.
 238
 239 # Warshall transitive closure algorithm
 240
 241 TODO move to [[sv/remap/discussion]] page, copied from here
 242 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 243
 244 with thanks to Hendrik.
 245
 246 > Just a note:  interpreting + as 'or', and * as 'and',
 247 > operating on Boolean matrices,
 248 > and having result, X, and Y be the exact same matrix,
 249 > updated while being used,
 250 > gives the traditional Warshall transitive-closure
 251 > algorithm, if the loops are nested exactly in thie order.
 252
 253 this can be done with the ternary instruction which has
 254 an in-place triple boolean input:
 255
 256     RT = RT | (RA & RB)
 257
 258 and also has a CR Field variant of the same
 259
 260 notes from conversations:
 261
 262 > > for y in y_r:
 263 > >  for x in x_r:
 264 > >    for z in z_r:
 265 > >      result[y][x] +=
 266 > >         a[y][z] *
 267 > >         b[z][x]
 268
 269 > This nesting of loops works for matrix multiply, but not for transitive
 270 > closure.
 271
 272 > > it can be done:
 273 > >
 274 > >   for z in z_r:
 275 > >    for y in y_r:
 276 > >     for x in x_r:
 277 > >       result[y][x] +=
 278 > >          a[y][z] *
 279 > >          b[z][x]
 280 >
 281 > And this ordering of loops *does* work for transitive closure, when a,
 282 > b, and result are the very same matrix, updated while being used.
 283 >
 284 > By the way, I believe there is a graph algorithm that does the
 285 > transitive closure thing, but instead of using boolean, "and", and "or",
 286 > they use real numbers, addition, and minimum.  I think that one computes
 287 > shortest paths between vertices.
 288 >
 289 > By the time the z'th iteration of the z loop begins, the algorithm has
 290 > already peocessed paths that go through vertices numbered < z, and it
 291 > adds paths that go through vertices numbered z.
 292 >
 293 > For this to work, the outer loop has to be the one on teh subscript that
 294 > bridges a and b (which in this case are teh same matrix, of course).
 295
 296 # SUBVL Remap
 297
 298 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 299 sub-vectir elements were simply part of the main VL loop.  This is the
 300 *complete opposite* of predication which **only** applies to the whole
 301 vec2/3/4.  In pseudocode this would be:
 302
 303       for (i = 0; i < VL; i++)
 304         if (predval & 1<<i) # apply to VL not SUBVL
 305           for (j = 0; j < SUBVL; j++)
 306              id = i*SUBVL + j # not, "id=i".
 307              ireg[RT+remap1(id)] ...
 308
 309 The reason for allowing SUBVL Remaps is that some regular patterns using
 310 Swizzle which would otherwise require multiple explicit instructions
 311 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 312 instead.  Not however that Swizzle is *still permitted to be applied*.
 313
 314 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 315 stage:
 316
 317 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 318
 319 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 320 a 2D REMAP allows:
 321
 322 * the column bytes (as a vec4) to be iterated over as an inner loop,
 323   progressing vertically (`a00 a10 a20 a30`)
 324 * the columns themselves to be iterated as an outer loop
 325 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 326
 327 This entirely in-place without special 128-bit opcodes.  Below is
 328 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 329
 330 ```
 331 void gmix_column(unsigned char *r) {
 332     unsigned char a[4];
 333     unsigned char b[4];
 334     unsigned char c;
 335     unsigned char h;
 336     // no swizzle here but still SUBVL.Remap
 337     // can be done as vec4 byte-level
 338     // elwidth overrides though.
 339     for (c = 0; c < 4; c++) {
 340         a[c] = r[c];
 341         h = (unsigned char)((signed char)r[c] >> 7);
 342         b[c] = r[c] << 1;
 343         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 344     }
 345     // SUBVL.Remap still needed here
 346     // bytelevel elwidth overrides and vec4
 347     // These may then each be 4x 8bit bit Swizzled
 348     // r0.vec4 = b.vec4
 349     // r0.vec4 ^= a.vec4.WXYZ
 350     // r0.vec4 ^= a.vec4.ZWXY
 351     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 352     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 353     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 354     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 355     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 356 }
 357 ```
 358
 359 With the assumption made by the above code that the column bytes have
 360 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 361 may transparently fill that role, in-place, without a complex byte-level
 362 mv operation.
 363
 364 The application of the swizzles allows the remapped vec4 a, b and r
 365 variables to perform four straight linear 32 bit XOR operations where a
 366 scalar processor would be required to perform 16 byte-level individual
 367 operations.  Given wide enough SIMD backends in hardware these 3 bit
 368 XORs may be done as single-cycle operations across the entire 128 bit
 369 Rijndael Matrix.
 370
 371 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 372 Multiply using the MDS Matrix.
 373
 374 # TODO
 375
 376 investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 377 in https://bugs.libre-soc.org/show_bug.cgi?id=653