openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 see [[sv/propagation]] because it is currently the only way to apply
   6 REMAP.
   7
   8 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped)
   9 from a linear form to a 2D or 3D transposed form, or "offset" to permit
  10 arbitrary access to elements, independently on each Vector src or dest
  11 register.
  12
  13 Their primary use is for Matrix Multiplication, reordering of sequential
  14 data in-place.  Four SPRs are provided so that a single FMAC may be
  15 used in a single loop to perform 4x4 times 4x4 Matrix multiplication,
  16 generating 64 FMACs.  Additional uses include regular "Structure Packing"
  17 such as RGB pixel data extraction and reforming.
  18
  19 REMAP, like all of SV, is abstracted out, meaning that unlike traditional
  20 Vector ISAs which would typically only have a limited set of instructions
  21 that can be structure-packed (LD/ST typically), REMAP may be applied to
  22 literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  23
  24 Note that REMAP does not apply to sub-vector elements: that is what
  25 swizzle is for.  Swizzle *can* however be applied to the same instruction
  26 as REMAP.
  27
  28 REMAP is quite expensive to set up, and on some implementations introduce
  29 latency, so should realistically be used only where it is worthwhile
  30
  31 # Principle
  32
  33 * normal vector element read/write as operands would be sequential
  34   (0 1 2 3 ....)
  35 * this is not appropriate for (e.g.) Matrix multiply which requires
  36   accessing elements in alternative sequences (0 3 6 1 4 7 ...)
  37 * normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
  38   with this.  both are expensive (copy large vectors, spill through memory)
  39 * REMAP **redefines** the order of access according to set "Schedules"
  40
  41 Only the most commonly-used algorithms in computer science have REMAP
  42 support, due to the high cost in both the ISA and in hardware.
  43
  44 # REMAP SPR
  45
  46 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
  47 | -- | -- | -- | -- | -- | ----- | ------ |
  48 |mi0 |mi1 |mi2 |mo0 |mo1 | SVme  | rsv    |
  49
  50 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
  51 mi0-2 apply to RA, RB, RC respectively, as input registers, and
  52 likewise mo0-1 apply to output registers (FRT, FRS respectively).
  53 SVme is 5 bits, and indicates indicate whether the
  54 SVSHAPE is actively applied or not.
  55
  56 * bit 0 of SVme indicates if mi0 is applied to RA / FRA
  57 * bit 1 of SVme indicates if mi1 is applied to RB / FRB
  58 * bit 2 of SVme indicates if mi2 is applied to RC / FRC
  59 * bit 3 of SVme indicates if mo0 is applied to RT / FRT
  60 * bit 4 of SVme indicates if mo1 is applied to Effective Address / FRS
  61   (LD/ST-with-update has an implicit 2nd write register, RA)
  62
  63 There is also a corresponding SVRM-Form for the svremap
  64 instruction which matches the above SPR:
  65
  66     |0     |6     |11  |13   |15   |17   |19   |21    | 22    |26     |31 |
  67     | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd | XO    | / |
  68
  69 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
  70
  71 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
  72 which have the same format.
  73
  74 [[!inline raw="yes" pages="openpower/sv/shape_table_format" ]]
  75
  76 The algorithm below shows how REMAP works more clearly, and may be
  77 executed as a python program:
  78
  79 ```
  80 [[!inline quick="yes" raw="yes" pages="openpower/sv/remap.py" ]]
  81 ```
  82
  83 An easier-to-read version (using python iterators) shows the loop nesting:
  84
  85 ```
  86 [[!inline quick="yes" raw="yes" pages="openpower/sv/remapyield.py" ]]
  87 ```
  88
  89 Each element index from the for-loop `0..VL-1`
  90 is run through the above algorithm to work out the **actual** element
  91 index, instead.  Given that there are four possible SHAPE entries, up to
  92 four separate registers in any given operation may be simultaneously
  93 remapped:
  94
  95     function op_add(rd, rs1, rs2) # add not VADD!
  96       ...
  97       ...
  98       for (i = 0; i < VL; i++)
  99         xSTATE.srcoffs = i # save context
 100         if (predval & 1<<i) # predication uses intregs
 101            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
 102                                   ireg[rs2+remap3(irs2)];
 103            if (!int_vec[rd ].isvector) break;
 104         if (int_vec[rd ].isvector)  { id += 1; }
 105         if (int_vec[rs1].isvector)  { irs1 += 1; }
 106         if (int_vec[rs2].isvector)  { irs2 += 1; }
 107
 108 By changing remappings, 2D matrices may be transposed "in-place" for one
 109 operation, followed by setting a different permutation order without
 110 having to move the values in the registers to or from memory.
 111
 112 Note that:
 113
 114 * Over-running the register file clearly has to be detected and
 115   an illegal instruction exception thrown
 116 * When non-default elwidths are set, the exact same algorithm still
 117   applies (i.e. it offsets *polymorphic* elements *within* registers rather
 118   than entire registers).
 119 * If permute option 000 is utilised, the actual order of the
 120   reindexing does not change.  However, modulo MVL still occurs
 121   which will result in repeated operations (use with caution).
 122 * If two or more dimensions are set to zero, the actual order does not change!
 123 * The above algorithm is pseudo-code **only**.  Actual implementations
 124   will need to take into account the fact that the element for-looping
 125   must be **re-entrant**, due to the possibility of exceptions occurring.
 126   See SVSTATE SPR, which records the current element index.
 127   Continuing after return from an interrupt may introduce latency
 128   due to re-computation of the remapped offsets.
 129 * Twin-predicated operations require **two** separate and distinct
 130   element offsets.  The above pseudo-code algorithm will be applied
 131   separately and independently to each, should each of the two
 132   operands be remapped.  *This even includes unit-strided LD/ST*
 133   and other operations
 134   in that category, where in that case it will be the **offset** that is
 135   remapped.
 136 * Offset is especially useful, on its own, for accessing elements
 137   within the middle of a register.  Without offsets, it is necessary
 138   to either use a predicated MV, skipping the first elements, or
 139   performing a LOAD/STORE cycle to memory.
 140   With offsets, the data does not have to be moved.
 141 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 142   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 143   entries to be regularly presented to operands **more than once**, thus
 144   allowing the same underlying registers to act as an accumulator of
 145   multiple vector or matrix operations, for example.
 146 * Note especially that Program Order **must** still be respected
 147   even when overlaps occur that read or write the same register
 148   elements *including polymorphic ones*
 149
 150 Clearly here some considerable care needs to be taken as the remapping
 151 could hypothetically create arithmetic operations that target the
 152 exact same underlying registers, resulting in data corruption due to
 153 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 154 register-renaming will have an easier time dealing with this than
 155 DSP-style SIMD micro-architectures.
 156
 157 ## svstate instruction
 158
 159 Please note: this is **not** intended for production.  It sets up
 160 (overwrites) all required SVSHAPE SPRs and indicates that the
 161 *next instruction* shall have those REMAP shapes applied to it,
 162 assuming that instruction is of the form FRT,FRA,FRC,FRB.
 163
 164 Form: SVM-Form SV "Matrix" Form (see [[isatables/fields.text]])
 165
 166 | 0.5|6.10  |11.15  |16..20 | 21..25 | 25 | 26..30 |31|  name    |
 167 | -- | --   | ---   | ----- | ------ | -- | ------ |--| -------- |
 168 |OPCD| SVxd | SVyd  | SVzd  | SVRM   | vf | XO     |/ | svstate  |
 169
 170
 171 Fields:
 172
 173 * **SVxd** - SV REMAP "xdim"
 174 * **SVyd** - SV REMAP "ydim"
 175 * **SVMM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT)
 176 * **vf** - sets "Vertical-First" mode
 177 * **XO** - standard 5-bit XO field
 178
 179 # 4x4 Matrix to vec4 Multiply Example
 180
 181 The following settings will allow a 4x4 matrix (starting at f8), expressed
 182 as a sequence of 16 numbers first by row then by column, to be multiplied
 183 by a vector of length 4 (starting at f0), using a single FMAC instruction.
 184
 185 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 186 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 187 * VL=16, f4=vec, f0=vec, f8=vec
 188 * FMAC f4, f0, f8, f4
 189
 190 The permutation on SHAPE0 will use f0 as a vec4 source. On the first
 191 four iterations through the hardware loop, the REMAPed index will not
 192 increment. On the second four, the index will increase by one. Likewise
 193 on each subsequent group of four.
 194
 195 The permutation on SHAPE1 will increment f4 continuously cycling through
 196 f4-f7 every four iterations of the hardware loop.
 197
 198 At the same time, VL will, because there is no SHAPE on f8, increment
 199 straight sequentially through the 16 values f8-f23 in the Matrix. The
 200 equivalent sequence thus is issued:
 201
 202     fmac f4, f0, f8, f4
 203     fmac f5, f0, f9, f5
 204     fmac f6, f0, f10, f6
 205     fmac f7, f0, f11, f7
 206     fmac f4, f1, f12, f4
 207     fmac f5, f1, f13, f5
 208     fmac f6, f1, f14, f6
 209     fmac f7, f1, f15, f7
 210     fmac f4, f2, f16, f4
 211     fmac f5, f2, f17, f5
 212     fmac f6, f2, f18, f6
 213     fmac f7, f2, f19, f7
 214     fmac f4, f3, f20, f4
 215     fmac f5, f3, f21, f5
 216     fmac f6, f3, f22, f6
 217     fmac f7, f3, f23, f7
 218
 219 The only other instruction required is to ensure that f4-f7 are
 220 initialised (usually to zero).
 221
 222 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
 223 the same technique applied to four independent vectors, can be done by
 224 setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
 225 and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
 226 it to apply four times to compute the four columns worth of vectors.
 227
 228 # Warshall transitive closure algorithm
 229
 230 TODO move to [[sv/remap/discussion]] page, copied from here
 231 http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-July/003286.html
 232
 233 with thanks to Hendrik.
 234
 235 > Just a note:  interpreting + as 'or', and * as 'and',
 236 > operating on Boolean matrices,
 237 > and having result, X, and Y be the exact same matrix,
 238 > updated while being used,
 239 > gives the traditional Warshall transitive-closure
 240 > algorithm, if the loops are nested exactly in thie order.
 241
 242 this can be done with the ternary instruction which has
 243 an in-place triple boolean input:
 244
 245     RT = RT | (RA & RB)
 246
 247 and also has a CR Field variant of the same
 248
 249 notes from conversations:
 250
 251 > > for y in y_r:
 252 > >  for x in x_r:
 253 > >    for z in z_r:
 254 > >      result[y][x] +=
 255 > >         a[y][z] *
 256 > >         b[z][x]
 257
 258 > This nesting of loops works for matrix multiply, but not for transitive
 259 > closure.
 260
 261 > > it can be done:
 262 > >
 263 > >   for z in z_r:
 264 > >    for y in y_r:
 265 > >     for x in x_r:
 266 > >       result[y][x] +=
 267 > >          a[y][z] *
 268 > >          b[z][x]
 269 >
 270 > And this ordering of loops *does* work for transitive closure, when a,
 271 > b, and result are the very same matrix, updated while being used.
 272 >
 273 > By the way, I believe there is a graph algorithm that does the
 274 > transitive closure thing, but instead of using boolean, "and", and "or",
 275 > they use real numbers, addition, and minimum.  I think that one computes
 276 > shortest paths between vertices.
 277 >
 278 > By the time the z'th iteration of the z loop begins, the algorithm has
 279 > already peocessed paths that go through vertices numbered < z, and it
 280 > adds paths that go through vertices numbered z.
 281 >
 282 > For this to work, the outer loop has to be the one on teh subscript that
 283 > bridges a and b (which in this case are teh same matrix, of course).
 284
 285 # SUBVL Remap
 286
 287 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the
 288 sub-vectir elements were simply part of the main VL loop.  This is the
 289 *complete opposite* of predication which **only** applies to the whole
 290 vec2/3/4.  In pseudocode this would be:
 291
 292       for (i = 0; i < VL; i++)
 293         if (predval & 1<<i) # apply to VL not SUBVL
 294           for (j = 0; j < SUBVL; j++)
 295              id = i*SUBVL + j # not, "id=i".
 296              ireg[RT+remap1(id)] ...
 297
 298 The reason for allowing SUBVL Remaps is that some regular patterns using
 299 Swizzle which would otherwise require multiple explicit instructions
 300 with 12 bit swizzles encoded in them may be efficently encoded with Remap
 301 instead.  Not however that Swizzle is *still permitted to be applied*.
 302
 303 An example where SUBVL Remap is appropriate is the Rijndael MixColumns
 304 stage:
 305
 306 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 307
 308 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 309 a 2D REMAP allows:
 310
 311 * the column bytes (as a vec4) to be iterated over as an inner loop,
 312   progressing vertically (`a00 a10 a20 a30`)
 313 * the columns themselves to be iterated as an outer loop
 314 * a 32 bit `GF(256)` Matrix Multiply on the vec4 to be performed.
 315
 316 This entirely in-place without special 128-bit opcodes.  Below is
 317 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 318
 319 ```
 320 void gmix_column(unsigned char *r) {
 321     unsigned char a[4];
 322     unsigned char b[4];
 323     unsigned char c;
 324     unsigned char h;
 325     // no swizzle here but still SUBVL.Remap
 326     // can be done as vec4 byte-level
 327     // elwidth overrides though.
 328     for (c = 0; c < 4; c++) {
 329         a[c] = r[c];
 330         h = (unsigned char)((signed char)r[c] >> 7);
 331         b[c] = r[c] << 1;
 332         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 333     }
 334     // SUBVL.Remap still needed here
 335     // bytelevel elwidth overrides and vec4
 336     // These may then each be 4x 8bit bit Swizzled
 337     // r0.vec4 = b.vec4
 338     // r0.vec4 ^= a.vec4.WXYZ
 339     // r0.vec4 ^= a.vec4.ZWXY
 340     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 341     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 342     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 343     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 344     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 345 }
 346 ```
 347
 348 With the assumption made by the above code that the column bytes have
 349 already been turned around (vertical rather than horizontal) SUBVL.REMAP
 350 may transparently fill that role, in-place, without a complex byte-level
 351 mv operation.
 352
 353 The application of the swizzles allows the remapped vec4 a, b and r
 354 variables to perform four straight linear 32 bit XOR operations where a
 355 scalar processor would be required to perform 16 byte-level individual
 356 operations.  Given wide enough SIMD backends in hardware these 3 bit
 357 XORs may be done as single-cycle operations across the entire 128 bit
 358 Rijndael Matrix.
 359
 360 The other alternative is to simply perform the actual 4x4 GF(256) Matrix
 361 Multiply using the MDS Matrix.
 362
 363 # TODO
 364
 365 investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429
 366 in https://bugs.libre-soc.org/show_bug.cgi?id=653