openpower/sv/remap.mdwn

   1 [[!tag standards]]
   2
   3 # REMAP <a name="remap" />
   4
   5 REMAP allows the usual vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear
   6 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
   7 access to elements, independently on each Vector src or dest register.
   8
   9 Their primary use is for Matrix Multiplication, reordering of sequential data in-place.  Four SPRs are provided so that a single FMAC may be used in a single loop to perform 4x4 times 4x4 Matrix multiplication, generating 64 FMACs.  Additional uses include regular "Structure Packing" such as RGB pixel data extraction and reforming.
  10
  11 REMAP, like all of SV, is abstracted out, meaning that unlike traditional Vector ISAs which would typically only have a limited set of instructions that can be structure-packed (LD/ST typically), REMAP may be applied to literally any instruction: CRs, Arithmetic, Logical, LD/ST, anything.
  12
  13 Note that REMAP does not apply to sub-vector elements: that is what swizzle is for.  Swizzle *can* however be applied to the same instruction as REMAP.
  14
  15 REMAP is quite expensive to set up, and on some implementations introduce latency, so should realistically be used only where it is worthwhile
  16
  17 # SHAPE 1D/2D/3D vector-matrix remapping SPRs
  18
  19 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
  20 which have the same format.
  21
  22 [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
  23
  24 The algorithm below shows how REMAP works more clearly, and may be
  25 executed as a python program:
  26
  27     xdim = 3 # changeme
  28     ydim = 4
  29     zdim = 1
  30
  31     lims = [xdim, ydim, zdim]
  32     idxs = [0,0,0] # starting indices
  33     order = [0,1,2] # experiment with different permutations, here
  34     offset = 2     # experiment with different offset, here
  35     VL = xdim * ydim * zdim # multiply (or add) to this to get "cycling"
  36     applydim = 0
  37     invxyz = [0,0,0]
  38
  39     # run for offset iterations before actually starting
  40     for idx in range(offset):
  41         for i in range(3):
  42             idxs[order[i]] = idxs[order[i]] + 1
  43             if (idxs[order[i]] != lims[order[i]]):
  44                 break
  45             idxs[order[i]] = 0
  46
  47     break_count = 0
  48
  49     for idx in range(VL):
  50         ix = [0] * 3
  51         for i in range(3):
  52             if i >= applydim:
  53                 ix[i] = idxs[i]
  54             if invxyz[i]:
  55                 ix[i] = lims[i] - 1 - ix[i]
  56         new_idx = ix[0] + ix[1] * xdim + ix[2] * xdim * ydim
  57         print new_idx,
  58         break_count += 1
  59         if break_count == lims[order[0]]:
  60             print
  61             break_count = 0
  62         for i in range(3):
  63             idxs[order[i]] = idxs[order[i]] + 1
  64             if (idxs[order[i]] != lims[order[i]]):
  65                 break
  66             idxs[order[i]] = 0
  67
  68
  69 Each element index from the for-loop `0..VL-1`
  70 is run through the above algorithm to work out the **actual** element
  71 index, instead.  Given that there are four possible SHAPE entries, up to
  72 four separate registers in any given operation may be simultaneously
  73 remapped:
  74
  75     function op_add(rd, rs1, rs2) # add not VADD!
  76       ...
  77       ...
  78       for (i = 0; i < VL; i++)
  79         xSTATE.srcoffs = i # save context
  80         if (predval & 1<<i) # predication uses intregs
  81            ireg[rd+remap1(id)] <= ireg[rs1+remap2(irs1)] +
  82                                   ireg[rs2+remap3(irs2)];
  83            if (!int_vec[rd ].isvector) break;
  84         if (int_vec[rd ].isvector)  { id += 1; }
  85         if (int_vec[rs1].isvector)  { irs1 += 1; }
  86         if (int_vec[rs2].isvector)  { irs2 += 1; }
  87
  88 By changing remappings, 2D matrices may be transposed "in-place" for one
  89 operation, followed by setting a different permutation order without
  90 having to move the values in the registers to or from memory.
  91
  92 Note that:
  93
  94 * Over-running the register file clearly has to be detected and
  95   an illegal instruction exception thrown
  96 * When non-default elwidths are set, the exact same algorithm still
  97   applies (i.e. it offsets *polymorphic* elements *within* registers rather
  98   than entire registers).
  99 * If permute option 000 is utilised, the actual order of the
 100   reindexing does not change.  However, modulo MVL still occurs
 101   which will result in repeated operations (use with caution).
 102 * If two or more dimensions are set to zero, the actual order does not change!
 103 * The above algorithm is pseudo-code **only**.  Actual implementations
 104   will need to take into account the fact that the element for-looping
 105   must be **re-entrant**, due to the possibility of exceptions occurring.
 106   See SVSTATE SPR, which records the current element index.
 107   Continuing after return from an interrupt may introduce latency
 108   due to re-computation of the remapped offsets.
 109 * Twin-predicated operations require **two** separate and distinct
 110   element offsets.  The above pseudo-code algorithm will be applied
 111   separately and independently to each, should each of the two
 112   operands be remapped.  *This even includes unit-strided LD/ST*
 113   and other operations
 114   in that category, where in that case it will be the **offset** that is
 115   remapped.
 116 * Offset is especially useful, on its own, for accessing elements
 117   within the middle of a register.  Without offsets, it is necessary
 118   to either use a predicated MV, skipping the first elements, or
 119   performing a LOAD/STORE cycle to memory.
 120   With offsets, the data does not have to be moved.
 121 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 122   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 123   entries to be regularly presented to operands **more than once**, thus
 124   allowing the same underlying registers to act as an accumulator of
 125   multiple vector or matrix operations, for example.
 126 * Note especially that Program Order **must** still be respected
 127   even when overlaps occur that read or write the same register
 128   elements *including polymorphic ones*
 129
 130 Clearly here some considerable care needs to be taken as the remapping
 131 could hypothetically create arithmetic operations that target the
 132 exact same underlying registers, resulting in data corruption due to
 133 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 134 register-renaming will have an easier time dealing with this than
 135 DSP-style SIMD micro-architectures.
 136
 137 # 4x4 Matrix to vec4 Multiply Example
 138
 139 The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
 140
 141 * SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
 142 * SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
 143 * VL=16, f4=vec, f0=vec, f8=vec
 144 * FMAC f4, f0, f8, f4
 145
 146 The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
 147
 148 The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
 149
 150 At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
 151
 152     fmac f4, f0, f8, f4
 153     fmac f5, f0, f9, f5
 154     fmac f6, f0, f10, f6
 155     fmac f7, f0, f11, f7
 156     fmac f4, f1, f12, f4
 157     fmac f5, f1, f13, f5
 158     fmac f6, f1, f14, f6
 159     fmac f7, f1, f15, f7
 160     fmac f4, f2, f16, f4
 161     fmac f5, f2, f17, f5
 162     fmac f6, f2, f18, f6
 163     fmac f7, f2, f19, f7
 164     fmac f4, f3, f20, f4
 165     fmac f5, f3, f21, f5
 166     fmac f6, f3, f22, f6
 167     fmac f7, f3, f23, f7
 168
 169 The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
 170
 171 It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs, and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get it to apply four times to compute the four columns worth of vectors.
 172
 173 # SUBVL Remap
 174
 175 Remapping even of SUBVL (vec2/3/4) elements is permitted, as if the sub-vectir elements were simply part of the main VL loop.  This is the *complete opposite* of predication which **only** applies to the whole vec2/3/4.  In pseudocode this would be:
 176
 177       for (i = 0; i < VL; i++)
 178         if (predval & 1<<i) # apply to VL not SUBVL
 179           for (j = 0; j < SUBVL; j++)
 180              id = i*SUBVL + j # not, "id=i".
 181              ireg[RT+remap1(id)] ...
 182
 183 The reason for allowing SUBVL Remaps is that some regular patterns using Swizzle which would otherwise require multiple explicit instructions with 12 bit swizzles encoded in them may be efficently encoded with Remap instead.  Not however that Swizzle is *still permitted to be applied*.
 184
 185 An example where SUBVL Remap is appropriate is the Rijndael MixColumns stage:
 186
 187 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/AES-MixColumns.svg/600px-AES-MixColumns.svg.png" width="400px" />
 188
 189 Assuming that the bytes are stored `a00 a01 a02 a03 a10 .. a33`
 190 a 2D REMAP allows:
 191
 192 * the column bytes (as a vec4) to be iterated over as an inner loop,
 193   progressing vertically (`a00 a10 a20 a30`)
 194 * the columns themselves to be iterated as an outer loop
 195 * a 32 bit `GF(256)` multiply on the vec4 to be performed.
 196
 197 This entirely in-place without special 128-bit opcodes.  Below is
 198 the pseudocode for [[!wikipedia Rijndael MixColumns]]
 199
 200 ```
 201 void gmix_column(unsigned char *r) {
 202     unsigned char a[4];
 203     unsigned char b[4];
 204     unsigned char c;
 205     unsigned char h;
 206     // no swizzle here but still SUBVL.Remap
 207     // can be done as vec4 byte-level
 208     // elwidth overrides though.
 209     for (c = 0; c < 4; c++) {
 210         a[c] = r[c];
 211         h = (unsigned char)((signed char)r[c] >> 7);
 212         b[c] = r[c] << 1;
 213         b[c] ^= 0x1B & h; /* Rijndael's Galois field */
 214     }
 215     // SUBVL.Remap still needed here
 216     // byyelevel elwidth overrides and vec4
 217     // These may then each be 4x 8bit bit Swizzled
 218     // r0.vec4 = b.vec4
 219     // r0.vec4 ^= a.vec4.WXYZ
 220     // r0.vec4 ^= a.vec4.ZWXY
 221     // r0.vec4 ^= b.vec4.YZWX ^ a.vec4.YZWX
 222     r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1];
 223     r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2];
 224     r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3];
 225     r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0];
 226 }
 227 ```
 228
 229 With the assumption made by the above code that the column bytes have already been turned around (vertical rather than horizontal) SUBVL.REMAP may transparently fill that role, in-place, without a complex byte-level mv operation.
 230
 231 The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations.  Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix.