openpower/sv/cookbook/remap_matrix.mdwn

   1 # SVP64 REMAP Worked Example: Matrix Multiply
   2
   3 Links
   4
   5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
   6 * [[sv/remap]]
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
   8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
   9   <https://ftp.libre-soc.org/openpower_2021.pdf>
  10 * setvl instruction: [[sv/setvl]]
  11 * REMAP python function based on yield (shows how indices are generated):
  12 [[sv/remapyield.py]]
  13
  14 TODO: Include screenshots
  15
  16 One of the most powerful and versatile modes of the REMAP engine (a part of
  17 the SVP64 feature set) is the ability to perform matrix
  18 multiplication with all elements within a scalar register file.
  19
  20 This is done by converting the index used to iterate over the operand and
  21 result matrices to the actual index inside the scalar register file.
  22
  23 ## Worked example - manual (conventional method)
  24
  25 The matrix multiply looks like this:
  26
  27 ```
  28     mat_X * mat_Y = mat_Z
  29 ```
  30
  31 When multiplying non-square matrices (rows != columns), to determine the
  32 dimension of the result when matrix X has `a` rows and `b` columns and matrix
  33 Y has `b` rows and `c` columns:
  34
  35 ```
  36     X_axb * Y_bxc = Z_axc
  37 ```
  38
  39 The result matrix will have number of rows of the first matrix, and number of
  40 columns of the second matrix.
  41
  42
  43 For this example, the following values will be used for the operand matrices X
  44 and Y, result Z shown for completeness.
  45
  46     X =| 1 2 3 |  Y =  |  6  7 |  Z = |  52  58 |
  47        | 3 4 5 |       |  8  9 |      | 100 112 |
  48                        | 10 11 |
  49
  50 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
  51
  52 To determine the final dimensions of the resultant matrix Z, take the number
  53 of rows from matrix X (2) and number of columns from matrix Y (2).
  54
  55 The method usually taught in linear algebra course to students is the
  56 following (outer product):
  57
  58 1. Start with the first row of the first matrix, and first column of the
  59 second matrix.
  60 2. Multiply each element in the row by each element in the column, and sum
  61 with the current value of the result matrix element (multiply-add-accumulate).
  62 Store result in the first row, first column entry.
  63 3. Move to the next column of the second matrix, and next column of the result
  64 matrix. If there are no more columns in the second matrix, go back to first
  65 column (second matrix), and move to next row (first and result matrices).
  66 If there are no more rows left, result matrix has been fully computed.
  67 4. Repeat step 2.
  68
  69 This for-loop uses the indices as shown above
  70
  71 ```
  72     for i in range(0, mat_X_num_rows):
  73         for k in range(0, mat_Y_num_cols):
  74             for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
  75                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
  76 ```
  77
  78 Calculations:
  79
  80 ```
  81     | 1 2 3 |   |  6  7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
  82     | 3 4 5 | * |  8  9 |   | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
  83                 | 10 11 |
  84
  85     | 1 2 3 |   |  6  7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
  86     | 3 4 5 | * |  8  9 |   | (18 + 32 + 50) (21 + 36 + 55) |
  87                 | 10 11 |
  88
  89     | 1 2 3 |   |  6  7 | = |  52  58 |
  90     | 3 4 5 | * |  8  9 |   | 100 112 |
  91                 | 10 11 |
  92 ```
  93
  94 For the algorithm, assign indeces to matrices as follows:
  95
  96 ```
  97     Index | 0 1 2 3 4 5 |
  98     Mat X | 1 2 3 3 4 5 |
  99
 100     Index | 0 1 2 3  4  5 |
 101     Mat Y | 6 7 8 9 10 11 |
 102
 103     Index |  0  1   2   3 |
 104     Mat Z | 52 58 100 112 |
 105 ```
 106
 107 (Start with the first row, then assign index left-to-right, top-to-bottom.)
 108
 109 Index list:
 110
 111 ```
 112     Mat X | Mat Y | Mat Z
 113       0   |   0   |   0
 114       1   |   2   |   0
 115       2   |   4   |   0
 116       0   |   1   |   1
 117       1   |   3   |   1
 118       2   |   5   |   1
 119       3   |   0   |   2
 120       4   |   2   |   2
 121       5   |   4   |   2
 122       3   |   1   |   3
 123       4   |   3   |   3
 124       5   |   5   |   3
 125 ```
 126
 127
 128 The issue with this algorithm is that the result matrix element is the same
 129 for three consecutive operations, and where each element is stored in CPU
 130 registers, the same register will be written to three times and thus causing
 131 consistent stalling.
 132
 133 ## Inner Product
 134
 135 A slight modification to the order of the loops in the algorithm massively
 136 reduces the chance of read-after-write hazards, as the result matrix
 137 element (and thus register) changes with every multiply-add operation.
 138
 139 The code:
 140
 141 ```
 142     for i in range(mat_X_num_rows):
 143         for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
 144             for k in range(0, mat_Y_num_cols):
 145                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
 146 ```
 147
 148 Index list:
 149
 150 ```
 151     Mat X | Mat Y | Mat Z
 152       0   |   0   |   0
 153       0   |   1   |   1
 154       3   |   0   |   2
 155       3   |   1   |   3
 156       1   |   2   |   0
 157       1   |   3   |   1
 158       4   |   2   |   2
 159       4   |   3   |   3
 160       2   |   4   |   0
 161       2   |   5   |   1
 162       5   |   4   |   2
 163       5   |   5   |   3
 164 ```
 165
 166 The index for the result matrix changes with every operation, and thus the
 167 consecutive multiply-add instruction doesn't depend on the previous write
 168 register.
 169
 170 # SVP64 instructions implementing matrix multiply
 171
 172 * SVP64 assembler example:
 173 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
 174 * svremap and svshape instructions defined in:
 175 [[sv/rfc/ls009]
 176 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
 177 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
 178
 179 *(Need to check if first arg of svremap correct, then one shown works with
 180 ISACaller)*
 181
 182 ```
 183     svshape 2, 2, 3, 0, 0
 184     svremap 31, 1, 2, 3, 0, 0, 0
 185     sv.maddld *0, *16, *32, *0
 186 ```
 187
 188 ## svshape
 189
 190 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
 191 Purpose Registers (SPRs), which were added alongside the SVP64 looping
 192 system for complex element indexing. Without having "Re-shaping" SPRs, only the most
 193 basic, consecuting indexing of register elements (0,1,2,3...) would
 194 be possible.
 195
 196 ### SVSHAPE Remapping SPRs
 197
 198 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
 199
 200 For Matrix Multiply, SHAPE0 SPR is used:
 201
 202 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31|
 203 |----- |----- | ------- | ------- | ------  |------|------ |---- |
 204 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |
 205
 206
 207 skip:
 208
 209 - 0b00 indicates no dimensions to be skipped
 210 - 0b01 - skip '1st dim'
 211 - 0b10 - skip '2nd dim'
 212 - 0b11 - skip '3rd dim'
 213
 214 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
 215
 216 - If corresponding dim bit is zero, start index from zero and increment
 217 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
 218 bit is being looked at) and decrement down to zero.
 219
 220 offset is used to offset the result by `offset` elements (important for when
 221 using element width overrides are used).
 222
 223 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
 224 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
 225 3 elements in the array.
 226
 227 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
 228 ydimsz=2, zdimsz=0.
 229
 230 permute setting:
 231 | permute | order | array format             |
 232 | ------- | ----- | ------------------------ |
 233 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 234 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 235 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 236 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 237 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 238 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 239 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 240 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 241
 242 Permute re-arranges the order of the nested for-loops used to iterate over the
 243 three dimensions. This allows for in-place transpose, in-place rotate, matrix
 244 multiply, convolutions, without the limitation of Power-of-Two matrices.
 245
 246 Limitations of Matrix REMAP are currently:
 247
 248 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
 249 (MAC), or other operations may be performed in total.
 250 For matrix multiply, it means both operand matrices and result matrix can
 251 have no more than 127 elements in total.
 252 (Larger matrices can be split into tiles to circumvent this issue, out
 253 of scope of this document).
 254 - `svshape` instruction only provides part of the Matrix REMAP capability.
 255 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
 256 requiring more assembler instructions). Future revisions of SVP64 will
 257 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
 258 SPRs directly.
 259
 260 Going back to the assembler instruction used to setup the shape for matrix
 261 multiply:
 262
 263 ```
 264     svshape 2, 2, 3, 0, 0
 265 ```
 266
 267 breakdown:
 268
 269 - SVxd=2, SVyd=2, SVzd=3
 270 - SVRM=0 (Matrix mode, uses `SVSHAPE0` SPR)
 271 -
 272
 273 ## SVREMAP
 274
 275 ```
 276     svremap 31, 1, 2, 3, 0, 0, 0
 277 ```
 278
 279 Assigns the configured SVSHAPEs to the relevant operand/result registers
 280 of the consecutive instruction/s (depending on if REMAP is set to persistent).
 281
 282 ## maddld - Multiply-Add Low Doubleword VA-form
 283
 284 ```
 285     sv.maddld *0, *16, *32, *0
 286 ```
 287
 288 A standard instruction available since version 3.0 of PowerISA.
 289
 290 *Temporary note:* maddld (Multiply-Add Low Doubleword) in the 3.1b version
 291 of the PowerISA spec is in the Linux Compliancy Subset, not SFS or SFFS.
 292 See page 1477 of the document, or page 1503 of the pdf.
 293
 294 This instruction can be used as a multiply-add accumulate by setting the
 295 third operand to be the same as the result register, which functions as
 296 an accumulator.
 297
 298 ## Appendix
 299
 300
 301 [[!tag svp64_cookbook ]]
 302