openpower/sv/cookbook/remap_matrix.mdwn

   1 # SVP64 REMAP Worked Example: Matrix Multiply
   2
   3 Links
   4
   5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
   6 * [[sv/remap]]
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
   8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
   9   <https://ftp.libre-soc.org/openpower_2021.pdf>
  10
  11 TODO: Include screenshots
  12
  13 One of the most powerful and versatile modes of the REMAP engine (a part of
  14 the SVP64 feature set) is the ability to perform matrix
  15 multiplication with all elements within a scalar register file.
  16
  17 This is done by converting the index used to iterate over the operand and
  18 result matrices to the actual index inside the scalar register file.
  19
  20 ## Worked example - manual (conventional method)
  21
  22 The matrix multiply looks like this:
  23
  24 ```
  25     mat_X * mat_Y = mat_Z
  26 ```
  27
  28 When multiplying non-square matrices (rows != columns), to determine the
  29 dimension of the result when matrix X has `a` rows and `b` columns and matrix
  30 Y has `b` rows and `c` columns:
  31
  32 ```
  33     X_axb * Y_bxc = Z_axc
  34 ```
  35
  36 The result matrix will have number of rows of the first matrix, and number of
  37 columns of the second matrix.
  38
  39
  40 For this example, the following values will be used for the operand matrices X
  41 and Y, result Z shown for completeness.
  42
  43     X =| 1 2 3 |  Y =  |  6  7 |  Z = |  52  58 |
  44        | 3 4 5 |       |  8  9 |      | 100 112 |
  45                        | 10 11 |
  46
  47 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
  48
  49 To determine the final dimensions of the resultant matrix Z, take the number
  50 of rows from matrix X (2) and number of columns from matrix Y (2).
  51
  52 The method usually taught in linear algebra course to students is the
  53 following (outer product):
  54
  55 1. Start with the first row of the first matrix, and first column of the
  56 second matrix.
  57 2. Multiply each element in the row by each element in the column, and sum
  58 with the current value of the result matrix element (multiply-add-accumulate).
  59 Store result in the first row, first column entry.
  60 3. Move to the next column of the second matrix, and next column of the result
  61 matrix. If there are no more columns in the second matrix, go back to first
  62 column (second matrix), and move to next row (first and result matrices).
  63 If there are no more rows left, result matrix has been fully computed.
  64 4. Repeat step 2.
  65
  66 This for-loop uses the indices as shown above
  67
  68 ```
  69     for i in range(0, mat_X_num_rows):
  70         for k in range(0, mat_Y_num_cols):
  71             for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
  72                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
  73 ```
  74
  75 Calculations:
  76
  77 ```
  78     | 1 2 3 |   |  6  7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
  79     | 3 4 5 | * |  8  9 |   | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
  80                 | 10 11 |
  81
  82     | 1 2 3 |   |  6  7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
  83     | 3 4 5 | * |  8  9 |   | (18 + 32 + 50) (21 + 36 + 55) |
  84                 | 10 11 |
  85
  86     | 1 2 3 |   |  6  7 | = |  52  58 |
  87     | 3 4 5 | * |  8  9 |   | 100 112 |
  88                 | 10 11 |
  89 ```
  90
  91 For the algorithm, assign indeces to matrices as follows:
  92
  93 ```
  94     Index | 0 1 2 3 4 5 |
  95     Mat X | 1 2 3 3 4 5 |
  96
  97     Index | 0 1 2 3  4  5 |
  98     Mat Y | 6 7 8 9 10 11 |
  99
 100     Index |  0  1   2   3 |
 101     Mat Z | 52 58 100 112 |
 102 ```
 103
 104 (Start with the first row, then assign index left-to-right, top-to-bottom.)
 105
 106 Index list:
 107
 108 ```
 109     Mat X | Mat Y | Mat Z
 110       0   |   0   |   0
 111       1   |   2   |   0
 112       2   |   4   |   0
 113       0   |   1   |   1
 114       1   |   3   |   1
 115       2   |   5   |   1
 116       3   |   0   |   2
 117       4   |   2   |   2
 118       5   |   4   |   2
 119       3   |   1   |   3
 120       4   |   3   |   3
 121       5   |   5   |   3
 122 ```
 123
 124
 125 The issue with this algorithm is that the result matrix element is the same
 126 for three consecutive operations, and where each element is stored in CPU
 127 registers, the same register will be written to three times and thus causing
 128 consistent stalling.
 129
 130 ## Inner Product
 131
 132 A slight modification to the order of the loops in the algorithm massively
 133 reduces the chance of read-after-write hazards, as the result matrix
 134 element (and thus register) changes with every multiply-add operation.
 135
 136 The code:
 137
 138 ```
 139     for i in range(mat_X_num_rows):
 140         for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
 141             for k in range(0, mat_Y_num_cols):
 142                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
 143 ```
 144
 145 Index list:
 146
 147 ```
 148     Mat X | Mat Y | Mat Z
 149       0   |   0   |   0
 150       0   |   1   |   1
 151       3   |   0   |   2
 152       3   |   1   |   3
 153       1   |   2   |   0
 154       1   |   3   |   1
 155       4   |   2   |   2
 156       4   |   3   |   3
 157       2   |   4   |   0
 158       2   |   5   |   1
 159       5   |   4   |   2
 160       5   |   5   |   3
 161 ```
 162
 163 The index for the result matrix changes with every operation, and thus the
 164 consecutive multiply-add instruction doesn't depend on the previous write
 165 register.
 166
 167 # SVP64 instructions implementing matrix multiply
 168
 169 * SVP64 assembler example:
 170 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
 171 * svremap and svshape instructions defined in:
 172 [[sv/rfc/ls009]
 173 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
 174 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
 175
 176 *(Need to check if first arg of svremap correct, then one shown works with
 177 ISACaller)*
 178
 179 ```
 180     svshape 2, 2, 3, 0, 0
 181     svremap 31, 1, 2, 3, 0, 0, 0
 182     sv.maddld *0, *16, *32, *0
 183 ```
 184
 185 ## svshape
 186
 187 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
 188 Purpose Registers (SPRs), which were added alongside the SVP64 looping
 189 system for complex element indexing. Without having "Re-shaping" SPRs, only the most
 190 basic, consecuting indexing of register elements (0,1,2,3...) would
 191 be possible.
 192
 193 ### SVSHAPE Remapping SPRs
 194
 195 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
 196
 197 For Matrix Multiply, SHAPE0 SPR is used:
 198
 199 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31|
 200 |----- |----- | ------- | ------- | ------  |------|------ |---- |
 201 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |
 202
 203
 204 skip:
 205
 206 - 0b00 indicates no dimensions to be skipped
 207 - 0b01 - skip '1st dim'
 208 - 0b10 - skip '2nd dim'
 209 - 0b11 - skip '3rd dim'
 210
 211 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
 212
 213 - If corresponding dim bit is zero, start index from zero and increment
 214 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
 215 bit is being looked at) and decrement down to zero.
 216
 217 offset is used to offset the result by `offset` elements (important for when
 218 using element width overrides are used).
 219
 220 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
 221 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
 222 3 elements in the array.
 223
 224 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
 225 ydimsz=2, zdimsz=0.
 226
 227 permute setting:
 228 | permute | order | array format             |
 229 | ------- | ----- | ------------------------ |
 230 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 231 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 232 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 233 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 234 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 235 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 236 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 237 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 238
 239 Permute re-arranges the order of the nested for-loops used to iterate over the
 240 three dimensions. This allows for in-place transpose, in-place rotate, matrix
 241 multiply, convolutions, without the limitation of Power-of-Two matrices.
 242
 243 Limitations of Matrix REMAP are currently:
 244
 245 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
 246 (MAC), or other operations may be performed in total.
 247 For matrix multiply, it means both operand matrices and result matrix can
 248 have no more than 127 elements in total.
 249 (Larger matrices can be split into tiles to circumvent this issue, out
 250 of scope of this document).
 251 - `svshape` instruction only provides part of the Matrix REMAP capability.
 252 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
 253 requiring more assembler instructions). Future revisions of SVP64 will
 254 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
 255 SPRs directly.
 256
 257 Going back to the assembler instruction used to setup the shape for matrix
 258 multiply:
 259
 260 ```
 261     svshape 2, 2, 3, 0, 0
 262 ```
 263
 264 breakdown:
 265
 266 - SVxd=2, SVyd=2, SVzd=3
 267 - SVRM=0 (Matrix mode, uses `SVSHAPE0` SPR)
 268 -
 269
 270 ## SVREMAP
 271
 272 ```
 273     svremap 31, 1, 2, 3, 0, 0, 0
 274 ```
 275
 276 Assigns the configured SVSHAPEs to the relevant operand/result registers
 277 of the consecutive instruction/s (depending on if REMAP is set to persistent).
 278
 279 ## maddld - Multiply-Add Low Doubleword VA-form
 280
 281 ```
 282     sv.maddld *0, *16, *32, *0
 283 ```
 284
 285 A standard instruction available since version 3.0 of PowerISA.
 286
 287 *Temporary note:* maddld (Multiply-Add Low Doubleword) in the 3.1b version
 288 of the PowerISA spec is in the Linux Compliancy Subset, not SFS or SFFS.
 289 See page 1477 of the document, or page 1503 of the pdf.
 290
 291 This instruction can be used as a multiply-add accumulate by setting the
 292 third operand to be the same as the result register, which functions as
 293 an accumulator.
 294
 295 ## Appendix
 296
 297
 298 [[!tag svp64_cookbook ]]
 299