openpower/sv/cookbook/remap_matrix.mdwn

   1 # SVP64 REMAP Worked Example: Matrix Multiply
   2
   3 Links
   4
   5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
   6 * [[sv/remap]]
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
   8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
   9   <https://ftp.libre-soc.org/openpower_2021.pdf>
  10 * setvl instruction: [[sv/setvl]]
  11 * REMAP python function based on yield (shows how indices are generated):
  12 [[sv/remapyield.py]]
  13
  14 TODO: Include screenshots
  15
  16 One of the most powerful and versatile modes of the REMAP engine (a part of
  17 the SVP64 feature set) is the ability to perform matrix
  18 multiplication with all elements within a scalar register file.
  19
  20 This is done by converting the index used to iterate over the operand and
  21 result matrices to the actual index inside the scalar register file.
  22
  23 ## Worked example - manual (conventional method)
  24
  25 The matrix multiply looks like this:
  26
  27 ```
  28     mat_X * mat_Y = mat_Z
  29 ```
  30
  31 When multiplying non-square matrices (rows != columns), to determine the
  32 dimension of the result when matrix X has `a` rows and `b` columns and matrix
  33 Y has `b` rows and `c` columns:
  34
  35 ```
  36     X_axb * Y_bxc = Z_axc
  37 ```
  38
  39 The result matrix will have number of rows of the first matrix, and number of
  40 columns of the second matrix.
  41
  42
  43 For this example, the following values will be used for the operand matrices X
  44 and Y, result Z shown for completeness.
  45
  46     X =| 1 2 3 |  Y =  |  6  7 |  Z = |  52  58 |
  47        | 3 4 5 |       |  8  9 |      | 100 112 |
  48                        | 10 11 |
  49
  50 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
  51
  52 To determine the final dimensions of the resultant matrix Z, take the number
  53 of rows from matrix X (2) and number of columns from matrix Y (2).
  54
  55 The method usually taught in linear algebra course to students is the
  56 following (outer product):
  57
  58 1. Start with the first row of the first matrix, and first column of the
  59 second matrix.
  60 2. Multiply each element in the row by each element in the column, and sum
  61 with the current value of the result matrix element (multiply-add-accumulate).
  62 Store result in the first row, first column entry.
  63 3. Move to the next column of the second matrix, and next column of the result
  64 matrix. If there are no more columns in the second matrix, go back to first
  65 column (second matrix), and move to next row (first and result matrices).
  66 If there are no more rows left, result matrix has been fully computed.
  67 4. Repeat step 2.
  68
  69 This for-loop uses the indices as shown above
  70
  71 ```
  72     for i in range(0, mat_X_num_rows):
  73         for k in range(0, mat_Y_num_cols):
  74             for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
  75                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
  76 ```
  77
  78 Calculations:
  79
  80 ```
  81     | 1 2 3 |   |  6  7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
  82     | 3 4 5 | * |  8  9 |   | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
  83                 | 10 11 |
  84
  85     | 1 2 3 |   |  6  7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
  86     | 3 4 5 | * |  8  9 |   | (18 + 32 + 50) (21 + 36 + 55) |
  87                 | 10 11 |
  88
  89     | 1 2 3 |   |  6  7 | = |  52  58 |
  90     | 3 4 5 | * |  8  9 |   | 100 112 |
  91                 | 10 11 |
  92 ```
  93
  94 For the algorithm, assign indeces to matrices as follows:
  95
  96 ```
  97     Index | 0 1 2 3 4 5 |
  98     Mat X | 1 2 3 3 4 5 |
  99
 100     Index | 0 1 2 3  4  5 |
 101     Mat Y | 6 7 8 9 10 11 |
 102
 103     Index |  0  1   2   3 |
 104     Mat Z | 52 58 100 112 |
 105 ```
 106
 107 (Start with the first row, then assign index left-to-right, top-to-bottom.)
 108
 109 Index list:
 110
 111 ```
 112     | Mat X | Mat Y | Mat Z |
 113     |   0   |   0   |   0   |
 114     |   1   |   2   |   0   |
 115     |   2   |   4   |   0   |
 116     |   0   |   1   |   1   |
 117     |   1   |   3   |   1   |
 118     |   2   |   5   |   1   |
 119     |   3   |   0   |   2   |
 120     |   4   |   2   |   2   |
 121     |   5   |   4   |   2   |
 122     |   3   |   1   |   3   |
 123     |   4   |   3   |   3   |
 124     |   5   |   5   |   3   |
 125 ```
 126
 127 Worked example broken down into individual multiply-add accumulates:
 128
 129 [[!img outer_product_worked_example.jpg size="600x" ]]
 130
 131 The issue with this algorithm is that the result matrix element is the same
 132 for three consecutive operations, and where each element is stored in CPU
 133 registers, the same register will be written to three times and thus causing
 134 consistent stalling.
 135
 136 ## Inner Product
 137
 138 A slight modification to the order of the loops in the algorithm massively
 139 reduces the chance of read-after-write hazards, as the result matrix
 140 element (and thus register) changes with every multiply-add operation.
 141
 142 The code:
 143
 144 ```
 145     for i in range(mat_X_num_rows):
 146         for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
 147             for k in range(0, mat_Y_num_cols):
 148                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
 149 ```
 150
 151 Index list:
 152
 153 ```
 154     | Mat X | Mat Y | Mat Z |
 155     |   0   |   0   |   0   |
 156     |   0   |   1   |   1   |
 157     |   3   |   0   |   2   |
 158     |   3   |   1   |   3   |
 159     |   1   |   2   |   0   |
 160     |   1   |   3   |   1   |
 161     |   4   |   2   |   2   |
 162     |   4   |   3   |   3   |
 163     |   2   |   4   |   0   |
 164     |   2   |   5   |   1   |
 165     |   5   |   4   |   2   |
 166     |   5   |   5   |   3   |
 167 ```
 168
 169 Worked example for inner product:
 170
 171 [[!img inner_product_worked_example.jpg size="600x" ]]
 172
 173 The index for the result matrix changes with every operation, and thus the
 174 consecutive multiply-add instruction doesn't depend on the previous write
 175 register.
 176
 177 Outer and inner product indices side-by-side:
 178
 179 ```
 180     |   Outer Product       |   Inner Product       |
 181     | Mat X | Mat Y | Mat Z | Mat X | Mat Y | Mat Z |
 182     |   0   |   0   |   0   |   0   |   0   |   0   |
 183     |   1   |   2   |   0   |   0   |   1   |   1   |
 184     |   2   |   4   |   0   |   3   |   0   |   2   |
 185     |   0   |   1   |   1   |   3   |   1   |   3   |
 186     |   1   |   3   |   1   |   1   |   2   |   0   |
 187     |   2   |   5   |   1   |   1   |   3   |   1   |
 188     |   3   |   0   |   2   |   4   |   2   |   2   |
 189     |   4   |   2   |   2   |   4   |   3   |   3   |
 190     |   5   |   4   |   2   |   2   |   4   |   0   |
 191     |   3   |   1   |   3   |   2   |   5   |   1   |
 192     |   4   |   3   |   3   |   5   |   4   |   2   |
 193     |   5   |   5   |   3   |   5   |   5   |   3   |
 194 ```
 195
 196 # SVP64 instructions implementing matrix multiply
 197
 198 * SVP64 assembler example:
 199 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
 200 * svremap and svshape instructions defined in:
 201 [[sv/rfc/ls009]
 202 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
 203 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
 204
 205 *(Need to check if first arg of svremap correct, then one shown works with
 206 ISACaller)*
 207
 208 ```
 209     svshape 2, 2, 3, 0, 0
 210     svremap 31, 1, 2, 3, 0, 0, 0
 211     sv.maddld *0, *16, *32, *0
 212 ```
 213
 214 ## svshape
 215
 216 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
 217 Purpose Registers (SPRs), which were added alongside the SVP64 looping
 218 system for complex element indexing. Without having "Re-shaping" SPRs, only the most
 219 basic, consecuting indexing of register elements (0,1,2,3...) would
 220 be possible.
 221
 222 ### SVSHAPE Remapping SPRs
 223
 224 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
 225
 226 For Matrix Multiply, SHAPE0 SPR is used:
 227
 228 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31|
 229 |----- |----- | ------- | ------- | ------  |------|------ |---- |
 230 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |
 231
 232
 233 skip:
 234
 235 - 0b00 indicates no dimensions to be skipped
 236 - 0b01 - skip '1st dim'
 237 - 0b10 - skip '2nd dim'
 238 - 0b11 - skip '3rd dim'
 239
 240 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
 241
 242 - If corresponding dim bit is zero, start index from zero and increment
 243 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
 244 bit is being looked at) and decrement down to zero.
 245
 246 offset is used to offset the result by `offset` elements (important for when
 247 using element width overrides are used).
 248
 249 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
 250 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
 251 3 elements in the array.
 252
 253 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
 254 ydimsz=2, zdimsz=0.
 255
 256 permute setting:
 257
 258 | permute | order | array format             |
 259 | ------- | ----- | ------------------------ |
 260 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 261 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 262 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 263 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 264 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 265 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 266 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 267 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 268
 269 Permute re-arranges the order of the nested for-loops used to iterate over the
 270 three dimensions. This allows for in-place transpose, in-place rotate, matrix
 271 multiply, convolutions, without the limitation of Power-of-Two matrices.
 272
 273 For normal matrix multiply, the permute setting is 0b010 (order 1,0,2,
 274 or swap x and y loops).
 275
 276 (*NOTE:* This is done automatically by the Matrix-Multiply REMAP mode, `SVRM=0`.)
 277
 278 Limitations of Matrix REMAP are currently:
 279
 280 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
 281 (MAC), or other operations may be performed in total.
 282 For matrix multiply, it means both operand matrices and result matrix can
 283 have no more than 127 elements in total.
 284 (Larger matrices can be split into tiles to circumvent this issue, out
 285 of scope of this document).
 286 - `svshape` instruction only provides part of the Matrix REMAP capability.
 287 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
 288 requiring more assembler instructions). Future revisions of SVP64 will
 289 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
 290 SPRs directly.
 291
 292 Going back to the assembler instruction used to setup the shape for matrix
 293 multiply:
 294
 295 ```
 296     svshape 2, 2, 3, 0, 0
 297 ```
 298
 299 breakdown:
 300
 301 - SVxd=2, SVyd=2, SVzd=3
 302 - SVRM=0 (Matrix mode, uses `SVSHAPE0` SPR)
 303 - vf=0 (not using Vertical-First mode)
 304
 305 To determine the `SVxd`/`SVyd`/`SVzd` settings:
 306
 307 - `SVxd` is equal to the number of columns in the second operand matrix.
 308 Matrix Y has 2 columns, so `SVxd=2`.
 309 - `SVyd` is equal to the number of rows in the first operand matrix.
 310 Matrix X has 2 rows, so `SVyd=2`.
 311 - `SVzd` is equal to the number of columns in the first operand matrix,
 312 or the number of rows in the second operand matrix.
 313 Matrix X has 3 columns, matrix Y has 3 rows, so `SVzd=3`.
 314
 315 Table form
 316
 317 ```
 318     SVxd |         mat_Y_num_cols
 319     SVyd |         mat_X_num_rows
 320     SVzd | mat_X_num_cols OR mat_Y_num_rows
 321 ```
 322
 323 ## SVREMAP
 324
 325 SVRM-Form:
 326
 327 |0     |6     |11  |13   |15   |17   |19   |21    | 22:25 |26:31  |
 328 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 329 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 330
 331 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
 332
 333 ```
 334     svremap 31, 1, 2, 3, 0, 0, 0
 335 ```
 336
 337 breakdown:
 338
 339 - `SVme=31`, 5-bit bitmask determines which registers have REMAP applied.
 340 Bitfields are: `RA|RB|RC|RT|EA/FRS` In this example, all input registers
 341 (RA, RB, RC) of *any* scalar instruction to follow and output register (RT)
 342 will have REMAP applied.
 343 - `mix/mox` fields determine which shape is applied to the activated register
 344 - `mi0=1`, instruction operand RA has SVSHAPE1 applied to it.
 345 - `mi1=2`, instruction operand RB has SVSHAPE2 applied to it.
 346 - `mi2=3`, instruction operand RA has SVSHAPE3 applied to it.
 347 - `mo0=0`, instruction result RT has SVSHAPE0 applied to it.
 348 - `mo1=0`, instruction result EA/FRS has SVSHAPE0 applied to it. *(not applicable
 349 for this example)*
 350 - `pst=0`, if set, REMAP remains enabled until explicitly disabled, or another
 351 REMAP, or setvl is setup.
 352
 353 Assigns the configured SVSHAPEs to the relevant operand/result registers
 354 of the consecutive instruction/s (depending on if REMAP is set to persistent).
 355
 356 ## maddld - Multiply-Add Low Doubleword VA-form
 357
 358 ```
 359     sv.maddld *0, *16, *32, *0
 360 ```
 361
 362 A standard instruction available since version 3.0 of PowerISA.
 363
 364 *Temporary note:* maddld (Multiply-Add Low Doubleword) in the 3.1b version
 365 of the PowerISA spec is in the Linux Compliancy Subset, not SFS or SFFS.
 366 See page 1477 of the document, or page 1503 of the pdf.
 367
 368 This instruction can be used as a multiply-add accumulate by setting the
 369 third operand to be the same as the result register, which functions as
 370 an accumulator.
 371
 372 ## Appendix
 373
 374 ### Running the simulator test case
 375
 376 - Set up the LibreSOC development environment (need to have openpower-isa repo
 377 setup and installed).
 378
 379     $: cd /PATH/TO/src/openpower-isa/src/openpower/decoder/isa/
 380     $: python3 test_caller_svp64_matrix.py >& /tmp/f
 381
 382 (All test cases are enabled by default; extra test cases can be disabled by changing
 383 `test_` to `tst_` or any other prefix.)
 384
 385 The simulator dump will be stored as in the temporary directory.
 386
 387 Particular text strings to to search for:
 388
 389 - `shape remap` - shows the indices corresponding to the operand and result matrices
 390 - `flattened X,Y,expected` - shows the expected matrix result which will be used to
 391 validate the result coming from the simulator.
 392 - `maddld-matrix` - shows the result matrix element values coming from the simulator.
 393 These values must match the expect result calculated before the simulator call.
 394
 395 Screenshots:
 396
 397 [[!img mat_mul_sim_results.png size="600x" ]]
 398
 399 [[!tag svp64_cookbook ]]
 400