openpower/sv/cookbook/remap_matrix.mdwn

   1 # SVP64 REMAP Worked Example: Matrix Multiply
   2
   3 Links
   4
   5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
   6 * [[sv/remap]]
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
   8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
   9   <https://ftp.libre-soc.org/openpower_2021.pdf>
  10 * setvl instruction: [[sv/setvl]]
  11 * REMAP python function based on yield (shows how indices are generated):
  12 [[sv/remapyield.py]]
  13
  14 One of the most powerful and versatile modes of the REMAP engine (a part of
  15 the SVP64 feature set) is the ability to perform matrix
  16 multiplication with all elements within a scalar register file.
  17
  18 This is done by converting the index used to iterate over the operand and
  19 result matrices to the actual index inside the scalar register file.
  20
  21 ## Worked example - manual (conventional method)
  22
  23 The matrix multiply looks like this:
  24
  25 ```
  26     mat_X * mat_Y = mat_Z
  27 ```
  28
  29 When multiplying non-square matrices (rows != columns), to determine the
  30 dimension of the result when matrix X has `a` rows and `b` columns and matrix
  31 Y has `b` rows and `c` columns:
  32
  33 ```
  34     X_axb * Y_bxc = Z_axc
  35 ```
  36
  37 The result matrix will have number of rows of the first matrix, and number of
  38 columns of the second matrix.
  39
  40
  41 For this example, the following values will be used for the operand matrices X
  42 and Y, result Z shown for completeness.
  43
  44     X =| 1 2 3 |  Y =  |  6  7 |  Z = |  52  58 |
  45        | 3 4 5 |       |  8  9 |      | 100 112 |
  46                        | 10 11 |
  47
  48 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
  49
  50 To determine the final dimensions of the resultant matrix Z, take the number
  51 of rows from matrix X (2) and number of columns from matrix Y (2).
  52
  53 The method usually taught in linear algebra course to students is the
  54 following (outer product):
  55
  56 1. Start with the first row of the first matrix, and first column of the
  57 second matrix.
  58 2. Multiply each element in the row by each element in the column, and sum
  59 with the current value of the result matrix element (multiply-add-accumulate).
  60 Store result in the row matching first operand row and column matching second
  61 operand column.
  62 3. Move to the next column of the second matrix, and next column of the result
  63 matrix. If there are no more columns in the second matrix, go back to first
  64 column (second matrix), and move to next row (first and result matrices).
  65 If there are no more rows left, result matrix has been fully computed.
  66 4. Repeat step 2.
  67
  68 This for-loop uses the indices as shown above
  69
  70 ```
  71 for i in range(0, mat_X_num_rows):
  72     for k in range(0, mat_Y_num_cols):
  73         for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
  74             mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
  75 ```
  76
  77 Calculations:
  78
  79 ```
  80     | 1 2 3 |   |  6  7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
  81     | 3 4 5 | * |  8  9 |   | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
  82                 | 10 11 |
  83
  84     | 1 2 3 |   |  6  7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
  85     | 3 4 5 | * |  8  9 |   | (18 + 32 + 50) (21 + 36 + 55) |
  86                 | 10 11 |
  87
  88     | 1 2 3 |   |  6  7 | = |  52  58 |
  89     | 3 4 5 | * |  8  9 |   | 100 112 |
  90                 | 10 11 |
  91 ```
  92
  93 For the algorithm, assign indeces to matrices as follows:
  94
  95 ```
  96     Index | 0 1 2 3 4 5 |
  97     Mat X | 1 2 3 3 4 5 |
  98
  99     Index | 0  1  2  3  4  5 |
 100     Mat Y | 6  7  8  9 10 11 |
 101
 102     Index |   0   1   2   3 |
 103     Mat Z |  52  58 100 112 |
 104 ```
 105
 106 (Start with the first row, then assign index left-to-right, top-to-bottom.)
 107
 108 Index list:
 109
 110 ```
 111     | Mat X | Mat Y | Mat Z |
 112     |   0   |   0   |   0   |
 113     |   1   |   2   |   0   |
 114     |   2   |   4   |   0   |
 115     |   0   |   1   |   1   |
 116     |   1   |   3   |   1   |
 117     |   2   |   5   |   1   |
 118     |   3   |   0   |   2   |
 119     |   4   |   2   |   2   |
 120     |   5   |   4   |   2   |
 121     |   3   |   1   |   3   |
 122     |   4   |   3   |   3   |
 123     |   5   |   5   |   3   |
 124 ```
 125
 126 Worked example broken down into individual multiply-add accumulates:
 127
 128 [[!img outer_product_worked_example.jpg size="500x" ]]
 129
 130 The issue with this algorithm is that the result matrix element is the same
 131 for three consecutive operations, and where each element is stored in CPU
 132 registers, the same register will be written to three times and thus causing
 133 consistent stalling.
 134
 135 ## Inner Product
 136
 137 A slight modification to the order of the loops in the algorithm massively
 138 reduces the chance of read-after-write hazards, as the result matrix
 139 element (and thus register) changes with every multiply-add operation.
 140
 141 The code:
 142
 143 ```
 144 for i in range(mat_X_num_rows):
 145     for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
 146         for k in range(0, mat_Y_num_cols):
 147             mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
 148 ```
 149
 150 Index list:
 151
 152 ```
 153     | Mat X | Mat Y | Mat Z |
 154     |   0   |   0   |   0   |
 155     |   0   |   1   |   1   |
 156     |   3   |   0   |   2   |
 157     |   3   |   1   |   3   |
 158     |   1   |   2   |   0   |
 159     |   1   |   3   |   1   |
 160     |   4   |   2   |   2   |
 161     |   4   |   3   |   3   |
 162     |   2   |   4   |   0   |
 163     |   2   |   5   |   1   |
 164     |   5   |   4   |   2   |
 165     |   5   |   5   |   3   |
 166 ```
 167
 168 Worked example for inner product:
 169
 170 [[!img inner_product_worked_example.jpg size="500x" ]]
 171
 172 The index for the result matrix changes with every operation, and thus the
 173 consecutive multiply-add instruction doesn't depend on the previous write
 174 register.
 175
 176 Outer and inner product indices side-by-side:
 177
 178 ```
 179     |   Outer Product       |   Inner Product       |
 180     | Mat X | Mat Y | Mat Z | Mat X | Mat Y | Mat Z |
 181     |   0   |   0   |   0   |   0   |   0   |   0   |
 182     |   1   |   2   |   0   |   0   |   1   |   1   |
 183     |   2   |   4   |   0   |   3   |   0   |   2   |
 184     |   0   |   1   |   1   |   3   |   1   |   3   |
 185     |   1   |   3   |   1   |   1   |   2   |   0   |
 186     |   2   |   5   |   1   |   1   |   3   |   1   |
 187     |   3   |   0   |   2   |   4   |   2   |   2   |
 188     |   4   |   2   |   2   |   4   |   3   |   3   |
 189     |   5   |   4   |   2   |   2   |   4   |   0   |
 190     |   3   |   1   |   3   |   2   |   5   |   1   |
 191     |   4   |   3   |   3   |   5   |   4   |   2   |
 192     |   5   |   5   |   3   |   5   |   5   |   3   |
 193 ```
 194
 195 # SVP64 instructions implementing matrix multiply
 196
 197 * SVP64 assembler example:
 198 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
 199 * svremap and svshape instructions defined in:
 200 [[sv/rfc/ls009]
 201 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
 202 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
 203
 204 *(Need to check if first arg of svremap correct, then one shown works with
 205 ISACaller)*
 206
 207 ```
 208     svshape 2, 2, 3, 0, 0
 209     svremap 31, 1, 2, 3, 0, 0, 0
 210     sv.maddld *0, *16, *32, *0
 211 ```
 212
 213 ## svshape
 214
 215 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
 216 Purpose Registers (SPRs), which were added alongside the SVP64 looping
 217 system for complex element indexing. Without having "Re-shaping" SPRs, only the most
 218 basic, consecuting indexing of register elements (0,1,2,3...) would
 219 be possible.
 220
 221 ### SVSHAPE Remapping SPRs
 222
 223 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
 224 * Pseudo-code for the
 225 [svshape instruction](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e#l120)
 226
 227 Matrix Multiply utilises SVSHAPE0-2 SPRs.
 228
 229 (NOTE: This section is duplicated from the remap spec, and thus will be re-worked.)
 230
 231 SVSHAPE0 SPR:
 232
 233 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31|
 234 |----- |----- | ------- | ------- | ------  |------|------ |---- |
 235 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |
 236
 237
 238 skip:
 239
 240 - 0b00 indicates no dimensions to be skipped
 241 - 0b01 - skip '1st dim'
 242 - 0b10 - skip '2nd dim'
 243 - 0b11 - skip '3rd dim'
 244
 245 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
 246
 247 - If corresponding dim bit is zero, start index from zero and increment
 248 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
 249 bit is being looked at) and decrement down to zero.
 250
 251 offset is used to offset the result by `offset` elements (important for when
 252 using element width overrides are used).
 253
 254 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
 255 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
 256 3 elements in the array.
 257
 258 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
 259 ydimsz=2, zdimsz=0.
 260
 261 permute setting:
 262
 263 | permute | order | array format             |
 264 | ------- | ----- | ------------------------ |
 265 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 266 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 267 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 268 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 269 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 270 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 271 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 272 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 273
 274 Permute re-arranges the order of the nested for-loops used to iterate over the
 275 three dimensions. This allows for in-place transpose, in-place rotate, matrix
 276 multiply, convolutions, without the limitation of Power-of-Two matrices.
 277
 278 For normal matrix multiply, the permute setting is 0b010 (order 1,0,2,
 279 or swap x and y loops).
 280
 281 (*NOTE:* This is done automatically by the Matrix-Multiply REMAP mode, `SVRM=0`.)
 282
 283 ### Limitations of Matrix REMAP
 284
 285 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
 286 (MAC), or other operations may be performed in total.
 287 For matrix multiply, it means both operand matrices and result matrix can
 288 have no more than 127 elements in total.
 289 (Larger matrices can be split into tiles to circumvent this issue, out
 290 of scope of this document).
 291 - `svshape` instruction only provides part of the Matrix REMAP capability.
 292 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
 293 requiring more assembler instructions). Future revisions of SVP64 will
 294 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
 295 SPRs directly.
 296
 297 Going back to the assembler instruction used to setup the shape for matrix
 298 multiply:
 299
 300 ```
 301     svshape 2, 2, 3, 0, 0
 302 ```
 303
 304 breakdown:
 305
 306 - SVxd=2, SVyd=2, SVzd=3
 307 - SVRM=0 (Matrix mode, uses `SVSHAPE0` SPR)
 308 - vf=0 (not using Vertical-First mode)
 309
 310 To determine the `SVxd`/`SVyd`/`SVzd` settings:
 311
 312 - `SVxd` is equal to the number of columns in the second operand matrix.
 313 Matrix Y has 2 columns, so `SVxd=2`.
 314 - `SVyd` is equal to the number of rows in the first operand matrix.
 315 Matrix X has 2 rows, so `SVyd=2`.
 316 - `SVzd` is equal to the number of columns in the first operand matrix,
 317 or the number of rows in the second operand matrix.
 318 Matrix X has 3 columns, matrix Y has 3 rows, so `SVzd=3`.
 319
 320 Table form
 321
 322 ```
 323     SVxd |         mat_Y_num_cols
 324     SVyd |         mat_X_num_rows
 325     SVzd | mat_X_num_cols OR mat_Y_num_rows
 326 ```
 327
 328 ## SVREMAP
 329
 330 SVRM-Form:
 331
 332 |0     |6     |11  |13   |15   |17   |19   |21    | 22:25 |26:31  |
 333 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 334 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 335
 336 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
 337
 338 ```
 339     svremap 31, 1, 2, 3, 0, 0, 0
 340 ```
 341
 342 breakdown:
 343
 344 - `SVme=31`, 5-bit bitmask determines which registers have REMAP applied.
 345 Bitfields are: `RA|RB|RC|RT|EA/FRS` In this example, all input registers
 346 (RA, RB, RC) of *any* scalar instruction to follow and output register (RT)
 347 will have REMAP applied.
 348 - `mix/mox` fields determine which shape is applied to the activated register
 349 - `mi0=1`, instruction operand RA has SVSHAPE1 applied to it.
 350 - `mi1=2`, instruction operand RB has SVSHAPE2 applied to it.
 351 - `mi2=3`, instruction operand RA has SVSHAPE3 applied to it.
 352 - `mo0=0`, instruction result RT has SVSHAPE0 applied to it.
 353 - `mo1=0`, instruction result EA/FRS has SVSHAPE0 applied to it. *(not applicable
 354 for this example)*
 355 - `pst=0`, if set, REMAP remains enabled until explicitly disabled, or another
 356 REMAP, or setvl is setup.
 357
 358 Assigns the configured SVSHAPEs to the relevant operand/result registers
 359 of the consecutive instruction/s (depending on if REMAP is set to persistent).
 360
 361 The index table shown for the inner method above shows indices for a 'flattened'
 362 matrix (how it would be arranged in sequential GPR registers), whereas
 363 SVSHAPE0, 1, 2 registers setup the indices in relation to rows and columns
 364 of the matrix.
 365
 366 This is how the indices compare:
 367
 368 ```
 369                              Row/Column Indices
 370     Flattened Indices     | Mat X | Mat Y | Mat Z |
 371 | Mat X | Mat Y | Mat Z | | r   c | r   c | r   c |
 372 |   0   |   0   |   0   | | 0   0 | 0   0 | 0   0 |
 373 |   0   |   1   |   1   | | 0   0 | 0   1 | 0   1 |
 374 |   3   |   0   |   2   | | 1   0 | 0   0 | 1   0 |
 375 |   3   |   1   |   3   | | 1   0 | 0   1 | 1   1 |
 376 |   1   |   2   |   0   | | 0   1 | 1   0 | 0   0 |
 377 |   1   |   3   |   1   | | 0   1 | 1   1 | 0   1 |
 378 |   4   |   2   |   2   | | 1   1 | 1   0 | 1   0 |
 379 |   4   |   3   |   3   | | 1   1 | 1   1 | 1   1 |
 380 |   2   |   4   |   0   | | 0   2 | 2   0 | 0   0 |
 381 |   2   |   5   |   1   | | 0   2 | 2   1 | 0   1 |
 382 |   5   |   4   |   2   | | 1   2 | 2   0 | 1   0 |
 383 |   5   |   5   |   3   | | 1   2 | 2   1 | 1   1 |
 384 ```
 385
 386 See [[openpower/sv/remap]] Section 3.3 Matrix Mode for more information on
 387 the index sequences which can be produced with SVSHAPE SPRs.
 388
 389 ## maddld - Multiply-Add Low Doubleword VA-form
 390
 391 ```
 392     sv.maddld *0, *16, *32, *0
 393 ```
 394
 395 A standard instruction available since version 3.0 of PowerISA.
 396
 397 *Temporary note:* maddld (Multiply-Add Low Doubleword) in the 3.1b version
 398 of the PowerISA spec is in the Linux Compliancy Subset, not SFS or SFFS.
 399 See page 1477 of the document, or page 1503 of the pdf.
 400
 401 This instruction can be used as a multiply-add accumulate by setting the
 402 third operand to be the same as the result register, which functions as
 403 an accumulator.
 404
 405 ## Appendix
 406
 407 ### Running the simulator test case
 408
 409 - Set up the LibreSOC development environment (need to have openpower-isa repo
 410 setup and installed).
 411
 412     $: cd /PATH/TO/src/openpower-isa/src/openpower/decoder/isa/
 413     $: python3 test_caller_svp64_matrix.py >& /tmp/f
 414
 415 (All test cases are enabled by default; extra test cases can be disabled by changing
 416 `test_` to `tst_` or any other prefix.)
 417
 418 The simulator dump will be stored as in the temporary directory.
 419
 420 Particular text strings to to search for:
 421
 422 - `shape remap` - shows the indices corresponding to the operand and result matrices
 423 - `flattened X,Y,expected` - shows the expected matrix result which will be used to
 424 validate the result coming from the simulator.
 425 - `maddld-matrix` - shows the result matrix element values coming from the simulator.
 426 These values must match the expect result calculated before the simulator call.
 427
 428 If the matrix multiply worked successfully, the simulator dump will have an `OK`
 429 string after the `Ran 1 test in...` summary.
 430
 431 Screenshots:
 432
 433 [[!img mat_mul_sim_results.png size="600x" ]]
 434
 435 ### Remapyield showing how the matrix indices are generated
 436
 437 *(more work required not critically important right now)*
 438
 439 This table was drawn up using the `remapyield.py` code found
 440 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;h=d7b4bdeff2ae5d5f319ae036e4dff70de194dc67;hb=HEAD).
 441
 442 The xdim, ydim, and zdim were set to match the values specified in the svshape
 443 instruction for the above matrix multiply example.
 444
 445 `x`, 'y`, and `z` are iterators for the three loops, with a range between 0 and
 446 `SVxd-1`, `SVyd-1`, `SVzd-1` respectively.
 447
 448 For the above `svshape` instruction example:
 449
 450 - `x` takes values between 0 and 1 (`SVxd=2`)
 451 - `y` takes values between 0 and 1 (`SVyd=2`)
 452 - `z` takes values between 0 and 2 (`SVzd=3`)
 453 - `x_end`, `y_end`, and `z_end` correspond to the largest permitted values for
 454 `x`, `y`, and `z` (1, 1, and 2 respectively).
 455
 456 ```
 457                   | (x ==   | (y ==   | (z ==
 458 index | x | y | z |  x_end) |  y_end) |  z_end)
 459    0  | 0 | 0 | 0 |    F    |    F    |   F
 460    1  | 1 | 0 | 0 |    T    |    F    |   F
 461    2  | 0 | 1 | 0 |    F    |    T    |   F
 462    3  | 1 | 1 | 0 |    T    |    T    |   F
 463    4  | 0 | 0 | 1 |    F    |    F    |   F
 464    5  | 1 | 0 | 1 |    T    |    F    |   F
 465    6  | 0 | 1 | 1 |    F    |    T    |   F
 466    7  | 1 | 1 | 1 |    T    |    T    |   F
 467    8  | 0 | 0 | 2 |    F    |    F    |   T
 468    9  | 1 | 0 | 2 |    T    |    F    |   T
 469   10  | 0 | 1 | 2 |    F    |    T    |   T
 470   11  | 1 | 1 | 2 |    T    |    T    |   T
 471 ```
 472
 473 If the `x`, `y`, and `z` sequences are compared with the inner product index
 474 table, there are some resemblances. Swapping the `x` and `y` (permute=0b10)
 475 shows that matrix X indices correspond to `y`, matrix Y indices correspond to
 476 `x` (if the following equation used: `x+z*z_end`), and matrix Z indices
 477 correspond to `z`.
 478
 479 [[!tag svp64_cookbook ]]
 480