openpower/sv/cookbook/remap_matrix.mdwn

   1 # SVP64 REMAP Worked Example: Matrix Multiply
   2
   3 Links
   4
   5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
   6 * [[sv/remap]]
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
   8
   9 TODO: Include screenshots
  10
  11 One of the most powerful and versatile modes of the REMAP engine (a part of
  12 the SVP64 feature set) is the ability to perform matrix
  13 multiplication with all elements within a scalar register file.
  14
  15 This is done by converting the index used to iterate over the operand and
  16 result matrices to the actual index inside the scalar register file.
  17
  18 ## Worked example - manual (conventional method)
  19
  20 The matrix multiply looks like this:
  21
  22     mat_X * mat_Y = mat_Z
  23
  24 When multiplying non-square matrices (rows != columns), to determine the
  25 dimension of the result when matrix X has `a` rows and `b` columns and matrix
  26 Y has `b` rows and `c` columns:
  27
  28     X_axb * Y_bxc = Z_axc
  29
  30 The result matrix will have number of rows of the first matrix, and number of
  31 columns of the second matrix.
  32
  33
  34 For this example, the following values will be used for the operand matrices X
  35 and Y, result Z shown for completeness.
  36
  37     X =| 1 2 3 |  Y =  |  6  7 |  Z = |  52  58 |
  38        | 3 4 5 |       |  8  9 |      | 100 112 |
  39                        | 10 11 |
  40
  41 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
  42
  43 To determine the final dimensions of the resultant matrix Z, take the number
  44 of rows from matrix X (2) and number of columns from matrix Y (2).
  45
  46 The method usually taught in linear algebra course to students is the
  47 following (outer product):
  48
  49 1. Start with the first row of the first matrix, and first column of the
  50 second matrix.
  51 2. Multiply each element in the row by each element in the column, and sum
  52 with the current value of the result matrix element (multiply-add-accumulate).
  53 Store result in the first row, first column entry.
  54 3. Move to the next column of the second matrix, and next column of the result
  55 matrix. If there are no more columns in the second matrix, go back to first
  56 column (second matrix), and move to next row (first and result matrices).
  57 If there are no more rows left, result matrix has been fully computed.
  58 4. Repeat step 2.
  59 4. Move to the next row of the first matrix, and next column of the second matrix.
  60
  61 This for-loop uses the indices as shown above
  62
  63 ```
  64     for i in range(mat_X_num_rows):
  65         for k in range(0, mat_Y_num_cols):
  66             for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
  67                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
  68 ```
  69
  70 Calculations:
  71
  72 ```
  73     | 1 2 3 |   |  6  7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
  74     | 3 4 5 | * |  8  9 |   | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
  75                 | 10 11 |
  76
  77     | 1 2 3 |   |  6  7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
  78     | 3 4 5 | * |  8  9 |   | (18 + 32 + 50) (21 + 36 + 55) |
  79                 | 10 11 |
  80
  81     | 1 2 3 |   |  6  7 | = |  52  58 |
  82     | 3 4 5 | * |  8  9 |   | 100 112 |
  83                 | 10 11 |
  84 ```
  85
  86 For the algorithm, assign indeces to matrices as follows:
  87
  88 ```
  89     Index | 0 1 2 3 4 5 |
  90     Mat X | 1 2 3 3 4 5 |
  91
  92     Index | 0 1 2 3  4  5 |
  93     Mat Y | 6 7 8 9 10 11 |
  94
  95     Index |  0  1   2   3 |
  96     Mat Z | 52 58 100 112 |
  97 ```
  98
  99 (Start with the first row, then assign index left-to-right, top-to-bottom.)
 100
 101 Index list:
 102
 103 ```
 104     Mat X | Mat Y | Mat Z
 105       0   |   0   |   0
 106       1   |   2   |   0
 107       2   |   4   |   0
 108       0   |   1   |   1
 109       1   |   3   |   1
 110       2   |   5   |   1
 111       3   |   0   |   2
 112       4   |   2   |   2
 113       5   |   4   |   2
 114       3   |   1   |   3
 115       4   |   3   |   3
 116       5   |   5   |   3
 117 ```
 118
 119
 120 The issue with this algorithm is that the result matrix element is the same
 121 for three consecutive operations, and where each element is stored in CPU
 122 registers, the same register will be written to three times and thus causing
 123 consistent stalling.
 124
 125 ## Inner Product
 126
 127 A slight modification to the order of the loops in the algorithm massively
 128 reduces the chance of read-after-write hazards, as the result matrix
 129 element (and thus register) changes with every multiply-add operation.
 130
 131 The code:
 132
 133 ```
 134     for i in range(mat_X_num_rows):
 135         for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
 136             for k in range(0, mat_Y_num_cols):
 137                 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
 138 ```
 139
 140 Index list:
 141
 142 ```
 143     Mat X | Mat Y | Mat Z
 144       0   |   0   |   0
 145       0   |   1   |   1
 146       3   |   0   |   2
 147       3   |   1   |   3
 148       1   |   2   |   0
 149       1   |   3   |   1
 150       4   |   2   |   2
 151       4   |   3   |   3
 152       2   |   4   |   0
 153       2   |   5   |   1
 154       5   |   4   |   2
 155       5   |   5   |   3
 156 ```
 157
 158 The index for the result matrix changes with every operation, and thus the
 159 consecutive multiply-add instruction doesn't depend on the previous write
 160 register.
 161
 162 # SVP64 instructions implementing matrix multiply
 163
 164 * SVP64 assembler example:
 165 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
 166 * SVREMAP and SVSHAPE instructions defined in:
 167 [[sv/rfc/ls009]
 168 * Multiple-Add Low Doubleword instruction pseudo-code (OpenPOWER ISA 3.0C
 169 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
 170
 171 *(Need to check if first arg of svremap correct, then one shown works with
 172 ISACaller)*
 173
 174     svshape 2, 2, 3, 0, 0
 175     svremap 31, 1, 2, 3, 0, 0, 0
 176     sv.maddld *0, *16, *32, *0
 177
 178 ## svshape
 179
 180 The `svshape` instruction is a convenient way to access the SHAPE Special
 181 Purpose Registers (SPRs), which were added alongside the SVP64 looping
 182 system for complex element indexing. Without having SHAPE SPRs, only the most
 183 basic, consecuting indexing of register elements (0,1,2,3...) would
 184 be possible.
 185
 186 ### SHAPE Remapping SPRs
 187
 188 * See [[sv/remap]] for the full break down of SPRs SHAPE0-3.
 189
 190 For Matrix Multiply, SHAPE0 SPR is used:
 191
 192 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31|
 193 |----- |----- | ------- | ------- | ------  |------|------ |---- |
 194 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |
 195
 196
 197 skip:
 198
 199 - 0b00 indicates no dimensions to be skipped
 200 - 0b01 - skip '1st dim'
 201 - 0b10 - skip '2nd dim'
 202 - 0b11 - skip '3rd dim'
 203
 204 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
 205
 206 - If corresponding dim bit is zero, start index from zero and increment
 207 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
 208 bit is being looked at) and decrement down to zero.
 209
 210 offset is used to offset the result by `offset` elements (important for when
 211 using element width overrides are used).
 212
 213 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
 214 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
 215 3 elements in the array.
 216
 217 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
 218 ydimsz=2, zdimsz=0.
 219
 220 permute setting:
 221 | permute | order | array format             |
 222 | ------- | ----- | ------------------------ |
 223 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 224 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 225 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 226 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 227 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 228 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 229 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 230 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 231
 232 Permute re-arranges the order of the nested for-loops used to iterate over the
 233 three dimensions. This allows for in-place transpose, in-place rotate, matrix
 234 multiply, convolutions, without the limitation of Power-of-Two matrices.
 235
 236 Limitations of Matrix REMAP are currently:
 237
 238 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
 239 (MAC), or other operations may be performed in total.
 240 For matrix multiply, it means both operand matrices and result matrix can
 241 have no more than 127 elements in total.
 242 (Larger matrices can be split into tiles to circumvent this issue, out
 243 of scope of this document).
 244 - `svshape` instruction only provides part of the Matrix REMAP capability.
 245 For rotation and mirroring, SVSHAPE SPRs must be programmed directly (thus
 246 requiring more assembler instructions). Future revisions of SVP64 will
 247 provide more comprehensive capacity, mitigating the need to write to SVSHAPE
 248 SPRs directly.
 249
 250 Going back to the assembler instruction used to setup the shape for matrix
 251 multiply:
 252
 253 ```
 254     svshape 2, 2, 3, 0, 0
 255 ```
 256
 257 breakdown:
 258
 259 - SVxd=2, SVyd=2, SVzd=3
 260 - SVRM=0 (Matrix mode, uses SHAPE0 SPR)
 261 -
 262
 263 ## SVREMAP
 264
 265
 266 ## Appendix
 267