71ecad5f1b02e480bd34a0bba06b9322556c9a54
[libreriscv.git] / openpower / sv / cookbook / remap_matrix.mdwn
1 # SVP64 REMAP Worked Example: Matrix Multiply
2
3 Links
4
5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
6 * [[sv/remap]]
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
9 <https://ftp.libre-soc.org/openpower_2021.pdf>
10 * setvl instruction: [[sv/setvl]]
11 * REMAP python function based on yield (shows how indices are generated):
12 [[sv/remapyield.py]]
13
14 One of the most powerful and versatile modes of the REMAP engine (a part of
15 the SVP64 feature set) is the ability to perform matrix
16 multiplication with all elements within a scalar register file.
17
18 This is done by converting the index used to iterate over the operand and
19 result matrices to the actual index inside the scalar register file.
20
21 ## Worked example - manual (conventional method)
22
23 The matrix multiply looks like this:
24
25 ```
26 mat_X * mat_Y = mat_Z
27 ```
28
29 When multiplying non-square matrices (rows != columns), to determine the
30 dimension of the result when matrix X has `a` rows and `b` columns and matrix
31 Y has `b` rows and `c` columns:
32
33 ```
34 X_axb * Y_bxc = Z_axc
35 ```
36
37 The result matrix will have number of rows of the first matrix, and number of
38 columns of the second matrix.
39
40
41 For this example, the following values will be used for the operand matrices X
42 and Y, result Z shown for completeness.
43
44 X =| 1 2 3 | Y = | 6 7 | Z = | 52 58 |
45 | 3 4 5 | | 8 9 | | 100 112 |
46 | 10 11 |
47
48 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
49
50 To determine the final dimensions of the resultant matrix Z, take the number
51 of rows from matrix X (2) and number of columns from matrix Y (2).
52
53 The method usually taught in linear algebra course to students is the
54 following (outer product):
55
56 1. Start with the first row of the first matrix, and first column of the
57 second matrix.
58 2. Multiply each element in the row by each element in the column, and sum
59 with the current value of the result matrix element (multiply-add-accumulate).
60 Store result in the row matching first operand row and column matching second
61 operand column.
62 3. Move to the next column of the second matrix, and next column of the result
63 matrix. If there are no more columns in the second matrix, go back to first
64 column (second matrix), and move to next row (first and result matrices).
65 If there are no more rows left, result matrix has been fully computed.
66 4. Repeat step 2.
67
68 This for-loop uses the indices as shown above
69
70 ```
71 for i in range(0, mat_X_num_rows):
72 for k in range(0, mat_Y_num_cols):
73 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
74 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
75 ```
76
77 Calculations:
78
79 ```
80 | 1 2 3 | | 6 7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
81 | 3 4 5 | * | 8 9 | | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
82 | 10 11 |
83
84 | 1 2 3 | | 6 7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
85 | 3 4 5 | * | 8 9 | | (18 + 32 + 50) (21 + 36 + 55) |
86 | 10 11 |
87
88 | 1 2 3 | | 6 7 | = | 52 58 |
89 | 3 4 5 | * | 8 9 | | 100 112 |
90 | 10 11 |
91 ```
92
93 For the algorithm, assign indeces to matrices as follows:
94
95 ```
96 Index | 0 1 2 3 4 5 |
97 Mat X | 1 2 3 3 4 5 |
98
99 Index | 0 1 2 3 4 5 |
100 Mat Y | 6 7 8 9 10 11 |
101
102 Index | 0 1 2 3 |
103 Mat Z | 52 58 100 112 |
104 ```
105
106 (Start with the first row, then assign index left-to-right, top-to-bottom.)
107
108 Index list:
109
110 ```
111 | Mat X | Mat Y | Mat Z |
112 | 0 | 0 | 0 |
113 | 1 | 2 | 0 |
114 | 2 | 4 | 0 |
115 | 0 | 1 | 1 |
116 | 1 | 3 | 1 |
117 | 2 | 5 | 1 |
118 | 3 | 0 | 2 |
119 | 4 | 2 | 2 |
120 | 5 | 4 | 2 |
121 | 3 | 1 | 3 |
122 | 4 | 3 | 3 |
123 | 5 | 5 | 3 |
124 ```
125
126 Worked example broken down into individual multiply-add accumulates:
127
128 [[!img outer_product_worked_example.jpg size="500x" ]]
129
130 The issue with this algorithm is that the result matrix element is the same
131 for three consecutive operations, and where each element is stored in CPU
132 registers, the same register will be written to three times and thus causing
133 consistent stalling.
134
135 ## Inner Product
136
137 A slight modification to the order of the loops in the algorithm massively
138 reduces the chance of read-after-write hazards, as the result matrix
139 element (and thus register) changes with every multiply-add operation.
140
141 The code:
142
143 ```
144 for i in range(mat_X_num_rows):
145 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
146 for k in range(0, mat_Y_num_cols):
147 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
148 ```
149
150 Index list:
151
152 ```
153 | Mat X | Mat Y | Mat Z |
154 | 0 | 0 | 0 |
155 | 0 | 1 | 1 |
156 | 3 | 0 | 2 |
157 | 3 | 1 | 3 |
158 | 1 | 2 | 0 |
159 | 1 | 3 | 1 |
160 | 4 | 2 | 2 |
161 | 4 | 3 | 3 |
162 | 2 | 4 | 0 |
163 | 2 | 5 | 1 |
164 | 5 | 4 | 2 |
165 | 5 | 5 | 3 |
166 ```
167
168 Worked example for inner product:
169
170 [[!img inner_product_worked_example.jpg size="500x" ]]
171
172 The index for the result matrix changes with every operation, and thus the
173 consecutive multiply-add instruction doesn't depend on the previous write
174 register.
175
176 Outer and inner product indices side-by-side:
177
178 ```
179 | Outer Product | Inner Product |
180 | Mat X | Mat Y | Mat Z | Mat X | Mat Y | Mat Z |
181 | 0 | 0 | 0 | 0 | 0 | 0 |
182 | 1 | 2 | 0 | 0 | 1 | 1 |
183 | 2 | 4 | 0 | 3 | 0 | 2 |
184 | 0 | 1 | 1 | 3 | 1 | 3 |
185 | 1 | 3 | 1 | 1 | 2 | 0 |
186 | 2 | 5 | 1 | 1 | 3 | 1 |
187 | 3 | 0 | 2 | 4 | 2 | 2 |
188 | 4 | 2 | 2 | 4 | 3 | 3 |
189 | 5 | 4 | 2 | 2 | 4 | 0 |
190 | 3 | 1 | 3 | 2 | 5 | 1 |
191 | 4 | 3 | 3 | 5 | 4 | 2 |
192 | 5 | 5 | 3 | 5 | 5 | 3 |
193 ```
194
195 # SVP64 instructions implementing matrix multiply
196
197 * SVP64 assembler example:
198 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
199 * svremap and svshape instructions defined in:
200 [[sv/rfc/ls009]
201 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
202 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
203
204 *(Need to check if first arg of svremap correct, then one shown works with
205 ISACaller)*
206
207 ```
208 svshape 2, 2, 3, 0, 0
209 svremap 31, 1, 2, 3, 0, 0, 0
210 sv.maddld *0, *16, *32, *0
211 ```
212
213 ## svshape
214
215 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
216 Purpose Registers (SPRs), which were added alongside the SVP64 looping
217 system for complex element indexing. Without having "Re-shaping" SPRs,
218 only the most basic, consecuting indexing of register elements (0,1,2,3...)
219 would be possible.
220
221 The REMAP system has 16 modes, all of which are accessible through the
222 `svshape` instruction. However for the purpose of this guide, only SVrm=0
223 (Matrix Multiply) will be covered.
224
225 ### SVSHAPE Remapping SPRs
226
227 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
228 * Pseudo-code for the
229 [svshape instruction](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/simplev.mdwn;h=33a02e#l120)
230 * Pseude-code also available on the wiki: [[openpower/sv/appendix]].
231
232 Matrix Multiply utilises SVSHAPE0-2 SPRs.
233
234 The index table shown for the inner method above shows indices for a 'flattened'
235 matrix (how it would be arranged in sequential GPR registers), whereas
236 SVSHAPE0, 1, 2 registers setup the indices in relation to rows and columns
237 of the matrix.
238
239 This is how the indices compare:
240
241 ```
242 Row/Column Indices
243 Flattened Indices | Mat X | Mat Y | Mat Z |
244 | Mat X | Mat Y | Mat Z | | r c | r c | r c |
245 | 0 | 0 | 0 | | 0 0 | 0 0 | 0 0 |
246 | 0 | 1 | 1 | | 0 0 | 0 1 | 0 1 |
247 | 3 | 0 | 2 | | 1 0 | 0 0 | 1 0 |
248 | 3 | 1 | 3 | | 1 0 | 0 1 | 1 1 |
249 | 1 | 2 | 0 | | 0 1 | 1 0 | 0 0 |
250 | 1 | 3 | 1 | | 0 1 | 1 1 | 0 1 |
251 | 4 | 2 | 2 | | 1 1 | 1 0 | 1 0 |
252 | 4 | 3 | 3 | | 1 1 | 1 1 | 1 1 |
253 | 2 | 4 | 0 | | 0 2 | 2 0 | 0 0 |
254 | 2 | 5 | 1 | | 0 2 | 2 1 | 0 1 |
255 | 5 | 4 | 2 | | 1 2 | 2 0 | 1 0 |
256 | 5 | 5 | 3 | | 1 2 | 2 1 | 1 1 |
257 ```
258
259 These row/column indices are converted to the flattened indices when actually
260 used when SVP64 looping is going on (during the `maddld` hot loop).
261
262 See [[openpower/sv/remap]] Section 3.3 Matrix Mode for more information on
263 the index sequences which can be produced with SVSHAPE SPRs.
264
265
266 ### Limitations of Matrix REMAP
267
268 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
269 (MAC), or other operations may be performed in total.
270 For matrix multiply, it means both operand matrices and result matrix can
271 have no more than 127 elements in total.
272 (Larger matrices can be split into tiles to circumvent this issue, out
273 of scope of this document).
274 - `svshape` instruction only provides part of the Matrix REMAP capability.
275 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
276 requiring more assembler instructions). Future revisions of SVP64 will
277 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
278 SPRs directly.
279
280 Going back to the assembler instruction used to setup the shape for matrix
281 multiply:
282
283 ```
284 svshape 2, 2, 3, 0, 0
285 ```
286
287 breakdown:
288
289 - `SVxd=2`, `SVyd=2`, `SVzd=3`
290 - `SVrm=0` (Matrix mode)
291 - `vf=0` (not using Vertical-First mode)
292
293 To determine the `SVxd`/`SVyd`/`SVzd` settings:
294
295 - `SVxd` is equal to the number of columns in the second operand matrix.
296 Matrix Y has 2 columns, so `SVxd=2`.
297 - `SVyd` is equal to the number of rows in the first operand matrix.
298 Matrix X has 2 rows, so `SVyd=2`.
299 - `SVzd` is equal to the number of columns in the first operand matrix,
300 or the number of rows in the second operand matrix.
301 Matrix X has 3 columns, matrix Y has 3 rows, so `SVzd=3`.
302
303 Table form
304
305 ```
306 SVxd | mat_Y_num_cols
307 SVyd | mat_X_num_rows
308 SVzd | mat_X_num_cols OR mat_Y_num_rows
309 ```
310
311 The `svshape` instruction will do the following (for Matrix Multiply REMAP):
312
313 - The vector length `VL` of the SVP64 loop is determined based on the three
314 dimensions: `VL <- xd * yd * zd`. For this example should be 12, since there
315 will be 12 multiply-add accumulates to fully compute the result matrix.
316 - `SVSHAPE0`, `SVSHAPE1`, and `SVSHAPE2` SPRs are configured with the x/y/z
317 dimensions. As each SVSHAPE register supports three sets of indices (three
318 loops), the third index z is skipped (because we're dealing with 2d matrices).
319 - Other modifications done by `svshape` (such as `SVSTATE` SPR) are
320 out-of-scope for this document.
321
322 ## SVREMAP
323
324 * See [[sv/remap]] for the `svremap` instruction.
325
326 Assigns the configured SVSHAPEs to the relevant operand/result registers
327 of the consecutive instruction/s (depending on if REMAP is set to persistent).
328
329 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
330
331 ```
332 svremap 31, 1, 2, 3, 0, 0, 0
333 ```
334
335 breakdown:
336
337 - `SVme=31`, 5-bit bitmask determines which registers have REMAP applied.
338 Bitfields are: `RA|RB|RC|RT|EA/FRS` In this example, all input registers
339 (RA, RB, RC) of *any* scalar instruction to follow and output register (RT)
340 will have REMAP applied.
341 - `mix/mox` fields determine which shape is applied to the activated register
342 - `mi0=1`, instruction operand RA has SVSHAPE1 applied to it.
343 - `mi1=2`, instruction operand RB has SVSHAPE2 applied to it.
344 - `mi2=3`, instruction operand RC has SVSHAPE3 applied to it.
345 - `mo0=0`, instruction result RT has SVSHAPE0 applied to it.
346 - `mo1=0`, instruction result EA/FRS has SVSHAPE0 applied to it.
347 *(not applicable for this example)*
348 - `pst=0`, if set, REMAP remains enabled until explicitly disabled, or another
349 REMAP, or setvl is setup.
350
351
352 ## maddld - Multiply-Add Low Doubleword VA-form
353
354 A standard instruction available since version 3.0 of PowerISA.
355
356 This instruction can be used as a multiply-add accumulate by setting the
357 third operand to be the same as the result register, which functions as
358 an accumulator.
359
360 ```
361 sv.maddld *0, *16, *32, *0
362 ```
363
364 breakdown:
365
366 - Store result (RT) of the operation starting at register 0.
367 - Operands RA and RB correspond to the two operand matrices, starting at
368 register 16 and register 32 respectively.
369 - The third operand RC is the same as the result register, which gives the
370 multiply-add accumulate behaviour.
371
372 ## Appendix
373
374 ### Running the simulator test case
375
376 - Set up the LibreSOC development environment (need to have openpower-isa repo
377 setup and installed).
378
379 $: cd /PATH/TO/src/openpower-isa/src/openpower/decoder/isa/
380 $: python3 test_caller_svp64_matrix.py >& /tmp/f
381
382 (All test cases are enabled by default; extra test cases can be disabled by changing
383 `test_` to `tst_` or any other prefix.)
384
385 The simulator dump will be stored as in the temporary directory.
386
387 Particular text strings to to search for:
388
389 - `shape remap` - shows the indices corresponding to the operand and result matrices
390 - `flattened X,Y,expected` - shows the expected matrix result which will be used to
391 validate the result coming from the simulator.
392 - `maddld-matrix` - shows the result matrix element values coming from the simulator.
393 These values must match the expect result calculated before the simulator call.
394
395 If the matrix multiply worked successfully, the simulator dump will have an `OK`
396 string after the `Ran 1 test in...` summary.
397
398 Screenshots:
399
400 [[!img mat_mul_sim_results.png size="600x" ]]
401
402 ### Remapyield showing how the matrix indices are generated
403
404 *(more work required not critically important right now)*
405
406 This table was drawn up using the `remapyield.py` code found
407 [here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;h=d7b4bdeff2ae5d5f319ae036e4dff70de194dc67;hb=HEAD).
408
409 The xdim, ydim, and zdim were set to match the values specified in the svshape
410 instruction for the above matrix multiply example.
411
412 `x`, 'y`, and `z` are iterators for the three loops, with a range between 0 and
413 `SVxd-1`, `SVyd-1`, `SVzd-1` respectively.
414
415 For the above `svshape` instruction example:
416
417 - `x` takes values between 0 and 1 (`SVxd=2`)
418 - `y` takes values between 0 and 1 (`SVyd=2`)
419 - `z` takes values between 0 and 2 (`SVzd=3`)
420 - `x_end`, `y_end`, and `z_end` correspond to the largest permitted values for
421 `x`, `y`, and `z` (1, 1, and 2 respectively).
422
423 ```
424 | (x == | (y == | (z ==
425 index | x | y | z | x_end) | y_end) | z_end)
426 0 | 0 | 0 | 0 | F | F | F
427 1 | 1 | 0 | 0 | T | F | F
428 2 | 0 | 1 | 0 | F | T | F
429 3 | 1 | 1 | 0 | T | T | F
430 4 | 0 | 0 | 1 | F | F | F
431 5 | 1 | 0 | 1 | T | F | F
432 6 | 0 | 1 | 1 | F | T | F
433 7 | 1 | 1 | 1 | T | T | F
434 8 | 0 | 0 | 2 | F | F | T
435 9 | 1 | 0 | 2 | T | F | T
436 10 | 0 | 1 | 2 | F | T | T
437 11 | 1 | 1 | 2 | T | T | T
438 ```
439
440 If the `x`, `y`, and `z` sequences are compared with the inner product index
441 table, there are some resemblances. Swapping the `x` and `y` (permute=0b10)
442 shows that matrix X indices correspond to `y`, matrix Y indices correspond to
443 `x` (if the following equation used: `x+z*z_end`), and matrix Z indices
444 correspond to `z`.
445
446 [[!tag svp64_cookbook ]]
447