e28b99fb43c9a4e34f90bc22e7768ae6b48ba301
[libreriscv.git] / openpower / sv / cookbook / remap_matrix.mdwn
1 # SVP64 REMAP Worked Example: Matrix Multiply
2
3 Links
4
5 * [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
6 * [[sv/remap]]
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=701>
8 * video <https://m.youtube.com/watch?v=BbhOA8ULKt4> slides
9 <https://ftp.libre-soc.org/openpower_2021.pdf>
10 * setvl instruction: [[sv/setvl]]
11 * REMAP python function based on yield (shows how indices are generated):
12 [[sv/remapyield.py]]
13
14 TODO: Include screenshots
15
16 One of the most powerful and versatile modes of the REMAP engine (a part of
17 the SVP64 feature set) is the ability to perform matrix
18 multiplication with all elements within a scalar register file.
19
20 This is done by converting the index used to iterate over the operand and
21 result matrices to the actual index inside the scalar register file.
22
23 ## Worked example - manual (conventional method)
24
25 The matrix multiply looks like this:
26
27 ```
28 mat_X * mat_Y = mat_Z
29 ```
30
31 When multiplying non-square matrices (rows != columns), to determine the
32 dimension of the result when matrix X has `a` rows and `b` columns and matrix
33 Y has `b` rows and `c` columns:
34
35 ```
36 X_axb * Y_bxc = Z_axc
37 ```
38
39 The result matrix will have number of rows of the first matrix, and number of
40 columns of the second matrix.
41
42
43 For this example, the following values will be used for the operand matrices X
44 and Y, result Z shown for completeness.
45
46 X =| 1 2 3 | Y = | 6 7 | Z = | 52 58 |
47 | 3 4 5 | | 8 9 | | 100 112 |
48 | 10 11 |
49
50 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
51
52 To determine the final dimensions of the resultant matrix Z, take the number
53 of rows from matrix X (2) and number of columns from matrix Y (2).
54
55 The method usually taught in linear algebra course to students is the
56 following (outer product):
57
58 1. Start with the first row of the first matrix, and first column of the
59 second matrix.
60 2. Multiply each element in the row by each element in the column, and sum
61 with the current value of the result matrix element (multiply-add-accumulate).
62 Store result in the first row, first column entry.
63 3. Move to the next column of the second matrix, and next column of the result
64 matrix. If there are no more columns in the second matrix, go back to first
65 column (second matrix), and move to next row (first and result matrices).
66 If there are no more rows left, result matrix has been fully computed.
67 4. Repeat step 2.
68
69 This for-loop uses the indices as shown above
70
71 ```
72 for i in range(0, mat_X_num_rows):
73 for k in range(0, mat_Y_num_cols):
74 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
75 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
76 ```
77
78 Calculations:
79
80 ```
81 | 1 2 3 | | 6 7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
82 | 3 4 5 | * | 8 9 | | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
83 | 10 11 |
84
85 | 1 2 3 | | 6 7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
86 | 3 4 5 | * | 8 9 | | (18 + 32 + 50) (21 + 36 + 55) |
87 | 10 11 |
88
89 | 1 2 3 | | 6 7 | = | 52 58 |
90 | 3 4 5 | * | 8 9 | | 100 112 |
91 | 10 11 |
92 ```
93
94 For the algorithm, assign indeces to matrices as follows:
95
96 ```
97 Index | 0 1 2 3 4 5 |
98 Mat X | 1 2 3 3 4 5 |
99
100 Index | 0 1 2 3 4 5 |
101 Mat Y | 6 7 8 9 10 11 |
102
103 Index | 0 1 2 3 |
104 Mat Z | 52 58 100 112 |
105 ```
106
107 (Start with the first row, then assign index left-to-right, top-to-bottom.)
108
109 Index list:
110
111 ```
112 | Mat X | Mat Y | Mat Z |
113 | 0 | 0 | 0 |
114 | 1 | 2 | 0 |
115 | 2 | 4 | 0 |
116 | 0 | 1 | 1 |
117 | 1 | 3 | 1 |
118 | 2 | 5 | 1 |
119 | 3 | 0 | 2 |
120 | 4 | 2 | 2 |
121 | 5 | 4 | 2 |
122 | 3 | 1 | 3 |
123 | 4 | 3 | 3 |
124 | 5 | 5 | 3 |
125 ```
126
127 Worked example broken down into individual multiply-add accumulates:
128
129 [[!img outer_product_worked_example.jpg size="600x" ]]
130
131 The issue with this algorithm is that the result matrix element is the same
132 for three consecutive operations, and where each element is stored in CPU
133 registers, the same register will be written to three times and thus causing
134 consistent stalling.
135
136 ## Inner Product
137
138 A slight modification to the order of the loops in the algorithm massively
139 reduces the chance of read-after-write hazards, as the result matrix
140 element (and thus register) changes with every multiply-add operation.
141
142 The code:
143
144 ```
145 for i in range(mat_X_num_rows):
146 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
147 for k in range(0, mat_Y_num_cols):
148 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
149 ```
150
151 Index list:
152
153 ```
154 | Mat X | Mat Y | Mat Z |
155 | 0 | 0 | 0 |
156 | 0 | 1 | 1 |
157 | 3 | 0 | 2 |
158 | 3 | 1 | 3 |
159 | 1 | 2 | 0 |
160 | 1 | 3 | 1 |
161 | 4 | 2 | 2 |
162 | 4 | 3 | 3 |
163 | 2 | 4 | 0 |
164 | 2 | 5 | 1 |
165 | 5 | 4 | 2 |
166 | 5 | 5 | 3 |
167 ```
168
169 Worked example for inner product:
170
171 [[!img inner_product_worked_example.jpg size="600x" ]]
172
173 The index for the result matrix changes with every operation, and thus the
174 consecutive multiply-add instruction doesn't depend on the previous write
175 register.
176
177 Outer and inner product indices side-by-side:
178
179 ```
180 | Outer Product | Inner Product |
181 | Mat X | Mat Y | Mat Z | Mat X | Mat Y | Mat Z |
182 | 0 | 0 | 0 | 0 | 0 | 0 |
183 | 1 | 2 | 0 | 0 | 1 | 1 |
184 | 2 | 4 | 0 | 3 | 0 | 2 |
185 | 0 | 1 | 1 | 3 | 1 | 3 |
186 | 1 | 3 | 1 | 1 | 2 | 0 |
187 | 2 | 5 | 1 | 1 | 3 | 1 |
188 | 3 | 0 | 2 | 4 | 2 | 2 |
189 | 4 | 2 | 2 | 4 | 3 | 3 |
190 | 5 | 4 | 2 | 2 | 4 | 0 |
191 | 3 | 1 | 3 | 2 | 5 | 1 |
192 | 4 | 3 | 3 | 5 | 4 | 2 |
193 | 5 | 5 | 3 | 5 | 5 | 3 |
194 ```
195
196 # SVP64 instructions implementing matrix multiply
197
198 * SVP64 assembler example:
199 [unit test](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=30f2d8a8e92ad2939775f19e6a0f387499e9842b#l56)
200 * svremap and svshape instructions defined in:
201 [[sv/rfc/ls009]
202 * Multiple-Add Low Doubleword instruction pseudo-code (Power ISA 3.0C
203 Book I, section 3.3.9): [[openpower/isa/fixedarith]]
204
205 *(Need to check if first arg of svremap correct, then one shown works with
206 ISACaller)*
207
208 ```
209 svshape 2, 2, 3, 0, 0
210 svremap 31, 1, 2, 3, 0, 0, 0
211 sv.maddld *0, *16, *32, *0
212 ```
213
214 ## svshape
215
216 The `svshape` instruction is a convenient way to access the `SVSHAPE` Special
217 Purpose Registers (SPRs), which were added alongside the SVP64 looping
218 system for complex element indexing. Without having "Re-shaping" SPRs, only the most
219 basic, consecuting indexing of register elements (0,1,2,3...) would
220 be possible.
221
222 ### SVSHAPE Remapping SPRs
223
224 * See [[sv/remap]] for the full break down of SPRs `SVSHAPE0-3`.
225
226 For Matrix Multiply, SHAPE0 SPR is used:
227
228 |0:5 |6:11 | 12:17 | 18:20 | 21:23 |24:27 |28:29 |30:31|
229 |----- |----- | ------- | ------- | ------ |------|------ |---- |
230 |xdimsz|ydimsz| zdimsz | permute | invxyz |offset|skip |mode |
231
232
233 skip:
234
235 - 0b00 indicates no dimensions to be skipped
236 - 0b01 - skip '1st dim'
237 - 0b10 - skip '2nd dim'
238 - 0b11 - skip '3rd dim'
239
240 invxyz (3-bit; 1 for x, 1 for y, 1 for z):
241
242 - If corresponding dim bit is zero, start index from zero and increment
243 - If bit set, start from xdimsz-1 (x dimension size, or whichever dimension
244 bit is being looked at) and decrement down to zero.
245
246 offset is used to offset the result by `offset` elements (important for when
247 using element width overrides are used).
248
249 xdimsz, ydimsz, zdimsz are offset by 1, such that 0-0b111111 correspond to
250 1-64. A value of xdimsz=2 would indicate that in the first dimension there are
251 3 elements in the array.
252
253 With the example Matrix X (2 rows, 3 columns, or 2x3 matrix), xdimsz=1,
254 ydimsz=2, zdimsz=0.
255
256 permute setting:
257
258 | permute | order | array format |
259 | ------- | ----- | ------------------------ |
260 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
261 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
262 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
263 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
264 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
265 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
266 | 110 | 0,1 | Indexed (xdim+1)(ydim+1) |
267 | 111 | 1,0 | Indexed (ydim+1)(xdim+1) |
268
269 Permute re-arranges the order of the nested for-loops used to iterate over the
270 three dimensions. This allows for in-place transpose, in-place rotate, matrix
271 multiply, convolutions, without the limitation of Power-of-Two matrices.
272
273 For normal matrix multiply, the permute setting is 0b010 (order 1,0,2,
274 or swap x and y loops).
275
276 (*NOTE:* This is done automatically by the Matrix-Multiply REMAP mode, `SVRM=0`.)
277
278 Limitations of Matrix REMAP are currently:
279
280 - Vector Length (VL) limited to 127, and up to 127 Multiply-Add Accumulates
281 (MAC), or other operations may be performed in total.
282 For matrix multiply, it means both operand matrices and result matrix can
283 have no more than 127 elements in total.
284 (Larger matrices can be split into tiles to circumvent this issue, out
285 of scope of this document).
286 - `svshape` instruction only provides part of the Matrix REMAP capability.
287 For rotation and mirroring, `SVSHAPE` SPRs must be programmed directly (thus
288 requiring more assembler instructions). Future revisions of SVP64 will
289 provide more comprehensive capacity, mitigating the need to write to `SVSHAPE`
290 SPRs directly.
291
292 Going back to the assembler instruction used to setup the shape for matrix
293 multiply:
294
295 ```
296 svshape 2, 2, 3, 0, 0
297 ```
298
299 breakdown:
300
301 - SVxd=2, SVyd=2, SVzd=3
302 - SVRM=0 (Matrix mode, uses `SVSHAPE0` SPR)
303 - vf=0 (not using Vertical-First mode)
304
305 To determine the `SVxd`/`SVyd`/`SVzd` settings:
306
307 - `SVxd` is equal to the number of columns in the second operand matrix.
308 Matrix Y has 2 columns, so `SVxd=2`.
309 - `SVyd` is equal to the number of rows in the first operand matrix.
310 Matrix X has 2 rows, so `SVyd=2`.
311 - `SVzd` is equal to the number of columns in the first operand matrix,
312 or the number of rows in the second operand matrix.
313 Matrix X has 3 columns, matrix Y has 3 rows, so `SVzd=3`.
314
315 Table form
316
317 ```
318 SVxd | mat_Y_num_cols
319 SVyd | mat_X_num_rows
320 SVzd | mat_X_num_cols OR mat_Y_num_rows
321 ```
322
323 ## SVREMAP
324
325 ```
326 svremap 31, 1, 2, 3, 0, 0, 0
327 ```
328
329 Assigns the configured SVSHAPEs to the relevant operand/result registers
330 of the consecutive instruction/s (depending on if REMAP is set to persistent).
331
332 ## maddld - Multiply-Add Low Doubleword VA-form
333
334 ```
335 sv.maddld *0, *16, *32, *0
336 ```
337
338 A standard instruction available since version 3.0 of PowerISA.
339
340 *Temporary note:* maddld (Multiply-Add Low Doubleword) in the 3.1b version
341 of the PowerISA spec is in the Linux Compliancy Subset, not SFS or SFFS.
342 See page 1477 of the document, or page 1503 of the pdf.
343
344 This instruction can be used as a multiply-add accumulate by setting the
345 third operand to be the same as the result register, which functions as
346 an accumulator.
347
348 ## Appendix
349
350
351 [[!tag svp64_cookbook ]]
352