Add outer/inner loop info
[libreriscv.git] / openpower / sv / cookbook / remap_matrix.mdwn
1 # SVP64/Vector+1 REMAP Worked Example: Matrix Multiply
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=701>
4
5 TODO: Include screenshots
6
7 One of the most powerful and versatile modes of the REMAP engine (a part of
8 the SVP64/Vector+1 feature set) is the ability to perform matrix
9 multiplication with all elements within a scalar register file.
10
11 This is done by converting the index used to iterate over the operand and
12 result matrices to the actual index inside the scalar register file.
13
14 ## Worked example - manual (conventional method)
15
16 The matrix multiply looks like this:
17
18 mat_X * mat_Y = mat_Z
19
20 When multiplying non-square matrices (rows != columns), to determine the
21 dimension of the result when matrix X has `a` rows and `b` columns and matrix
22 Y has `b` rows and `c` columns:
23
24 X_axb * Y_bxc = Z_axc
25
26 The result matrix will have number of rows of the first matrix, and number of
27 columns of the second matrix.
28
29
30 For this example, the following values will be used for the operand matrices X
31 and Y, result Z shown for completeness.
32
33 X =| 1 2 3 | Y = | 6 7 | Z = | 52 58 |
34 | 3 4 5 | | 8 9 | | 100 112 |
35 | 10 11 |
36
37 Matrix X has 2 rows, 3 columns (2x3), and matrix Y has 3 rows, 2 columns.
38
39 To determine the final dimensions of the resultant matrix Z, take the number
40 of rows from matrix X (2) and number of columns from matrix Y (2).
41
42 The method usually taught in linear algebra course to students is the
43 following (outer product):
44
45 1. Start with the first row of the first matrix, and first column of the
46 second matrix.
47 2. Multiply each element in the row by each element in the column, and sum
48 with the current value of the result matrix element (multiply-add-accumulate).
49 Store result in the first row, first column entry.
50 3. Move to the next column of the second matrix, and next column of the result
51 matrix. If there are no more columns in the second matrix, go back to first
52 column (second matrix), and move to next row (first and result matrices).
53 If there are no more rows left, result matrix has been fully computed.
54 4. Repeat step 2.
55 4. Move to the next row of the first matrix, and next column of the second matrix.
56
57 This for-loop uses the indeces as shown above
58
59 for i in range(mat_X_num_rows):
60 for k in range(0, mat_Y_num_cols):
61 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
62 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
63
64
65 Calculations:
66
67 | 1 2 3 | | 6 7 | = | (1*6 + 2*8 + 3*10) (1*7 + 2*9 3*11) |
68 | 3 4 5 | * | 8 9 | | (3*6 + 4*8 + 5*10) (3*7 + 4*9 5*11) |
69 | 10 11 |
70
71 | 1 2 3 | | 6 7 | = | ( 6 + 16 + 30) ( 7 + 18 + 33) |
72 | 3 4 5 | * | 8 9 | | (18 + 32 + 50) (21 + 36 + 55) |
73 | 10 11 |
74
75 | 1 2 3 | | 6 7 | = | 52 58 |
76 | 3 4 5 | * | 8 9 | | 100 112 |
77 | 10 11 |
78
79 For the algorithm, assign indeces to matrices as follows:
80
81 Index | 0 1 2 3 4 5 |
82 Mat X | 1 2 3 3 4 5 |
83
84 Index | 0 1 2 3 4 5 |
85 Mat Y | 6 7 8 9 10 11 |
86
87 Index | 0 1 2 3 |
88 Mat Z | 52 58 100 112 |
89
90 (Start with the first row, then assign index left-to-right, top-to-bottom.)
91
92 Index list:
93
94 Mat X | Mat Y | Mat Z
95 0 | 0 | 0
96 1 | 2 | 0
97 2 | 4 | 0
98 0 | 1 | 1
99 1 | 3 | 1
100 2 | 5 | 1
101 3 | 0 | 2
102 4 | 2 | 2
103 5 | 4 | 2
104 3 | 1 | 3
105 4 | 3 | 3
106 5 | 5 | 3
107
108
109 The issue with this algorithm is that the result matrix element is the same
110 for three consecutive operations, and where each element is stored in CPU
111 registers, the same register will be written to three times and thus causing
112 consistent stalling.
113
114 ## Inner Product
115
116 A slight modification to the order of the loops in the algorithm massively
117 reduces the chance of read-after-write hazards, as the result matrix
118 element (and thus register) changes with every multiply-add operation.
119
120 The code:
121
122 for i in range(mat_X_num_rows):
123 for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
124 for k in range(0, mat_Y_num_cols):
125 mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
126
127 Index list:
128
129 Mat X | Mat Y | Mat Z
130 0 | 0 | 0
131 0 | 1 | 1
132 3 | 0 | 2
133 3 | 1 | 3
134 1 | 2 | 0
135 1 | 3 | 1
136 4 | 2 | 2
137 4 | 3 | 3
138 2 | 4 | 0
139 2 | 5 | 1
140 5 | 4 | 2
141 5 | 5 | 3
142
143 The index for the result matrix changes with every operation, and thus the
144 consecutive multiply-add instruction doesn't depend on the previous write
145 register.
146
147 ## Appendix
148
149 ### Links
150
151 - [Online matrix calculator](https://matrix.reshish.com/multCalculation.php)
152 - [LibreSOC matrix multiply REMAP]()