To determine the final dimensions of the resultant matrix Z, take the number
of rows from matrix X (2) and number of columns from matrix Y (2).
-For the algorithm, assign indeces to matrices as follows:
-
- Index | 0 1 2 3 4 5 |
- Mat X | 1 2 3 3 4 5 |
-
- Index | 0 1 2 3 4 5 |
- Mat Y | 6 7 8 9 10 11 |
-
- Index | 0 1 2 3 |
- Mat Z | 52 58 100 112 |
-
-(Start with the first row, then assign index left-to-right, top-to-bottom.)
-
The method usually taught in linear algebra course to students is the
-following:
+following (outer product):
1. Start with the first row of the first matrix, and first column of the
second matrix.
| 3 4 5 | * | 8 9 | | 100 112 |
| 10 11 |
+For the algorithm, assign indeces to matrices as follows:
+
+ Index | 0 1 2 3 4 5 |
+ Mat X | 1 2 3 3 4 5 |
+
+ Index | 0 1 2 3 4 5 |
+ Mat Y | 6 7 8 9 10 11 |
+
+ Index | 0 1 2 3 |
+ Mat Z | 52 58 100 112 |
+
+(Start with the first row, then assign index left-to-right, top-to-bottom.)
+
+Index list:
+
+ Mat X | Mat Y | Mat Z
+ 0 | 0 | 0
+ 1 | 2 | 0
+ 2 | 4 | 0
+ 0 | 1 | 1
+ 1 | 3 | 1
+ 2 | 5 | 1
+ 3 | 0 | 2
+ 4 | 2 | 2
+ 5 | 4 | 2
+ 3 | 1 | 3
+ 4 | 3 | 3
+ 5 | 5 | 3
+
+
+The issue with this algorithm is that the result matrix element is the same
+for three consecutive operations, and where each element is stored in CPU
+registers, the same register will be written to three times and thus causing
+consistent stalling.
+
+## Inner Product
+
+A slight modification to the order of the loops in the algorithm massively
+reduces the chance of read-after-write hazards, as the result matrix
+element (and thus register) changes with every multiply-add operation.
+
+The code:
+
+ for i in range(mat_X_num_rows):
+ for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
+ for k in range(0, mat_Y_num_cols):
+ mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]
+Index list:
+
+ Mat X | Mat Y | Mat Z
+ 0 | 0 | 0
+ 0 | 1 | 1
+ 3 | 0 | 2
+ 3 | 1 | 3
+ 1 | 2 | 0
+ 1 | 3 | 1
+ 4 | 2 | 2
+ 4 | 3 | 3
+ 2 | 4 | 0
+ 2 | 5 | 1
+ 5 | 4 | 2
+ 5 | 5 | 3
+
+The index for the result matrix changes with every operation, and thus the
+consecutive multiply-add instruction doesn't depend on the previous write
+register.
## Appendix