+For the algorithm, assign indeces to matrices as follows:
+
+ Index | 0 1 2 3 4 5 |
+ Mat X | 1 2 3 3 4 5 |
+
+ Index | 0 1 2 3 4 5 |
+ Mat Y | 6 7 8 9 10 11 |
+
+ Index | 0 1 2 3 |
+ Mat Z | 52 58 100 112 |
+
+(Start with the first row, then assign index left-to-right, top-to-bottom.)
+
+Index list:
+
+ Mat X | Mat Y | Mat Z
+ 0 | 0 | 0
+ 1 | 2 | 0
+ 2 | 4 | 0
+ 0 | 1 | 1
+ 1 | 3 | 1
+ 2 | 5 | 1
+ 3 | 0 | 2
+ 4 | 2 | 2
+ 5 | 4 | 2
+ 3 | 1 | 3
+ 4 | 3 | 3
+ 5 | 5 | 3
+
+
+The issue with this algorithm is that the result matrix element is the same
+for three consecutive operations, and where each element is stored in CPU
+registers, the same register will be written to three times and thus causing
+consistent stalling.
+
+## Inner Product
+
+A slight modification to the order of the loops in the algorithm massively
+reduces the chance of read-after-write hazards, as the result matrix
+element (and thus register) changes with every multiply-add operation.
+
+The code:
+
+ for i in range(mat_X_num_rows):
+ for j in range(0, mat_X_num_cols): # or mat_Y_num_rows
+ for k in range(0, mat_Y_num_cols):
+ mat_Z[i][k] += mat_X[i][j] * mat_Y[j][k]