Worked example broken down into individual multiply-add accumulates:
-[[!img outer_product_worked_example.jpg size="600x" ]]
+[[!img outer_product_worked_example.jpg size="500x" ]]
The issue with this algorithm is that the result matrix element is the same
for three consecutive operations, and where each element is stored in CPU
Worked example for inner product:
-[[!img inner_product_worked_example.jpg size="600x" ]]
+[[!img inner_product_worked_example.jpg size="500x" ]]
The index for the result matrix changes with every operation, and thus the
consecutive multiply-add instruction doesn't depend on the previous write