which has identical triple nesting then the FFT Schedule may be
used even there.
-# 4x4 Matrix to vec4 Multiply Example
+# 4x4 Matrix to vec4 Multiply (4x4 by 1x4)
The following settings will allow a 4x4 matrix (starting at f8), expressed
as a sequence of 16 numbers first by row then by column, to be multiplied
fmac f7, f3, f23, f7
```
-The only other instruction required is to ensure that f4-f7 are
-initialised (usually to zero).
+Hardware should easily pipeline the above FMACs and as long as each FMAC
+completes in 4 cycles or less there should be 100% sustained throughput,
+from the one single Vector FMAC.
-It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively
-the same technique applied to four independent vectors, can be done by
-setting VL=64, using an extra dimension on the SHAPE0 and SHAPE1 SPRs,
-and applying a rotating 1D SHAPE SPR of xdim=16 to f8 in order to get
-it to apply four times to compute the four columns worth of vectors.
+The only other instruction required is to ensure that f4-f7 are
+initialised (usually to zero) however obviously if used as part
+of some other computation, which is frequently the case, then
+clearly the zeroing is not needed.
[[!tag standards]]