The following settings will allow a 4x4 matrix (starting at f8), expressed as a sequence of 16 numbers first by row then by column, to be multiplied by a vector of length 4 (starting at f0), using a single FMAC instruction.
* SHAPE0: xdim=4, ydim=4, permute=yx, applied to f0
-* VL=16.
+* SHAPE1: xdim=4, ydim=1, permute=xy, applied to f4
+* VL=16, f4=vec, f0=vec, f8=vec
* FMAC f4, f0, f8, f4
The permutation on SHAPE0 will use f0 as a vec4 source. On the first four iterations through the hardware loop, the REMAPed index will not increment. On the second four, the index will increase by one. Likewise on each subsequent group of four.
-At the same time, VL will increment through the 16 values in the Matrix. The equivalent sequence thus is issued:
+The permutation on SHAPE1 will increment f4 continuously cycling through f4-f7 every four iterations of the hardware loop.
+
+At the same time, VL will, because there is no SHAPE on f8, increment straight sequentially through the 16 values f8-f23 in the Matrix. The equivalent sequence thus is issued:
fmac f4, f0, f8, f4
fmac f5, f0, f9, f5
fmac f6, f3, f22, f6
fmac f7, f3, f23, f7
+The only other instruction required is to ensure that f4-f7 are initialised (usually to zero).
+
+It should be clear that a 4x4 by 4x4 Matrix Multiply, being effectively the same technique applied to four independent vectors, can be done by setting VL=64, using an extra dimension on the SHAPE CSRs and applying a rotating SHAPE CSR to f8 in order to get it to apply four times to compute the four columns worth of vectors.