programmer and expects them to like it. As the article immediately
demonstrates, an arbitrary-sized data set has to contend with
an insane power-of-two Packed SIMD cascade at both setup and teardown
-that can add literally an order
+that routinely adds literally an order
of magnitude increase in the number of hand-written lines of assembler
compared to a well-designed Cray-style Vector ISA with a `setvl`
instruction.
for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
the case of the IBM POWER9 a little-known design flaw this results in
contention between the L1 D and I Caches at the L2 Bus, slowing down
-execution even further.
+execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
+requires loop-unrolling to contend with non-power-of-two Matrix
+sizes: SVP64 does not, as hinted at below.
Additional savings come in the form of `SVREMAP`. This is a hardware
index transformation system where the normally sequentially-linear
2x6 may be performed in as little as 4 instructions, one of which
is to zero-initialise the accumulator Vector used to store the result.
If addition to another Matrix is also required then it is only three
-instructions. Not only that, but because the "Schedule" is an abstract
+instructions. Not only that, but because the "Schedule" is an abstract
concept separated from the mathematical operation, there is no reason
why Matrix Multiplication Schedules may not be applied to Integer
Mul-and-Accumulate, Galois Field Mul-and-Accumulate, or Logical