complex and very comprehensive, it is hard to justify creating complex
3-in 2-out variants when a sequence of 3 simple instructions will suffice.
+For larger shift amounts beyond an element bitwidth standard register move
+operations may be used, or, if the shift amount is static,
+to reference an alternate starting point in
+the registers containing the Vector elements
+because SVP64 sits on top of a standard Scalar register file.
+
# Vector Multiply
Long-multiply, assuming an O(N^2) algorithm, is performed by summing
performance RISC advocates
recommend "macro-op fusion" which is in effect where the second instruction
gains access to the cached copy of the HI half of the
-multiply redult, which had already been
+multiply result, which had already been
computed by the first. This approach quickly complicates the internal
microarchitecture, especially at the decode phase.
Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
-both HI and LO halves of the multiply to reach registers. If done as a
+both HI and LO halves of the multiply to reach registers with a single
+instruction. If however done as a
multiply-and-accumulate this becomes quite an expensive operation:
(3 64-Bit in, 2 64-bit registers out).