permitted to exceed 1 (because MAXVL is set to 1) then the above
actually becomes a Scalar Big-Int add algorithm.
-# Vector Shift
-
-Like add and subtract, strictly speaking these need no new instructions.
-Keeping the shift amount within the range of the element (64 bit)
-a Vector bit-shift may be synthesised from a pair of shift operations
-and an OR, all of which are standard Scalar Power ISA instructions
-that when Vectorized are exactly what is needed.
-
-```
-void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
- for (int i = 0; i < n - 1; i++)
- r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
- r[n - 1] = un[n - 1] >> s;
-}
-```
-
-With SVP64 being on top of the standard scalar regfile the offset by
-one of the elements may be achieved simply by referencing the same
-vector data offset by one. Given that all three instructions
-(`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
-it is possible to reference the full 128 64-bit registers (r0-r127):
-
- subfic t1, t0, 64 # compute 64-s (s in t0)
- sv.srd r8.v, r24.v, t0 # shift each element of r24.v up by s
- sv.sld r16.v, r25.v, t1 # offset start of vector by one (r25)
- sv.or r8.v, r8.v, r16.v # OR two parts together
-
-Predication with zeroing may be utilised on sld to ensure that the
-last element is zero, avoiding over-run.
-
-The reason why three instructions are needed instead of one in the
-case of big-add is because multiple bits chain through to the
-next element, where for add it is a single bit (carry-in, carry-out),
-and this is precisely what `adde` already does.
-For multiply and divide as shown later it is worthwhile to use
-one scalar register effectively as a full 64-bit carry/chain
-but in the case of shift, an OR may glue things together, easily,
-and in parallel, because unlike `sv.adde`, down-chain
-carry-propagation through multiple elements does not occur.
-
-With Scalar shift and rotate operations in the Power ISA already being
-complex and very comprehensive, it is hard to justify creating complex
-3-in 2-out variants when a sequence of 3 simple instructions will suffice.
-However it is reasonably justifiable to have a 3-in 1-out instruction
-with an implicit source, based around the inner operation:
-
-```
- # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
- t <- ROT128(RA || RA1, RB[58:63])
- RT <- t[64:127]
-```
-
-RA1 is implicitly (or explicitly, RC) greater than RA by one
-scalar register number, and like the other operations below,
-a 128/64 shift is performed, truncating to take the lower
-64 bits. By taking a Vector source RA and assuming lower-numbered
-registers are lower-significant digits in the biginteger operation
-the entire biginteger source may be shifted by a scalar.
-
-For larger shift amounts beyond an element bitwidth standard register move
-operations may be used, or, if the shift amount is static,
-to reference an alternate starting point in
-the registers containing the Vector elements
-because SVP64 sits on top of a standard Scalar register file.
-`sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
-by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
-
# Vector Multiply
Long-multiply, assuming an O(N^2) algorithm, is performed by summing
Also it is possible to specify that any of RA, RB or RC are scalar or
vector. Overall it is extremely powerful.
+# Vector Shift
+
+The decision here was made to follow the same principle as
+for multiply-and-accumulate. Whilst Intel has
+[srd](https://www.felixcloutier.com/x86/shrd)
+which is very similar, the decision was made to create a 3-in 2-out
+instruction that effectively uses one of the inputs and one of
+the outputs as a "sort of 64-bit carry".
+
+Without the `dsrd` instruction, it is necessary to have three
+instructions in an inner loop.
+Keeping the shift amount within the range of the element (64 bit)
+a Vector bit-shift may be synthesised from a pair of shift operations
+and an OR, all of which are standard Scalar Power ISA instructions.
+that when Vectorized are exactly what is needed, but that critically
+requires Vertical-First Mode.
+
+```
+void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
+ for (int i = 0; i < n - 1; i++)
+ r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
+ r[n - 1] = un[n - 1] >> s;
+}
+```
+
+With SVP64 being on top of the standard scalar regfile the offset by
+one of the elements may be achieved simply by referencing the same
+vector data offset by one. Given that all three instructions
+(`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
+it is possible to reference the full 128 64-bit registers (r0-r127):
+
+ subfic t1, t0, 64 # compute 64-s (s in t0)
+ sv.srd r8.v, r24.v, t0 # shift each element of r24.v up by s
+ sv.sld r16.v, r25.v, t1 # offset start of vector by one (r25)
+ sv.or r8.v, r8.v, r16.v # OR two parts together
+
+Predication with zeroing may be utilised on sld to ensure that the
+last element is zero, avoiding over-run.
+
+The reason why three instructions are needed instead of one in the
+case of big-add is because multiple bits chain through to the
+next element, where for add it is a single bit (carry-in, carry-out),
+and this is precisely what `adde` already does.
+For multiply, divide and shift it is worthwhile to use
+one scalar register effectively as a full 64-bit carry/chain.
+
+The basic principle of the 3-in 2-out `dsrd` is:
+
+```
+ # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
+ temp <- ROT128(RA || RC, RB[58:63])
+ RT <- temp[64:127]
+ RS <- temp[0:63]
+```
+
+A 128-bit shift is performed, taking the lower half into RS
+and the upper half into RT. However there is a trick that
+may be applied, which only requires `ROT64`:
+
+```
+ n <- (RB)[58:63]
+ v <- ROTL64((RA), 64-n)
+ mask <- MASK(n, 63)
+ RT <- (v[0:63] & mask) | ((RC) & ¬mask)
+ RS <- v[0:63] & ¬mask
+```
+
+The trick here is that the *entirety* of `RA` is rotated,
+the
+For larger shift amounts beyond an element bitwidth standard register move
+operations may be used, or, if the shift amount is static,
+to reference an alternate starting point in
+the registers containing the Vector elements
+because SVP64 sits on top of a standard Scalar register file.
+`sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
+by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
+
# Vector Divide
The simplest implementation of big-int divide is the standard schoolbook