RS is calculated as RT+VL, where all scalar operations
assume VL=1. With `sv.msubx` *creating* a pair of Vector
results, `sv.weirdaddx` correspondingly has to pick the
-pair up in order to carry on the algorithm.
+pairs up, containing the split lo-hi 128-bit products,
+in order to carry on the algorithm.
**msubx RT, RA, RB, RC** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
RT <- sub[64:127]
RS <- sub[0:63]
-**weirdaddx RT, RA, RB** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
+**weirdaddx RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)
- cat[0:127] = (RB) || (RS)
+ cat[0:127] = (RS) || (RB)
sum[0:127] = cat + EXTZ(RA) + [1]*128
rhi[0:63] = sum[0:63]
if (RA) <= 1 then rhi = rhi + ([0]*63 || 1)
These two combine as, simply:
- # RS=RT+VL, assume VL=8, therefore RS starts at r8.v
+ # assume VL=8, therefore RS starts at r8.v
# q : r16
# dividend: r20.v
# divisor : r28.v
# carry : r40
li r17, 0
sv.msubx r0.v, r16, r20.v, r28.v
- sv.weirdaddx r0.v, r17, r8.v
+ # here, RS=RB+VL, therefore again RS starts at r8.v
+ sv.weirdaddx r0.v, r17, r0.v
As a result, a big-integer subtract and multiply may be carried out
in only 3 instructions, one of which is setting a scalar integer to