*All of these can be done with SV elwidth overrides, as long as the dest is no greater than 128. SV specifically does not do 128 bit arithmetic. Instead, vec2.X mul-lo followed by vec2.Y mul-hi can be macro-op fused to get at the full 128 bit internal result. Specifying e.g. src elwidth=8 and dest elwidth=16 will give a widening multiply*
-(Now added `madded` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]])
+(Now added `maddedu` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]])
## vec_rl - rotate left
is sufficient for SVP64 Vectorisation of big-integer addition (and subfe
for subtraction) but that big-integer multiply and divide require an
extra 3-in 2-out instruction, similar to Intel's `mulx`, to be efficient.
-The same instruction (`madded`) is used for both because 'madded''s primary
+The same instruction (`maddedu`) is used for both because 'maddedu''s primary
purpose is to perform a fused 64-bit scalar multiply with a large vector,
where that result is Big-Added for Big-Multiply, but Big-Subtracted for
Big-Divide.
The same macro-op fusion may theoretically be deployed even on Scalar
operations.
-# madded
+# maddedu
**DRAFT**
-`madded` is similar to v3.0 `madd`, and
+`maddedu` is similar to v3.0 `madd`, and
is VA-Form despite having 2 outputs: the second
destination register is implicit.
|-------|-----|------|------|------|------|
| EXT04 | RT | RA | RB | RC | XO |
-The pseudocode for `madded RT, RA, RB, RC` is:
+The pseudocode for `maddedu RT, RA, RB, RC` is:
prod[0:127] = (RA) * (RB)
sum[0:127] = EXTZ(RC) + prod
in RS.
The differences here to `maddhdu` are that `maddhdu` stores the upper
-half in RT, where `madded` stores the upper half in RS. There is no
+half in RT, where `maddedu` stores the upper half in RS. There is no
equivalent to `maddld` because `maddld` performs sign-extension on RC.
*Programmer's Note:
To achieve the same big-integer rolling-accumulation effect
as SVP64: assuming the scalar to multiply is in r0,
the vector to multiply by starts at r4 and the result vector
-in r20, instructions may be issued `madded r20,r4,r0,r20
-madded r21,r5,r0,r21` etc. where the first `madded` will have
+in r20, instructions may be issued `maddedu r20,r4,r0,r20
+maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have
stored the upper half of the 128-bit multiply into r21, such
-that it may be picked up by the second `madded`. Repeat inline
+that it may be picked up by the second `maddedu`. Repeat inline
to construct a larger bigint scalar-vector multiply,
as Scalar GPR register file space permits.*
| Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding) |
| Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding) |
| Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding) |
-| EXTRA2_MODE | `18` | used by `madded` for determining RS |
+| EXTRA2_MODE | `18` | used by `maddedu` for determining RS |
When `EXTRA2_MODE` is set to zero, the implicit RS register takes
its Vector/Scalar setting from Rdest_EXTRA2, and takes
* the lower 64 bits of the dividend, instead of being zero, contain a
register, RC.
* it performs a fused divide and modulo in a single instruction, storing
- the modulo in an implicit RS (similar to `madded`)
+ the modulo in an implicit RS (similar to `maddedu`)
RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
division, producing a (pair) of 64 bit result(s). Overflow conditions
Algorithm D*
For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
-the exact same EXTRA format and setting of RS is used as for `sv.madded`.
-For Scalar usage, just as for `madded`, `RS=RT+1` (similar to `lq` and `stq`).
+the exact same EXTRA format and setting of RS is used as for `sv.maddedu`.
+For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`).
Pseudo-code:
For the Opcode map (XO Field)
see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
-Proposed is the addition of `madded` (**DRAFT, NOT APPROVED**) in `110010`
+Proposed is the addition of `maddedu` (**DRAFT, NOT APPROVED**) in `110010`
and `divmod2du` in `110100`
|110000|110001 |110010 |110011|110100 |110101|110110|110111|
|------|-------|----------|------|-------------|------|------|------|
-|maddhd|maddhdu|**madded**|maddld|**divmod2du**|rsvd |rsvd |rsvd |
+|maddhd|maddhdu|**maddedu**|maddld|**divmod2du**|rsvd |rsvd |rsvd |
**DRAFT SVP64**
* Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
-* Revision 0.01: 22apr2022 removal of msubed because sv.madded and sv.subfe works
+* Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
* Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
-* Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to madded
+* Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
* Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
* Revision 0.05: 27apr2022 add vector shift section (no new instructions)
of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
```
- // this becomes the basis for sv.madded in RS=RC Mode,
+ // this becomes the basis for sv.maddedu in RS=RC Mode,
// where k is RC. k takes the upper half of product
// and adds it in on the next iteration
k = 0;
```
uint32_t carry = 0;
- // this is just sv.madded again
+ // this is just sv.maddedu again
for (int i = 0; i <= n; i++) {
uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
carry = (uint32_t)(value >> 32); // upper half for next loop
**Back to Vector carry-looping**
There is however another reason for having a 128/64 division
-instruction, and it's effectively the reverse of `madded`.
+instruction, and it's effectively the reverse of `maddedu`.
Look closely at Algorithm D when the divisor is only a scalar
(`v[0]`):
}
```
-Here, just as with `madded` which can put the hi-half of the 128 bit product
+Here, just as with `maddedu` which can put the hi-half of the 128 bit product
back in as a form of 64-bit carry, a scalar divisor of a vector dividend
puts the modulo back in as the hi-half of a 128/64-bit divide.
or act in loop-back mode for big-int division by a scalar,
or for a single scalar 128/64 div/mod.
-Again, just as with `sv.madded` and `sv.adde`, adventurous implementors
+Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
may perform massively-wide DIV/MOD by transparently merging (fusing)
the Vector element operations together, only inputting a single RC and
outputting the last RC. Where efficient algorithms such as Goldschmidt
// Multiply and subtract.
uint32_t carry = 0;
// VL = n + 1
- // sv.madded product.v, vn.v, qhat.s, carry.s
+ // sv.maddedu product.v, vn.v, qhat.s, carry.s
for (int i = 0; i <= n; i++)
{
uint32_t vn_v = i < n ? vn[i] : 0;
{
uint32_t carry = 0;
// VL = n + 1
- // sv.madded product.v, vn.v, qhat.s, carry.s
+ // sv.maddedu product.v, vn.v, qhat.s, carry.s
for (int i = 0; i <= n; i++)
{
uint32_t vn_v = i < n ? vn[i] : 0;
un[i + j] = (uint32_t)result;
}
bool need_fixup = carry != 0;
-#elif defined(MADDED_SUBFE)
+#elif defined(MADDEDU_SUBFE)
(void)p, (void)t; // shut up unused variable warning
// Multiply and subtract.
uint32_t carry = 0;
uint32_t product[n + 1];
// VL = n + 1
- // sv.madded product.v, vn.v, qhat.s, carry.s
+ // sv.maddedu product.v, vn.v, qhat.s, carry.s
for (int i = 0; i <= n; i++)
{
uint32_t vn_v = i < n ? vn[i] : 0;
| Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
| Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
| Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
-| EXTRA2_MODE | `18` | used by `divmod2du` and `madded` for RS |
+| EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS |
These are for 3 operand in and either 1 or 2 out instructions.
3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
-such as `madded` have an implicit second destination, RS, the
+such as `maddedu` have an implicit second destination, RS, the
selection of which is determined by bit 18.
## RM-1P-2S1D
a precedent: `RTp` stands for "RT pair". In other words the result
is stored in RT and RT+1. For Scalar operations, following this
precedent is perfectly reasonable. In Scalar mode,
-`madded` therefore stores the two halves of the 128-bit multiply
+`maddedu` therefore stores the two halves of the 128-bit multiply
into RT and RT+1.
-What, then, of `sv.madded`? If the destination is hard-coded to
+What, then, of `sv.maddedu`? If the destination is hard-coded to
RT and RT+1 the instruction is not useful when Vectorised because
the output will be overwritten on the next element. To solve this
is easy: define the destination registers as RT and RT+MAXVL
into consideration, the starting point for the implicit destination
is best illustrated in pseudocode:
- # demo of madded
+ # demo of maddedu
for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication
src1 = get_polymorphed_reg(RA, srcwid, irs1)