From: Jacob Lifshay Date: Thu, 29 Sep 2022 03:08:47 +0000 (-0700) Subject: rename madded->maddedu for consistency with PowerISA maddhdu instruction X-Git-Tag: opf_rfc_ls005_v1~271 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=6ecafc9cfa1cf13f25fa851d012b20989d770108;p=libreriscv.git rename madded->maddedu for consistency with PowerISA maddhdu instruction --- diff --git a/openpower/sv/av_opcodes.mdwn b/openpower/sv/av_opcodes.mdwn index ce4971fbf..dadcf72d1 100644 --- a/openpower/sv/av_opcodes.mdwn +++ b/openpower/sv/av_opcodes.mdwn @@ -216,7 +216,7 @@ For 8,16,32,64, resulting in 8,16,32,64,128. *All of these can be done with SV elwidth overrides, as long as the dest is no greater than 128. SV specifically does not do 128 bit arithmetic. Instead, vec2.X mul-lo followed by vec2.Y mul-hi can be macro-op fused to get at the full 128 bit internal result. Specifying e.g. src elwidth=8 and dest elwidth=16 will give a widening multiply* -(Now added `madded` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]]) +(Now added `maddedu` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]]) ## vec_rl - rotate left diff --git a/openpower/sv/biginteger.mdwn b/openpower/sv/biginteger.mdwn index 840541915..cd32936ed 100644 --- a/openpower/sv/biginteger.mdwn +++ b/openpower/sv/biginteger.mdwn @@ -28,7 +28,7 @@ Covered in [[biginteger/analysis]] the summary is that standard `adde` is sufficient for SVP64 Vectorisation of big-integer addition (and subfe for subtraction) but that big-integer multiply and divide require an extra 3-in 2-out instruction, similar to Intel's `mulx`, to be efficient. -The same instruction (`madded`) is used for both because 'madded''s primary +The same instruction (`maddedu`) is used for both because 'maddedu''s primary purpose is to perform a fused 64-bit scalar multiply with a large vector, where that result is Big-Added for Big-Multiply, but Big-Subtracted for Big-Divide. @@ -38,11 +38,11 @@ fashion that is hidden from the user, behind a consistent, stable ISA API. The same macro-op fusion may theoretically be deployed even on Scalar operations. -# madded +# maddedu **DRAFT** -`madded` is similar to v3.0 `madd`, and +`maddedu` is similar to v3.0 `madd`, and is VA-Form despite having 2 outputs: the second destination register is implicit. @@ -50,7 +50,7 @@ destination register is implicit. |-------|-----|------|------|------|------| | EXT04 | RT | RA | RB | RC | XO | -The pseudocode for `madded RT, RA, RB, RC` is: +The pseudocode for `maddedu RT, RA, RB, RC` is: prod[0:127] = (RA) * (RB) sum[0:127] = EXTZ(RC) + prod @@ -62,7 +62,7 @@ to it; the lower half of that result stored in RT and the upper half in RS. The differences here to `maddhdu` are that `maddhdu` stores the upper -half in RT, where `madded` stores the upper half in RS. There is no +half in RT, where `maddedu` stores the upper half in RS. There is no equivalent to `maddld` because `maddld` performs sign-extension on RC. *Programmer's Note: @@ -70,10 +70,10 @@ As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1. To achieve the same big-integer rolling-accumulation effect as SVP64: assuming the scalar to multiply is in r0, the vector to multiply by starts at r4 and the result vector -in r20, instructions may be issued `madded r20,r4,r0,r20 -madded r21,r5,r0,r21` etc. where the first `madded` will have +in r20, instructions may be issued `maddedu r20,r4,r0,r20 +maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have stored the upper half of the 128-bit multiply into r21, such -that it may be picked up by the second `madded`. Repeat inline +that it may be picked up by the second `maddedu`. Repeat inline to construct a larger bigint scalar-vector multiply, as Scalar GPR register file space permits.* @@ -87,7 +87,7 @@ used with the additional bit set for determining RS. | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding) | | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding) | | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding) | -| EXTRA2_MODE | `18` | used by `madded` for determining RS | +| EXTRA2_MODE | `18` | used by `maddedu` for determining RS | When `EXTRA2_MODE` is set to zero, the implicit RS register takes its Vector/Scalar setting from Rdest_EXTRA2, and takes @@ -109,7 +109,7 @@ that is near-identical to `divdeu` except that: * the lower 64 bits of the dividend, instead of being zero, contain a register, RC. * it performs a fused divide and modulo in a single instruction, storing - the modulo in an implicit RS (similar to `madded`) + the modulo in an implicit RS (similar to `maddedu`) RB, the divisor, remains 64 bit. The instruction is therefore a 128/64 division, producing a (pair) of 64 bit result(s). Overflow conditions @@ -124,8 +124,8 @@ been set to useful values needed as part of implementing Knuth's Algorithm D* For SVP64, given that this instruction is also 3-in 2-out 64-bit registers, -the exact same EXTRA format and setting of RS is used as for `sv.madded`. -For Scalar usage, just as for `madded`, `RS=RT+1` (similar to `lq` and `stq`). +the exact same EXTRA format and setting of RS is used as for `sv.maddedu`. +For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`). Pseudo-code: @@ -144,10 +144,10 @@ Pseudo-code: For the Opcode map (XO Field) see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357. -Proposed is the addition of `madded` (**DRAFT, NOT APPROVED**) in `110010` +Proposed is the addition of `maddedu` (**DRAFT, NOT APPROVED**) in `110010` and `divmod2du` in `110100` |110000|110001 |110010 |110011|110100 |110101|110110|110111| |------|-------|----------|------|-------------|------|------|------| -|maddhd|maddhdu|**madded**|maddld|**divmod2du**|rsvd |rsvd |rsvd | +|maddhd|maddhdu|**maddedu**|maddld|**divmod2du**|rsvd |rsvd |rsvd | diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index 7ff374a84..ef4fdf540 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -5,9 +5,9 @@ **DRAFT SVP64** * Revision 0.0: 21apr2022 -* Revision 0.01: 22apr2022 removal of msubed because sv.madded and sv.subfe works +* Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt -* Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to madded +* Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div * Revision 0.05: 27apr2022 add vector shift section (no new instructions) @@ -237,7 +237,7 @@ Adapted from a simple implementation of Knuth M: ``` - // this becomes the basis for sv.madded in RS=RC Mode, + // this becomes the basis for sv.maddedu in RS=RC Mode, // where k is RC. k takes the upper half of product // and adds it in on the next iteration k = 0; @@ -369,7 +369,7 @@ this time using subtract instead of add. ``` uint32_t carry = 0; - // this is just sv.madded again + // this is just sv.maddedu again for (int i = 0; i <= n; i++) { uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry; carry = (uint32_t)(value >> 32); // upper half for next loop @@ -467,7 +467,7 @@ and a subtract. **Back to Vector carry-looping** There is however another reason for having a 128/64 division -instruction, and it's effectively the reverse of `madded`. +instruction, and it's effectively the reverse of `maddedu`. Look closely at Algorithm D when the divisor is only a scalar (`v[0]`): @@ -481,7 +481,7 @@ Look closely at Algorithm D when the divisor is only a scalar } ``` -Here, just as with `madded` which can put the hi-half of the 128 bit product +Here, just as with `maddedu` which can put the hi-half of the 128 bit product back in as a form of 64-bit carry, a scalar divisor of a vector dividend puts the modulo back in as the hi-half of a 128/64-bit divide. @@ -518,7 +518,7 @@ allows the instruction to perform full parallel vector div/mod, or act in loop-back mode for big-int division by a scalar, or for a single scalar 128/64 div/mod. -Again, just as with `sv.madded` and `sv.adde`, adventurous implementors +Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors may perform massively-wide DIV/MOD by transparently merging (fusing) the Vector element operations together, only inputting a single RC and outputting the last RC. Where efficient algorithms such as Goldschmidt diff --git a/openpower/sv/biginteger/divgnu64.c b/openpower/sv/biginteger/divgnu64.c index 649143963..b0f203398 100644 --- a/openpower/sv/biginteger/divgnu64.c +++ b/openpower/sv/biginteger/divgnu64.c @@ -88,7 +88,7 @@ bool bigmul(unsigned long long qhat, unsigned product[], unsigned vn[], int m, // Multiply and subtract. uint32_t carry = 0; // VL = n + 1 - // sv.madded product.v, vn.v, qhat.s, carry.s + // sv.maddedu product.v, vn.v, qhat.s, carry.s for (int i = 0; i <= n; i++) { uint32_t vn_v = i < n ? vn[i] : 0; diff --git a/openpower/sv/biginteger/divmnu64.c b/openpower/sv/biginteger/divmnu64.c index 10b05d5ab..b4bd10659 100644 --- a/openpower/sv/biginteger/divmnu64.c +++ b/openpower/sv/biginteger/divmnu64.c @@ -78,7 +78,7 @@ bool bigmul(uint32_t qhat, unsigned product[], unsigned vn[], int m, int n) { uint32_t carry = 0; // VL = n + 1 - // sv.madded product.v, vn.v, qhat.s, carry.s + // sv.maddedu product.v, vn.v, qhat.s, carry.s for (int i = 0; i <= n; i++) { uint32_t vn_v = i < n ? vn[i] : 0; @@ -372,14 +372,14 @@ int divmnu(unsigned q[], unsigned r[], const unsigned u[], const unsigned v[], un[i + j] = (uint32_t)result; } bool need_fixup = carry != 0; -#elif defined(MADDED_SUBFE) +#elif defined(MADDEDU_SUBFE) (void)p, (void)t; // shut up unused variable warning // Multiply and subtract. uint32_t carry = 0; uint32_t product[n + 1]; // VL = n + 1 - // sv.madded product.v, vn.v, qhat.s, carry.s + // sv.maddedu product.v, vn.v, qhat.s, carry.s for (int i = 0; i <= n; i++) { uint32_t vn_v = i < n ? vn[i] : 0; diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn index 86b760599..a7d23f106 100644 --- a/openpower/sv/svp64.mdwn +++ b/openpower/sv/svp64.mdwn @@ -435,11 +435,11 @@ is based on whether the number of src operands is 2 or 3. With only | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) | -| EXTRA2_MODE | `18` | used by `divmod2du` and `madded` for RS | +| EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS | These are for 3 operand in and either 1 or 2 out instructions. 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions -such as `madded` have an implicit second destination, RS, the +such as `maddedu` have an implicit second destination, RS, the selection of which is determined by bit 18. ## RM-1P-2S1D diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index ae8a0616f..5df704f4a 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -1082,10 +1082,10 @@ being only 32 bit, 5 operands is quite an ask. `lq` however sets a precedent: `RTp` stands for "RT pair". In other words the result is stored in RT and RT+1. For Scalar operations, following this precedent is perfectly reasonable. In Scalar mode, -`madded` therefore stores the two halves of the 128-bit multiply +`maddedu` therefore stores the two halves of the 128-bit multiply into RT and RT+1. -What, then, of `sv.madded`? If the destination is hard-coded to +What, then, of `sv.maddedu`? If the destination is hard-coded to RT and RT+1 the instruction is not useful when Vectorised because the output will be overwritten on the next element. To solve this is easy: define the destination registers as RT and RT+MAXVL @@ -1097,7 +1097,7 @@ and bear in mind that element-width overrides still have to be taken into consideration, the starting point for the implicit destination is best illustrated in pseudocode: - # demo of madded + # demo of maddedu  for (i = 0; i < VL; i++) if (predval & 1<