From 86d62e10761d9c4a9ac7ef5b0650bfdc93a88d2a Mon Sep 17 00:00:00 2001 From: lkcl Date: Sat, 29 Apr 2023 16:38:17 +0100 Subject: [PATCH] --- openpower/sv/rfc/ls016.mdwn | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/openpower/sv/rfc/ls016.mdwn b/openpower/sv/rfc/ls016.mdwn index f2f98ee60..1daf67662 100644 --- a/openpower/sv/rfc/ls016.mdwn +++ b/openpower/sv/rfc/ls016.mdwn @@ -73,12 +73,32 @@ are often loop-unrolled, resulting in L1 I-Cache stripmining. 1. Whilst it is easy to justify these high-value instructions they are sufficiently complex as to warrant placement as optional SFFS in the new EXT2xx area (marked as Vectoriseable). -2. Although they are 3-in 2-out the actual encoding is as double-overwrite +2. Although they are 3-in 2-out the actual encoding is as a double-overwrite reducing the actual number of operands down to three (RT RA and RB) - where RT is a Read-Modify-Write and an additional RS is implicit. -* Although desirable (particularly to detect overflow) Rc=1 is hard to - conceptualise. It is likely that instead, Simple-V "saturation" if - enabled will create an Rc=1 CR.SO flag (including SVP64Single). + where RT is a Read-Modify-Write and an additional RS (normally RT+1) is implicit. +3. As with the biginteger set of 3-in 2-out instructions if Power ISA did not + already have LD/ST-with-Update, Load/Store-Quad, and other RTp and RAp instructions, + these instructions would not be proposed. +4. The read and write of two overlapping registers normally requires + an intermediate register (similar to the justifcation for CAS - Compare-and-Swap). + When Vectorised the situation becomes even worse: an entire *Vector* + of intermediate temporaries is required. + Thus *even if implemented inefficiently* requiring more cycles to complete + (taking an extra cycle to write the second result) these instructions still + save on resources. +5. Macro-op fusion equivalents of these instructions is *not possible* for + exactly the same reason that the equivalent CAS sequence may not be macro-op + fused. Full in-place Vectorised FFT and DCT algorithms *only* become + possible due to these instructions atomically reading **both** operands + into internal Reservation Stations (exactly like CAS). +5. Although desirable (particularly to detect overflow) Rc=1 is hard to + conceptualise. It is likely that instead, Simple-V "saturation" if + enabled will create an Rc=1 CR.SO flag (including SVP64Single). +6. Saturated variants are **not** included: that is what SVP64 and SVP64Single + provides (SVP64 provides a signed/unsigned saturation enhancement) +7. Unlike in ARM, (except FP Single), 8 16 and 32 bit variants are **not** + included: that is what SVP64 and SVP64Single provides (SVP64 "redefines" + "FP Single" to be "half of the register/element width"). **Changes** -- 2.30.2