From 86d62e10761d9c4a9ac7ef5b0650bfdc93a88d2a Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sat, 29 Apr 2023 16:38:17 +0100
Subject: [PATCH]

---
 openpower/sv/rfc/ls016.mdwn | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/openpower/sv/rfc/ls016.mdwn b/openpower/sv/rfc/ls016.mdwn
index f2f98ee60..1daf67662 100644
--- a/openpower/sv/rfc/ls016.mdwn
+++ b/openpower/sv/rfc/ls016.mdwn
@@ -73,12 +73,32 @@ are often loop-unrolled, resulting in L1 I-Cache stripmining.
 1. Whilst it is easy to justify these high-value instructions they are
    sufficiently complex as to warrant placement as optional SFFS in
    the new EXT2xx area (marked as Vectoriseable).
-2. Although they are 3-in 2-out the actual encoding is as double-overwrite
+2. Although they are 3-in 2-out the actual encoding is as a double-overwrite
    reducing the actual number of operands down to three (RT RA and RB)
-   where RT is a Read-Modify-Write and an additional RS is implicit.
-* Although desirable (particularly to detect overflow) Rc=1 is hard to
-  conceptualise.  It is likely that instead, Simple-V "saturation" if
-  enabled will create an Rc=1 CR.SO flag (including SVP64Single).
+   where RT is a Read-Modify-Write and an additional RS (normally RT+1) is implicit.
+3. As with the biginteger set of 3-in 2-out instructions if Power ISA did not
+   already have LD/ST-with-Update, Load/Store-Quad, and other RTp and RAp instructions,
+   these instructions would not be proposed.
+4. The read and write of two overlapping registers normally requires
+   an intermediate register (similar to the justifcation for CAS - Compare-and-Swap).
+   When Vectorised the situation becomes even worse: an entire *Vector*
+   of intermediate temporaries is required.
+   Thus *even if implemented inefficiently* requiring more cycles to complete
+   (taking an extra cycle to write the second result) these instructions still
+   save on resources.
+5. Macro-op fusion equivalents of these instructions is *not possible* for
+   exactly the same reason that the equivalent CAS sequence may not be macro-op
+   fused.  Full in-place Vectorised FFT and DCT algorithms *only* become
+   possible due to these instructions atomically reading **both** operands
+   into internal Reservation Stations (exactly like CAS).
+5. Although desirable (particularly to detect overflow) Rc=1 is hard to
+   conceptualise.  It is likely that instead, Simple-V "saturation" if
+   enabled will create an Rc=1 CR.SO flag (including SVP64Single).
+6. Saturated variants are **not** included: that is what SVP64 and SVP64Single
+   provides (SVP64 provides a signed/unsigned saturation enhancement)
+7. Unlike in ARM, (except FP Single), 8 16 and 32 bit variants are **not**
+   included: that is what SVP64 and SVP64Single provides (SVP64 "redefines"
+   "FP Single" to be "half of the register/element width").
 
 **Changes**
 
-- 
2.30.2