From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Wed, 19 Apr 2023 15:23:12 +0000 (+0100)
Subject: bug #1063, remove predicate-result
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=888297dcfd7fa442bb5832b31c68ab9d79985075;p=libreriscv.git

bug #1063, remove predicate-result
---

diff --git a/openpower/sv/bitmanip.mdwn b/openpower/sv/bitmanip.mdwn
index 59688e129..2c00e1d77 100644
--- a/openpower/sv/bitmanip.mdwn
+++ b/openpower/sv/bitmanip.mdwn
@@ -178,11 +178,7 @@ writing back to non-masked-out bits of `BF`.
 
 required for the [[sv/av_opcodes]]
 
-signed and unsigned min/max for integer.  this is sort-of partly
-synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
-is one of the sources, but not both signed and unsigned.  when the dest
-is also one of the srces and the mv fails due to the CR bittest failing
-this will only overwrite the dest where the src is greater (or less).
+signed and unsigned min/max for integer.
 
 signed/unsigned min/max gives more flexibility.
 
diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn
index c82cb592e..d486556ae 100644
--- a/openpower/sv/ldst.mdwn
+++ b/openpower/sv/ldst.mdwn
@@ -72,7 +72,6 @@ alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
 modes make sense:
 
 * saturation
-* predicate-result would be useful but is lower priority than Data-Dependent Fail-First
 * simple (no augmentation)
 * fail-first (where Vector Indexed is banned)
 * Signed Effective Address computation (Vector Indexed only)
diff --git a/openpower/sv/overview.mdwn b/openpower/sv/overview.mdwn
index 7bc2d0a6a..a17afdcb0 100644
--- a/openpower/sv/overview.mdwn
+++ b/openpower/sv/overview.mdwn
@@ -170,7 +170,6 @@ The rest of this document builds on the above simple loop to add:
 * Compacted operations into registers (normally only provided by SIMD)
 * Fail-on-first (introduced in ARM SVE2)
 * A new concept: Data-dependent fail-first
-* Condition-Register based *post-result* predication (also new)
 * A completely new concept: "Twin Predication"
 * vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
 
@@ -803,49 +802,6 @@ tricky because it typically does not exist in standard scalar ISAs.
 If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
 VGATHER/VSCATTER.
 
-# CR predicate result analysis
-
-Power ISA has Condition Registers.  These store an analysis of the result
-of an operation to test it for being greater, less than or equal to zero.
-What if a test could be done, similar to branch BO testing, which hooked
-into the predication system?
-
-    for i in range(VL):
-        # predication test, skip all masked out elements.
-        if predicate_masked_out(i): continue # skip
-        result = op(iregs[RA+i], iregs[RB+i])
-        CRnew = analyse(result) # calculates eq/lt/gt
-        # Rc=1 always stores the CR
-        if RC1 or Rc=1: crregs[offs+i] = CRnew
-        if RC1: continue # RC1 mode skips result store
-        # now test CR, similar to branch
-        if CRnew[BO[0:1]] == BO[2]:
-            # result optionally stored but CR always is
-            iregs[RT+i] = result
-
-Note that whilst the Vector of CRs is always written to the CR regfile,
-only those result elements that pass the BO test get written to the
-integer regfile (when RC1 mode is not set).  In RC1 mode the CR is always
-stored, but the result never is. This effectively turns every arithmetic
-operation into a type of `cmp` instruction.
-
-Here for example if FP overflow occurred, and the CR testing was carried
-out for that, all valid results would be stored but invalid ones would
-not, but in addition the Vector of CRs would contain the indicators of
-which ones failed.  With the invalid results being simply not written
-this could save resources (save on register file writes).
-
-Also expected is, due to the fact that the predicate mask is effectively
-ANDed with the post-result analysis as a secondary type of predication,
-that there would be savings to be had in some types of operations where
-the post-result analysis, if not included in SV, would need a second
-predicate calculation followed by a predicate mask AND operation.
-
-Note, hilariously, that Vectorised Condition Register Operations (crand,
-cror) may also have post-result analysis applied to them.  With Vectors
-of CRs being utilised *for* predication, possibilities for compact and
-elegant code begin to emerge from this innocuous-looking addition to SV.
-
 # Exception-based Fail-on-first
 
 One of the major issues with Vectorised LD/ST operations is when a
@@ -898,9 +854,7 @@ Note: see <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 
 # Data-dependent fail-first
 
-This is a minor variant on the CR-based predicate-result mode.  Where
-pred-result continues with independent element testing (any of which may
-be parallelised), data-dependent fail-first *stops* at the first failure:
+Data-dependent fail-first *stops* at the first failure:
 
     if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
     for i in range(VL):
@@ -924,9 +878,7 @@ that tested for the possibility of that failure, in advance of doing
 the actual calculation.
 
 The only minor downside here though is the change to VL, which in some
-implementations may cause pipeline stalls.  This was one of the reasons
-why CR-based pred-result analysis was added, because that at least is
-entirely paralleliseable.
+implementations may cause pipeline stalls.
 
 # Vertical-First Mode
 
diff --git a/openpower/sv/predication.mdwn b/openpower/sv/predication.mdwn
index 31eecfafc..53a56d22d 100644
--- a/openpower/sv/predication.mdwn
+++ b/openpower/sv/predication.mdwn
@@ -173,7 +173,7 @@ On balance this is a less favourable option than vectorising CRs
 
 ## Scalar (single) integer as predicate, with one DM row
 
-This idea has merit in that to perform predicate bitmanip operations the preficate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away.  Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
+This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away.  Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
 
 This idea has several disadvantages.
 
diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn
index 65bbb2585..e2765fb5f 100644
--- a/openpower/sv/rfc/ls001.mdwn
+++ b/openpower/sv/rfc/ls001.mdwn
@@ -265,16 +265,6 @@ REMAP is outlined separately.
   truncates VL to that exact point. Useful for implementing algorithms
   such as `strcpy` in around 14 high-performance Vector instructions, the
   option exists to include or exclude the failing element.
-* **Predicate-result**: a strategic mode that effectively turns all and any
-  operations into a type of `cmp`. An `Rc=1 BO test` is performed and if
-  failing that element result is **not** written to the regfile. The `Rc=1`
-  Vector of co-results **is** always written (subject to usual predication).
-  Termed "predicate-result" because the combination of producing then
-  testing a result is as if the test was in a follow-up predicated
-  copy/mv operation, it reduces regfile pressure and instruction count.
-  Also useful on saturated or other overflowing operations, the overflowing
-  elements may be excluded from outputting to the regfile then
-  post-analysed outside of critical hot-loops.
 
 **RM Modes**
 
diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn
index 69da11392..de08ccdba 100644
--- a/openpower/sv/svp64.mdwn
+++ b/openpower/sv/svp64.mdwn
@@ -100,7 +100,7 @@ only 24 bits:
 * Predication on both source and destination
 * Two different sources of predication: INT and CR Fields
 * SV Modes including saturation (for Audio, Video and DSP), mapreduce,
-  fail-first and predicate-result mode.
+  and fail-first mode.
 
 Different classes of operations require different formats. The earlier
 sections cover the common formats and the four separate modes follow:
diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index edb8ecf10..170ab08b0 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -564,24 +564,6 @@ There are two primary different types of CR operations:
 
 More details can be found in [[sv/cr_ops]].
 
-## pred-result mode
-
-Pred-result mode may not be applied on CR-based operations.
-
-Although CR operations (mtcr, crand, cror) may be Vectorised, predicated,
-pred-result mode applies to operations that have an Rc=1 mode, or make
-sense to add an RC1 option.
-
-Predicate-result merges common CR testing with predication, saving
-on instruction count.  In essence, a Condition Register Field test is
-performed, and if it fails it is considered to have been *as if* the
-destination predicate bit was zero.  Given that there are no CR-based
-operations that produce Rc=1 co-results, there can be no pred-result
-mode  for mtcr and other CR-based instructions
-
-Arithmetic and Logical Pred-result, which does have Rc=1 or for which
-RC1 Mode makes sense, is covered in [[sv/normal]]
-
 ## CR Operations
 
 CRs are slightly more involved than INT or FP registers due to the
@@ -848,7 +830,6 @@ Qualifiers:
 * ew={N}: ew=8/16/32 - sets elwidth override
 * sw={N}: sw=8/16/32 - sets source elwidth override
 * ff={xx}: see fail-first mode
-* pr={xx}: see predicate-result mode
 * sat{x}: satu / sats - see saturation mode
 * mr: see map-reduce mode
 * mrr: map-reduce, reverse-gear (VL-1 downto 0)
@@ -860,9 +841,6 @@ Qualifiers:
 
 For modes:
 
-* pred-result:
-  - pm=lt/gt/le/ge/eq/ne/so/ns
-  - RC1 mode
 * fail-first
   - ff=lt/gt/le/ge/eq/ne/so/ns
   - RC1 mode
diff --git a/openpower/sv/svp64_quirks.mdwn b/openpower/sv/svp64_quirks.mdwn
index b55e6b767..ea6d17b0a 100644
--- a/openpower/sv/svp64_quirks.mdwn
+++ b/openpower/sv/svp64_quirks.mdwn
@@ -101,8 +101,8 @@ makes no sense at all, such as `sc` or `mtmsr`). The categories are:
 **Arithmetic**
 
 Arithmetic (known as "normal" mode) is where Scalar and Parallel
-Reduction can be done: Saturation as well, and two new innovative
-modes for Vector ISAs: data-dependent fail-first and predicate result.
+Reduction can be done: Saturation as well, and a new innovative
+modes for Vector ISAs: data-dependent fail-first.
 Reduction and Saturation are common to see in Vector ISAs: it is just
 that they are usually added as explicit instructions,
 and NEC SX Aurora has even more iterative instructions. In SVP64 these
@@ -142,8 +142,7 @@ overrides make absolutely no sense whatsoever. Therefore the elwidth
 override field bits can be used for other purposes when Vectorising
 CR Field instructions.  Moreover, Rc=1 is completely invalid for
 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
-a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
-such as predicate-result make no sense, and neither does Saturation.
+a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense.
 All of these differences, which require quite a lot of logical
 reasoning and deduction, help explain why there is an entirely different
 CR ops Vectorisation Category.