-# Normal SVP64 Modes, for Arithmetic and Logical Operations
+# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem
-Normal SVP64 Mode covers Arithmetic and Logical operations
-to provide suitable additional behaviour. The Mode
-field is bits 19-23 of the [[svp64]] RM Field.
+TODO template from other RFCs
-## Mode
-
-Mode is an augmentation of SV behaviour, providing additional
-functionality. Some of these alterations are element-based (saturation),
-others involve post-analysis (predicate result) and others are
-Vector-based (mapreduce, fail-on-first).
-
-[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately:
-the following Modes apply to Arithmetic and Logical SVP64 operations:
-
-* **simple** mode is straight vectorisation. no augmentations: the
- vector comprises an array of independently created results.
-* **ffirst** or data-dependent fail-on-first: see separate section.
- the vector may be truncated depending on certain criteria.
- *VL is altered as a result*.
-* **sat mode** or saturation: clamps each element result to a min/max
- rather than overflows / wraps. allows signed and unsigned clamping
- for both INT and FP.
-* **reduce mode**. if used correctly, a mapreduce (or a prefix sum)
- is performed. see [[svp64/appendix]].
- note that there are comprehensive caveats when using this mode.
-* **pred-result** will test the result (CR testing selects a bit of CR
- and inverts it, just like branch conditional testing) and if the
- test fails it is as if the *destination* predicate bit was zero even
- before starting the operation. When Rc=1 the CR element however is
- still stored in the CR regfile, even if the test failed. See appendix
- for details.
-
-Note that ffirst and reduce modes are not anticipated to be
-high-performance in some implementations. ffirst due to interactions
-with VL, and reduce due to it requiring additional operations to produce
-a result. simple, saturate and pred-result are however inter-element
-independent and may easily be parallelised to give high performance,
-regardless of the value of VL.
-
-The Mode table for Arithmetic and Logical operations is laid out as
-follows:
-
-| 0-1 | 2 | 3 4 | description |
-| --- | --- |---------|-------------------------- |
-| 00 | 0 | dz sz | simple mode |
-| 00 | 1 | 0 RG | scalar reduce mode (mapreduce) |
-| 00 | 1 | 1 / | reserved |
-| 01 | inv | CR-bit | Rc=1: ffirst CR sel |
-| 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz |
-| 10 | N | dz sz | sat mode: N=0/1 u/s |
-| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
-
-Fields:
-
-* **sz / dz** if predication is enabled will put zeros into the dest
- (or as src in the case of twin pred) when the predicate bit is zero.
- otherwise the element is ignored or skipped, depending on context.
-* **zz**: both sz and dz are set equal to this flag
-* **inv CR bit** just as in branches (BO) these bits allow testing of
- a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **RG** inverts the Vector Loop order (VL-1 downto 0) rather
- than the normal 0..VL-1
-* **N** sets signed/unsigned saturation.
-* **RC1** as if Rc=1, enables access to `VLi`.
-* **VLi** VL inclusive: in fail-first mode, the truncation of
- VL *includes* the current element at the failure point rather
- than excludes it from the count.
-
-For LD/ST Modes, see [[sv/ldst]]. For Condition Registers see
-[[sv/cr_ops]]. For Branch modes, see [[sv/branches]].
-
-## Rounding, clamp and saturate
-
-To help ensure for example that audio quality is not compromised by
-overflow, "saturation" is provided, as well as a way to detect when
-saturation occurred if desired (Rc=1). When Rc=1 there will be a *vector*
-of CRs, one CR per element in the result (Note: this is different from
-VSX which has a single CR per block).
-
-When N=0 the result is saturated to within the maximum range of an
-unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
-logic applies to FP operations, with the result being saturated to
-maximum rather than returning INF, and the minimum to +0.0
-
-When N=1 the same occurs except that the result is saturated to the min
-or max of a signed result, and for FP to the min and max value rather
-than returning +/- INF.
-
-When Rc=1, the CR "overflow" bit is set on the CR associated with
-the element, to indicate whether saturation occurred. Note that
-due to the hugely detrimental effect it has on parallel processing,
-XER.SO is **ignored** completely and is **not** brought into play here.
-The CR overflow bit is therefore simply set to zero if saturation did
-not occur, and to one if it did. This behaviour (ignoring XER.SO) is
-actually optional in the SFFS Compliancy Subset: for SVP64 it is made
-mandatory *but only on Vectorised instructions*.
-
-Note also that saturate on operations that set OE=1 must raise an Illegal
-Instruction due to the conflicting use of the CR.so bit for storing
-if saturation occurred. Vectorised Integer Operations that produce a
-Carry-Out (CA, CA32): these two bits will be `UNDEFINED` if saturation
-is also requested.
-
-Note that the operation takes place at the maximum bitwidth (max of
-src and dest elwidth) and that truncation occurs to the range of the
-dest elwidth.
-
-*Programmer's Note: Post-analysis of the Vector of CRs to find out if any
-given element hit saturation may be done using a mapreduced CR op (cror),
-or by using the new crrweird instruction with Rc=1, which will transfer
-the required CR bits to a scalar integer and update CR0, which will allow
-testing the scalar integer for nonzero. see [[sv/cr_int_predication]].
-Alternatively, a Data-Dependent Fail-First may be used to truncate the
-Vector Length to non-saturated elements, greatly increasing the productivity
-of parallelised inner hot-loops.*
-
-## Reduce mode
-
-Reduction in SVP64 is similar in essence to other Vector Processing ISAs,
-but leverages the underlying scalar Base v3.0B operations. Thus it is
-more a convention that the programmer may utilise to give the appearance
-and effect of a Horizontal Vector Reduction. Due to the unusual decoupling
-it is also possible to perform prefix-sum (Fibonacci Series) in certain
-circumstances. Details are in the SVP64 appendix
-
-Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
-As explained in the [[sv/appendix]] Reduce Mode switches off the check
-which would normally stop looping if the result register is scalar.
-Thus, the result scalar register, if also used as a source scalar,
-may be used to perform sequential accumulation. This *deliberately*
-sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce
-[[sv/remap]] deliberately issues a Tree-Schedule of operations that may
-be parallelised.
-
-## Data-dependent Fail-on-first
-
-Data-dependent fail-on-first is very different from LD/ST Fail-First
-(also known as Fault-First) and is actually CR-field-driven.
-Vector elements are required to appear
-to be executed in sequential Program Order. When REMAP is not active,
-element 0 would be the first.
-
-Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
-CR-creating operation produces a result (including cmp). Similar to
-branch, an analysis of the CR is performed and if the test fails, the
-vector operation terminates and discards all element operations **at and
-above the current one**, and VL is truncated to either the *previous*
-element or the current one, depending on whether VLi (VL "inclusive")
-is clear or set, respectively.
-
-Thus the new VL comprises a contiguous vector of results, all of which
-pass the testing criteria (equal to zero, less than zero etc as defined
-by the CR-bit test).
-
-*Note: when VLi is clear, the behaviour at first seems counter-intuitive.
-A result is calculated but if the test fails it is prohibited from being
-actually written. This becomes intuitive again when it is remembered
-that the length that VL is set to is the number of *written* elements, and
-only when VLI is set will the current element be included in that count.*
-
-The CR-based data-driven fail-on-first is "new" and not found in ARM SVE
-or RVV. At the same time it is "old" because it is almost identical to
-a generalised form of Z80's `CPIR` instruction. It is extremely useful
-for reducing instruction count, however requires speculative execution
-involving modifications of VL to get high performance implementations.
-An additional mode (RC1=1) effectively turns what would otherwise be an
-arithmetic operation into a type of `cmp`. The CR is stored (and the
-CR.eq bit tested against the `inv` field). If the CR.eq bit is equal to
-`inv` then the Vector is truncated and the loop ends.
-
-VLi is only available as an option when `Rc=0` (or for instructions
-which do not have Rc). When set, the current element is always also
-included in the count (the new length that VL will be set to). This may
-be useful in combination with "inv" to truncate the Vector to *exclude*
-elements that fail a test, or, in the case of implementations of strncpy,
-to include the terminating zero.
-
-In CR-based data-driven fail-on-first there is only the option to select
-and test one bit of each CR (just as with branch BO). For more complex
-tests this may be insufficient. If that is the case, a vectorised crop
-such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
-and ffirst applied to the crop instead of to the arithmetic vector. Note
-that crops are covered by the [[sv/cr_ops]] Mode format.
-
-Use of Fail-on-first with Vertical-First Mode is not prohibited but is
-not really recommended. The effect of truncating VL
-may have unintended and unexpected consequences on subsequent instructions.
-VLi set will be fine: it is when VLi is clear that problems may be faced.
-
-*Programmer's note: `VLi` is only accessible in normal operations which in
-turn limits the CR field bit-testing to only `EQ/NE`. [[sv/cr_ops]] are
-not so limited. Thus it is possible to use for example `sv.cror/ff=gt/vli
-*0,*0,*0`, which is not a `nop` because it allows Fail-First Mode to
-perform a test and truncate VL.*
-
-*Hardware implementor's note: effective Sequential Program Order must
-be preserved. Speculative Execution is perfectly permitted as long as
-the speculative elements are held back from writing to register files
-(kept in Resevation Stations), until such time as the relevant CR Field
-bit(s) has been analysed. All Speculative elements sequentially beyond
-the test-failure point **MUST** be cancelled. This is no different from
-standard Out-of-Order Execution and the modification effort to efficiently
-support Data-Dependent Fail-First within a pre-existing Multi-Issue
-Out-of-Order Engine is anticipated to be minimal. In-Order systems on
-the other hand are expected, unavoidably, to be low-performance*.
-
-Two extremely important aspects of ffirst are:
-
-* LDST ffirst may never set VL equal to zero. This because on the first
- element an exception must be raised "as normal".
-* CR-based data-dependent ffirst on the other hand **can** set VL equal
- to zero. This is the only means in the entirety of SV that VL may be set
- to zero (with the exception of via the SV.STATE SPR). When VL is set
- zero due to the first element failing the CR bit-test, all subsequent
- vectorised operations are effectively `nops` which is
- *precisely the desired and intended behaviour*.
-
-The second crucial aspect, compared to LDST Ffirst:
-
-* LD/ST Failfirst may (beyond the initial first element
- conditions) truncate VL for any architecturally suitable reason. Beyond
- the first element LD/ST Failfirst is arbitrarily speculative and 100%
- non-deterministic.
-* CR-based data-dependent first on the other hand MUST NOT truncate VL
- arbitrarily to a length decided by the hardware: VL MUST only be
- truncated based explicitly on whether a test fails. This because it is
- a precise Deterministic test on which algorithms can and will will rely.
-
-**Floating-point Exceptions**
-
-When Floating-point exceptions are enabled VL must be truncated at
-the point where the Exception appears not to have occurred. If `VLi`
-is set then VL must include the faulting element, and thus the faulting
-element will always raise its exception. If however `VLi` is clear then
-VL **excludes** the faulting element and thus the exception will **never**
-be raised.
-
-Although very strongly discouraged the Exception Mode that permits
-Floating Point Exception notification to arrive too late to unwind
-is permitted (under protest, due it violating the otherwise 100%
-Deterministic nature of Data-dependent Fail-first) and is `UNDEFINED`
-behaviour.
-
-**Use of lax FP Exception Notification Mode could result in parallel
-computations proceeding with invalid results that have to be explicitly
-detected, whereas with the strict FP Execption Mode enabled, FFirst
-truncates VL, allows subsequent parallel computation to avoid the
-exceptions entirely**
-
-## Data-dependent fail-first on CR operations (crand etc)
-
-Operations that actually produce or alter CR Field as a result have
-their own SVP64 Mode, described in [[sv/cr_ops]].
-
-## pred-result mode
-
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the pseudocode for SVP64 CR-ops is slightly different.
-
-```
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i):
- continue
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR field
- if Rc=1 or RC1:
- CR.field[offs+i] = CRnew
- # now test CR, similar to branch
- if RC1 or CR.field[BO[0:1]] != BO[2]:
- continue # test failed: cancel store
- # result optionally stored but CR always is
- iregs[RT+i] = result
-```
-
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out. For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`. The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
+[[!tag opf_rfc]]
--------
\newpage{}
-
-[[!tag opf_rfc]]
-