Mode is an augmentation of SV behaviour, providing additional
functionality. Some of these alterations are element-based (saturation),
-others involve post-analysis (predicate result) and others are
-Vector-based (mapreduce, fail-on-first).
+others are Vector-based (mapreduce, fail-on-first).
[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately:
the following Modes apply to Arithmetic and Logical SVP64 operations:
is performed. See [[svp64/appendix]].
Note that there are comprehensive caveats when using this mode,
and it should not be confused with the Parallel Reduction [[sv/remap]].
-* **pred-result** will test the result (CR testing selects a bit of CR
- and inverts it, just like branch conditional testing) and if the
- test fails it is as if the *destination* predicate bit was zero even
- before starting the operation. When Rc=1 the CR element however is
- still stored in the CR regfile, even if the test failed. See appendix
- for details.
+ Also care is needed with `hphint`.
Note that ffirst and reduce modes are not anticipated to be
high-performance in some implementations. ffirst due to interactions
-with VL, and reduce due to it requiring additional operations to produce
-a result. simple, saturate and pred-result are however inter-element
+with VL, and reduce due to it creating overlapping operations in
+many of its uses. simple and saturate are however inter-element
independent and may easily be parallelised to give high performance,
regardless of the value of VL.
| 01 | inv | CR-bit | Rc=1: ffirst CR sel |
| 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz |
| 10 | N | dz sz | sat mode: N=0/1 u/s |
-| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
+| 11 | / | / / | reserved |
Fields:
which would normally stop looping if the result register is scalar.
Thus, the result scalar register, if also used as a source scalar,
may be used to perform sequential accumulation. This *deliberately*
-sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce
+sets up a chain of Register Hazard Dependencies
+(which advanced hardware may optimise out), whereas Parallel Reduce
[[sv/remap]] deliberately issues a Tree-Schedule of operations that may
be parallelised.
*Hardware architectural note: implementations may optimise out the Hazard
-Dependency chain as long as Sequential Program Execution Order is preserved.*
+Dependency chain as long as Sequential Program Execution Order is preserved.
+Easy examples include Reduction on Logical OR or AND operations.*
+
+**Horizontal Parallelism Hint**
+
+`SVSTATE.hphint` declares to hardware that groups of elements up to this
+size are 100% independent (free of all Hazards inter-element but not inter-group).
+With Reduction literally creating Dependency
+Hazards on every element-level sub-instruction it is pretty clear that setting
+`hphint` *at all* would cause data corruption. However `sv.add *r0, *r4, *r0`
+for example clearly leaves room for four parallel elements. Programmers must
+be aware of this and exercise caution.
## Data-dependent Fail-on-first
Data-driven (CR-field-driven) fail-on-first activates when Rc=1 or other
CR-creating operation produces a result (including cmp). Similar to
-branch, an analysis of the CR is performed and if the test fails, the
+Branch-Conditional,
+an analysis of the CR is performed and if the test fails, the
vector operation terminates and discards all element operations **at and
above the current one**, and VL is truncated to either the *previous*
element or the current one, depending on whether VLi (VL "inclusive")
* LDST ffirst may never set VL equal to zero. This because on the first
element an exception must be raised "as normal".
* CR-based data-dependent ffirst on the other hand **can** set VL equal
- to zero. This is the only means in the entirety of SV that VL may be set
- to zero (with the exception of via the SV.STATE SPR). When VL is set
+ to zero. When VL is set
zero due to the first element failing the CR bit-test, all subsequent
vectorised operations are effectively `nops` which is
*precisely the desired and intended behaviour*.
* CR-based data-dependent first on the other hand MUST NOT truncate VL
arbitrarily to a length decided by the hardware: VL MUST only be
truncated based explicitly on whether a test fails. This because it is
- a precise Deterministic test on which algorithms can and will will rely.
+ a precise Deterministic test on which algorithms can and will rely.
**Floating-point Exceptions**
Operations that actually produce or alter CR Field as a result have
their own SVP64 Mode, described in [[sv/cr_ops]].
-## pred-result mode
-
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the pseudocode for [[sv/cr_ops]] is slightly
-different.
-
-```
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i):
- continue
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR field
- if Rc=1 or RC1:
- CR.field[offs+i] = CRnew
- # now test CR, similar to branch
- if RC1 or CRnew[BO[0:1]] != BO[2]:
- continue # test failed: cancel store
- # result optionally stored but CR always is
- iregs[RT+i] = result
-```
-
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out. For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`. The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
-
[[!tag standards]]
--------