Mode is an augmentation of SV behaviour, providing additional
functionality. Some of these alterations are element-based (saturation),
-others involve post-analysis (predicate result) and others are
-Vector-based (mapreduce, fail-on-first).
+others are Vector-based (mapreduce, fail-on-first).
[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately:
the following Modes apply to Arithmetic and Logical SVP64 operations:
is performed. See [[svp64/appendix]].
Note that there are comprehensive caveats when using this mode,
and it should not be confused with the Parallel Reduction [[sv/remap]].
-* **pred-result** will test the result (CR testing selects a bit of CR
- and inverts it, just like branch conditional testing) and if the
- test fails it is as if the *destination* predicate bit was zero even
- before starting the operation. When Rc=1 the CR element however is
- still stored in the CR regfile, even if the test failed. See appendix
- for details.
+ Also care is needed with `hphint`.
Note that ffirst and reduce modes are not anticipated to be
high-performance in some implementations. ffirst due to interactions
-with VL, and reduce due to it requiring additional operations to produce
-a result. simple, saturate and pred-result are however inter-element
+with VL, and reduce due to it creating overlapping operations in
+many of its uses. simple and saturate are however inter-element
independent and may easily be parallelised to give high performance,
regardless of the value of VL.
-The Mode table for Arithmetic and Logical operations is laid out as
+The Mode table for Arithmetic and Logical operations,
+being bits 19-23 of SVP64 `RM`, is laid out as
follows:
| 0-1 | 2 | 3 4 | description |
| 01 | inv | CR-bit | Rc=1: ffirst CR sel |
| 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz |
| 10 | N | dz sz | sat mode: N=0/1 u/s |
-| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
+| 11 | / | / / | reserved |
Fields:
-* **sz / dz** if predication is enabled will put zeros into the dest
+* **sz / dz** source-zeroing, destination-zeroing.
+ if predication is enabled will put zeros into the dest
(or as src in the case of twin pred) when the predicate bit is zero.
Otherwise the element is ignored or skipped, depending on context.
* **zz**: both sz and dz are set equal to this flag
circumstances. Details are in the [[svp64/appendix]]
Reduce Mode should not be confused with Parallel Reduction [[sv/remap]].
-As explained in the [[sv/appendix]] Reduce Mode switches off the check
+As explained in the [[sv/svp64/appendix]] Reduce Mode switches off the check
which would normally stop looping if the result register is scalar.
Thus, the result scalar register, if also used as a source scalar,
may be used to perform sequential accumulation. This *deliberately*
-sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce
+sets up a chain of Register Hazard Dependencies
+(which advanced hardware may optimise out), whereas Parallel Reduce
[[sv/remap]] deliberately issues a Tree-Schedule of operations that may
be parallelised.
+*Hardware architectural note: implementations may optimise out the Hazard
+Dependency chain as long as Sequential Program Execution Order is preserved.
+Easy examples include Reduction on Logical OR or AND operations.*
+
+**Horizontal Parallelism Hint**
+
+`SVSTATE.hphint` declares to hardware that groups of elements up to this
+size are 100% independent (free of all Hazards inter-element but not inter-group).
+With Reduction literally creating Dependency
+Hazards on every element-level sub-instruction it is pretty clear that setting
+`hphint` *at all* would cause data corruption. However `sv.add *r0, *r4, *r0`
+for example clearly leaves room for four parallel elements. Programmers must
+be aware of this and exercise caution.
+
## Data-dependent Fail-on-first
Data-dependent fail-on-first is CR-field-driven and is completely separate
Data-driven (CR-field-driven) fail-on-first activates when Rc=1 or other
CR-creating operation produces a result (including cmp). Similar to
-branch, an analysis of the CR is performed and if the test fails, the
+Branch-Conditional,
+an analysis of the CR is performed and if the test fails, the
vector operation terminates and discards all element operations **at and
above the current one**, and VL is truncated to either the *previous*
element or the current one, depending on whether VLi (VL "inclusive")
* LDST ffirst may never set VL equal to zero. This because on the first
element an exception must be raised "as normal".
* CR-based data-dependent ffirst on the other hand **can** set VL equal
- to zero. This is the only means in the entirety of SV that VL may be set
- to zero (with the exception of via the SV.STATE SPR). When VL is set
+ to zero. When VL is set
zero due to the first element failing the CR bit-test, all subsequent
vectorised operations are effectively `nops` which is
*precisely the desired and intended behaviour*.
* CR-based data-dependent first on the other hand MUST NOT truncate VL
arbitrarily to a length decided by the hardware: VL MUST only be
truncated based explicitly on whether a test fails. This because it is
- a precise Deterministic test on which algorithms can and will will rely.
+ a precise Deterministic test on which algorithms can and will rely.
**Floating-point Exceptions**
Operations that actually produce or alter CR Field as a result have
their own SVP64 Mode, described in [[sv/cr_ops]].
-## pred-result mode
-
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the pseudocode for [[sv/cr_ops]] is slightly
-different.
-
-```
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i):
- continue
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR field
- if Rc=1 or RC1:
- CR.field[offs+i] = CRnew
- # now test CR, similar to branch
- if RC1 or CRnew[BO[0:1]] != BO[2]:
- continue # test failed: cancel store
- # result optionally stored but CR always is
- iregs[RT+i] = result
-```
-
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out. For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`. The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
-
[[!tag standards]]
--------