interoperability expectations within certain environments. Details in
the [[svp64/appendix]].
-## Strict Program Order
-
-Many Vector ISAs allow interrupts to occur in the middle of
-processing of large Vector operations, only under the condition
-that continuation on return will restart the entire operation.
-The reason is that saving of full Architectural State is
-not practical.
-
-Simple-V operates on an entirely different paradigm from traditional
-Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
-with Scalar instructions. With this in mind it is critical for
-implementations to observe Strict Element-Level Program Order
-at all times. *Any* element is Interruptible and Simple-V has
-been carefully designed to ensure that Architectural State is
-fully preserved.
-
-Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
-but the full SVP64 Architectural State may be saved and
-restored through manual copying of `SVSTATE` and the four
-REMAP SPRs. Whilst this initially sounds unsafe in reality
-all rhat Trap Handlers (and function call stack save/restore)
-need do is avoid
-use of SVP64 Prefixed instructions to perform the necessary
-save/restore of Simple-V Architectural State.
-This capability also allows nested function calls to be made from
-inside Vector loops, which is very rare for Vector ISAs.
-
-Strict Program Order is also preserved by the Parallel Reduction
-REMAP Schedule, but only at the cost of requiring the destination
-Vector to be used (Deterministically) to store partial progress of the
-Parallel Reduction Schedule.
-
-The only major caveat for REMAP is that
-after an explicit change to
-Architectural State caused by writing to the
-Simple-V SPRs, some implementations may find
-it easier to take longer to calculate where in a given Schedule
-the re-mapping Indices were. Obvious examples include Interrupts occuring
-in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
-for example), which
-will force implementations to perform divide and modulo
-calculations.
-
## SVP64 encoding features
A number of features need to be compacted into a very small space of
* Predication on both source and destination
* Two different sources of predication: INT and CR Fields
* SV Modes including saturation (for Audio, Video and DSP), mapreduce,
- fail-first and predicate-result mode.
+ and fail-first mode.
Different classes of operations require different formats. The earlier
sections cover the common formats and the four separate modes follow:
Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
adding "UnVectorised" to this phase is not unreasonable.*
+## Definition of Strict Program Order
+
+Strict Program Order is defined as giving the appearance, as far
+as programs are concerned, that instructions were executed
+strictly in the sequence that they occurred. A "Precise"
+out-of-order
+Micro-architecture goes to considerable lengths to ensure that
+this is the case.
+
+Many Vector ISAs allow interrupts to occur in the middle of
+processing of large Vector operations, only under the condition
+that partial results are cleanly discarded, and continuation on return
+from the Trap Handler will restart the entire operation.
+The reason is that saving of full Architectural State is
+not practical.
+
+Simple-V operates on an entirely different paradigm from traditional
+Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
+with Scalar instructions. With this in mind it is critical for
+implementations to observe Strict Element-Level Program Order
+at all times
+(often simply referred to as just "Strict Program Order"
+throughout
+this Chapter).
+*Any* element is Interruptible and Simple-V has
+been carefully designed to guarantee that Architectural State may
+be fully preserved and restored regardless of that same State, but
+it is not necessarily guaranteed that the amount of time needed to recover
+will be low latency (particularly if REMAP
+is active).
+
+Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
+but the full SVP64 Architectural State may be saved and
+restored through manual copying of `SVSTATE` (and the four
+REMAP SPRs if in use at the time)
+Whilst this initially sounds unsafe in reality
+all that Trap Handlers (and function call stack save/restore)
+need do is avoid
+use of SVP64 Prefixed instructions to perform the necessary
+save/restore of Simple-V Architectural State.
+This capability also allows nested function calls to be made from
+inside Vertical-First Vector loops, which is very rare for Vector ISAs.
+
+Strict Program Order is also preserved by the Parallel Reduction
+REMAP Schedule, but only at the cost of requiring the destination
+Vector to be used (Deterministically) to store partial progress of the
+Parallel Reduction.
+
+The only major caveat for REMAP is that
+after an explicit change to
+Architectural State caused by writing to the
+Simple-V SPRs, some implementations may find
+it easier to take longer to calculate where in a given Schedule
+the re-mapping Indices were. Obvious examples include Interrupts occuring
+in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
+for example), which
+will force implementations to perform divide and modulo
+calculations.
+
+An additional caveat involves Condition Register Fields
+when also used as Predicate Masks. An operation that
+overwrites the same CR Fields that are simultaneously
+being used as a Predicate Mask is `UNDEFINED` behaviour
+if the overwritten CR field element was needed by a
+subsequent Element for its Predicate Mask bit.
+This allows implementations to relax some of the
+otherwise-draconian Register Hazards that would otherwise
+occur, and to consider internal cacheing of the CR-based
+Predicate
+bits, but some implementations *may not necessarily
+perform pre-reading* and consequently the risk of
+overwrite is the responsibility of the Programmer.
+Special care is particularly needed here when using REMAP.
+
## Register files, elements, and Element-width Overrides
The relationship between register files, elements, and element-width
sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
sv.mfcr cr5, *cr40 # only one source (CR40) copied to CR5
sv.mfcr *cr16, cr40 # Vector-Splat CR40 onto CR16,17,18...
+ sv.mfcr *cr16, cr3 # Vector-Splat CR3 onto CR16,17,18...
```
Examples of prohibited instructions:
sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
this may be considered to be elements 0b00 to 0b01 inclusive.
+Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+element operation issued, SUBVL element operations are issued (as an inner loop).
+The key difference between VL looping and SUBVL looping
+is that predication bits are applied per
+**group**, rather than by individual element.
+
+Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
+
## MASK/MASK_SRC & MASKMODE Encoding
One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
Likewise CR based twin predication has a second set of 3 bits, allowing
a different test to be applied.
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed. In practice (for CR Fields)
-this creates an unnecessary block on parallelism. Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop. Doing so results
+Note that it cannot necessarily be assumed that Predicate Masks
+(whether INT or CR) are read in full *before* the operations proceed. In practice (for CR Fields)
+this creates an unnecessary block on parallelism, prohibiting
+"Vector Chaining". Therefore, it is up
+to the programmer to ensure that the CR field Elements used as Predicate Masks
+are not overwritten by any parallel Vector Loop. Doing so results
in **UNDEFINED** behaviour, according to the definition outlined in the
Power ISA v3.0B Specification.
needs to take place, safe in the knowledge that no programmer will have
issued a Vector Instruction where previous elements could have overwritten
(destroyed) not-yet-executed CR-Predicated element operations.
+This particularly is an issue when using REMAP, as the order in
+which CR-Field-based Predicate Mask bits could be read on a per-element
+execution basis could well conflict with the order in which prior
+elements wrote to the very same CR Field.
+
+Additionally Programmers should avoid using r3 r10 or r30
+as destination registers when these are also used as a Predicate
+Mask. Doing so is again UNDEFINED behaviour.
### Integer Predication (MASKMODE=0)
r10 and r30 are at the high end of temporary and unused registers,
so as not to interfere with register allocation from ABIs.
+
### CR-based Predication (MASKMODE=1)
When the predicate mode bit is one the 3 bits are interpreted as below.