Vector systems are expected to be high performance. This is achieved
through parallelism, which requires that elements in the vector be
-independent. XER SO and other global "accumulation" flags (CR.OV) cause
+independent. XER SO/OV and other global "accumulation" flags (CR.SO) cause
Read-Write Hazards on single-bit global resources, having a significant
detrimental effect.
-Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
+Consequently in SV, XER.SO and OV behaviour is disregarded (including
in `cmp` instructions). XER is simply neither read nor written.
This includes when `scalar identity behaviour` occurs. If precise
OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
instructions should be used without an SV Prefix.
+Of note here is that XER.SO and OV may already be disregarded in the
+Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
+SVP64 simply makes it mandatory to disregard even for other Subsets,
+but only for SVP64 Prefixed Operations.
+
An interesting side-effect of this decision is that the OE flag is now
-free for other uses when SV Prefixing is used.
+free for other uses when SV Prefixing is used, and CR.SO may likewise
+used for other purposes (saturation for example).
XER.CA/CA32 on the other hand is expected and required to be implemented
according to standard Power ISA Scalar behaviour. Interestingly, due
to SVP64 being in effect a hardware for-loop around Scalar instructions
executing in precise Program Order, a little thought shows that a Vectorised
-Carry-In-Out add is in effect a Big Integer Add, taking a single bit CarryIn
+Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
and producing, at the end, a single bit Carry out. High performance
implementations may exploit this observation to deploy efficient
Parallel Carry Lookahead.
- sv.
+ # assume VL=4, this results in 4 sequential ops (below)
+ sv.adde r0.v, r4.v, r8.v
+
+ # instructions that get executed in backend hardware:
+ adde r0, r4, r8 # takes carry-in, produces carry-out
+ adde r1, r5, r9 # takes carry from previous
+ ...
+ adde r3, r7, r11 # likewise
+
+It can clearly be seen that the carry chains from one
+64 bit add to the next, the end result being that a
+256-bit "Big Integer Add" has been performed, and that
+CA contains the 257th bit.
# v3.0B/v3.1 relevant instructions