**Arithmetic**
Arithmetic (known as "normal" mode) is where Scalar and Parallel
-Reduction can be done: Saturation as well, and two new innovative
-modes for Vector ISAs: data-dependent fail-first and predicate result.
+Reduction can be done: Saturation as well, and a new innovative
+modes for Vector ISAs: data-dependent fail-first.
Reduction and Saturation are common to see in Vector ISAs: it is just
that they are usually added as explicit instructions,
and NEC SX Aurora has even more iterative instructions. In SVP64 these
override field bits can be used for other purposes when Vectorising
CR Field instructions. Moreover, Rc=1 is completely invalid for
CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
-a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
-such as predicate-result make no sense, and neither does Saturation.
+a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense.
All of these differences, which require quite a lot of logical
reasoning and deduction, help explain why there is an entirely different
CR ops Vectorisation Category.
BF16 or FP16 operations: there does not exist a BF8 or an IEEE754 FP8
format, so these (`sv.fadds/ew=8`) should be avoided.
+# Word frequently becomes "half"
+
+Again, related to "Single" becoming "half of element width", unless there
+are compelling reasons the same trick applies to Scalar GPR operations.
+With the pseudocode being "XLEN//2" then of course if XLEN=8 the operation
+becomes a 4-bit one.
+
+Similarly byte operations which use "XLEN//8" when XLEN=8 actually become
+single-bit operations, which is very useful with `sv.extsb/w=8`
+for example. This instruction copies the LSB of each byte in a sequence of bytes,
+and expands it to all 8 bits in each result byte.
+
# Vertical-First and Subvectors
Documented in the [[sv/setvl]] page, Vertical-First goes through
* LD/ST Immediate has no individual control over src/dest zeroing,
whereas LD/ST Indexed does.
-* LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
-* LD/ST Indexed has no Pack/Unpack, whereas LD/ST Immediate does.
-
-These are not insurmountable problems: there do exist workarounds.
-For example it is possible to set up Matrix REMAP to perform the same
-job as Pack/Unpack, at which point the LD/ST "Saturation" mode may
-be used, saving on costly intermediary registers *at double the LD
-width* if a Saturated MV had to be involved. Store on the other hand
-it is extremely likely that an arithmetic operation already computed
-a Saturated Vector of results, so is less of a problem than Load.
+* Post-Increment is not possible with Saturation or Data-Dependent Fail-First
+* Element-Strided LD/ST Indexed is not possible with Data-Dependent Fail-First.
Also, the LD/ST Indexed Mode can be element-strided (RB as
a Scalar, times
Simple-V is powerful but it cannot do everything! There is just not
enough space and so some compromises had to be made.
+
+# sv.mtcr on entire 64-bit Condition Register
+
+Normally, CR operations are either bit-based (where the element numbering actually
+applies to the CR Field) or field-based in which case the elements are still
+fields. The `sv.mtcr` and other instructions are actually full 64-bit Condition
+*Register* operations and are therefore qualified as Normal/Arithmetic not
+CRops.
+
+This is to save on both Vector Length (VL of 16 is sufficient) as well as
+complexity in the Hazard Management when context-switching CR fields, as the
+entire batch of 128 CR Fields may be transferred to 8 GPRs with a VL of 16
+and elwidth overriding of 32. Truncation is sufficent, dropping the top 32 bits
+of the Condition Register(s) which are always zero anyway.
+
+# Separate Scalar and Vector Condition Register files
+
+As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]]
+Scalar Power ISA lacks "Conditional Execution" present in ARM
+Scalar ISA of several decades. When Vectorised the fact that
+Rc=1 Vector results can immediately be used as a Predicate Mask
+back into the following instruction can result in large latency
+unless "Vector Chaining" is used in the Micro-Architecture.
+
+But that aside is not the main problem faced by the introduction
+of Simple-V to the Power ISA: it's that the existing implementations
+(IBM) don't have "Conditional Execution" and to add it to their
+existing designs would be too disruptive a first step.
+
+A compromise is to wipe blank certain entries in the Register Dependency
+Matrices by prohibiting some operations involving the two groups
+of CR Fields: those that fall into the existing Scalar 32-bit CR
+(fields CR0-CR7) and those that fall into the newly-introduced
+CR Fields, CR8-CR127.
+
+This will drive compiler writers nuts, and give assembler writers headaches,
+but it gives IBM the opportunity to implement SVP64 without massive
+disruption. They can add an entirely new Vector CR register file,
+new pipelines etc safe in the knowledge that existing Scalar HDL
+needs no modification.