+in the decode phase was too great. The lesson was learned, the
+hard way: it would be infinitely preferable
+to add a 32-bit Scalar Load-with-Shift
+instruction *first*, which then inherently becomes Vectorised.
+Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
+both ARM and x86 have it, because it saves greatly on instruction count in
+hot-loops.
+
+The other reason for not adding an SVP64-Prefixed instruction without
+also having it as a Scalar un-prefixed instruction is that if the
+32-bit encoding is ever allocated in a future revision
+of the Power ISA
+to a completely unrelated operation
+then how can a Vectorised version of that new instruction ever be added?
+The uniformity and RISC Abstraction is irreparably damaged.
+Bottom line here is that the fundamental RISC Principle is strictly adhered
+to, even though these are Advanced 64-bit Vector instructions.
+Advocates of the RISC Principle will appreciate the uniformity of
+SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
+
+# Instruction Groups
+
+The basic principle of SVP64 is the prefix, which contains mode
+as well as register augmentation and predicates. When thinking of
+instructions and Vectorising them, it is natural for arithmetic
+operations (ADD, OR) to be the first to spring to mind.
+Arithmetic instructions have registers, therefore augmentation
+applies, end of story, right?
+
+Except, Load and Store deals also with Memory, not just registers.
+Power ISA has Condition Register Fields: how can element widths
+apply there? And branches: how can you have Saturation on something
+that does not return an arithmetic result? In short: there are actually
+four different categories (five including those for which Vectorisation
+makes no sense at all, such as `sc` or `mtmsr`). The categories are:
+
+* arithmetic/logical including floating-point
+* Load/Store
+* Condition Register Field operations
+* branch
+
+**Arithmetic**
+
+Arithmetic (known as "normal" mode) is where Scalar and Parallel
+Reduction can be done: Saturation as well, and two new innovative
+modes for Vector ISAs: data-dependent fail-first and predicate result.
+Reduction and Saturation are common to see in Vector ISAs: it is just
+that they are usually added as explicit instructions,
+and NEC SX Aurora has even more iterative instructions. In SVP64 these
+concepts are applied in the abstract general form, which takes some
+getting used to.
+
+Reduction may, when applied to non-commutative
+instructions incorrectly, result in invalid results, but ultimately
+it is critical to think in terms of the "rules", that everything is
+Scalar instructions in strict Program Order. Reduction on non-commutative
+Scalar Operations is not *prohibited*: the strict Program Order allows
+the programmer to think through what would happen and thus potentially
+actually come up with legitimate use.
+
+**Branches**
+
+Branch is the one and only place where the Scalar
+(non-prefixed) operations differ from the Vector (element)
+instructions (as explained in a separate section) although
+a case could be made for the perspective that they are identical,
+but the defaults for new parameters in the Scalar case makes branch
+identical to Power ISA v3.1 Scalar branches.
+
+The
+RM bits can be used for other purposes because the Arithmetic modes
+make no sense at all for a Branch.
+Almost the entire
+SVP64 RM Field is interpreted differently from other Modes, in
+order to support a wide range of parallel boolean condition options
+which are expected of a Vector / GPU ISA. These save a considerable
+number of instructions in tight inner loop situations.
+
+**CR Field Ops**
+
+Condition Register Fields are 4-bit wide and consequently element-width
+overrides make absolutely no sense whatsoever. Therefore the elwidth
+override field bits can be used for other purposes when Vectorising
+CR Field instructions. Moreover, Rc=1 is completely invalid for
+CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
+a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
+such as predicate-result make no sense, and neither does Saturation.
+All of these differences, which require quite a lot of logical
+reasoning and deduction, help explain why there is an entirely different
+CR ops Vectorisation Category.
+
+A particularly strange quirk of CR-based Vector Operations is that the
+Scalar Power ISA CR Register is 32-bits, but actually comprises eight
+CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
+this makes up 32 bits, and therefore a CR operand referring to one bit
+of the CR will be 5 bits in length (BA, BT).
+*However*, some instructions refer
+to a *CR Field* (CR0-CR7) and consequently these operands
+(BF, BFA etc) are only 3-bits.
+
+(*It helps here to think of the top 3 bits of BA as referring
+to a CR Field, like BFA does, and the bottom 2 bits of BA
+referring to
+EQ/LT/GT/SO within that Field*)
+
+With SVP64 extending the number of CR *Fields* to 128, the number of
+32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
+(8 per CR Register). Then, it gets even more strange, when it comes
+to Vectorisation, which applies to the CR Field *numbers*. The
+hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
+and moves to CR1 for element 1, and so on. The reason here is quite
+simple: each element result has to have its own CR Field co-result.
+
+In other words, the
+element is the 4-bit CR *Field*, not the bits *of* the 32-bit
+CR Register, and not the CR *Register* (of which there are now 16).
+All quite logical, but a little mind-bending.
+
+**Load/Store**
+
+LOAD/STORE is another area that has different needs: this time it is
+down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
+which simply make no sense in a RISC Scalar ISA: element-stride and
+unit-stride and the entire concept of a stride itself (a spacing
+between elements) has no place at all in a Scalar ISA. The problems
+come when trying to *retrofit* the concept of "Vector Elements" onto
+a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
+RM Prefix to convey the stride mode, changing the Effective Address
+computation as a result. Interestingly, worth noting for Hardware
+designers: it did turn out to be possible to perform pre-multiplication
+of the D/DS Immediate by the stride amount, making it possible to avoid
+actually modifying the LD/ST Pipeline itself.
+
+Other areas where LD/ST went quirky: element-width overrides especially
+when combined with Saturation, given that LD/ST operations have byte,
+halfword, word, dword and quad variants. The interaction between these
+widths as part of the actual operation, and the source and destination
+elwidth overrides, was particularly obtuse and hard to derive: some care
+and attention is advised, here, when reading the specification,
+especially on arithmetic loads (lbarx, lharx etc.)
+
+**Non-vectorised**
+
+The concept of a Vectorised halt (`attn`) makes no sense. There are never
+going to be a Vector of global MSRs (Machine Status Register). `mtcr`
+on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
+Even `td` and `tdi` makes a strange type of sense to permit it to be
+Vectorised, because a sequence of comparisons could be Vectorised.
+Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
+Nemory Management
+instructions, these make no sense to Vectorise.
+
+However, it is really quite important to not be tempted to conclude that
+just because these instructions are un-vectoriseable, the Prefix opcode space
+must be free for reiterpretation and use for other purposes. This would
+be a serious mistake because a future revision of the specification
+might *retire* the Scalar instruction, and, worse, replace it with another.
+Again this comes down to being quite strict about the rules: only Scalar
+instructions get Vectorised: there are *no* actual explicit Vector
+instructions.
+
+**Summary**
+
+Where a traditional Vector ISA effectively duplicates the entirety
+of a Scalar ISA and then adds additional instructions which only
+make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
+considerable lengths to keep strictly to augmentation and embedding
+of an entire Scalar ISA's instructions into an abstract Vectorisation
+Context. That abstraction subdivides down into Categories appropriate
+for the type of operation (Branch, CRs, Memory, Arithmetic),
+and each Category has its own relevant but
+ultimately rational quirks.
+
+# Abstraction between Prefix and Suffix
+
+In the introduction paragraph, a great fuss was made emphasising that
+the Prefix is kept separate from the Suffix. The whole idea there is
+that a Multi-issue Decoder and subsequent pipelines would in no way have
+"back-propagation" of state that can only be determined far too late.
+This *has* been preserved, however there is a hiccup.
+
+Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001.
+The encoding of the prefix has 6 bits that are dedicated to letting
+the hardware know what the remainder of the Prefix bits mean: how they
+are formatted, even without having to examine the Suffix to which
+they are applied.
+
+SVP64 has such pressure on its 24-bit encoding that it was simply
+not possible to perform the same trick used by Power ISA 3.1 Prefixing.
+Therefore, rather unfortunately, it becomes necessary to perform
+a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM
+Fields may be identified. Fortunately this is straightforward, and
+does not rely on any outside state, and even more fortunately
+for a Multi-Issue Execution decoder, the length 32/64 is also
+easy to identify by looking for the EXT001 pattern. Once identified
+the 32/64 bits may be passed independently to multiple Decoders in
+parallel.
+
+# Predication
+
+Predication is entirely missing from the Power ISA.
+Adding it would be a costly mistake because it cannot be retrofitted
+to an ISA without literally duplicating all instructions. Prefixing
+is about the only sane way to go.
+
+CR Fields as predicate masks could be spread across multiple register
+file entries, making them costly to read in one hit. Therefore the
+possibility exists that an instruction element writing to a CR Field
+could *overwrite* the Predicate mask CR Vector during the middle of
+a for-loop.
+
+Clearly this is bad, so don't do it. If there are potential issues
+they can be avoided by using the crweird instructions to get CR Field
+bits into an Integer GPR (r3, r10 or r30) and use that GPR as a
+Predicate mask instead.
+
+Even in Vertical-First Mode, which is a single Scalar instruction executed
+with "offset" registers (in effect), the rule still applies: don't write
+to the same register being used as the predicate, it's `UNDEFINED`
+behaviour.
+
+## Single Predication
+
+So named because there is a Twin Predication concept as well, Single
+Predication is also unlike other Vector ISAs because it allows zeroing
+on both the source and destination. This takes some explaining.
+
+In Vector ISAs, there is a Predicate Mask, it applies to the
+destination only, and there
+is a choice of actions when a Predicate Mask bit
+is zero:
+
+* set the destination element to zero
+* skip that element operation entirely, leaving the destination unmodified
+
+The problem comes if the underlying register file SRAM is say 64-bit wide
+write granularity but the Vector elements are say 8-bit wide.
+Some Vector ISAs strongly advocate Zeroing because to leave one single
+element at a small bitwidth in amongst other elements where the register
+file does not have the prerequisite access granularity is very expensive,
+requiring a Read-Modify-Write cycle to preserve the untouched elements.
+Putting zero into the destination avoids that Read.
+
+This is technically very easy to solve: use a Register File that does
+in fact have the smallest element-level write-enable granularity.
+If the elements are 8 bit then allow 8-bit writes!
+
+With that technical issue solved there is nothing in the way of choosing
+to support both zeroing and non-zeroing (skipping) at the ISA level:
+SV chooses to further support both *on both the source and destination*.
+This can result in the source and destination
+element indices getting "out-of-sync" even though the Predicate Mask
+is the same because the behaviour is different when zeros in the
+Predicate are encountered.
+
+## Twin Predication
+
+Twin Predication is an entirely new concept not present in any commercial
+Vector ISA of the past forty years. To explain how normal Single-predication
+is applied in a standard Vector ISA:
+
+* Predication on the **source** of a LOAD instruction creates something
+ called "Vector Compressed Load" (VCOMPRESS).
+* Predication on the **destination** of a STORE instruction creates something
+ called "Vector Expanded Store" (VEXPAND).
+* SVP64 allows the two to be put back-to-back: one on source, one on
+ destination.
+
+The above allows a reader familiar with VCOMPRESS and VEXPAND to
+conceptualise what the effect of Twin Predication is, but it actually
+goes much further: in *any* twin-predicated instruction (extsw, fmv)
+it is possible to apply one predicate to the source register (compressing
+the source element array) and another *completely separate* predicate
+to the destination register, not just on Load/Stores but on *arithmetic*
+operations.
+
+No other Vector ISA in the world has this back-to-back
+capability. All true Vector
+ISAs have Predicate Masks: it is an absolutely essential characteristic.
+However none of them have abstracted dual predicates out to the extent
+where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
+wide range of arithmetic
+instructions, as well as Load/Store.
+
+It is however important to note that not all instructions can be Twin
+Predicated (2P): some remain only Single Predicated (1P), as is normally found
+in other Vector ISAs. Arithmetic operations with
+four registers (3-in, 1-out, VA-Form for example) are Single. The reason
+is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
+Consequently, when using a given instruction, it is necessary to look
+up in the ISA Tables whether it is 1P or 2P. caveat emptor!
+
+Also worth a special mention: all Load/Store operations are Twin-Predicated.
+The underlying key to understanding:
+
+* one Predicate effectively applies to the Array of Memory *Addresses*,
+* the other Predicate effectively applies to the Array of Memory *Data*.