[[!toc]]
-SVP64 is designed around these fundamental and inviolate principles:
+SVP64 is designed around fundamental and inviolate RISC principles.
+This gives a uniformity and regularity to the ISA, making implementation
+straightforward, which was why RISC
+as a concept became popular.
1. There are no actual Vector instructions: Scalar instructions
are the sole exclusive bedrock.
with scalar instructions (the suffix)
How can a Vector ISA even exist when no actual Vector instructions
-are permitted to be added? It comes down to the strict abstraction.
-First you add a **scalar** instruction (32-bit). Second, the
+are permitted to be added? It comes down to the strict RISC abstraction.
+First you start from a **scalar** instruction (32-bit). Second, the
Prefixing is applied *in the abstract* to give the *appearance*
and ultimately the same effect as if an explicit Vector instruction
-had also been added.
+had also been added. Looking at the pseudocode of any Vector ISA
+(RVV, NEC SX Aurora, Cray)
+they always comprise (a) a for-loop around (b) element-based operations.
+It is perfectly reasonable and rational to separate (a) from (b)
+then find a powerful pre-existing
+Supercomputing-class ISA that qualifies for (b).
There are a few exceptional places where these rules get
bent, and others where the rules take some explaining,
-and this page tracks them.
+and this page tracks them all.
The modification caveat in (2) above semantically
exempts element width overrides,
which still do not actually modify the meaning of the instruction:
an add remains an add, even if its override makes it an 8-bit add rather than
a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
-that it's an *8-bit* add-with-carry.
-In other words, elwidth overrides *definitely* do not alter the actual
-Scalar v3.0 ISA encoding itself, and consequently we can still, in
-the strictest sense, not be breaking rule (2).
+that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry
+where the 9th bit becomes Carry-out (not the 65th bit).
+In other words, elwidth overrides **definitely** do not fundamentally
+alter the actual
+Scalar v3.0 ISA encoding itself. Consequently we can still, in
+the strictest semantic sense, not be breaking rule (2).
+
Likewise, other "modifications" such as saturation or Data-dependent
Fail-First likewise are actually post-augmentation or post-analysis, and do
not fundamentally change an add operation into a subtract
-for example, and under absolutely no circumstances do the actual
-operand field bits change or the number of operands change.
+for example, and under absolutely no circumstances do the actual 32-bit
+Scalar v3.0 operand field bits change or the number of operands change.
-*(In an early Draft of SVP64,
+In an early Draft of SVP64,
an experiment was attempted, to modify LD-immediate instructions
to include a
third RC register i.e. reinterpret the normal
different encoding if SVP64-prefixed. It did not go well.
The complexity that resulted
in the decode phase was too great. The lesson was learned, the
-hard way: it is infinitely preferable to add a 32-bit Scalar Load-with-Shift
+hard way: it would be infinitely preferable
+to add a 32-bit Scalar Load-with-Shift
instruction *first*, which then inherently becomes Vectorised.
Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
both ARM and x86 have it, because it saves greatly on instruction count in
-hot-loops.)*
+hot-loops.
+
+The other reason for not adding an SVP64-Prefixed instruction without
+also having it as a Scalar un-prefixed instruction is that if the
+32-bit encoding is ever allocated in a future revision
+of the Power ISA
+to a completely unrelated operation
+then how can a Vectorised version of that new instruction ever be added?
+The uniformity and RISC Abstraction is irreparably damaged.
+Bottom line here is that the fundamental RISC Principle is strictly adhered
+to, even though these are Advanced 64-bit Vector instructions.
+Advocates of the RISC Principle will appreciate the uniformity of
+SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
# Instruction Groups
Reduction can be done: Saturation as well, and two new innovative
modes for Vector ISAs: data-dependent fail-first and predicate result.
Reduction and Saturation are common to see in Vector ISAs: it is just
-that they are usually added as explicit instructions. In SVP64 these
+that they are usually added as explicit instructions,
+and NEC SX Aurora has even more iterative instructions. In SVP64 these
concepts are applied in the abstract general form, which takes some
-getting used to, as it may result in invalid results, but ultimately
+getting used to.
+
+Reduction may, when applied to non-commutative
+instructions incorrectly, result in invalid results, but ultimately
it is critical to think in terms of the "rules", that everything is
-Scalar instructions in strict Program Order.
+Scalar instructions in strict Program Order. Reduction on non-commutative
+Scalar Operations is not *prohibited*: the strict Program Order allows
+the programmer to think through what would happen and thus potentially
+actually come up with legitimate use.
**Branches**
Scalar Power ISA CR Register is 32-bits, but actually comprises eight
CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
this makes up 32 bits, and therefore a CR operand referring to one bit
-of the CR will be 5 bits in length. *However*, some instructions refer
-to a *CR Field* (CR0-CR7) and consequently are only 3-bits.
+of the CR will be 5 bits in length (BA, BT).
+*However*, some instructions refer
+to a *CR Field* (CR0-CR7) and consequently these operands
+(BF, BFA etc) are only 3-bits.
+
+(*It helps here to think of the top 3 bits of BA as referring
+to a CR Field, like BFA does, and the bottom 2 bits of BA
+referring to
+EQ/LT/GT/SO within that Field*)
With SVP64 extending the number of CR *Fields* to 128, the number of
-CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
+32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
(8 per CR Register). Then, it gets even more strange, when it comes
-to Vectorisation, which applies to the CR *Field* numbers. The
+to Vectorisation, which applies to the CR Field *numbers*. The
hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
-and moves to CR1 for element 1, and so on. In other words, the
+and moves to CR1 for element 1, and so on. The reason here is quite
+simple: each element result has to have its own CR Field co-result.
+
+In other words, the
element is the 4-bit CR *Field*, not the bits *of* the 32-bit
CR Register, and not the CR *Register* (of which there are now 16).
All quite logical, but a little mind-bending.
halfword, word, dword and quad variants. The interaction between these
widths as part of the actual operation, and the source and destination
elwidth overrides, was particularly obtuse and hard to derive: some care
-and attention is advised, here, when reading the specification.
+and attention is advised, here, when reading the specification,
+especially on arithmetic loads (lbarx, lharx etc.)
**Non-vectorised**
and each Category has its own relevant but
ultimately rational quirks.
+# Single Predication
+
+So named because there is a Twin Predication concept as well, Single
+Predication is also unlike other Vector ISAs because it allows zeroing
+on both the source and destination. This takes some explaining.
+
+In Vector ISAs, there is a Predicate Mask, it applies to the
+destination only, and there
+is a choice of actions when a Predicate Mask bit
+is zero:
+
+* set the destination element to zero
+* skip that element operation entirely, leaving the destination unmodified
+
+The problem comes if the underlying register file SRAM is say 64-bit wide
+write granularity but the Vector elements are say 8-bit wide.
+Some Vector ISAs strongly advocate Zeroing because to leave one single
+element at a small bitwidth in amongst other elements where the register
+file does not have the prerequisite access granularity is very expensive,
+requiring a Read-Modify-Write cycle to preserve the untouched elements.
+Putting zero into the destination avoids that Read.
+
+This is technically very easy to solve: use a Register File that does
+in fact have the smallest element-level write-enable granularity.
+If the elements are 8 bit then allow 8-bit writes!
+
+With that technical issue solved there is nothing in the way of choosing
+to support both zeroing and non-zeroing (skipping) at the ISA level:
+SV chooses to further support both *on both the source and destination*.
+This can result in the source and destination
+element indices getting "out-of-sync" even though the Predicate Mask
+is the same because the behaviour is different when zeros in the
+Predicate are encountered.
+
# Twin Predication
Twin Predication is an entirely new concept not present in any commercial
Vector ISA of the past forty years. To explain how normal Single-predication
is applied in a standard Vector ISA:
-* Predication on the **destination** of a LOAD instruction creates something
+* Predication on the **source** of a LOAD instruction creates something
called "Vector Compressed Load" (VCOMPRESS).
-* Predication on the **source** of a STORE instruction creates something
+* Predication on the **destination** of a STORE instruction creates something
called "Vector Expanded Store" (VEXPAND).
* SVP64 allows the two to be put back-to-back: one on source, one on
destination.
to the destination register, not just on Load/Stores but on *arithmetic*
operations.
-No other Vector ISA in the world has this capability. All true Vector
+No other Vector ISA in the world has this back-to-back
+capability. All true Vector
ISAs have Predicate Masks: it is an absolutely essential characteristic.
However none of them have abstracted dual predicates out to the extent
where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
Also worth a special mention: all Load/Store operations are Twin-Predicated.
The underlying key to understanding:
-* one Predicate applies to the Array of Memory *Addresses*,
-* the other Predicate applies to the Array of Memory *Data*.
+* one Predicate effectively applies to the Array of Memory *Addresses*,
+* the other Predicate effectively applies to the Array of Memory *Data*.
# CR weird instructions
-[[sv/int_cr_predication]] is by far the biggest violator of the SVP64
+[[sv/cr_int_predication]] is by far the biggest violator of the SVP64
rules, for good reasons. Transfers between Vectors of CR Fields and Integers
for use as predicates is very awkward without them.
handling as far as Hazard Dependencies are concerned, due to nonconformance
(bit-level management)
-# mv.x
+# mv.x (vector permute)
[[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
terms of Register Hazard Management that its addition to any Scalar
indices are isolated behind a single Vector Hazard, there is no
problem at all. `sv.mv.x` is also fraught, precisely because it
sits on top of a Standard Scalar register paradigm, not a Vector
-ISA, with separate and distinct Vector registers.
+ISA with separate and distinct Vector registers.
-To help partly solve this, `sv.mv.x` has to be made relative:
+To help partly solve this, `sv.mv.x` would have had to have
+been made relative:
```
for i in range(VL):
the number of Register Hazards that need to be raised to a fixed
quantity, at Issue time.
-`mv.x` itself will still have to be added as a Scalar instruction,
-but the behaviour of `sv.mv.x` will have to be different from that
+`mv.x` itself would still have to be added as a Scalar instruction,
+but the behaviour of `sv.mv.x` would have to be different from that
Scalar version.
Normally, Scalar Instructions have a good justification for being
added as Scalar instructions on their own merit. `mv.x` is the
-polar opposite, and as such qualifies for a special mention in
-this section.
+polar opposite, and in the end, the idea was thrown out, and Indexed
+REMAP added in its place. Indexed REMAP comes with its own quirks,
+solving the Hazard problem, described in a later section.
# Branch-Conditional
This is sort-of a quirk and sort-of not, because the option to support
XER.SO is already optional from the SFFS Compliancy Level.
+
+# Indexed REMAP and CR Field Predication Hazards
+
+Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
+Vector "Permute" or "Shuffle" instructions. These provide a Vector of
+indices whereby another Vector is reordered (permuted, shuffled) according
+to the indices. Register Hazard Managent here is trivial because there
+are three registers: indices source vector, elements source vector to
+be shuffled, result vector.
+
+For SVP64 which is based on top of a Scalar Register File paradigm,
+combined with the hard requirement to respect full Register Hazard
+Management as if element instructions were actual Scalar instructions,
+the addition of a Vector permute instruction under these strict
+conditions would result in a catastrophic
+reduction in performance, due to having to consider Read-after-Write
+and Write-after-Read Hazards *at the element level*.
+
+A little leniency and rule-bending is therefore required.
+
+Rather than add explicit Vector permute instructions, the "Indexing"
+has been separated out into a REMAP Schedule. When an Indexed
+REMAP is requested, it is assumed (required, of software) that
+subsequent instructions intending to use those indices *will not*
+attempt to modify the indices. It is *Software* that must consider them
+to be read-only.
+
+This simple relaxation of the rules releases Hardware from having the
+horrendous job of dynamically detecting Write-after-Read Hazards on a
+huge range of registers.
+
+A similar Hazard problem exists for CR Field Predicates, in Vertical-First
+Mode. Instructions could modify CR Fields currently being used as Predicate
+Masks: detecting this is so horrendous for hardware resource utilisation
+and hardware complexity that, again, the decision is made to relax these
+constraints and for Software to take that into account.
+
+# Floating-Point "Single" becomes "Half"
+
+In several places in the Power ISA there are operations that are on
+32-bit quantities in 64-bit registers. The best example is FP which
+has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or
+FP Add "single"). Element-width overrides it would seem to
+be unnecessary, under these circunstances.
+
+However, it is not possible for `fadds` to fit two elements into
+64-bit: bear in mind that the FP32 bits are spread out across a 64
+bit register in FP64 format. The solution here was to consider the
+"s" at the end of each instruction
+to mean "half of the element's width". Thus, `sv.fadds/ew=32`
+actually stores an FP16 spread out across the 32 bits of an
+element, in FP32 format, where `sv.fadd/ew=32` stores a full
+FP32 result into the full 32 bits.