# The Rules

[[!toc]]

SVP64 is designed around these fundamental and inviolate principles:

1. There are no actual Vector instructions: Scalar instructions
   are the sole exclusive bedrock.
2. No scalar instruction ever deviates in its encoding or meaning
   just because it is prefixed (caveats below)
3. A hardware-level for-loop makes vector elements 100% synonymous
   with scalar instructions (the suffix)

How can a Vector ISA even exist when no actual Vector instructions
are permitted to be added? It comes down to the strict abstraction.
First you add a **scalar** instruction (32-bit). Second, the
Prefixing is applied *in the abstract* to give the *appearance*
and ultimately the same effect as if an explicit Vector instruction
had also been added.

There are a few exceptional places where these rules get
bent, and others where the rules take some explaining,
and this page tracks them.

The modification caveat obviously exempts element width overrides,
which still do not actually modify the meaning of the instruction:
an add remains an add, even if it is only an 8-bit add rather than
a 64-bit add. elwidth overrides *definitely* do not alter the 3.0 encoding.
Other "modifications" such as saturation or Data-dependent Fail-First
likewise are post-augmentation or post-analysis, and do not actually
fundamentally change an add operation into a subtract for example.

An experiment was attempted to modify LD-immediate instructions
to include a
third RC register i.e. reinterpret the normal
v3.0 32-bit instruction as a
different encoding if SVP64-prefixed: it did not go well.
The complexity that resulted
in the decode phase was too great. The lesson was learned, the
hard way: it is infinitely preferable to add a 32-bit Scalar Load-with-Shift
instruction *first*, which then inherently becomes Vectorised.
Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
both ARM and x86 have it, because it saves greatly on instruction count in
hot-loops. 

# Instruction Groups

The basic principle of SVP64 is the prefix, which contains mode
as well as register augmentation and predicates.  When thinking of
instructions and Vectorising them, it is natural for arithmetic
operations (ADD, OR) to be the first to spring to mind.
Arithmetic instructions have registers, therefore augmentation
applies, end of story, right?

Except, Load and Store deals also with Memory, not just registers.
Power ISA has Condition Register Fields: how can element widths
apply there? And branches: how can you have Saturation on something
that does not return an arithmetic result? In short: there are actually
four different categories (five including those for which Vectorisation
makes no sense at all, such as `sc` or `mtmsr`). The categories are:

* arithmetic/logical including floating-point
* Load/Store
* Condition Register Field operations
* branch

**Arithmetic**

Arithmetic (known as "normal" mode) is where Scalar and Parallel
Reduction can be done: Saturation as well, and two new innovative
modes for Vector ISAs: data-dependent fail-first and predicate result.
Reduction and Saturation are common to see in Vector ISAs: it is just
that they are usually added as explicit instructions. In SVP64 these
concepts are applied in the abstract general form, which takes some
getting used to, as it may result in invalid results, but ultimately
it is critical to think in terms of the "rules", that everything is
Scalar instructions in strict Program Order.

**Branches**

Branch is the one and only place where the Scalar
(non-prefixed) operations differ from the Vector (element)
instructions, as explained in a separate section.
The
RM bits can be used for other purposes because the Arithmetic modes
make no sense at all for a Branch.
Almost the entire
SVP64 RM Field is interpreted differently from other Modes, in
order to support a wide range of parallel boolean condition options
which are expected of a Vector / GPU ISA. These save a considerable
number of instructions in tight inner loop situations.

**CR Field Ops**

Condition Register Fields are 4-bit wide and consequently element-width
overrides make absolutely no sense whatsoever. Therefore the elwidth
override field bits can be used for other purposes when Vectorising
CR Field instructions.  Moreover, Rc=1 is completely invalid for
CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
such as predicate-result make no sense, and neither does Saturation.
All of these differences, which require quite a lot of logical
reasoning and deduction, help explain why there is an entirely different
CR ops Vectorisation Category.

**Load/Store**

LOAD/STORE is another area that has different needs: this time it is
down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
which simply make no sense in a RISC Scalar ISA: element-stride and
unit-stride and the entire concept of a stride itself (a spacing
between elements) has no place at all in a Scalar ISA. The problems
come when trying to *retrofit* the concept of "Vector Elements" onto
a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
RM Prefix to convey the stride mode, changing the Effective Address
computation as a result. Interestingly, worth noting for Hardware
designers: it did turn out to be possible to perform pre-multiplication
of the D/DS Immediate by the stride amount, making it possible to avoid
actually modifying the LD/ST Pipeline itself.

Other areas where LD/ST went quirky: element-width overrides especially
when combined with Saturation, given that LD/ST operations have byte,
halfword, word, dword and quad variants. The interaction between these
widths as part of the actual operation, and the source and destination
elwidth overrides, was particularly obtuse and hard to derive: some care
and attention is advised, here, when reading the specification.

**Non-vectorised**

The concept of a Vectorised halt (`attn`) makes no sense. There are never
going to be a Vector of global MSRs (Machine Status Register). `mtcr`
on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
Even `td` and `tdi` makes a strange type of sense to permit it to be
Vectorised, because a sequence of comparisons could be Vectorised.
Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
Nemory Management
instructions, these make no sense to Vectorise.

However, it is really quite important to not be tempted to conclude that
just because these instructions are un-vectoriseable, the opcode space
must be free for reiterpretation and use for other purposes. This would
be a serious mistake because a future revision of the specification
might *retire* the Scalar instruction, replace it with another.
Again this comes down to being quite strict about the rules: only Scalar
instructions get Vectorised: there are *no* actual explicit Vector
instructions.

**Summary**

Where a traditional Vector ISA effectively duplicates the entirety
of a Scalar ISA and then adds additional instructions which only
make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
considerable lengths to keep strictly to augmentation and embedding
of an entire Scalar ISA's instructions into an abstract Vectorisation
Context. That abstraction subdivides down into Categories appropriate
for the type of operation (Branch, CRs, Memory, Arithmetic),
and each Category has its own relevant but
ultimately rational quirks.

# Twin Predication

Twin Predication is an entirely new concept not present in any commercial
Vector ISA of the past forty years.  To explain:

* Predication on the destination of a LOAD instruction creates something
  called "Vector Compressed Load" (VCOMPRESS).
* Predication on the *source* of a STORE instruction creates something
  called "Vector Expanded Store" (VEXPAND).
* SVP64 allows the two to be put back-to-back.

The above allows a reader familiar with VCOMPRESS and VEXPAND to
conceptualise what the effect of Twin Predication is, but it actually
goes much further: in *any* twin-predicated instruction (extsw, fmv)
it is possible to apply one predicate to the source register (compressing
the source element array) and another *completely separate* predicate
to the destination register, not just on Load/Stores but on *arithmetic*
operations.

No other Vector ISA in the world has this capability.  All true Vector
ISAs have Predicate Masks: it is an absolutely essential characteristic.
However none of them have abstracted dual predicates out to the extent
where they are applicable *in general* to a wide range of arithmetic
instructions, as well as Load/Store.

It is however important to note that not all instructions can be Twin
Predicated: some remain only Single Predicated, as is normally found
in other Vector ISAs. Arithmetic operations with
four registers (3-in, 1-out, VA-Form for example) are Single. The reason
is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
Consequently, when using a given instruction, it is necessary to look
up in the ISA Tables whether it is 1P or 2P. caveat emptor!

Also worth a special mention: all Load/Store operations are Twin-Predicated.
In other words: one Predicate applies to the Array of Memory Addresses,
whilst the other Predicate applies to the Array of Memory Data.

# CR weird instructions

[[sv/int_cr_predication]] is by far the biggest violator of the SVP64
rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
for use as predicates is very awkward without them.

Normally, element width overrides allow the element width to be specified
as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
consuming either 1 bit or 4 bit elements (in effect) some adaptation was
required.  When this perspective is taken (that results or sources are
1 or 4 bits) the weirdness starts to make sense, because the "elements",
such as they are, are still packed sequentially.

From a hardware implementation perspective however they will need special
handling as far as Hazard Dependencies are concerned, due to nonconformance
(bit-level management)

# mv.x

[[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
terms of Register Hazard Management that its addition to any Scalar
ISA is anathematic. In a Traditional Vector ISA however, where the
indices are isolated behind a single Vector Hazard, there is no
problem at all.  `sv.mv.x` is also fraught, precisely because it
sits on top of a Standard Scalar register paradigm, not a Vector
ISA, with separate and distinct Vector registers.

To help partly solve this, `sv.mv.x` has to be made relative:

```
for i in range(VL):
    GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
```

The reason for doing so is that MAXVL or VL may be used to limit
the number of Register Hazards that need to be raised to a fixed
quantity, at Issue time.

`mv.x` itself will still have to be added as a Scalar instruction,
but the behaviour of `sv.mv.x` will have to be different from that
Scalar version.

Normally, Scalar Instructions have a good justification for being
added as Scalar instructions on their own merit. `mv.x` is the
polar opposite, and as such qualifies for a special mention in
this section.

# Branch-Conditional

[[sv/branches]] are a very special exception to the rule that there
shall be no deviation from the corresponding
Scalar instruction.  This because of the tight
integration with looping and the application of Boolean Logic
manipulation needed for Parallel operations (predicate mask usage).
This results in an extremely important observation that `scalar identity
behaviour` is violated: the SV Prefixed variant of branch is **not** the same
operation as the unprefixed 32-bit scalar version.

One key difference is that LR is only updated if certain additional
conditions are met, whereas Scalar `bclrl` for example unconditionally
overwrites LR.

Well over 500 Vectorised branch instructions exist in SVP64 due to the
number of options available: close integration and interaction with
the base Scalar Branch was unavoidable in order to create Conditional
Branching suitable for parallel 3D / CUDA GPU workloads.

# Saturation

The application of Saturation as a retro-fit to a Scalar ISA is challenging.
It does help that within the SFFS Compliancy subset there are no Saturated
operations at all: they are only added in VSX.

Saturation does not inherently change the instruction itself: it does however
come with some fundamental implications, when applied. For example:
a Floating-Point operation that would normally raise an exception will
no longer do so, instead setting the CR1.SO Flag.  Another quirky
example: signed operations which produce a negative result will be
truncated to zero if Unsigned Saturation is requested.

One very important aspect for implementors is that the operation in
effect has to be considered to be performed at infinite precision,
followed by saturation detection. In practice this does not actually
require infinite precision hardware! Two 8-bit integers being
added can only ever overflow into a 9-bit result.

Overall some care and consideration needs to be applied.

# Fail-First

Fail-First (both the Load/Store and Data-Dependent variants)
is worthy of a special mention in its own right. Where VL is
normally forward-looking and may be part of a pre-decode phase
in a (simplified) pipelined architecture with no Read-after-Write Hazards,
Fail-First changes that because at any point during the execution
of the element-level instructions, one of those elements may not only
terminate further continuation of the hardware-for-looping but also
effect a change of VL:

```
for i in range(VL):
    result = element_operation(GPR(RA+i), GPR(RB+i))
    if test(result):
        VL = i
        break
```

This is not exactly a violation of SVP64 Rules, more of a breakage
of user expectations, particularly for LD/ST where exceptions
would normally be expected to be raised, Fail-First provides for
avoidance of those exceptions.