thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which
is clearly impractical: other schemes need to be devised.
-In addition we would like to add SV-C32 which is a Vectorised version
+In addition we would like to add SV-C32 which is a Vectorized version
of 16 bit Compressed, and ideally have a variant that adds the 27-bit
prefix format from SV-P64, as well.
including simulators and compilers: OpenRISC 1200 took 12 years to
mature. Stable Open ISAs require Standards and Compliance Suites that
take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
-auto-vectorisation compiler support has never been achieved in the
+auto-vectorization compiler support has never been achieved in the
history of computing, not with the combined resources of ARM, Intel,
AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
assembler and direct use of intrinsics is the Industry-standard norm
Slowly, at this point, a realisation should be sinking in that, actually,
there aren't as many really truly viable Vector ISAs out there, as the
-ones that are evolving in the general direction of Vectorisation are,
+ones that are evolving in the general direction of Vectorization are,
in various completely different ways, flawed.
**Successfully identifying a limitation marks the beginning of an
sequential carry-flag chaining of these scalar instructions.
* The Condition Register Fields of the Power ISA make a great candidate
for use as Predicate Masks, particularly when combined with
- Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+ Vectorized `cmp` and Vectorized `crand`, `crxor` etc.
It is only when looking slightly deeper into the Power ISA that
certain things turn out to be missing, and this is down in part to IBM's
so Scalar ones. Examples include that transfer operations between the
Integer and Floating-point Scalar register files were dropped approximately
a decade ago after the Packed SIMD variants were considered to be
-duplicates. With it being completely inappropriate to attempt to Vectorise
+duplicates. With it being completely inappropriate to attempt to Vectorize
a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
-the Scalar ISA, a much better all-round candidate for Vectorisation
+the Scalar ISA, a much better all-round candidate for Vectorization
(the Scalar parts of Power ISA) is left anaemic.
A particular key instruction that is missing is `MV.X` which is
expensive instruction causing a huge swathe of Register Hazards
in one single hit is almost never added to a Scalar ISA but
is almost always added to a Vector one. When `MV.X` is
-Vectorised it allows for arbitrary
+Vectorized it allows for arbitrary
remapping of elements within a Vector to positions specified
by another Vector. A typical Scalar ISA will use Memory to
achieve this task, but with Vector ISAs the Vector Register Files are
have to be "massaged" by tools that insert intrinsics into the
source code, in order to identify the Basic Blocks that the Zero-Overhead
Loops can run. Can this be merged into standard gcc and llvm
-compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
+compilers? As intrinsics: of course. Can it become part of auto-vectorization? Probably,
if an infinite supply of money and engineering time is thrown at it.
Is a half-way-house solution of compiler intrinsics good enough?
Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
<img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
First, some important definitions, because there are two different
-Vectorisation Modes in SVP64:
+Vectorization Modes in SVP64:
* **Horizontal-First**: (aka standard Cray Vectors) walk
through **elements** first before moving to next **instruction**
the L1-L4 Cache and Virtual Memory Barriers is it possible to
ascertain, retrospectively, that time and power had just been wasted.
-SVP64 is able to do what is termed "Vertical-First" Vectorisation,
+SVP64 is able to do what is termed "Vertical-First" Vectorization,
combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been
extended, Snitch-style, to perform a deterministic memory-array walk of
a large Matrix.
The reason in this case for the use of Vertical-First Mode is the
conditional execution of the Multiply-and-Accumulate.
-Horizontal-First Mode is the standard Cray-Style Vectorisation:
+Horizontal-First Mode is the standard Cray-Style Vectorization:
loop on all *elements* with the same instruction before moving
on to the next instruction. Horizontal-First
Predication needs to be pre-calculated
# Scalar OpenPOWER Audio and Video Opcodes
-the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
+the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorization context does it become clear why they are needed and how they may be designed.
This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
# Scalar OpenPOWER Audio and Video Opcodes
-the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
+the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorization context does it become clear why they are needed and how they may be designed.
This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
The fundamental principle for these instructions is:
* identify the scalar primitive
-* assume that longer runs of scalars will have Simple-V vectorisatin applied
+* assume that longer runs of scalars will have Simple-V vectorizatin applied
* assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
(even if that involves a mv.swizxle which may be macro-op fused)
in order to perform the necessary HI/LO selection normally hard-coded
their own right without SVP64. Thus the operations here are proposed
first as Scalar Extensions to the Power ISA.
-A secondary focus is that if Vectorised, implementors may choose
+A secondary focus is that if Vectorized, implementors may choose
to deploy macro-op fusion targetting back-end 256-bit or greater
Dynamic SIMD ALUs for maximum performance and effectiveness.
# Analysis
Covered in [[biginteger/analysis]] the summary is that standard `adde`
-is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe`
+is sufficient for SVP64 Vectorization of big-integer addition (and `subfe`
for subtraction) but that big-integer shift, multiply and divide require an
extra 3-in 2-out instructions, similar to Intel's
[shld](https://www.felixcloutier.com/x86/shld)
Use of smaller sub-operations is a given: worst-case in a Scalar
context, addition is O(N) whilst multiply and divide are O(N^2),
-and their Vectorisation would reduce those (for small N) to
+and their Vectorization would reduce those (for small N) to
O(1) and O(N). Knuth's big-integer scalar algorithms provide
useful real-world grounding into the types of operations needed,
-making it easy to demonstrate how they would be Vectorised.
+making it easy to demonstrate how they would be Vectorized.
The basic principle behind Knuth's algorithms is to break the
problem down into a single scalar op against a Vector operand.
# Vector Add and Subtract
Surprisingly, no new additional instructions are required to perform
-a straightforward big-integer add or subtract. Vectorised `adde`
+a straightforward big-integer add or subtract. Vectorized `adde`
or `addex` is perfectly sufficient to produce arbitrary-length
big-integer add due to the rules set in SVP64 that all Vector Operations
are directly equivalent to the strict Program Order Execution of
of how SVP64 works!
Thus, due to sequential execution of `adde` both consuming and producing
a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
-`sv.adde` is in effect an alias for Big-Integer Vectorised add. As such,
+`sv.adde` is in effect an alias for Big-Integer Vectorized add. As such,
implementors are entirely at liberty to recognise Horizontal-First Vector
adds and send the vector of registers to a much larger and wider back-end
ALU, and short-cut the intermediate storage of XER.CA on an element
bnz loop # do more digits
This is not that different from a Scalar Big-Int add, it is
-just that like all Cray-style Vectorisation, a variable number
+just that like all Cray-style Vectorization, a variable number
of elements are covered by one instruction. Of interest
to people unfamiliar with Cray-style Vectors: if VL is not
permitted to exceed 1 (because MAXVL is set to 1) then the above
Keeping the shift amount within the range of the element (64 bit)
a Vector bit-shift may be synthesised from a pair of shift operations
and an OR, all of which are standard Scalar Power ISA instructions
-that when Vectorised are exactly what is needed.
+that when Vectorized are exactly what is needed.
```
void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
RT2, RC2 = RA2 * RB2 + RC1
Following up to add each partially-computed row to what will become
-the final result is achieved with a Vectorised big-int
+the final result is achieved with a Vectorized big-int
`sv.adde`. Thus, the key inner loop of
Knuth's Algorithm M may be achieved in four instructions, two of
which are scalar initialisation:
bool need_fixup = !ca; // for phase 3 correction
```
-In essence then the primary focus of Vectorised Big-Int divide is in
+In essence then the primary focus of Vectorized Big-Int divide is in
fact big-integer multiply
Detection of the fixup (phase 3) is determined by the Carry (borrow)
bit at the end. Logically: if borrow was required then the qhat estimate
was too large and the correction is required, which is, again,
-nothing more than a Vectorised big-integer add (one instruction).
+nothing more than a Vectorized big-integer add (one instruction).
However this is not the full story
**128/64-bit divisor**
The irony is, therefore, that attempting to
improve big-integer divide by moving to 64-bit digits in order to take
-advantage of the efficiency of 64-bit scalar multiply when Vectorised
+advantage of the efficiency of 64-bit scalar multiply when Vectorized
would instead
lock up CPU time performing a 128/64 scalar division. With the Vector
Multiply operations being critically dependent on that `qhat` estimate, and
this extension amalgamates bitmanipulation primitives from many sources,
including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
Also included are DSP/Multimedia operations suitable for Audio/Video.
-Vectorisation and SIMD are removed: these are straight scalar (element)
-operations making them suitable for embedded applications. Vectorisation
+Vectorization and SIMD are removed: these are straight scalar (element)
+operations making them suitable for embedded applications. Vectorization
Context is provided by [[openpower/sv]].
When combined with SV, scalar variants of bitmanip operations found in
for i in range(64):
RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
-When Vectorised with SVP64, as usual both source and destination may be
+When Vectorized with SVP64, as usual both source and destination may be
Vector or Scalar.
*Programmer's note: a dynamic ternary lookup may be synthesised from
a,b = CRs[BF][i], CRs[BF][i])
if msk[i] CRs[BF][i] = lut2(CRs[BFB], a, b)
-When SVP64 Vectorised any of the 4 operands may be Scalar or
+When SVP64 Vectorized any of the 4 operands may be Scalar or
Vector, including `BFB` meaning that multiple different dynamic
lookups may be performed with a single instruction. Note that
this instruction is deliberately an overwrite in order to reduce
considered completely separate and distinct from standard scalar
OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way
impacted, altered, changed or modified in any way, shape or form by the
-SVP64 Vectorised Variants**.
+SVP64 Vectorized Variants**.
It is also extremely important to note that Branches are the sole
pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches
Unless Branches are aware and capable of such analysis, additional
instructions would be required which perform Horizontal Cumulative
-analysis of Vectorised Condition Register Fields, in order to reduce
+analysis of Vectorized Condition Register Fields, in order to reduce
the Vector of CR Fields down to one single yes or no decision that a
Scalar-only v3.0B Branch-Conditional could cope with. Such instructions
would be unavoidable, required, and costly by comparison to a single
Given that Power ISA v3.0B is already quite powerful, particularly
the Condition Registers and their interaction with Branches, there are
-opportunities to create extremely flexible and compact Vectorised Branch
+opportunities to create extremely flexible and compact Vectorized Branch
behaviour. In addition, the side-effects (updating of CTR, truncation
of VL, described below) make it a useful instruction even if the branch
points to the next instruction (no actual branch).
a Great Big AND of all condition tests. Exit occurs
on the first **failed** test.
-Early-exit is enacted such that the Vectorised Branch does not
+Early-exit is enacted such that the Vectorized Branch does not
perform needless extra tests, which will help reduce reads on
the Condition Register file.
prudent. This introduces a new immediate field, `SNZ`, which works in
conjunction with `sz`.
-Vectorised Branches can be used in either SVP64 Horizontal-First or
+Vectorized Branches can be used in either SVP64 Horizontal-First or
Vertical-First Mode. Essentially, at an element level, the behaviour
is identical in both Modes, although the `ALL` bit is meaningless in
Vertical-First Mode.
-It is also important to bear in mind that, fundamentally, Vectorised
+It is also important to bear in mind that, fundamentally, Vectorized
Branch-Conditional is still extremely close to the Scalar v3.0B
Branch-Conditional instructions, and that the same v3.0B Scalar
Branch-Conditional instructions are still *completely separate and
to occur because there will be no *successful* Condition Tests to make
it happen.
-## Vectorised CR Field numbering, and Scalar behaviour
+## Vectorized CR Field numbering, and Scalar behaviour
It is important to keep in mind that just like all SVP64 instructions,
the `BI` field of the base v3.0B Branch Conditional instruction may be
to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is
not used, CTR may still be decremented by the total number of nonmasked
elements, acting in effect as either a popcount or cntlz depending
-on which mode bits are set. In short, Vectorised Branch becomes an
+on which mode bits are set. In short, Vectorized Branch becomes an
extremely powerful tool.
**Micro-Architectural Implementation Note**: *when implemented on top
is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
Inline comments highlight the fact that the Scalar Branch behaviour and
-pseudocode is still clearly visible and embedded within the Vectorised
+pseudocode is still clearly visible and embedded within the Vectorized
variant:
```
CRbits = CR{SVCRf}
# select predicate bit or zero/one
if predicate[srcstep]:
- if BRc = 1 then # CR0 vectorised
+ if BRc = 1 then # CR0 vectorized
CR{SVCRf+srcstep} = CRbits
testbit = CRbits[BI & 0b11]
else if not SVRMmode.sz:
[^3]: A 2-Dimensional Scalable Vector ISA **specifically designed for the Power ISA** with both Horizontal-First and Vertical-First Modes. See [[sv/vector_isa_comparison]]
[^4]: on specific operations. See [[opcode_regs_deduped]] for full list. Key: 2P - Twin Predication, 1P - Single-Predicate
[^5]: SVP64 provides a Vector concept on top of the **Scalar** GPR, FPR and CR Fields, extended to 128 entries.
-[^6]: SVP64 Vectorises Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops.
-[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorised by SVP64). See [[sv/biginteger/analysis]]
+[^6]: SVP64 Vectorizes Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops.
+[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorized by SVP64). See [[sv/biginteger/analysis]]
[^8]: LD/ST Fault-First: see [[sv/svp64/appendix]] and [ARM SVE Fault-First](https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf)
[^9]: Data-dependent Fail-First: Based on LD/ST Fail-first, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See [[sv/svp64/appendix]]
[^10]: Predicate-result effectively turns any standard op into a type of "cmp". See [[sv/svp64/appendix]]
which are power-2 based on Silicon-partner SIMD width. Non-power-2 not supported but [zero-input masking](https://www.realworldtech.com/forum/?threadid=202688&curpostid=207774) is.
[^x4]: [Advanced matrix Extensions](https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions) supports BF16 and INT8 only. Separate regfile, power-of-two "tiles". Not general-purpose at all.
[^b1]: Although registers may be 128-bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128-bit operations. Only RVV and SVP64 have the possibility of 128-bit ops
-[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorisation** LOOP built-in as an extension named VVM. Classified as "Vertical-First".
+[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorization** LOOP built-in as an extension named VVM. Classified as "Vertical-First".
[^m2]: MyISA 66000 has a CARRY register up to 64-bit. Repeated application of FMA (esp. within Auto-Vectored LOOPS) automatically and inherently creates big-int operations with zero effort.
[^nc]: "Silicon-Partner" Scaling achieved through allowing same instruction to act on different regfile size and bitwidth. This catastrophically results in binary non-interoperability.
Firstly, we analyse the xchacha20 algorithm, showing what operations
are performed and in what order. Secondly, two innovative features
of SVP64 are described which are crucial to understanding of Simple-V
-Vectorisation: Vertical-First Mode and Indexed REMAP. Then we show
+Vectorization: Vertical-First Mode and Indexed REMAP. Then we show
how Index REMAP eliminates the need entirely for inline-loop-unrolling,
but note that in this particular algorithm REMAP is only useful for
us in Vertical-First Mode.
\newpage{}
-# Vectorised versions involving GPRs
+# Vectorized versions involving GPRs
The name "weird" refers to a minor violation of SV rules when it comes
-to deriving the Vectorised versions of these instructions.
+to deriving the Vectorized versions of these instructions.
Normally the progression of the SV for-loop would move on to the
next register. Instead however in the scalar case these instructions
interesting conceptual challenges for SVP64, which was designed
primarily for vectors of arithmetic and logical operations. However
if predicates may be bits of CR Fields it makes sense to extend
-Simple-V to cover CR Operations, especially given that Vectorised Rc=1
-may be processed by Vectorised CR Operations that usefully in turn
+Simple-V to cover CR Operations, especially given that Vectorized Rc=1
+may be processed by Vectorized CR Operations that usefully in turn
may become Predicate Masks to yet more Vector operations, like so:
```
operations are firmly out of scope for this section, being covered fully
by [[sv/normal]].
-* Examples of Vectoriseable Defined Words to which this section does
+* Examples of Vectorizeable Defined Words to which this section does
apply is
- `mfcr` and `cmpi` (3 bit operands) and
- `crnor` and `crand` (5 bit operands).
decision. However with CR-based operations that CR Field result to be
tested is provided *by the operation itself*.
-Data-dependent SVP64 Vectorised Operations involving the creation
+Data-dependent SVP64 Vectorized Operations involving the creation
or modification of a CR can require an extra two bits, which are not
available in the compact space of the SVP64 RM `MODE` Field. With the
concept of element width overrides being meaningless for CR Fields it
is a much easier proposition to consider.
The prohibitions utilise the CR Field numbers implicitly to
-split out Vectorised CR operations to be considered completely
+split out Vectorized CR operations to be considered completely
separare and distinct from Scalar CR operations *even though
they both use the same binary encoding*. This does in turn
mean that at the Decode Phase it becomes necessary to examine
not only the operation (`sv.crand`, `sv.cmp`) but also
the CR Field numbers as well as whether, in the EXTRA2/3 Mode
-bits, the operands are Vectorised.
+bits, the operands are Vectorized.
A future version of Power ISA, where SVP64Single is proposed,
would in fact introduce "Conditional Execution", including
* Condition Registers. see note below
* FPR (if present)
-When Rc=1 is encountered in an SVP64 Context the destination is different (TODO) i.e. not CR0 or CR1. Implicit Rc=1 Condition Registers are still Vectorised but do **not** have EXTRA2/3 spec adjustments. The only part if the EXTRA2/3 spec that is observed and respected is whether the CR is Vectorised (isvec).
+When Rc=1 is encountered in an SVP64 Context the destination is different (TODO) i.e. not CR0 or CR1. Implicit Rc=1 Condition Registers are still Vectorized but do **not** have EXTRA2/3 spec adjustments. The only part if the EXTRA2/3 spec that is observed and respected is whether the CR is Vectorized (isvec).
## Increasing register file sizes
* <https://libre-soc.org/openpower/sv/propagation/>
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/svp64.py;hb=HEAD>
-## Vectorised Branches
+## Vectorized Branches
TODO [[sv/branches]]
-## Vectorised LD/ST
+## Vectorized LD/ST
TODO [[sv/ldst]]
# SVP64 polymorphic elwidth overrides
-SimpleV, the Draft Cray-style Vectorisation for OpenPOWER, may
+SimpleV, the Draft Cray-style Vectorization for OpenPOWER, may
independently override both or either of the source or destination
register bitwidth in the base operation used to create the Vector
operation. In the case of IEEE754 FP operands this gives an
Memory infrastructure (and the ISA itself) correspondingly needs Vector
Memory Operations as well.
-Vectorised Load and Store also presents an extra dimension (literally)
+Vectorized Load and Store also presents an extra dimension (literally)
which creates scenarios unique to Vector applications, that a Scalar (and
even a SIMD) ISA simply never encounters: not even the complex Addressing
Modes of the 68,000 or S/360 resemble Vector Load/Store.
## Modes overview
-Vectorisation of Load and Store requires creation, from scalar operations,
+Vectorization of Load and Store requires creation, from scalar operations,
a number of different modes:
* **fixed aka "unit" stride** - contiguous sequence with no gaps
svctx.ldstmode = elementstride
```
-A summary of the effect of Vectorisation of src or dest:
+A summary of the effect of Vectorization of src or dest:
```
imm(RA) RT.v RA.v no stride allowed
imm(RA) RT.s RA.v no stride allowed
imm(RA) RT.v RA.s stride-select allowed
- imm(RA) RT.s RA.s not vectorised
+ imm(RA) RT.s RA.s not vectorized
RA,RB RT.v {RA|RB}.v Standard Indexed
RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
- RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
+ RA,RB RT.s {RA&RB}.s not vectorized (scalar identity)
```
Signed Effective Address computation is only relevant for Vector Indexed
truncating VL to that point. No branch is needed to issue that large
burst of LDs, which may be valuable in Embedded scenarios.
-## Vectorisation of Scalar Power ISA v3.0B
+## Vectorization of Scalar Power ISA v3.0B
Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
and [[isa/fixedstore]] pseudocode to be of the form:
named LD/ST Indexed**.
Whilst it may be costly in terms of register reads to allow REMAP Indexed
-Mode to be applied to any Vectorised LD/ST Indexed operation such as
+Mode to be applied to any Vectorized LD/ST Indexed operation such as
`sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
the strict application of the RISC Paradigm that Simple-V follows makes
it awkward to consider *preventing* the application of Indexed REMAP to
to be cancelled. Additionally an implementor may choose to truncate VL
for any arbitrary reason *except for the very first*.
-ffirst LD/ST to multiple pages via a Vectorised Index base is
+ffirst LD/ST to multiple pages via a Vectorized Index base is
considered a security risk due to the abuse of probing multiple
pages in rapid succession and getting speculative feedback on which
pages would fail. Therefore Vector Indexed LD/ST is prohibited
in pairs.
By contrast, in Vertical-First Mode it is in fact possible to issue
-the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
+the pairs, and consequently allowing Vectorized Data-Dependent Fail-First is
useful.
Programmer's note: Care should be taken when VL is truncated in
Although Rc=1 on LD/ST is a rare occurrence at present, future versions
of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
-with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that
+with the SVP64 Vectorization Prefixing being itself a RISC-paradigm that
is itself fully-independent of the Scalar Suffix Defined Words, prohibiting
the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
operations is not strategically sound.
REMAP easily covers this capability, and with dest elwidth overrides
and saturation may do so with built-in conversion that would normally
-require additional width-extension, sign-extension and min/max Vectorised
+require additional width-extension, sign-extension and min/max Vectorized
instructions as post-processing stages.
Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
this section covers assembly notation for the immediate and indexed LD/ST.
the summary is that in immediate mode for LD it is not clear that if the
-destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
+destination register is Vectorized `RT.v` but the source `imm(RA)` is scalar
the memory being read is *still a vector load*, known as "unit or element strides".
This anomaly is made clear with the following notation:
sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
mem@r#2 +0 ... +offs ... +offs*2
destreg r# r#+1 r#+2
- imm(RA) RT.s RA.s not vectorised
+ imm(RA) RT.s RA.s not vectorized
sv.ld r#, ofst(r#2)
indexed mode:
RA,RB RT.s RA.v RB.v
RA,RB RT.s RA.s RB.v
RA,RB RT.s RA.v RB.s
- RA,RB RT.s RA.s RB.s not vectorised
+ RA,RB RT.s RA.s RB.s not vectorized
* Thirdly, just because of the PO9-Prefix it is prohibited to
put an entirely different instruction into the Suffix position.
If `{PO14}` as a 32-bit instruction is defined as "addi", then
- it is **required** that `{PO9}-{PO14}` **be** a Vectorised "addi",
- **not** a Vectorised multiply.
+ it is **required** that `{PO9}-{PO14}` **be** a Vectorized "addi",
+ **not** a Vectorized multiply.
* Fourthly, where PO1-Prefixing of operand fields (often resulting
in "split field" redefinitions such as `si0||si1`) is an arbitrary
manually-hand-crafted procedure,
[and anticipate someone in the future to
define a 128-bit variant to match RISC-V RV128].
-bear in mind that SVP64 *has* to have Scalar Operations first, because by design and by definition *only Scalar operations may be Vectorised*. SVP64 *DOES NOT* add *ANY* Vector Instructions. SVP64 is a generic loop around *Scalar* operations and it us up to the Architecture to take advantage of that, at the back-end.
+bear in mind that SVP64 *has* to have Scalar Operations first, because by design and by definition *only Scalar operations may be Vectorized*. SVP64 *DOES NOT* add *ANY* Vector Instructions. SVP64 is a generic loop around *Scalar* operations and it us up to the Architecture to take advantage of that, at the back-end.
without SVP64 Sub-Looping it would on the face of it seem absolutely mental and a total waste of time and resources to define an 8 or 16 bit General-Purpose ISA in the year 2022 until you recall that:
(in particular, anyone who remembers how hard programming the Cell Processor turned out to be will be having that familiar "lightbulb moment" right about now)
-more than that: what if those 8 and 16 bit cores had a Supercomputing-class Vectorisation option in the ISA, and there were implementations out there with back-end ALUs that could perform 64 or 128 8 or 16 bit operations per clock cycle?
+more than that: what if those 8 and 16 bit cores had a Supercomputing-class Vectorization option in the ISA, and there were implementations out there with back-end ALUs that could perform 64 or 128 8 or 16 bit operations per clock cycle?
Quantity several thousand per processor, all of them capable of adapting to run massive AI number crunching or (at lower IPC than "normal" processors) general-purpose compute?
swizzle-copied to
a contiguous array of vec2. A contiguous array of vec2 sources
may have multiple of each vec2 elements (XY) copied to a contiguous
-vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
+vec4 array (YYXX or XYXX). For this reason, *when Vectorized*
Swizzle Moves support independent subvector lengths for both
source and destination.
ISA this not practical. A compromise is to cut the registers required
by half, placing it on-par with `lq`, `stq` and Indexed
Load-with-update instructions.
-When part of the Scalar Power ISA (not SVP64 Vectorised)
+When part of the Scalar Power ISA (not SVP64 Vectorized)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to a feasible
2-in, 2-out pairs of 64-bit registers:
as in `lq` and `stq`. Scalar Swizzle instructions must be atomically
indivisible: an Exception or Interrupt may not occur during the Moves.
-Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
+Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant
*must* buffer (read) both 64-bit RA registers before writing to the
RT pair (in an Out-of-Order Micro-architecture, both of the register
pair must be "in-flight").
This ensures that register file corruption does not occur.
-**SVP64 Vectorised**
+**SVP64 Vectorized**
-Vectorised Swizzle may be considered to
+Vectorized Swizzle may be considered to
contain an extended static predicate
mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
the static predication capability, the destination
length, and consequently the destination subvector length is
encoded into the Swizzle.
-When Vectorised, given the use-case is for a High-performance GPU,
+When Vectorized, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
be deployed in hardware to issue multiple Scalar MV operations and
Additionally, in order to make life easier for implementers, some of
whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
the usual strict Element-level Program Order is relaxed.
-An overlap between all and any Vectorised
+An overlap between all and any Vectorized
sources and destination Elements for the entirety of
the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
violate that expectation. The exceptions to this, explained
later, are when Pack/Unpack is enabled.
-**Effect of Saturation on Vectorised Swizzle**
+**Effect of Saturation on Vectorized Swizzle**
A useful convenience for pixel data is to be able to insert values
0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
# Pack/Unpack Mode:
-It is possible to apply Pack and Unpack to Vectorised
+It is possible to apply Pack and Unpack to Vectorized
swizzle moves. The interaction requires specific explanation
because it involves the separate SUBVLs (with destination SUBVL
being separate). Key to understanding is that the
also exist.
In SVP64, Pack and Unpack are achieved *in the abstract* for application on *all*
-Vectoriseable instructions.
+Vectorizeable instructions.
* See <https://bugs.libre-soc.org/show_bug.cgi?id=230#c30>
* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004911.html>
[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately:
the following Modes apply to Arithmetic and Logical SVP64 operations:
-* **simple** mode is straight vectorisation. No augmentations: the
+* **simple** mode is straight vectorization. No augmentations: the
vector comprises an array of independently created results.
* **ffirst** or data-dependent fail-on-first: see separate section.
The vector may be truncated depending on certain criteria.
The CR overflow bit is therefore simply set to zero if saturation did
not occur, and to one if it did. This behaviour (ignoring XER.SO) is
actually optional in the SFFS Compliancy Subset: for SVP64 it is made
-mandatory *but only on Vectorised instructions*.
+mandatory *but only on Vectorized instructions*.
Note also that saturate on operations that set OE=1 must raise an Illegal
Instruction due to the conflicting use of the CR.so bit for storing
-if saturation occurred. Vectorised Integer Operations that produce a
+if saturation occurred. Vectorized Integer Operations that produce a
Carry-Out (CA, CA32): these two bits will be `UNDEFINED` if saturation
is also requested.
In CR-based data-driven fail-on-first there is only the option to select
and test one bit of each CR (just as with branch BO). For more complex
-tests this may be insufficient. If that is the case, a vectorised crop
+tests this may be insufficient. If that is the case, a vectorized crop
such as crand, cror or [[sv/cr_int_predication]] crweirder may be used,
and ffirst applied to the crop instead of to the arithmetic vector. Note
that crops are covered by the [[sv/cr_ops]] Mode format.
* CR-based data-dependent ffirst on the other hand **can** set VL equal
to zero. When VL is set
zero due to the first element failing the CR bit-test, all subsequent
- vectorised operations are effectively `nops` which is
+ vectorized operations are effectively `nops` which is
*precisely the desired and intended behaviour*.
The second crucial aspect, compared to LDST Ffirst:
Links:
* This page: [http://libre-soc.org/openpower/sv/overview](http://libre-soc.org/openpower/sv/overview)
-* [FOSDEM2021 SimpleV for Power ISA](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorisation/)
+* [FOSDEM2021 SimpleV for Power ISA](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorization/)
* FOSDEM2021 presentation <https://www.youtube.com/watch?v=FS6tbfyb2VA>
* [[discussion]] and
[bugreport](https://bugs.libre-soc.org/show_bug.cgi?id=556)
The fundamentals are (just like x86 "REP"):
* The Program Counter (PC) gains a "Sub Counter" context (Sub-PC)
-* Vectorisation pauses the PC and runs a Sub-PC loop from 0 to VL-1
+* Vectorization pauses the PC and runs a Sub-PC loop from 0 to VL-1
(where VL is Vector Length)
* The [[Program Order]] of "Sub-PC" instructions must be preserved,
just as is expected of instructions ordered by the PC.
that loop size to one.
The important insight from the above is that, strictly speaking, Simple-V
-is not really a Vectorisation scheme at all: it is more of a hardware
+is not really a Vectorization scheme at all: it is more of a hardware
ISA "Compression scheme", allowing as it does for what would normally
require multiple sequential instructions to be replaced with just one.
This is where the rule that Program Order must be preserved in Sub-PC
execution derives from. However in other ways, which will emerge below,
the "tagging" concept presents an opportunity to include features
definitely not common outside of Vector ISAs, and in that regard it's
-definitely a class of Vectorisation.
+definitely a class of Vectorization.
## Register "tagging"
The reason for using so few bits is because there are up to *four*
registers to mark in this way (`fma`, `isel`) which starts to be of
concern when there are only 24 available bits to specify the entire SV
-Vectorisation Context. In fact, for a small subset of instructions it
+Vectorization Context. In fact, for a small subset of instructions it
is just not possible to tag every single register. Under these rare
circumstances a tag has to be shared between two registers.
an associated post-result "test", placing this test into an implicit
Condition Register. The original researchers who created the POWER ISA
chose CR0 for Integer, and CR1 for Floating Point. These *also become
-Vectorised* - implicitly - if the associated destination register is
-also Vectorised. This allows for some very interesting savings on
+Vectorized* - implicitly - if the associated destination register is
+also Vectorized. This allows for some very interesting savings on
instruction count due to the very same CR Vectors being predication masks.
# Adding single predication
is VGATHER (and VSCATTER): moving registers by specifying a vector of
register indices (`regs[rd] = regs[regs[rs]]` in a loop). This one is
tricky because it typically does not exist in standard scalar ISAs.
-If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
+If it did it would be called [[sv/mv.x]]. Once Vectorized, it's a
VGATHER/VSCATTER.
# Exception-based Fail-on-first
-One of the major issues with Vectorised LD/ST operations is when a
+One of the major issues with Vectorized LD/ST operations is when a
batch of LDs cross a page-fault boundary. With considerable resources
being taken up with in-flight data, a large Vector LD being cancelled
or unable to roll back is either a detriment to performance or can cause
This is a relatively new addition to SVP64 under development as of
July 2021. Where Horizontal-First is the standard Cray-style for-loop,
Vertical-First typically executes just the **one** scalar element
-in each Vectorised operation. That element is selected by srcstep
+in each Vectorized operation. That element is selected by srcstep
and dststep *neither of which are changed as a side-effect of execution*.
Illustrating this in pseodocode, with a branch/loop.
To create loops, a new instruction `svstep` must be called,
by embedding Scalar instructions - unmodified - into a Vector "context"
using "Prefixing". With careful thought, this technique reaches 90%
par with good Vector ISAs, increasing to 95% with the addition of a
-mere handful of additional context-vectoriseable scalar instructions
+mere handful of additional context-vectorizeable scalar instructions
([[sv/mv.x]] amongst them).
What is particularly cool about the SV concept is that custom extensions
and research need not be concerned about inventing new Vector instructions
and how to get them to interact with the Scalar ISA: they are effectively
one and the same. Any new instruction added at the Scalar level is
-inherently and automatically Vectorised, following some simple rules.
+inherently and automatically Vectorized, following some simple rules.
**Definition of Horizontal-First:**
-Normal Cray-style Vectorisation, designated Horizontal-First, performs
+Normal Cray-style Vectorization, designated Horizontal-First, performs
element-level operations (often in parallel) before moving in the usual
fashion to the next instruction. The term "Horizontal-First"
stems from naturally visually listing program instructions vertically,
revoked under any circumstances. A useful way to think of this is that
the Prefix Encoding is, like the 8086 REP instruction, an independent
32-bit Defined Word. The only semi-exceptions are the Post-Increment
-Mode of LD/ST-Update and Vectorised Branch-Conditional.*
+Mode of LD/ST-Update and Vectorized Branch-Conditional.*
Note a particular consequence of the application of the above paragraph:
due to the fact that the Prefix Encodings are independent, **by
Encoding spaces and their potential are illustrated:
-| Encoding |Available bits|Scalar|Vectoriseable | SVP64Single |PO1-Prefixable |
+| Encoding |Available bits|Scalar|Vectorizeable | SVP64Single |PO1-Prefixable |
|----------|--------------|------|--------------|--------------|---------------|
|EXT000-063| 32 | yes | yes |yes |yes |
|EXT100-163| 64 | yes | no |no |not twice |
SVP64Single.
* Considerable care is needed both on Architectural Resource Allocation
as well as instruction design itself. All new Scalar instructions automatically
- and inherently must be designed taking their Vectoriseable potential into
+ and inherently must be designed taking their Vectorizeable potential into
consideration *including VSX* in future.
* Once an instruction is allocated
- in an Unvectorizable area it can never be Vectorised without providing
+ in an Unvectorizable area it can never be Vectorized without providing
an entirely new Encoding.
[[!tag standards]]
XER.SO (sticky overflow) is known to cause massive slowdown in pretty much every microarchitecture and it definitely compromises the performance of out-of-order systems. The reason is that it introduces a READ-MODIFY-WRITE cycle between XER.SO and CR0 (which contains a copy of the SO field after inclusion of the overflow). The result and source registers branch off as RaW and WaR hazards from this RMW chain.
-This is even before predication or vectorisation were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA.
+This is even before predication or vectorization were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA.
-As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests. Consequently it makes very little sense to continue to propagate OE=1 in the Vectorisation context of SV.
+As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests. Consequently it makes very little sense to continue to propagate OE=1 in the Vectorization context of SV.
### Vector Chaining
In addition those scalar 64-bit bitmanip operations, although some of them are obscure and unusual in the scalar world, do actually have practical applications outside of a vector context.
-(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorised however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice).
+(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorized however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice).
The summary is that adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is anomalous, costly, and unnecessary.
type of special virtual register port or datapath that masks out the
required predicate bits closer to the regfile.
-another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs. Beyond that they can be transferred using vectorised mfcr and mtcrf into INT regs. this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix. however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane".
+another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs. Beyond that they can be transferred using vectorized mfcr and mtcrf into INT regs. this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix. however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane".
### Predicated SIMD HI32-LO32 FUs
The disadvantages appear on closer analysis:
* Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain just 8 predicate bits (each being an LSB of a given 64 bit scalar int), would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read. Resource-wise, then, this idea is expensive.
-* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly. Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
+* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorized-mfcr) are more challenging and costly. Rather than use vectorized mfcr, complex transfers of the LSBs into a single scalar int are required.
In a "normal" Vector ISA this would be solved by adding opcodes that perform the kinds of bitmanipulation operations normally needed for predicate masks, as specialist operations *on* those masks. However for SV the rule has been set: "no unnecessary additional Vector Instructions" because it is possible to use existing PowerISA scalar bitmanip opcodes to cover the same job.
The problem is that vectors of LSBs need to be transferred *to* scalar int regs, bitmanip operations carried out, *and then transferred back*, which is exceptionally costly.
-On balance this is a less favourable option than vectorising CRs
+On balance this is a less favourable option than vectorizing CRs
## Scalar (single) integer as predicate, with one DM row
-This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
+This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorized mfcr can be used to get CMP results or Vectorized Rc=1 CRs into the scalar INT, easily.
This idea has several disadvantages.
The amount of information needed to do so is however quite large: consequently it is only practical to apply indirectly, via Context propagation.
Vectors may be remapped such that Matrix multiply of any arbitrary size
-is performed in one Vectorised `fma` instruction as long as the total
+is performed in one Vectorized `fma` instruction as long as the total
number of elements is less than 64 (maximum for VL).
Additionally, in a fashion known as "Structure Packing" in NEON and RVV, it may be used to perform "zipping" and "unzipping" of
otherwise usual `0..VL-1` hardware for-loop
* `svremap` to set which registers a given reordering is to apply to
(RA, RT etc)
-* `sv.{instruction}` where any Vectorised register marked by `svremap`
+* `sv.{instruction}` where any Vectorized register marked by `svremap`
will have its ordering REMAPPED according to the schedule set
by `svshape`.
* <https://bugs.libre-soc.org/show_bug.cgi?id=924>
This proposal is to extend the Power ISA with an Abstract RISC-Paradigm
-Vectorisation Concept that may be orthogonally applied to **all and any**
+Vectorization Concept that may be orthogonally applied to **all and any**
suitable Scalar instructions, present and future, in the Scalar Power ISA.
-The Vectorisation System is called
+The Vectorization System is called
["Simple-V"](https://libre-soc.org/openpower/sv/)
and the Prefix Format is called
["SVP64"](https://libre-soc.org/openpower/sv/).
Audio/Visual DSPs to 3D GPUs and Supercomputing. As it does **not**
add actual Vector Instructions, relying solely and exclusively on the
**Scalar** ISA, it is **Scalar** instructions that need to be added to
-the **Scalar** Power ISA before Simple-V may orthogonally Vectorise them.
+the **Scalar** Power ISA before Simple-V may orthogonally Vectorize them.
The goal of RED Semiconductor Ltd, an OpenPOWER
Stakeholder, is to bring to market mass-volume general-purpose compute
It is also critical to note that Simple-V **does not modify the Scalar
Power ISA**, that **only** Scalar words may be
-Vectorised, and that Vectorised instructions are **not** permitted to be
+Vectorized, and that Vectorized instructions are **not** permitted to be
different from their Scalar words (`addi` must use the same Word encoding
as `sv.addi`, and any new Prefixed instruction added **must** also
be added as Scalar).
-The sole semi-exception is Vectorised
+The sole semi-exception is Vectorized
Branch Conditional, in order to provide the usual Advanced Branching
capability present in every Commercial 3D GPU ISA, but it
-is the *Vectorised* Branch-Conditional that is augmented, not Scalar
+is the *Vectorized* Branch-Conditional that is augmented, not Scalar
Branch.
# Basic principle
**Simple-V SPRs**
-* **SVSTATE** - 64-bit Vectorisation State sufficient for Precise-Interrupt
+* **SVSTATE** - 64-bit Vectorization State sufficient for Precise-Interrupt
Context-switching and no adverse latency, it may be considered to
be a "Sub-PC" and as such absolutely must be treated with the same
respect and priority as MSR and PC.
easily be passed downstream in a fully forward-progressive piplined fashion
to independent parallel units for further analysis.
-**Vectorised Branch-Conditional**
+**Vectorized Branch-Conditional**
As mentioned in the introduction this is the one sole instruction group
that
its various Mode bits and options can be set such that in the degenerate
case the behaviour becomes identical to Scalar Branch-Conditional.
-The two additional Modes within Vectorised Branch-Conditional, both of
+The two additional Modes within Vectorized Branch-Conditional, both of
which may be combined, are `CTR-Mode` and `VLI-Test` (aka "Data Fail First").
CTR Mode extends the way that CTR may be decremented unconditionally
within Scalar Branch-Conditional, and not only makes it conditional but
and restoring of LR and SVLR may be deferred until the final decision
as to whether to branch. In this way `sv.bclrl` does not corrupt `LR`.
-Vectorised Branch-Conditional due to its side-effects (e.g. reducing CTR
+Vectorized Branch-Conditional due to its side-effects (e.g. reducing CTR
or truncating VL) has practical uses even if the Branch is deliberately
set to the next instruction (CIA+8). For example it may be used to reduce
CTR by the number of bits set in a GPR, if that GPR is given as the predicate
One confusing thing is the unfortunate naming of LD/ST Indexed and
REMAP Indexed: some care is taken in the spec to discern the two.
LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB
-may be marked as Vectorised), where obviously the order in which
+may be marked as Vectorized), where obviously the order in which
that Vector of RA (or RB) is read in the usual linear sequential
fashion. REMAP Indexed affects the
**order** in which the Vector of RA (or RB) is accessed,
through **registers** (or, register *elements* in traditional
Cray-Vector ISAs) in full before moving on to the next *instruction*.
-Mitch Alsup's VVM Extension is a form of hardware-level auto-vectorisation
+Mitch Alsup's VVM Extension is a form of hardware-level auto-vectorization
based around Zero-Overhead Loops. Using a Variable-Length Encoding all
loop-invariant registers are "tagged" such that the Hazard Management
Engine may perform optimally and do less work in automatically identifying
to introduce into compilers, because all looping, as far as programs
is concerned, remains expressed as *Scalar assembler*.[^autovec]
Whilst Mitch Alsup's
-VVM biggest strength is its hardware-level auto-vectorisation
+VVM biggest strength is its hardware-level auto-vectorization
but is limited in its ability to call
functions, Simple-V's Vertical-First provides explicit control over the
parallelism ("hphint")[^hphint] and also allows for full state to be stored/restored
Simple-V Vertical-First Looping requires an explicit instruction to
move `SVSTATE` regfile offsets forward: `svstep`. An early version of
-Vectorised
+Vectorized
Branch-Conditional attempted to merge the functionality of `svstep`
into `sv.bc`: it became CISC-like in its complexity and was quickly reverted.
temporary registers to compute results that have a Vector source
or destination or both.
Contrast this with a Standard Horizontal-First Vector ISA where the only
-way to perform Vectorised Complex Arithmetic would be to add Complex Vector
+way to perform Vectorized Complex Arithmetic would be to add Complex Vector
Arithmetic operations, because due to the Horizontal (element-level)
progression there is no way to utilise intermediary temporary (scalar)
variables.[^complex]
be required. The entire 24-bits is **required** for the abstracted
Hardware-Looping Concept **even when these 24-bits are zero**
* Any Scalar 64-bit instruction (regardless of how it is encoded) is unsafe to
- then Vectorise because this creates the situation of Prefixed-Prefixed,
+ then Vectorize because this creates the situation of Prefixed-Prefixed,
resulting in deep complexity in Hardware Decode at a critical juncture, as
well as introducing 96-bit instructions.
-* **All** of these Scalar instructions are candidates for Vectorisation.
+* **All** of these Scalar instructions are candidates for Vectorization.
Thus none of them may be 64-bit-Scalar-only.
**Minor Opcodes to fit candidates above**
The primary point is that once an instruction is defined in Scalar
32-bit form its corresponding space **must** be reserved in the
SVP64 area with the exact same 32-bit form, even if that instruction
-is "Unvectoriseable" (`sc`, `sync`, `rfid` and `mtspr` for example).
+is "Unvectorizeable" (`sc`, `sync`, `rfid` and `mtspr` for example).
Instructions may **not** be added in the Vector space without also
-being added in the Scalar space, and vice-versa, *even if Unvectoriseable*.
+being added in the Scalar space, and vice-versa, *even if Unvectorizeable*.
This is extremely important because the worst possible situation
is if a conflicting Scalar instruction is added by another Stakeholder,
-which then turns out to be Vectoriseable: it would then have to be
+which then turns out to be Vectorizeable: it would then have to be
added to the Vector Space with a *completely different Defined Word*
and things go rapidly downhill in the Decode Phase from there.
Setting a simple inviolate rule helps avoid this scenario but does
need to be borne in mind when discussing potential allocation
-schemes, as well as when new Vectoriseable Opcodes are proposed
+schemes, as well as when new Vectorizeable Opcodes are proposed
for addition by future RFCs: the opcodes **must** be uniformly
added to Scalar **and** Vector spaces, or added in one and reserved
in the other, or
pressure on the EXT000-EXT063 (32-bit) opcode space to such a degree that
it risks jeapordising the Power ISA. These requirements are:
-* all of the scalar operations must be Vectoriseable
-* all of the scalar operations intended for Vectorisation
+* all of the scalar operations must be Vectorizeable
+* all of the scalar operations intended for Vectorization
must be in a 32-bit encoding (not prefixed-prefixed to 96-bit)
* bringing Scalar Power ISA up-to-date from the past 12 years
needs 75% of two Major opcodes all on its own
There exists a potential scheme which meets (exceeds) the above criteria,
-providing plenty of room for both Scalar (and Vectorised) operations,
+providing plenty of room for both Scalar (and Vectorized) operations,
*and* provides SVP64-Single with room to grow. It
is based loosely around Public v3.1 EXT001 Encoding.[^ext001]
If not allocated within the scope of this RFC
then these are requested to be `RESERVED` for a future Simple-V
proposal.
-* **SVP64** - a (well-defined, 2 years) DRAFT Proposal for a Vectorisation
+* **SVP64** - a (well-defined, 2 years) DRAFT Proposal for a Vectorization
Augmentation of suffixes.
For the needs identified by Libre-SOC (75% of 2 POs),
|old bit6=1| `RESERVED2`:{EXT300-363} | `RESERVED4`:SVP64-Single:{EXT000-063} | SVP64:{EXT000-063} |
* **`RESERVED2`:{EXT300-363}** (not strictly necessary to be added) is not
- and **cannot** ever be Vectorised or Augmented by Simple-V or any future
+ and **cannot** ever be Vectorized or Augmented by Simple-V or any future
Simple-V Scheme.
it is a pure **Scalar-only** word-length PO Group. It may remain `RESERVED`.
* **`RESERVED1`:{EXT200-263}** is also a new set of 64 word-length Major
in effect Single-Augmented-Prefixed variants of the v3.0 32-bit Power ISA.
Alternative instruction encodings other than the exact same 32-bit word
from EXT000-EXT063 are likewise prohibited.
-* **`SVP64:{EXT000-063}`** and **`SVP64:{EXT200-263}`** - Full Vectorisation
+* **`SVP64:{EXT000-063}`** and **`SVP64:{EXT200-263}`** - Full Vectorization
of EXT000-063 and EXT200-263 respectively, these Prefixed instructions
are likewise prohibited from being a different encoding from their
32-bit scalar versions.
`SVP64-Reserved` which will have to be achieved with SPRs (PCR or MSR).
*Most importantly what this scheme does not do is provide large areas
-for other (non-Vectoriseable) RFCs.*
+for other (non-Vectorizeable) RFCs.*
# Potential Opcode allocation solution (2)
as a Prefix, which is a new RESERVED encoding.
* when bit 6 is 0b0 and bits 32-33 are 0b11 are **defined** as also
allocated to Simple-V
-* all other patterns are `RESERVED` for other non-Vectoriseable
+* all other patterns are `RESERVED` for other non-Vectorizeable
purposes (just over 37.5%).
| 0-5 | 6 | 7 | 8-31 | 32:33 | Description |
This ensures that any potential for future conflict over uses of the
EXT009 space, jeapordising Simple-V in the process, are avoided,
yet leaves huge areas (just over 37.5% of the 64-bit space) for other
-(non-Vectoriseable) uses.
+(non-Vectorizeable) uses.
These areas thus need to be Allocated (SVP64 and Scalar EXT248-263):
* SVP64Single (`RESERVED3/4`) is *planned* for a future RFC
(but needs reserving as part of this RFC)
* `RESERVED1/2` is available for new general-purpose
- (non-Vectoriseable) 32-bit encodings (other RFCs)
+ (non-Vectorizeable) 32-bit encodings (other RFCs)
* EXT248-263 is for "new" instructions
which **must** be granted corresponding space
in SVP64.
-* Anything Vectorised-EXT000-063 is **automatically** being
+* Anything Vectorized-EXT000-063 is **automatically** being
requested as 100% Reserved for every single "Defined Word"
- (Public v3.1 1.6.3 definition). Vectorised-EXT001 or EXT009
+ (Public v3.1 1.6.3 definition). Vectorized-EXT001 or EXT009
is defined as illegal.
* Any **future** instruction
added to EXT000-063 likewise, must **automatically** be
assigned corresponding reservations in the SVP64:EXT000-063
and SVP64Single:EXT000-063 area, regardless of whether the
- instruction is Vectoriseable or not.
+ instruction is Vectorizeable or not.
Bit-allocation Summary:
* EXT3nn and other areas provide space for up to
- QTY 4of non-Vectoriseable EXTn00-EXTn47 ranges.
+ QTY 4of non-Vectorizeable EXTn00-EXTn47 ranges.
* QTY 3of 55-bit spaces also exist for future use (longer by 3 bits
than opcodes allocated in EXT001)
* Simple-V EXT2nn is restricted to range EXT248-263
-* non-Simple-V (non-Vectoriseable) EXT2nn (if ever requested in any future RFC) is restricted to range EXT200-247
+* non-Simple-V (non-Vectorizeable) EXT2nn (if ever requested in any future RFC) is restricted to range EXT200-247
* Simple-V EXT0nn takes up 50% of PO9 for this and future Simple-V RFCs
**This however potentially puts SVP64 under pressure (in 5-10 years).**
The clear separation between Simple-V and non-Simple-V stops
conflict in future RFCs, both of which get plenty of space.
-EXT000-063 pressure is reduced in both Vectoriseable and
-non-Vectoriseable, and the 100+ Vectoriseable Scalar operations
+EXT000-063 pressure is reduced in both Vectorizeable and
+non-Vectorizeable, and the 100+ Vectorizeable Scalar operations
identified by Libre-SOC may safely be proposed and each evaluated
on their merits.
**SVP64:{EXT000-063}** bit6=old bit7=vector
This encoding is identical to **SVP64:{EXT248-263}** except it
-is the Vectorisation of existing v3.0/3.1 Scalar-words, EXT000-063.
+is the Vectorization of existing v3.0/3.1 Scalar-words, EXT000-063.
All the same rules apply with the addition that
-Vectorisation of EXT001 or EXT009 is prohibited.
+Vectorization of EXT001 or EXT009 is prohibited.
| 0-5 | 6 | 7 | 8-31 | 32-63 |
|--------|---|---|-------|---------|
**SVP64:{EXT248-263}** bit6=new bit7=vector
This encoding, which permits VL to be dynamic (settable from GPR or CTR)
-is the Vectorisation of EXT248-263.
+is the Vectorization of EXT248-263.
Instructions may not be placed in this category without also being
implemented as pure Scalar *and* SVP64Single. Unlike SVP64Single
however, there is **no reserved encoding** (bits 8-24 zero).
| 64bit | ss.fishmv | 0x26!zero | 0x12345678| scalar SVP64Single:EXT0nn |
| 64bit | unallocated | 0x27nnnnnn | 0x12345678| vector SVP64:EXT0nn |
-This is illegal because the instruction is possible to Vectorise,
-therefore it should be **defined** as Vectoriseable.
+This is illegal because the instruction is possible to Vectorize,
+therefore it should be **defined** as Vectorizeable.
-**illegal due to unvectoriseable**
+**illegal due to unvectorizeable**
| width | assembler | prefix? | suffix | description |
|-------|-----------|--------------|-----------|---------------|
| 64bit | ss.mtmsr | 0x26!zero | 0x12345678| scalar SVP64Single:EXT0nn |
| 64bit | sv.mtmsr | 0x27nnnnnn | 0x12345678| vector SVP64:EXT0nn |
-This is illegal because the instruction `mtmsr` is not possible to Vectorise,
+This is illegal because the instruction `mtmsr` is not possible to Vectorize,
at all. This does **not** convey an opportunity to allocate the
space to an alternative instruction.
-**illegal unvectoriseable in EXT2nn**
+**illegal unvectorizeable in EXT2nn**
| width | assembler | prefix? | suffix | description |
|-------|-----------|--------------|-----------|---------------|
| 64bit | ss.mtmsr2 | 0x24!zero | 0x12345678| scalar SVP64Single:EXT2nn |
| 64bit | sv.mtmsr2 | 0x25nnnnnn | 0x12345678| vector SVP64:EXT2nn |
-For a given hypothetical `mtmsr2` which is inherently Unvectoriseable
+For a given hypothetical `mtmsr2` which is inherently Unvectorizeable
whilst it may be put into the scalar EXT2nn space it may **not** be
-allocated in the Vector space. As with Unvectoriseable EXT0nn opcodes
+allocated in the Vector space. As with Unvectorizeable EXT0nn opcodes
this does not convey the right to use the 0x24/0x26 space for alternative
-opcodes. This hypothetical Unvectoriseable operation would be better off
+opcodes. This hypothetical Unvectorizeable operation would be better off
being allocated as EXT001 Prefixed, EXT000-063, or hypothetically in
EXT300-363.
the use of 0x12345678 for fredmv in scalar but fishmv in Vector is
illegal. the suffix in both 64-bit locations
-must be allocated to a Vectoriseable EXT000-063
+must be allocated to a Vectorizeable EXT000-063
"Defined Word" (Public v3.1 Section 1.6.3 definition)
or not at all.
legal for Primary Opcodes in the range 232-263, where the top
two MSBs are 0b11. Thus this faulty attempt actually falls
unintentionally
-into `RESERVED` "Non-Vectoriseable" Encoding space.
+into `RESERVED` "Non-Vectorizeable" Encoding space.
**illegal attempt to put Scalar EXT001 into Vector space**
which are illegal due to cost at the Decode Phase (Variable-Length
Encoding). Likewise attempting to embed EXT009 (chained) is also
illegal. The implications are clear unfortunately that all 64-bit
-EXT001 Scalar instructions are Unvectoriseable.
+EXT001 Scalar instructions are Unvectorizeable.
\newpage{}
# Use cases
<https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_bigint.py;hb=HEAD>
\newpage{}
-# Vectorised strncpy
+# Vectorized strncpy
Aside from the `blr` return instruction this is an entire fully-functional
implementation of `strncpy` which demonstrates some of the remarkably
powerful capabilities of Simple-V. Load Fault-First avoids instruction
-traps and page faults in the middle of the Vectorised Load, providing
+traps and page faults in the middle of the Vectorized Load, providing
the *micro-architecture* with the opportunity to notify the program of
the successful Vector Length. `sv.cmpi` is the next strategically-critical
instruction, as it searches for a zero and yet *includes* it in a new
Vector Length - bearing in mind that the previous instruction (the Load)
*also* truncated down to the valid number of LDs performed. Finally,
-a Vectorised Branch-Conditional automatically decrements CTR by the number
+a Vectorized Branch-Conditional automatically decrements CTR by the number
of elements copied (VL), rather than decrementing simply by one.
```
[^ext001]: Recall that EXT100 to EXT163 is for Public v3.1 64-bit-augmented Operations prefixed by EXT001, for which, from Section 1.6.3, bit 6 is set to 1. This concept is where the above scheme originated. Section 1.6.3 uses the term "defined word" to refer to pre-existing EXT000-EXT063 32-bit instructions so prefixed to create the new numbering EXT100-EXT163, respectively
[^futurevsx]: A future version or other Stakeholder *may* wish to drop Simple-V onto VSX: this would be a separate RFC
[^vsx256]: imagine a hypothetical future VSX-256 using the exact same instructions as VSX. the binary incompatibility introducrd would catastrophically **and retroactively** damage existing IBM POWER8,9,10 hardware's reputation and that of Power ISA overall.
-[^autovec]: Compiler auto-vectorisation for best exploitation of SIMD and Vector ISAs on Scalar programming languages (c, c++) is an Indusstry-wide known-hard decades-long problem. Cross-reference the number of hand-optimised assembler algorithms.
+[^autovec]: Compiler auto-vectorization for best exploitation of SIMD and Vector ISAs on Scalar programming languages (c, c++) is an Indusstry-wide known-hard decades-long problem. Cross-reference the number of hand-optimised assembler algorithms.
[^hphint]: intended for use when the compiler has determined the extent of Memory or register aliases in loops: `a[i] += a[i+4]` would necessitate a Vertical-First hphint of 4
[^svshape]: although SVSHAPE0-3 should, realistically, be regarded as high a priority as SVSTATE, and given corresponding SVSRR and SVLR equivalents, it was felt that having to context-switch **five** SPRs on Interrupts and function calls was too much.
[^whoops]: two efforts were made to mix non-uniform encodings into Simple-V space: one deliberate to see how it would go, and one accidental. They both went extremely badly, the deliberate one costing over two months to add then remove.
the additional requirements are:
-* all of the scalar operations must be Vectoriseable
+* all of the scalar operations must be Vectorizeable
* all of the scalar operations must be in a 32-bit encoding (not prefixed-prefixed)
# use 75% of QTY 3 MAJOR ops
having this `RESERVED` encoding in the middle of the
space does complexify multi-issue decoding somewhat,
but it does provide an entire new (independent,
-non-vectorisable) 32-bit opcode space. **two** separate
+non-vectorizable) 32-bit opcode space. **two** separate
RESERVED Major opcode areas can be provided: numbering them
EXT200-263 and EXT300-363 respectively seems sane.
EXT300-363 for `RESERVED1` comes with a caveat that it can
**
it is unlikely that we (Libre-SOC) will initially implement any of v3.1
-64-bit prefixing (it cannot be Vectorised, resulting unacceptably in
+64-bit prefixing (it cannot be Vectorized, resulting unacceptably in
96-bit instructions which we decided is too much). that said, the LD
addressing immediate extended range is extremely useful
(along with the PC-relative modes and also other instructions
**Keywords**:
```
- Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
+ Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
Digital Signal Processing (DSP)
```
**Motivation**
Power ISA is synonymous with Supercomputing and the early Supercomputers
-(ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorisation. It is therefore anomalous
+(ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorization. It is therefore anomalous
that Power ISA does not have Scalable Vectors. This presents the opportunity to
modernise Power ISA keeping it at the top of Supercomputing.
**Keywords**:
```
- Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
+ Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
Digital Signal Processing (DSP)
```
DCT REMAP is RADIX2 only. Convolutions may be applied as usual
to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly
instructions the algorithm below (written in python3), becomes part
-of an in-place in-registers Vectorised DCT. The algorithms work
+of an in-place in-registers Vectorized DCT. The algorithms work
by loading data such that as the nested loops progress the result
is sorted into correct sequential order.
**Keywords**:
```
- Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC),
+ Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC),
True-Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model,
Digital Signal Processing (DSP), High-level Assembler
```
The purpose of this RFC is:
* to give a full list of upcoming **Scalar** opcodes developed by Libre-SOC
- (being cognisant that *all* of them are Vectoriseable)
+ (being cognisant that *all* of them are Vectorizeable)
* to give OPF Members and non-Members alike the opportunity to comment and get
involved early in RFC submission
* formally agree a priority order on an iterative basis with new versions
**separate** organisations*.
Worth bearing in mind during evaluation that every "Defined Word" may
-or may not be Vectoriseable, but that every "Defined Word" should have
-merits on its own, not just when Vectorised, precisely because the
+or may not be Vectorizeable, but that every "Defined Word" should have
+merits on its own, not just when Vectorized, precisely because the
instructions are Scalar. An example of a borderline
-Vectoriseable Defined Word is `mv.swizzle` which only really becomes
+Vectorizeable Defined Word is `mv.swizzle` which only really becomes
high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
less merit as a Scalar-only operation, yet when SVP64Single-Prefixed
can be part of an atomic Compare-and-Swap sequence.
Future versions of SVP64 and SVP64Single are expected to be developed
by future Power ISA Stakeholders on top of VSX. The decisions made
-there about the meaning of Prefixed Vectorised VSX may be *completely
+there about the meaning of Prefixed Vectorized VSX may be *completely
different* from those made for Prefixed SFFS instructions. At which
point the lack of SFFS equivalents would penalise SFFS implementors in a
much more severe way, effectively expecting them and SFFS programmers to
These without question have to go in EXT0xx. Future extended variants,
bringing even more powerful capabilities, can be followed up later with
EXT1xx prefixed variants, which is not possible if placed in EXT2xx.
-*Only `svstep` is actually Vectoriseable*, all other Management
-instructions are UnVectoriseable. PO1-Prefixed examples include
+*Only `svstep` is actually Vectorizeable*, all other Management
+instructions are UnVectorizeable. PO1-Prefixed examples include
adding psvshape in order to support both Inner and Outer Product Matrix
Schedules, by providing the option to directly reverse the order of the
triple loops. Outer is used for standard Matrix Multiply (on top of a
standard MAC or FMAC instruction), but Inner is required for Warshall
Transitive Closure (on top of a cumulatively-applied max instruction).
-Excpt for `svstep` which is Vectoriseable the Management Instructions
+Excpt for `svstep` which is Vectorizeable the Management Instructions
themselves are all 32-bit Defined Words (Scalar Operations), so
PO1-Prefixing is perfectly reasonable. SVP64 Management instructions
of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
Found at [[sv/av_opcodes]] these do not require Saturated variants
because Saturation is added via [[sv/svp64]] (Vector Prefixing) and
via [[sv/svp64-single]] Scalar Prefixing. This is important to note for
-Opcode Allocation because placing these operations in the UnVectoriseable
+Opcode Allocation because placing these operations in the UnVectorizeable
areas would irredeemably damage their value. Unlike PackedSIMD ISAs
the actual number of AV Opcodes is remarkably small once the usual
cascading-option-multipliers (SIMD width, bitwidth, saturation,
operations, typically performing for example one multiply but in-place
subtracting that product from one operand and adding it to the other.
The *in-place* aspect is strategically extremely important for significant
-reductions in Vectorised register usage, particularly for DCT.
+reductions in Vectorized register usage, particularly for DCT.
Further: even without Simple-V the number of instructions saved is huge: 8 for
integer and 4 for floating-point vs one.
Whilst some of these instructions have VSX equivalents they must not
be excluded on that basis. SVP64/VSX may have a different meaning from
-SVP64/SFFS i e. the two *Vectorised* instructions may not be equivalent.
+SVP64/SFFS i e. the two *Vectorized* instructions may not be equivalent.
## Bitmanip LUT2/3
SVP64Single Predication, whereupon the end result is the RISC-synthesis
of Compare-and-Swap, in two instructions.
-Where this instruction comes into its full value is when Vectorised.
+Where this instruction comes into its full value is when Vectorized.
3D GPU and HPC numerical workloads astonishingly contain between 10 to 15%
swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing
balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make
\newpage{}
-# Vectorisation: SVP64 and SVP64Single
+# Vectorization: SVP64 and SVP64Single
To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]],
with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually
becomes a candidate for Vector-Prefixing. This in turn means that when
a new instruction is proposed, it becomes a hard requirement to consider
not only the implications of its inclusion as a Scalar-only instruction,
-but how it will best be utilised as a Vectorised instruction **as well**.
+but how it will best be utilised as a Vectorized instruction **as well**.
Extreme examples of this are the Big-Integer 3-in 2-out instructions
that use one 64-bit register effectively as a Carry-in and Carry-out. The
instructions were designed in a *Scalar* context to be inline-efficient
1-out), but in a *Vector* context it is extremely straightforward to
Micro-code an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD
pipelines, and to perform a large internal Forward-Carry-Propagation on
-for example the Vectorised-Multiply instruction.
+for example the Vectorized-Multiply instruction.
Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be
considered as an independent stand-alone instruction (just like `REP`).
upcoming RFCs in development may be found.
*Reading advance Draft RFCs and providing feedback strongly advised*,
it saves time and effort for the OPF ISA Workgroup.
-* **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that
+* **SVP64** - Vectorizeable (SVP64-Prefixable) - also implies that
SVP64Single is also permitted (required).
* **page** - Libre-SOC wiki page at which further information can
be found. Again: **advance reading strongly advised due to the
sheer volume of information**.
* **PO1** - the instruction is capable of being PO1-Prefixed
(given an EXT1xx Opcode Allocation). Bear in mind that this option
- is **mutually exclusively incompatible** with Vectorisation.
+ is **mutually exclusively incompatible** with Vectorization.
* **group** - the Primary Opcode Group recommended for this instruction.
Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx. A third area
- (UnVectoriseable),
+ (UnVectorizeable),
EXT3xx, was available in an early Draft RFC but has been made "RESERVED"
instead. see [[sv/po9_encoding]].
* **Level** - Compliancy Subset and Simple-V Level. `SFFS` indicates "mandatory"
register to selectively target any four bits of a given CR Field
* CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
in one hit.
-* Optional Vectorisation of the same when SVP64 is implemented
+* Optional Vectorization of the same when SVP64 is implemented
Purpose:
* To provide a merged version of what is currently a multi-sequence of
CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
instruction count.
-* To provide a vectorised version of the same, suitable for advanced
+* To provide a vectorized version of the same, suitable for advanced
predication
Useful side-effects:
RAp instructions, these instructions would not be proposed.
4. The read and write of two overlapping registers normally requires
an intermediate register (similar to the justifcation for CAS -
- Compare-and-Swap). When Vectorised the situation becomes even
+ Compare-and-Swap). When Vectorized the situation becomes even
worse: an entire *Vector* of intermediate temporaries is required.
Thus *even if implemented inefficiently* requiring more cycles to
complete (taking an extra cycle to write the second result) these
instructions still save on resources.
5. Macro-op fusion equivalents of these instructions is *not possible* for
exactly the same reason that the equivalent CAS sequence may not be
- macro-op fused. Full in-place Vectorised FFT and DCT algorithms *only*
+ macro-op fused. Full in-place Vectorized FFT and DCT algorithms *only*
become possible due to these instructions atomically reading **both**
Butterfly operands into internal Reservation Stations (exactly like CAS).
5. Although desirable (particularly to detect overflow) Rc=1 is hard to
SV Link Register, exactly analogous to LR (Link Register) may
be used for temporary storage of SVSTATE, and, in particular,
-Vectorised Branch-Conditional instructions may interchange
+Vectorized Branch-Conditional instructions may interchange
SVLR and SVSTATE whenever LR and NIA are.
Note that there is no equivalent Link variant of SVREMAP or
The creation and maintenance of SVP64 Categorisation is an automated
process that uses "Register profiling", reading machine-readable
versions of the Power ISA Specification and tables in order to
-make the Vectorisation Categorisation. To create this information
+make the Vectorization Categorisation. To create this information
by hand is neither sensible nor desirable: it may take far longer
and introduce errors.
## Introduction
-Simple-V is a type of Vectorisation best described as a "Prefix Loop
+Simple-V is a type of Vectorization best described as a "Prefix Loop
Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR`[^bib_ldir] instruction and
to the 8086 `REP`[^bib_rep] Prefix instruction. More advanced features are similar
to the Z80 `CPIR`[^bib_cpir] instruction. If naively viewed one-dimensionally as an
(significantly reducing hot-loop instruction count) that one bit in
the Prefix is reserved for it (*Note the intention to release that bit
and move Post-Increment instructions to EXT2xx, as part of [[sv/rfc/ls011]]*).
-Vectorised Branch-Conditional operations "embed" the original Scalar
+Vectorized Branch-Conditional operations "embed" the original Scalar
Branch-Conditional behaviour into a much more advanced variant that is
highly suited to High-Performance Computation (HPC), Supercomputing,
and parallel GPU Workloads.
*Architectural Note: Given that a "pre-classification" Decode Phase is
required (identifying whether the Suffix - Defined Word - is
Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
-adding "Unvectorised" to this phase is not unreasonable.*
+adding "Unvectorized" to this phase is not unreasonable.*
Vectorizable Defined Word-instructions are **required** to be Vectorized,
or they may not be permitted to be added at all to the Power ISA as Defined
* The GPR-numbering is considered LSB0-ordered
* The Element-numbering (result0-result4) is LSB0-ordered
* Each of the results (result0-result4) are 16-bit
-* "same" indicates "no change as a result of the Vectorised add"
+* "same" indicates "no change as a result of the Vectorized add"
```
| MSB0: | 0:15 | 16:31 | 32:47 | 48:63 |
from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom
16 LSBs of GPR(1).
-If the 16-bit operation were to be followed up with a 32-bit Vectorised
+If the 16-bit operation were to be followed up with a 32-bit Vectorized
Operation, the exact same contents would be viewed as follows:
```
## Register Naming and size
As indicated above SV Registers are simply the GPR, FPR and CR register
-files extended linearly to larger sizes; SV Vectorisation iterates
+files extended linearly to larger sizes; SV Vectorization iterates
sequentially through these registers (LSB0 sequential ordering from 0
to VL-1).
| 110 | so/un | `CR[offs+i].FU` is set |
| 111 | ns/nu | `CR[offs+i].FU` is clear |
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
+`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorized
Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
-The CR Predicates chosen must start on a boundary that Vectorised CR
+The CR Predicates chosen must start on a boundary that Vectorized CR
operations can access cleanly, in full. With EXTRA2 restricting starting
-points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
+points to multiples of 8 (CR0, CR8, CR16...) both Vectorized Rc=1 and
CR Predicate Masks have to be adapted to fit on these boundaries as well.
## Extra Remapped Encoding <a name="extra_remap"> </a>
XER.CA/CA32 on the other hand is expected and required to be implemented
according to standard Power ISA Scalar behaviour. Interestingly, due
to SVP64 being in effect a hardware for-loop around Scalar instructions
-executing in precise Program Order, a little thought shows that a Vectorised
+executing in precise Program Order, a little thought shows that a Vectorized
Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
and producing, at the end, a single bit Carry out. High performance
implementations may exploit this observation to deploy efficient
In CR-based data-driven fail-on-first there is only the option to select
and test one bit of each CR (just as with branch BO). For more complex
-tests this may be insufficient. If that is the case, a vectorised crops
+tests this may be insufficient. If that is the case, a vectorized crops
(crand, cror) may be used, and ffirst applied to the crop instead of to
the arithmetic vector.
to zero. This is the only means in the entirety of SV that VL may be set
to zero (with the exception of via the SV.STATE SPR). When VL is set
zero due to the first element failing the CR bit-test, all subsequent
- vectorised operations are effectively `nops` which is
+ vectorized operations are effectively `nops` which is
*precisely the desired and intended behaviour*.
Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
### CR fields as inputs/outputs of vector operations
CRs (or, the arithmetic operations associated with them)
-may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
+may be marked as Vectorized or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorized if the destination is Vectorized. Likewise if the destination is scalar then so is the CR.
When vectorized, the CR inputs/outputs are sequentially read/written
-to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
+to 4-bit CR fields. Vectorized Integer results, when Rc=1, will begin
writing to CR8 (TBD evaluate) and increase sequentially from there.
This is so that:
CR when Rc=1 is written to. This is CR0 for integer operations and CR1
for FP operations.
-Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
-has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
+Note that yes, the CR Fields are genuinely Vectorized. Unlike in SIMD VSX which
+has a single CR (CR6) for a given SIMD result, SV Vectorized OpenPOWER
v3.0B scalar operations produce a **tuple** of element results: the
result of the operation as one part of that element *and a corresponding
CR element*. Greatly simplified pseudocode:
the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
more flexibility in analysing vectors than standard Vector ISAs. Normal
Vector ISAs are typically restricted to "were all results nonzero" and
-"were some results nonzero". The application of mapreduce to Vectorised
+"were some results nonzero". The application of mapreduce to Vectorized
cr operations allows far more sophisticated analysis, particularly in
conjunction with the new crweird operations see [[sv/cr_int_predication]].
so FP instructions with Rc=1 write to CR1 (n=1).
CRs are not stored in SPRs: they are registers in their own right.
-Therefore context-switching the full set of CRs involves a Vectorised
+Therefore context-switching the full set of CRs involves a Vectorized
mfcr or mtcr, using VL=8 to do so. This is exactly as how
scalar OpenPOWER context-switches CRs: it is just that there are now
more of them.
multiply into RT and RT+1.
What, then, of `sv.maddedu`? If the destination is hard-coded to RT and
-RT+1 the instruction is not useful when Vectorised because the output
+RT+1 the instruction is not useful when Vectorized because the output
will be overwritten on the next element. To solve this is easy: define
the destination registers as RT and RT+MAXVL respectively. This makes
it easy for compilers to statically allocate registers even when VL
| 111 | ~R30 |
-# CR Vectorisation
+# CR Vectorization
Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much).
## only 1 src/dest
-Instructions in this category are usually Unvectoriseable
+Instructions in this category are usually Unvectorizeable
or they are Load-Immediates. `fmvis` for example, is 1-Write,
whilst SV.Branch-Conditional is BI (CR field bit).
in the decode phase was too great. The lesson was learned, the
hard way: it would be infinitely preferable
to add a 32-bit Scalar Load-with-Shift
-instruction *first*, which then inherently becomes Vectorised.
+instruction *first*, which then inherently becomes Vectorized.
Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
both ARM and x86 have it, because it saves greatly on instruction count in
hot-loops.
32-bit encoding is ever allocated in a future revision
of the Power ISA
to a completely unrelated operation
-then how can a Vectorised version of that new instruction ever be added?
+then how can a Vectorized version of that new instruction ever be added?
The uniformity and RISC Abstraction is irreparably damaged.
Bottom line here is that the fundamental RISC Principle is strictly adhered
to, even though these are Advanced 64-bit Vector instructions.
The basic principle of SVP64 is the prefix, which contains mode
as well as register augmentation and predicates. When thinking of
-instructions and Vectorising them, it is natural for arithmetic
+instructions and Vectorizing them, it is natural for arithmetic
operations (ADD, OR) to be the first to spring to mind.
Arithmetic instructions have registers, therefore augmentation
applies, end of story, right?
Power ISA has Condition Register Fields: how can element widths
apply there? And branches: how can you have Saturation on something
that does not return an arithmetic result? In short: there are actually
-four different categories (five including those for which Vectorisation
+four different categories (five including those for which Vectorization
makes no sense at all, such as `sc` or `mtmsr`). The categories are:
* arithmetic/logical including floating-point
Condition Register Fields are 4-bit wide and consequently element-width
overrides make absolutely no sense whatsoever. Therefore the elwidth
-override field bits can be used for other purposes when Vectorising
+override field bits can be used for other purposes when Vectorizing
CR Field instructions. Moreover, Rc=1 is completely invalid for
CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense.
All of these differences, which require quite a lot of logical
reasoning and deduction, help explain why there is an entirely different
-CR ops Vectorisation Category.
+CR ops Vectorization Category.
A particularly strange quirk of CR-based Vector Operations is that the
Scalar Power ISA CR Register is 32-bits, but actually comprises eight
With SVP64 extending the number of CR *Fields* to 128, the number of
32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
(8 per CR Register). Then, it gets even more strange, when it comes
-to Vectorisation, which applies to the CR Field *numbers*. The
+to Vectorization, which applies to the CR Field *numbers*. The
hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
and moves to CR1 for element 1, and so on. The reason here is quite
simple: each element result has to have its own CR Field co-result.
and attention is advised, here, when reading the specification,
especially on arithmetic loads (lbarx, lharx etc.)
-**Non-vectorised**
+**Non-vectorized**
-The concept of a Vectorised halt (`attn`) makes no sense. There are never
+The concept of a Vectorized halt (`attn`) makes no sense. There are never
going to be a Vector of global MSRs (Machine Status Register). `mtcr`
-on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
+on the other hand is a grey area: `mtspr` is clearly Vectorizeable.
Even `td` and `tdi` makes a strange type of sense to permit it to be
-Vectorised, because a sequence of comparisons could be Vectorised.
-Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
+Vectorized, because a sequence of comparisons could be Vectorized.
+Vectorized System Calls (`sc`) or `tlbie` and other Cache or Virtual
Nemory Management
-instructions, these make no sense to Vectorise.
+instructions, these make no sense to Vectorize.
However, it is really quite important to not be tempted to conclude that
-just because these instructions are un-vectoriseable, the Prefix opcode space
+just because these instructions are un-vectorizeable, the Prefix opcode space
must be free for reiterpretation and use for other purposes. This would
be a serious mistake because a future revision of the specification
might *retire* the Scalar instruction, and, worse, replace it with another.
Again this comes down to being quite strict about the rules: only Scalar
-instructions get Vectorised: there are *no* actual explicit Vector
+instructions get Vectorized: there are *no* actual explicit Vector
instructions.
**Summary**
of a Scalar ISA and then adds additional instructions which only
make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
considerable lengths to keep strictly to augmentation and embedding
-of an entire Scalar ISA's instructions into an abstract Vectorisation
+of an entire Scalar ISA's instructions into an abstract Vectorization
Context. That abstraction subdivides down into Categories appropriate
for the type of operation (Branch, CRs, Memory, Arithmetic),
and each Category has its own relevant but
conditions are met, whereas Scalar `bclrl` for example unconditionally
overwrites LR.
-Another is that the Vectorised Branch-Conditional instructions are the
+Another is that the Vectorized Branch-Conditional instructions are the
only ones where there are side-effects on predication when skipping
is enabled. This is so as to be able to use CTR to count down
*masked-out* elements.
-Well over 500 Vectorised branch instructions exist in SVP64 due to the
+Well over 500 Vectorized branch instructions exist in SVP64 due to the
number of options available: close integration and interaction with
the base Scalar Branch was unavoidable in order to create Conditional
Branching suitable for parallel 3D / CUDA GPU workloads.
As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]]
Scalar Power ISA lacks "Conditional Execution" present in ARM
-Scalar ISA of several decades. When Vectorised the fact that
+Scalar ISA of several decades. When Vectorized the fact that
Rc=1 Vector results can immediately be used as a Predicate Mask
back into the following instruction can result in large latency
unless "Vector Chaining" is used in the Micro-Architecture.
**Description**
svstep may be used to enquire about the REMAP Schedule and it may be
-used to alter Vectorisation State. When `vf=1` then stepping occurs.
+used to alter Vectorization State. When `vf=1` then stepping occurs.
When `vf=0` the enquiry is performed without altering internal state.
If `SVi=0, Rc=0, vf=0` the instruction is a `nop`.
* Horizontal-First Mode can be used to return all indices,
i.e. walks through all possible states.
-**Vectorisation of svstep itself**
+**Vectorization of svstep itself**
As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as
`sv.svstep`. This will work perfectly well in Horizontal-First
A mode of srcstep (SVi=0) is called which can move srcstep and dststep
on to the next element, still respecting predicate masks.
-In other words, where normal SVP64 Vectorisation acts "horizontally"
+In other words, where normal SVP64 Vectorization acts "horizontally"
by looping first through 0 to VL-1 and only then moving the PC to the
next instruction, Vertical-First moves the PC onwards (vertically)
through multiple instructions **with the same srcstep and dststep**,
* Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
Mitch under NDA
on direct contact with him. It is a different approach from the
- others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
+ others, which may be termed "Cray-Style Horizontal-First" Vectorization.
66000 is a *Vertical-First* Vector ISA with hardware-level
- auto-vectorisation.
+ auto-vectorization.
* [ETA-10](http://50.204.185.175/collections/catalog/102641713)
an extremely rare Scalable Vector Architecture from 1986,
similar to the CDC Cyber 205.
the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
-scalar variant which can be Vectorised to give iotacr:
+scalar variant which can be Vectorized to give iotacr:
def crtaddi(RT, RA, BA, BO, D):
if test_CR_bit(BA, BO):