From: Luke Kenneth Casson Leighton Date: Mon, 29 May 2023 12:27:18 +0000 (+0100) Subject: replace Vectorised with Vectorized X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=e7b958bd6f3209bbe23d1abc193fe6ffb89847eb;p=libreriscv.git replace Vectorised with Vectorized --- diff --git a/openpower/sv/16_bit_compressed.mdwn b/openpower/sv/16_bit_compressed.mdwn index 2a302ea5d..564138e1b 100644 --- a/openpower/sv/16_bit_compressed.mdwn +++ b/openpower/sv/16_bit_compressed.mdwn @@ -41,7 +41,7 @@ standard 32 bit and 16 bit to intermingle cleanly. To achieve the same thing on OpenPOWER would require a whopping 24 6-bit Major Opcodes which is clearly impractical: other schemes need to be devised. -In addition we would like to add SV-C32 which is a Vectorised version +In addition we would like to add SV-C32 which is a Vectorized version of 16 bit Compressed, and ideally have a variant that adds the 27-bit prefix format from SV-P64, as well. diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn index 89ead3407..ea1bf2b9e 100644 --- a/openpower/sv/SimpleV_rationale.mdwn +++ b/openpower/sv/SimpleV_rationale.mdwn @@ -21,7 +21,7 @@ Inventing a new Scalar ISA from scratch is over a decade-long task including simulators and compilers: OpenRISC 1200 took 12 years to mature. Stable Open ISAs require Standards and Compliance Suites that take more. A Vector or Packed SIMD ISA to reach stable *general-purpose* -auto-vectorisation compiler support has never been achieved in the +auto-vectorization compiler support has never been achieved in the history of computing, not with the combined resources of ARM, Intel, AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted assembler and direct use of intrinsics is the Industry-standard norm @@ -129,7 +129,7 @@ performance is concerned. Slowly, at this point, a realisation should be sinking in that, actually, there aren't as many really truly viable Vector ISAs out there, as the -ones that are evolving in the general direction of Vectorisation are, +ones that are evolving in the general direction of Vectorization are, in various completely different ways, flawed. **Successfully identifying a limitation marks the beginning of an @@ -381,7 +381,7 @@ Remarkably, very little: the devil is in the details though. sequential carry-flag chaining of these scalar instructions. * The Condition Register Fields of the Power ISA make a great candidate for use as Predicate Masks, particularly when combined with - Vectorised `cmp` and Vectorised `crand`, `crxor` etc. + Vectorized `cmp` and Vectorized `crand`, `crxor` etc. It is only when looking slightly deeper into the Power ISA that certain things turn out to be missing, and this is down in part to IBM's @@ -389,9 +389,9 @@ primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or so Scalar ones. Examples include that transfer operations between the Integer and Floating-point Scalar register files were dropped approximately a decade ago after the Packed SIMD variants were considered to be -duplicates. With it being completely inappropriate to attempt to Vectorise +duplicates. With it being completely inappropriate to attempt to Vectorize a Packed SIMD ISA designed 20 years ago with no Predication of any kind, -the Scalar ISA, a much better all-round candidate for Vectorisation +the Scalar ISA, a much better all-round candidate for Vectorization (the Scalar parts of Power ISA) is left anaemic. A particular key instruction that is missing is `MV.X` which is @@ -399,7 +399,7 @@ illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously expensive instruction causing a huge swathe of Register Hazards in one single hit is almost never added to a Scalar ISA but is almost always added to a Vector one. When `MV.X` is -Vectorised it allows for arbitrary +Vectorized it allows for arbitrary remapping of elements within a Vector to positions specified by another Vector. A typical Scalar ISA will use Memory to achieve this task, but with Vector ISAs the Vector Register Files are @@ -872,7 +872,7 @@ not that straightforward: programs have to be "massaged" by tools that insert intrinsics into the source code, in order to identify the Basic Blocks that the Zero-Overhead Loops can run. Can this be merged into standard gcc and llvm -compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably, +compilers? As intrinsics: of course. Can it become part of auto-vectorization? Probably, if an infinite supply of money and engineering time is thrown at it. Is a half-way-house solution of compiler intrinsics good enough? Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that, @@ -912,7 +912,7 @@ definitely compelling enough to warrant in-depth investigation. First, some important definitions, because there are two different -Vectorisation Modes in SVP64: +Vectorization Modes in SVP64: * **Horizontal-First**: (aka standard Cray Vectors) walk through **elements** first before moving to next **instruction** @@ -943,7 +943,7 @@ traditional CPU in no way can help: only by loading the data through the L1-L4 Cache and Virtual Memory Barriers is it possible to ascertain, retrospectively, that time and power had just been wasted. -SVP64 is able to do what is termed "Vertical-First" Vectorisation, +SVP64 is able to do what is termed "Vertical-First" Vectorization, combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been extended, Snitch-style, to perform a deterministic memory-array walk of a large Matrix. @@ -977,7 +977,7 @@ L1/L2/L3 Caches only to find, at the CPU, that it is zero. The reason in this case for the use of Vertical-First Mode is the conditional execution of the Multiply-and-Accumulate. -Horizontal-First Mode is the standard Cray-Style Vectorisation: +Horizontal-First Mode is the standard Cray-Style Vectorization: loop on all *elements* with the same instruction before moving on to the next instruction. Horizontal-First Predication needs to be pre-calculated diff --git a/openpower/sv/av_opcodes.mdwn b/openpower/sv/av_opcodes.mdwn index 9614d0090..3894a6984 100644 --- a/openpower/sv/av_opcodes.mdwn +++ b/openpower/sv/av_opcodes.mdwn @@ -2,7 +2,7 @@ # Scalar OpenPOWER Audio and Video Opcodes -the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed. +the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorization context does it become clear why they are needed and how they may be designed. This page therefore has accompanying discussion at for evolution of suitable opcodes. diff --git a/openpower/sv/av_opcodes/analysis.mdwn b/openpower/sv/av_opcodes/analysis.mdwn index a189c5140..a94628498 100644 --- a/openpower/sv/av_opcodes/analysis.mdwn +++ b/openpower/sv/av_opcodes/analysis.mdwn @@ -2,7 +2,7 @@ # Scalar OpenPOWER Audio and Video Opcodes -the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed. +the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorization context does it become clear why they are needed and how they may be designed. This page therefore has accompanying discussion at for evolution of suitable opcodes. @@ -21,7 +21,7 @@ Links The fundamental principle for these instructions is: * identify the scalar primitive -* assume that longer runs of scalars will have Simple-V vectorisatin applied +* assume that longer runs of scalars will have Simple-V vectorizatin applied * assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level, (even if that involves a mv.swizxle which may be macro-op fused) in order to perform the necessary HI/LO selection normally hard-coded diff --git a/openpower/sv/biginteger.mdwn b/openpower/sv/biginteger.mdwn index c84971b82..26b2534c0 100644 --- a/openpower/sv/biginteger.mdwn +++ b/openpower/sv/biginteger.mdwn @@ -20,14 +20,14 @@ top of Scalar operations, where those scalar operations are useful in their own right without SVP64. Thus the operations here are proposed first as Scalar Extensions to the Power ISA. -A secondary focus is that if Vectorised, implementors may choose +A secondary focus is that if Vectorized, implementors may choose to deploy macro-op fusion targetting back-end 256-bit or greater Dynamic SIMD ALUs for maximum performance and effectiveness. # Analysis Covered in [[biginteger/analysis]] the summary is that standard `adde` -is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe` +is sufficient for SVP64 Vectorization of big-integer addition (and `subfe` for subtraction) but that big-integer shift, multiply and divide require an extra 3-in 2-out instructions, similar to Intel's [shld](https://www.felixcloutier.com/x86/shld) diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index 6075a6df4..f970ada96 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -21,10 +21,10 @@ of existing Scalar Power ISA instructions is also explained. Use of smaller sub-operations is a given: worst-case in a Scalar context, addition is O(N) whilst multiply and divide are O(N^2), -and their Vectorisation would reduce those (for small N) to +and their Vectorization would reduce those (for small N) to O(1) and O(N). Knuth's big-integer scalar algorithms provide useful real-world grounding into the types of operations needed, -making it easy to demonstrate how they would be Vectorised. +making it easy to demonstrate how they would be Vectorized. The basic principle behind Knuth's algorithms is to break the problem down into a single scalar op against a Vector operand. @@ -45,7 +45,7 @@ Links # Vector Add and Subtract Surprisingly, no new additional instructions are required to perform -a straightforward big-integer add or subtract. Vectorised `adde` +a straightforward big-integer add or subtract. Vectorized `adde` or `addex` is perfectly sufficient to produce arbitrary-length big-integer add due to the rules set in SVP64 that all Vector Operations are directly equivalent to the strict Program Order Execution of @@ -69,7 +69,7 @@ with incrementing register numbers - is precisely the very definition of how SVP64 works! Thus, due to sequential execution of `adde` both consuming and producing a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA, -`sv.adde` is in effect an alias for Big-Integer Vectorised add. As such, +`sv.adde` is in effect an alias for Big-Integer Vectorized add. As such, implementors are entirely at liberty to recognise Horizontal-First Vector adds and send the vector of registers to a much larger and wider back-end ALU, and short-cut the intermediate storage of XER.CA on an element @@ -107,7 +107,7 @@ Cray-style Vector Loop: bnz loop # do more digits This is not that different from a Scalar Big-Int add, it is -just that like all Cray-style Vectorisation, a variable number +just that like all Cray-style Vectorization, a variable number of elements are covered by one instruction. Of interest to people unfamiliar with Cray-style Vectors: if VL is not permitted to exceed 1 (because MAXVL is set to 1) then the above @@ -119,7 +119,7 @@ Like add and subtract, strictly speaking these need no new instructions. Keeping the shift amount within the range of the element (64 bit) a Vector bit-shift may be synthesised from a pair of shift operations and an OR, all of which are standard Scalar Power ISA instructions -that when Vectorised are exactly what is needed. +that when Vectorized are exactly what is needed. ``` void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) { @@ -286,7 +286,7 @@ setting an additional CA flag. We first cover the chain of RT2, RC2 = RA2 * RB2 + RC1 Following up to add each partially-computed row to what will become -the final result is achieved with a Vectorised big-int +the final result is achieved with a Vectorized big-int `sv.adde`. Thus, the key inner loop of Knuth's Algorithm M may be achieved in four instructions, two of which are scalar initialisation: @@ -385,13 +385,13 @@ this time using subtract instead of add. bool need_fixup = !ca; // for phase 3 correction ``` -In essence then the primary focus of Vectorised Big-Int divide is in +In essence then the primary focus of Vectorized Big-Int divide is in fact big-integer multiply Detection of the fixup (phase 3) is determined by the Carry (borrow) bit at the end. Logically: if borrow was required then the qhat estimate was too large and the correction is required, which is, again, -nothing more than a Vectorised big-integer add (one instruction). +nothing more than a Vectorized big-integer add (one instruction). However this is not the full story **128/64-bit divisor** @@ -438,7 +438,7 @@ implemented bit-wise, with all that implies. The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take -advantage of the efficiency of 64-bit scalar multiply when Vectorised +advantage of the efficiency of 64-bit scalar multiply when Vectorized would instead lock up CPU time performing a 128/64 scalar division. With the Vector Multiply operations being critically dependent on that `qhat` estimate, and diff --git a/openpower/sv/bitmanip.mdwn b/openpower/sv/bitmanip.mdwn index 555748209..82155aad2 100644 --- a/openpower/sv/bitmanip.mdwn +++ b/openpower/sv/bitmanip.mdwn @@ -19,8 +19,8 @@ pseudocode: [[openpower/isa/bitmanip]] this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX. Also included are DSP/Multimedia operations suitable for Audio/Video. -Vectorisation and SIMD are removed: these are straight scalar (element) -operations making them suitable for embedded applications. Vectorisation +Vectorization and SIMD are removed: these are straight scalar (element) +operations making them suitable for embedded applications. Vectorization Context is provided by [[openpower/sv]]. When combined with SV, scalar variants of bitmanip operations found in @@ -109,7 +109,7 @@ For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2: for i in range(64): RT[i] = lut2(CRs{BFA}, RB[i], RA[i]) -When Vectorised with SVP64, as usual both source and destination may be +When Vectorized with SVP64, as usual both source and destination may be Vector or Scalar. *Programmer's note: a dynamic ternary lookup may be synthesised from @@ -159,7 +159,7 @@ CRB-Form: a,b = CRs[BF][i], CRs[BF][i]) if msk[i] CRs[BF][i] = lut2(CRs[BFB], a, b) -When SVP64 Vectorised any of the 4 operands may be Scalar or +When SVP64 Vectorized any of the 4 operands may be Scalar or Vector, including `BFB` meaning that multiple different dynamic lookups may be performed with a single instruction. Note that this instruction is deliberately an overwrite in order to reduce diff --git a/openpower/sv/branches.mdwn b/openpower/sv/branches.mdwn index 789ec1d6b..7aea7936f 100644 --- a/openpower/sv/branches.mdwn +++ b/openpower/sv/branches.mdwn @@ -4,7 +4,7 @@ Please note: although similar, SVP64 Branch instructions should be considered completely separate and distinct from standard scalar OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way impacted, altered, changed or modified in any way, shape or form by the -SVP64 Vectorised Variants**. +SVP64 Vectorized Variants**. It is also extremely important to note that Branches are the sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches @@ -45,7 +45,7 @@ are false. Unless Branches are aware and capable of such analysis, additional instructions would be required which perform Horizontal Cumulative -analysis of Vectorised Condition Register Fields, in order to reduce +analysis of Vectorized Condition Register Fields, in order to reduce the Vector of CR Fields down to one single yes or no decision that a Scalar-only v3.0B Branch-Conditional could cope with. Such instructions would be unavoidable, required, and costly by comparison to a single @@ -55,7 +55,7 @@ high priority for 3D GPU (and OpenCL-style) workloads. Given that Power ISA v3.0B is already quite powerful, particularly the Condition Registers and their interaction with Branches, there are -opportunities to create extremely flexible and compact Vectorised Branch +opportunities to create extremely flexible and compact Vectorized Branch behaviour. In addition, the side-effects (updating of CTR, truncation of VL, described below) make it a useful instruction even if the branch points to the next instruction (no actual branch). @@ -74,7 +74,7 @@ which just leaves two modes: a Great Big AND of all condition tests. Exit occurs on the first **failed** test. -Early-exit is enacted such that the Vectorised Branch does not +Early-exit is enacted such that the Vectorized Branch does not perform needless extra tests, which will help reduce reads on the Condition Register file. @@ -122,12 +122,12 @@ than testing only against zero, the option to test against one is also prudent. This introduces a new immediate field, `SNZ`, which works in conjunction with `sz`. -Vectorised Branches can be used in either SVP64 Horizontal-First or +Vectorized Branches can be used in either SVP64 Horizontal-First or Vertical-First Mode. Essentially, at an element level, the behaviour is identical in both Modes, although the `ALL` bit is meaningless in Vertical-First Mode. -It is also important to bear in mind that, fundamentally, Vectorised +It is also important to bear in mind that, fundamentally, Vectorized Branch-Conditional is still extremely close to the Scalar v3.0B Branch-Conditional instructions, and that the same v3.0B Scalar Branch-Conditional instructions are still *completely separate and @@ -206,7 +206,7 @@ Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not to occur because there will be no *successful* Condition Tests to make it happen. -## Vectorised CR Field numbering, and Scalar behaviour +## Vectorized CR Field numbering, and Scalar behaviour It is important to keep in mind that just like all SVP64 instructions, the `BI` field of the base v3.0B Branch Conditional instruction may be @@ -301,7 +301,7 @@ is applied to the Branch because in `ALL` mode all nonmasked bits have to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is not used, CTR may still be decremented by the total number of nonmasked elements, acting in effect as either a popcount or cntlz depending -on which mode bits are set. In short, Vectorised Branch becomes an +on which mode bits are set. In short, Vectorized Branch becomes an extremely powerful tool. **Micro-Architectural Implementation Note**: *when implemented on top @@ -532,7 +532,7 @@ why CTR is decremented (CTRtest Mode) and whether LR is updated (which is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1). Inline comments highlight the fact that the Scalar Branch behaviour and -pseudocode is still clearly visible and embedded within the Vectorised +pseudocode is still clearly visible and embedded within the Vectorized variant: ``` @@ -640,7 +640,7 @@ Pseudocode for Vertical-First Mode: CRbits = CR{SVCRf} # select predicate bit or zero/one if predicate[srcstep]: - if BRc = 1 then # CR0 vectorised + if BRc = 1 then # CR0 vectorized CR{SVCRf+srcstep} = CRbits testbit = CRbits[BI & 0b11] else if not SVRMmode.sz: diff --git a/openpower/sv/comparison_table.mdwn b/openpower/sv/comparison_table.mdwn index 029638f50..15a73c836 100644 --- a/openpower/sv/comparison_table.mdwn +++ b/openpower/sv/comparison_table.mdwn @@ -16,8 +16,8 @@ [^3]: A 2-Dimensional Scalable Vector ISA **specifically designed for the Power ISA** with both Horizontal-First and Vertical-First Modes. See [[sv/vector_isa_comparison]] [^4]: on specific operations. See [[opcode_regs_deduped]] for full list. Key: 2P - Twin Predication, 1P - Single-Predicate [^5]: SVP64 provides a Vector concept on top of the **Scalar** GPR, FPR and CR Fields, extended to 128 entries. -[^6]: SVP64 Vectorises Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops. -[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorised by SVP64). See [[sv/biginteger/analysis]] +[^6]: SVP64 Vectorizes Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops. +[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorized by SVP64). See [[sv/biginteger/analysis]] [^8]: LD/ST Fault-First: see [[sv/svp64/appendix]] and [ARM SVE Fault-First](https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf) [^9]: Data-dependent Fail-First: Based on LD/ST Fail-first, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See [[sv/svp64/appendix]] [^10]: Predicate-result effectively turns any standard op into a type of "cmp". See [[sv/svp64/appendix]] @@ -51,6 +51,6 @@ which are power-2 based on Silicon-partner SIMD width. Non-power-2 not supported but [zero-input masking](https://www.realworldtech.com/forum/?threadid=202688&curpostid=207774) is. [^x4]: [Advanced matrix Extensions](https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions) supports BF16 and INT8 only. Separate regfile, power-of-two "tiles". Not general-purpose at all. [^b1]: Although registers may be 128-bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128-bit operations. Only RVV and SVP64 have the possibility of 128-bit ops -[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorisation** LOOP built-in as an extension named VVM. Classified as "Vertical-First". +[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorization** LOOP built-in as an extension named VVM. Classified as "Vertical-First". [^m2]: MyISA 66000 has a CARRY register up to 64-bit. Repeated application of FMA (esp. within Auto-Vectored LOOPS) automatically and inherently creates big-int operations with zero effort. [^nc]: "Silicon-Partner" Scaling achieved through allowing same instruction to act on different regfile size and bitwidth. This catastrophically results in binary non-interoperability. diff --git a/openpower/sv/cookbook/chacha20.mdwn b/openpower/sv/cookbook/chacha20.mdwn index 34dcae479..d578c98b2 100644 --- a/openpower/sv/cookbook/chacha20.mdwn +++ b/openpower/sv/cookbook/chacha20.mdwn @@ -11,7 +11,7 @@ need to be called again. Firstly, we analyse the xchacha20 algorithm, showing what operations are performed and in what order. Secondly, two innovative features of SVP64 are described which are crucial to understanding of Simple-V -Vectorisation: Vertical-First Mode and Indexed REMAP. Then we show +Vectorization: Vertical-First Mode and Indexed REMAP. Then we show how Index REMAP eliminates the need entirely for inline-loop-unrolling, but note that in this particular algorithm REMAP is only useful for us in Vertical-First Mode. diff --git a/openpower/sv/cr_int_predication.mdwn b/openpower/sv/cr_int_predication.mdwn index f2c510202..8476b30c4 100644 --- a/openpower/sv/cr_int_predication.mdwn +++ b/openpower/sv/cr_int_predication.mdwn @@ -239,10 +239,10 @@ on the CR Field containing `BT`. \newpage{} -# Vectorised versions involving GPRs +# Vectorized versions involving GPRs The name "weird" refers to a minor violation of SV rules when it comes -to deriving the Vectorised versions of these instructions. +to deriving the Vectorized versions of these instructions. Normally the progression of the SV for-loop would move on to the next register. Instead however in the scalar case these instructions diff --git a/openpower/sv/cr_ops.mdwn b/openpower/sv/cr_ops.mdwn index bcd5a3be1..ddb54d787 100644 --- a/openpower/sv/cr_ops.mdwn +++ b/openpower/sv/cr_ops.mdwn @@ -17,8 +17,8 @@ Condition Register Fields are only 4 bits wide: this presents some interesting conceptual challenges for SVP64, which was designed primarily for vectors of arithmetic and logical operations. However if predicates may be bits of CR Fields it makes sense to extend -Simple-V to cover CR Operations, especially given that Vectorised Rc=1 -may be processed by Vectorised CR Operations that usefully in turn +Simple-V to cover CR Operations, especially given that Vectorized Rc=1 +may be processed by Vectorized CR Operations that usefully in turn may become Predicate Masks to yet more Vector operations, like so: ``` @@ -44,7 +44,7 @@ considered to be a "co-result". Such CR Field "co-result" arithmeric operations are firmly out of scope for this section, being covered fully by [[sv/normal]]. -* Examples of Vectoriseable Defined Words to which this section does +* Examples of Vectorizeable Defined Words to which this section does apply is - `mfcr` and `cmpi` (3 bit operands) and - `crnor` and `crand` (5 bit operands). @@ -107,7 +107,7 @@ Register Field generated from Rc=1 is used as the basis for the truncation decision. However with CR-based operations that CR Field result to be tested is provided *by the operation itself*. -Data-dependent SVP64 Vectorised Operations involving the creation +Data-dependent SVP64 Vectorized Operations involving the creation or modification of a CR can require an extra two bits, which are not available in the compact space of the SVP64 RM `MODE` Field. With the concept of element width overrides being meaningless for CR Fields it @@ -267,13 +267,13 @@ Adding entirely new pipelines and a new Vector CR Register file is a much easier proposition to consider. The prohibitions utilise the CR Field numbers implicitly to -split out Vectorised CR operations to be considered completely +split out Vectorized CR operations to be considered completely separare and distinct from Scalar CR operations *even though they both use the same binary encoding*. This does in turn mean that at the Decode Phase it becomes necessary to examine not only the operation (`sv.crand`, `sv.cmp`) but also the CR Field numbers as well as whether, in the EXTRA2/3 Mode -bits, the operands are Vectorised. +bits, the operands are Vectorized. A future version of Power ISA, where SVP64Single is proposed, would in fact introduce "Conditional Execution", including diff --git a/openpower/sv/implementation.mdwn b/openpower/sv/implementation.mdwn index f55d5c58a..8264127df 100644 --- a/openpower/sv/implementation.mdwn +++ b/openpower/sv/implementation.mdwn @@ -183,7 +183,7 @@ unit tests: * Condition Registers. see note below * FPR (if present) -When Rc=1 is encountered in an SVP64 Context the destination is different (TODO) i.e. not CR0 or CR1. Implicit Rc=1 Condition Registers are still Vectorised but do **not** have EXTRA2/3 spec adjustments. The only part if the EXTRA2/3 spec that is observed and respected is whether the CR is Vectorised (isvec). +When Rc=1 is encountered in an SVP64 Context the destination is different (TODO) i.e. not CR0 or CR1. Implicit Rc=1 Condition Registers are still Vectorized but do **not** have EXTRA2/3 spec adjustments. The only part if the EXTRA2/3 spec that is observed and respected is whether the CR is Vectorized (isvec). ## Increasing register file sizes @@ -251,10 +251,10 @@ TODO * * -## Vectorised Branches +## Vectorized Branches TODO [[sv/branches]] -## Vectorised LD/ST +## Vectorized LD/ST TODO [[sv/ldst]] diff --git a/openpower/sv/int_fp_mv/appendix.mdwn b/openpower/sv/int_fp_mv/appendix.mdwn index 02a002e84..b94fadaa2 100644 --- a/openpower/sv/int_fp_mv/appendix.mdwn +++ b/openpower/sv/int_fp_mv/appendix.mdwn @@ -2,7 +2,7 @@ # SVP64 polymorphic elwidth overrides -SimpleV, the Draft Cray-style Vectorisation for OpenPOWER, may +SimpleV, the Draft Cray-style Vectorization for OpenPOWER, may independently override both or either of the source or destination register bitwidth in the base operation used to create the Vector operation. In the case of IEEE754 FP operands this gives an diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn index 98a194239..cbebc0032 100644 --- a/openpower/sv/ldst.mdwn +++ b/openpower/sv/ldst.mdwn @@ -33,7 +33,7 @@ Vector Operations, then in order to keep the ALUs 100% occupied the Memory infrastructure (and the ISA itself) correspondingly needs Vector Memory Operations as well. -Vectorised Load and Store also presents an extra dimension (literally) +Vectorized Load and Store also presents an extra dimension (literally) which creates scenarios unique to Vector applications, that a Scalar (and even a SIMD) ISA simply never encounters: not even the complex Addressing Modes of the 68,000 or S/360 resemble Vector Load/Store. @@ -46,7 +46,7 @@ instructions) ## Modes overview -Vectorisation of Load and Store requires creation, from scalar operations, +Vectorization of Load and Store requires creation, from scalar operations, a number of different modes: * **fixed aka "unit" stride** - contiguous sequence with no gaps @@ -192,17 +192,17 @@ Vector Indexed Strided Mode is qualified as follows: svctx.ldstmode = elementstride ``` -A summary of the effect of Vectorisation of src or dest: +A summary of the effect of Vectorization of src or dest: ``` imm(RA) RT.v RA.v no stride allowed imm(RA) RT.s RA.v no stride allowed imm(RA) RT.v RA.s stride-select allowed - imm(RA) RT.s RA.s not vectorised + imm(RA) RT.s RA.s not vectorized RA,RB RT.v {RA|RB}.v Standard Indexed RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT) RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable - RA,RB RT.s {RA&RB}.s not vectorised (scalar identity) + RA,RB RT.s {RA&RB}.s not vectorized (scalar identity) ``` Signed Effective Address computation is only relevant for Vector Indexed @@ -229,7 +229,7 @@ peripheral reads, stopping at the first NULL-terminated character and truncating VL to that point. No branch is needed to issue that large burst of LDs, which may be valuable in Embedded scenarios. -## Vectorisation of Scalar Power ISA v3.0B +## Vectorization of Scalar Power ISA v3.0B Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]] and [[isa/fixedstore]] pseudocode to be of the form: @@ -365,7 +365,7 @@ contexts, potentially causing confusion. named LD/ST Indexed**. Whilst it may be costly in terms of register reads to allow REMAP Indexed -Mode to be applied to any Vectorised LD/ST Indexed operation such as +Mode to be applied to any Vectorized LD/ST Indexed operation such as `sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly the strict application of the RISC Paradigm that Simple-V follows makes it awkward to consider *preventing* the application of Indexed REMAP to @@ -407,7 +407,7 @@ the LD/ST that would otherwise have caused an exception is *required* to be cancelled. Additionally an implementor may choose to truncate VL for any arbitrary reason *except for the very first*. -ffirst LD/ST to multiple pages via a Vectorised Index base is +ffirst LD/ST to multiple pages via a Vectorized Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited @@ -555,7 +555,7 @@ Load-Reservation and Store-Conditional are required to be executed in pairs. By contrast, in Vertical-First Mode it is in fact possible to issue -the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is +the pairs, and consequently allowing Vectorized Data-Dependent Fail-First is useful. Programmer's note: Care should be taken when VL is truncated in @@ -565,7 +565,7 @@ Vertical-First Mode. Although Rc=1 on LD/ST is a rare occurrence at present, future versions of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and -with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that +with the SVP64 Vectorization Prefixing being itself a RISC-paradigm that is itself fully-independent of the Scalar Suffix Defined Words, prohibiting the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST operations is not strategically sound. @@ -732,7 +732,7 @@ NEON covers this as shown in the diagram below: REMAP easily covers this capability, and with dest elwidth overrides and saturation may do so with built-in conversion that would normally -require additional width-extension, sign-extension and min/max Vectorised +require additional width-extension, sign-extension and min/max Vectorized instructions as post-processing stages. Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes diff --git a/openpower/sv/ldst/discussion.mdwn b/openpower/sv/ldst/discussion.mdwn index a6a113e3d..b95220234 100644 --- a/openpower/sv/ldst/discussion.mdwn +++ b/openpower/sv/ldst/discussion.mdwn @@ -2,7 +2,7 @@ this section covers assembly notation for the immediate and indexed LD/ST. the summary is that in immediate mode for LD it is not clear that if the -destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar +destination register is Vectorized `RT.v` but the source `imm(RA)` is scalar the memory being read is *still a vector load*, known as "unit or element strides". This anomaly is made clear with the following notation: @@ -39,7 +39,7 @@ permutations of vector selection, to identify above asm-syntax: sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2 mem@r#2 +0 ... +offs ... +offs*2 destreg r# r#+1 r#+2 - imm(RA) RT.s RA.s not vectorised + imm(RA) RT.s RA.s not vectorized sv.ld r#, ofst(r#2) indexed mode: @@ -55,4 +55,4 @@ indexed mode: RA,RB RT.s RA.v RB.v RA,RB RT.s RA.s RB.v RA,RB RT.s RA.v RB.s - RA,RB RT.s RA.s RB.s not vectorised + RA,RB RT.s RA.s RB.s not vectorized diff --git a/openpower/sv/ls010/hypothetical_addi.mdwn b/openpower/sv/ls010/hypothetical_addi.mdwn index 429f899d1..66e52a458 100644 --- a/openpower/sv/ls010/hypothetical_addi.mdwn +++ b/openpower/sv/ls010/hypothetical_addi.mdwn @@ -18,8 +18,8 @@ extension to `addi` / `paddi` (Public v3.1 I 3.3.9 p76). * Thirdly, just because of the PO9-Prefix it is prohibited to put an entirely different instruction into the Suffix position. If `{PO14}` as a 32-bit instruction is defined as "addi", then - it is **required** that `{PO9}-{PO14}` **be** a Vectorised "addi", - **not** a Vectorised multiply. + it is **required** that `{PO9}-{PO14}` **be** a Vectorized "addi", + **not** a Vectorized multiply. * Fourthly, where PO1-Prefixing of operand fields (often resulting in "split field" redefinitions such as `si0||si1`) is an arbitrary manually-hand-crafted procedure, diff --git a/openpower/sv/microcontroller_power_isa_for_ai.mdwn b/openpower/sv/microcontroller_power_isa_for_ai.mdwn index 0b09bff42..c174be2f0 100644 --- a/openpower/sv/microcontroller_power_isa_for_ai.mdwn +++ b/openpower/sv/microcontroller_power_isa_for_ai.mdwn @@ -61,7 +61,7 @@ let that sink in a moment because the implications are startling: [and anticipate someone in the future to define a 128-bit variant to match RISC-V RV128]. -bear in mind that SVP64 *has* to have Scalar Operations first, because by design and by definition *only Scalar operations may be Vectorised*. SVP64 *DOES NOT* add *ANY* Vector Instructions. SVP64 is a generic loop around *Scalar* operations and it us up to the Architecture to take advantage of that, at the back-end. +bear in mind that SVP64 *has* to have Scalar Operations first, because by design and by definition *only Scalar operations may be Vectorized*. SVP64 *DOES NOT* add *ANY* Vector Instructions. SVP64 is a generic loop around *Scalar* operations and it us up to the Architecture to take advantage of that, at the back-end. without SVP64 Sub-Looping it would on the face of it seem absolutely mental and a total waste of time and resources to define an 8 or 16 bit General-Purpose ISA in the year 2022 until you recall that: @@ -88,7 +88,7 @@ anyone who has tried either CUDA, 3D Shader programs, deep or wide SIMD Programm (in particular, anyone who remembers how hard programming the Cell Processor turned out to be will be having that familiar "lightbulb moment" right about now) -more than that: what if those 8 and 16 bit cores had a Supercomputing-class Vectorisation option in the ISA, and there were implementations out there with back-end ALUs that could perform 64 or 128 8 or 16 bit operations per clock cycle? +more than that: what if those 8 and 16 bit cores had a Supercomputing-class Vectorization option in the ISA, and there were implementations out there with back-end ALUs that could perform 64 or 128 8 or 16 bit operations per clock cycle? Quantity several thousand per processor, all of them capable of adapting to run massive AI number crunching or (at lower IPC than "normal" processors) general-purpose compute? diff --git a/openpower/sv/mv.swizzle.mdwn b/openpower/sv/mv.swizzle.mdwn index 2d413b6a3..3dc801822 100644 --- a/openpower/sv/mv.swizzle.mdwn +++ b/openpower/sv/mv.swizzle.mdwn @@ -28,7 +28,7 @@ contiguous array of vec3 (XYZ) may only have 2 elements (ZY) swizzle-copied to a contiguous array of vec2. A contiguous array of vec2 sources may have multiple of each vec2 elements (XY) copied to a contiguous -vec4 array (YYXX or XYXX). For this reason, *when Vectorised* +vec4 array (YYXX or XYXX). For this reason, *when Vectorized* Swizzle Moves support independent subvector lengths for both source and destination. @@ -106,7 +106,7 @@ be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar ISA this not practical. A compromise is to cut the registers required by half, placing it on-par with `lq`, `stq` and Indexed Load-with-update instructions. -When part of the Scalar Power ISA (not SVP64 Vectorised) +When part of the Scalar Power ISA (not SVP64 Vectorized) mv.swiz and fmv.swiz operate on four 32-bit quantities, reducing this instruction to a feasible 2-in, 2-out pairs of 64-bit registers: @@ -131,15 +131,15 @@ Also, making life easier, RT and RA are only permitted to be even as in `lq` and `stq`. Scalar Swizzle instructions must be atomically indivisible: an Exception or Interrupt may not occur during the Moves. -Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant +Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant *must* buffer (read) both 64-bit RA registers before writing to the RT pair (in an Out-of-Order Micro-architecture, both of the register pair must be "in-flight"). This ensures that register file corruption does not occur. -**SVP64 Vectorised** +**SVP64 Vectorized** -Vectorised Swizzle may be considered to +Vectorized Swizzle may be considered to contain an extended static predicate mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by the static predication capability, the destination @@ -147,7 +147,7 @@ subvector length can be *different* from the source subvector length, and consequently the destination subvector length is encoded into the Swizzle. -When Vectorised, given the use-case is for a High-performance GPU, +When Vectorized, given the use-case is for a High-performance GPU, the fundamental assumption is that Micro-coding or other technique will be deployed in hardware to issue multiple Scalar MV operations and @@ -159,7 +159,7 @@ quantities as the default is lifted on `sv.mv.swiz`. Additionally, in order to make life easier for implementers, some of whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding, the usual strict Element-level Program Order is relaxed. -An overlap between all and any Vectorised +An overlap between all and any Vectorized sources and destination Elements for the entirety of the Vector Loop `0..VL-1` is `UNDEFINED` behaviour. @@ -236,7 +236,7 @@ than operating on the entire vec2/3/4 together would violate that expectation. The exceptions to this, explained later, are when Pack/Unpack is enabled. -**Effect of Saturation on Vectorised Swizzle** +**Effect of Saturation on Vectorized Swizzle** A useful convenience for pixel data is to be able to insert values 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore, @@ -250,7 +250,7 @@ zero because there is no encoding space to select between -1, 0 and 1, and # Pack/Unpack Mode: -It is possible to apply Pack and Unpack to Vectorised +It is possible to apply Pack and Unpack to Vectorized swizzle moves. The interaction requires specific explanation because it involves the separate SUBVLs (with destination SUBVL being separate). Key to understanding is that the diff --git a/openpower/sv/mv.vec.mdwn b/openpower/sv/mv.vec.mdwn index 9190fcdc7..1f78f939f 100644 --- a/openpower/sv/mv.vec.mdwn +++ b/openpower/sv/mv.vec.mdwn @@ -6,7 +6,7 @@ In the SIMD VSX set, section 6.8.1 and 6.8.2 p254 of v3.0B has a series of pack also exist. In SVP64, Pack and Unpack are achieved *in the abstract* for application on *all* -Vectoriseable instructions. +Vectorizeable instructions. * See * diff --git a/openpower/sv/normal.mdwn b/openpower/sv/normal.mdwn index e104522da..f2ead2270 100644 --- a/openpower/sv/normal.mdwn +++ b/openpower/sv/normal.mdwn @@ -22,7 +22,7 @@ others are Vector-based (mapreduce, fail-on-first). [[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately: the following Modes apply to Arithmetic and Logical SVP64 operations: -* **simple** mode is straight vectorisation. No augmentations: the +* **simple** mode is straight vectorization. No augmentations: the vector comprises an array of independently created results. * **ffirst** or data-dependent fail-on-first: see separate section. The vector may be truncated depending on certain criteria. @@ -102,11 +102,11 @@ XER.SO is **ignored** completely and is **not** brought into play here. The CR overflow bit is therefore simply set to zero if saturation did not occur, and to one if it did. This behaviour (ignoring XER.SO) is actually optional in the SFFS Compliancy Subset: for SVP64 it is made -mandatory *but only on Vectorised instructions*. +mandatory *but only on Vectorized instructions*. Note also that saturate on operations that set OE=1 must raise an Illegal Instruction due to the conflicting use of the CR.so bit for storing -if saturation occurred. Vectorised Integer Operations that produce a +if saturation occurred. Vectorized Integer Operations that produce a Carry-Out (CA, CA32): these two bits will be `UNDEFINED` if saturation is also requested. @@ -209,7 +209,7 @@ to include the terminating zero. In CR-based data-driven fail-on-first there is only the option to select and test one bit of each CR (just as with branch BO). For more complex -tests this may be insufficient. If that is the case, a vectorised crop +tests this may be insufficient. If that is the case, a vectorized crop such as crand, cror or [[sv/cr_int_predication]] crweirder may be used, and ffirst applied to the crop instead of to the arithmetic vector. Note that crops are covered by the [[sv/cr_ops]] Mode format. @@ -245,7 +245,7 @@ Two extremely important aspects of ffirst are: * CR-based data-dependent ffirst on the other hand **can** set VL equal to zero. When VL is set zero due to the first element failing the CR bit-test, all subsequent - vectorised operations are effectively `nops` which is + vectorized operations are effectively `nops` which is *precisely the desired and intended behaviour*. The second crucial aspect, compared to LDST Ffirst: diff --git a/openpower/sv/overview.mdwn b/openpower/sv/overview.mdwn index a17afdcb0..794155bdc 100644 --- a/openpower/sv/overview.mdwn +++ b/openpower/sv/overview.mdwn @@ -12,7 +12,7 @@ This document provides an overview and introduction as to why SV (a Links: * This page: [http://libre-soc.org/openpower/sv/overview](http://libre-soc.org/openpower/sv/overview) -* [FOSDEM2021 SimpleV for Power ISA](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorisation/) +* [FOSDEM2021 SimpleV for Power ISA](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorization/) * FOSDEM2021 presentation * [[discussion]] and [bugreport](https://bugs.libre-soc.org/show_bug.cgi?id=556) @@ -85,7 +85,7 @@ registers from 32 to 64 bit). The fundamentals are (just like x86 "REP"): * The Program Counter (PC) gains a "Sub Counter" context (Sub-PC) -* Vectorisation pauses the PC and runs a Sub-PC loop from 0 to VL-1 +* Vectorization pauses the PC and runs a Sub-PC loop from 0 to VL-1 (where VL is Vector Length) * The [[Program Order]] of "Sub-PC" instructions must be preserved, just as is expected of instructions ordered by the PC. @@ -215,14 +215,14 @@ on the standard register file, just with a loop. Scalar happens to set that loop size to one. The important insight from the above is that, strictly speaking, Simple-V -is not really a Vectorisation scheme at all: it is more of a hardware +is not really a Vectorization scheme at all: it is more of a hardware ISA "Compression scheme", allowing as it does for what would normally require multiple sequential instructions to be replaced with just one. This is where the rule that Program Order must be preserved in Sub-PC execution derives from. However in other ways, which will emerge below, the "tagging" concept presents an opportunity to include features definitely not common outside of Vector ISAs, and in that regard it's -definitely a class of Vectorisation. +definitely a class of Vectorization. ## Register "tagging" @@ -233,7 +233,7 @@ is encoded in two to three bits, depending on the instruction. The reason for using so few bits is because there are up to *four* registers to mark in this way (`fma`, `isel`) which starts to be of concern when there are only 24 available bits to specify the entire SV -Vectorisation Context. In fact, for a small subset of instructions it +Vectorization Context. In fact, for a small subset of instructions it is just not possible to tag every single register. Under these rare circumstances a tag has to be shared between two registers. @@ -261,8 +261,8 @@ Readers familiar with the Power ISA will know of Rc=1 operations that create an associated post-result "test", placing this test into an implicit Condition Register. The original researchers who created the POWER ISA chose CR0 for Integer, and CR1 for Floating Point. These *also become -Vectorised* - implicitly - if the associated destination register is -also Vectorised. This allows for some very interesting savings on +Vectorized* - implicitly - if the associated destination register is +also Vectorized. This allows for some very interesting savings on instruction count due to the very same CR Vectors being predication masks. # Adding single predication @@ -799,12 +799,12 @@ The only one missing from the list here, because it is non-sequential, is VGATHER (and VSCATTER): moving registers by specifying a vector of register indices (`regs[rd] = regs[regs[rs]]` in a loop). This one is tricky because it typically does not exist in standard scalar ISAs. -If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a +If it did it would be called [[sv/mv.x]]. Once Vectorized, it's a VGATHER/VSCATTER. # Exception-based Fail-on-first -One of the major issues with Vectorised LD/ST operations is when a +One of the major issues with Vectorized LD/ST operations is when a batch of LDs cross a page-fault boundary. With considerable resources being taken up with in-flight data, a large Vector LD being cancelled or unable to roll back is either a detriment to performance or can cause @@ -887,7 +887,7 @@ implementations may cause pipeline stalls. This is a relatively new addition to SVP64 under development as of July 2021. Where Horizontal-First is the standard Cray-style for-loop, Vertical-First typically executes just the **one** scalar element -in each Vectorised operation. That element is selected by srcstep +in each Vectorized operation. That element is selected by srcstep and dststep *neither of which are changed as a side-effect of execution*. Illustrating this in pseodocode, with a branch/loop. To create loops, a new instruction `svstep` must be called, @@ -952,12 +952,12 @@ with conceptual sub-loops, a Scalar ISA can be turned into a Vector one, by embedding Scalar instructions - unmodified - into a Vector "context" using "Prefixing". With careful thought, this technique reaches 90% par with good Vector ISAs, increasing to 95% with the addition of a -mere handful of additional context-vectoriseable scalar instructions +mere handful of additional context-vectorizeable scalar instructions ([[sv/mv.x]] amongst them). What is particularly cool about the SV concept is that custom extensions and research need not be concerned about inventing new Vector instructions and how to get them to interact with the Scalar ISA: they are effectively one and the same. Any new instruction added at the Scalar level is -inherently and automatically Vectorised, following some simple rules. +inherently and automatically Vectorized, following some simple rules. diff --git a/openpower/sv/po9_encoding.mdwn b/openpower/sv/po9_encoding.mdwn index 3ad335b93..f0489c359 100644 --- a/openpower/sv/po9_encoding.mdwn +++ b/openpower/sv/po9_encoding.mdwn @@ -25,7 +25,7 @@ Anything not falling into those five categories is termed "Unvectorizable". **Definition of Horizontal-First:** -Normal Cray-style Vectorisation, designated Horizontal-First, performs +Normal Cray-style Vectorization, designated Horizontal-First, performs element-level operations (often in parallel) before moving in the usual fashion to the next instruction. The term "Horizontal-First" stems from naturally visually listing program instructions vertically, @@ -116,7 +116,7 @@ as an inviolate hard rule governing Primary Opcode 9 that may not be revoked under any circumstances. A useful way to think of this is that the Prefix Encoding is, like the 8086 REP instruction, an independent 32-bit Defined Word. The only semi-exceptions are the Post-Increment -Mode of LD/ST-Update and Vectorised Branch-Conditional.* +Mode of LD/ST-Update and Vectorized Branch-Conditional.* Note a particular consequence of the application of the above paragraph: due to the fact that the Prefix Encodings are independent, **by @@ -132,7 +132,7 @@ area, just as *all* Scalar Defined Words are. Encoding spaces and their potential are illustrated: -| Encoding |Available bits|Scalar|Vectoriseable | SVP64Single |PO1-Prefixable | +| Encoding |Available bits|Scalar|Vectorizeable | SVP64Single |PO1-Prefixable | |----------|--------------|------|--------------|--------------|---------------| |EXT000-063| 32 | yes | yes |yes |yes | |EXT100-163| 64 | yes | no |no |not twice | @@ -164,10 +164,10 @@ Notes: SVP64Single. * Considerable care is needed both on Architectural Resource Allocation as well as instruction design itself. All new Scalar instructions automatically - and inherently must be designed taking their Vectoriseable potential into + and inherently must be designed taking their Vectorizeable potential into consideration *including VSX* in future. * Once an instruction is allocated - in an Unvectorizable area it can never be Vectorised without providing + in an Unvectorizable area it can never be Vectorized without providing an entirely new Encoding. [[!tag standards]] diff --git a/openpower/sv/predication.mdwn b/openpower/sv/predication.mdwn index 53a56d22d..da0e42742 100644 --- a/openpower/sv/predication.mdwn +++ b/openpower/sv/predication.mdwn @@ -35,9 +35,9 @@ Implementation note: even in in-order microarchitectures it is strongly adviseab XER.SO (sticky overflow) is known to cause massive slowdown in pretty much every microarchitecture and it definitely compromises the performance of out-of-order systems. The reason is that it introduces a READ-MODIFY-WRITE cycle between XER.SO and CR0 (which contains a copy of the SO field after inclusion of the overflow). The result and source registers branch off as RaW and WaR hazards from this RMW chain. -This is even before predication or vectorisation were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA. +This is even before predication or vectorization were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA. -As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests. Consequently it makes very little sense to continue to propagate OE=1 in the Vectorisation context of SV. +As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests. Consequently it makes very little sense to continue to propagate OE=1 in the Vectorization context of SV. ### Vector Chaining @@ -66,7 +66,7 @@ They also involve adding extra scalar bitmanip opcodes, such that by utilising In addition those scalar 64-bit bitmanip operations, although some of them are obscure and unusual in the scalar world, do actually have practical applications outside of a vector context. -(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorised however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice). +(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorized however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice). The summary is that adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is anomalous, costly, and unnecessary. @@ -91,7 +91,7 @@ datapath to the relevant FUs. This could be reduced by adding yet another type of special virtual register port or datapath that masks out the required predicate bits closer to the regfile. -another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs. Beyond that they can be transferred using vectorised mfcr and mtcrf into INT regs. this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix. however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane". +another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs. Beyond that they can be transferred using vectorized mfcr and mtcrf into INT regs. this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix. however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane". ### Predicated SIMD HI32-LO32 FUs @@ -163,17 +163,17 @@ Implementation-wise just like in the CR-based case a special regfile port could The disadvantages appear on closer analysis: * Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain just 8 predicate bits (each being an LSB of a given 64 bit scalar int), would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read. Resource-wise, then, this idea is expensive. -* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly. Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required. +* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorized-mfcr) are more challenging and costly. Rather than use vectorized mfcr, complex transfers of the LSBs into a single scalar int are required. In a "normal" Vector ISA this would be solved by adding opcodes that perform the kinds of bitmanipulation operations normally needed for predicate masks, as specialist operations *on* those masks. However for SV the rule has been set: "no unnecessary additional Vector Instructions" because it is possible to use existing PowerISA scalar bitmanip opcodes to cover the same job. The problem is that vectors of LSBs need to be transferred *to* scalar int regs, bitmanip operations carried out, *and then transferred back*, which is exceptionally costly. -On balance this is a less favourable option than vectorising CRs +On balance this is a less favourable option than vectorizing CRs ## Scalar (single) integer as predicate, with one DM row -This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily. +This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorized mfcr can be used to get CMP results or Vectorized Rc=1 CRs into the scalar INT, easily. This idea has several disadvantages. diff --git a/openpower/sv/propagation.mdwn b/openpower/sv/propagation.mdwn index 1d6e8242a..a8d840993 100644 --- a/openpower/sv/propagation.mdwn +++ b/openpower/sv/propagation.mdwn @@ -174,7 +174,7 @@ to be algorithmically arbitrarily remapped via 1D, 2D or 3D reshaping. The amount of information needed to do so is however quite large: consequently it is only practical to apply indirectly, via Context propagation. Vectors may be remapped such that Matrix multiply of any arbitrary size -is performed in one Vectorised `fma` instruction as long as the total +is performed in one Vectorized `fma` instruction as long as the total number of elements is less than 64 (maximum for VL). Additionally, in a fashion known as "Structure Packing" in NEON and RVV, it may be used to perform "zipping" and "unzipping" of diff --git a/openpower/sv/remap/appendix.mdwn b/openpower/sv/remap/appendix.mdwn index 39a9ec8c9..80fcb41ed 100644 --- a/openpower/sv/remap/appendix.mdwn +++ b/openpower/sv/remap/appendix.mdwn @@ -501,7 +501,7 @@ used even there. otherwise usual `0..VL-1` hardware for-loop * `svremap` to set which registers a given reordering is to apply to (RA, RT etc) -* `sv.{instruction}` where any Vectorised register marked by `svremap` +* `sv.{instruction}` where any Vectorized register marked by `svremap` will have its ordering REMAPPED according to the schedule set by `svshape`. diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index 46f60b88e..ea0942180 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -11,9 +11,9 @@ * This proposal is to extend the Power ISA with an Abstract RISC-Paradigm -Vectorisation Concept that may be orthogonally applied to **all and any** +Vectorization Concept that may be orthogonally applied to **all and any** suitable Scalar instructions, present and future, in the Scalar Power ISA. -The Vectorisation System is called +The Vectorization System is called ["Simple-V"](https://libre-soc.org/openpower/sv/) and the Prefix Format is called ["SVP64"](https://libre-soc.org/openpower/sv/). @@ -27,7 +27,7 @@ Simple-V is designed for Embedded Scenarios right the way through Audio/Visual DSPs to 3D GPUs and Supercomputing. As it does **not** add actual Vector Instructions, relying solely and exclusively on the **Scalar** ISA, it is **Scalar** instructions that need to be added to -the **Scalar** Power ISA before Simple-V may orthogonally Vectorise them. +the **Scalar** Power ISA before Simple-V may orthogonally Vectorize them. The goal of RED Semiconductor Ltd, an OpenPOWER Stakeholder, is to bring to market mass-volume general-purpose compute @@ -58,14 +58,14 @@ at the same time*. It is also critical to note that Simple-V **does not modify the Scalar Power ISA**, that **only** Scalar words may be -Vectorised, and that Vectorised instructions are **not** permitted to be +Vectorized, and that Vectorized instructions are **not** permitted to be different from their Scalar words (`addi` must use the same Word encoding as `sv.addi`, and any new Prefixed instruction added **must** also be added as Scalar). -The sole semi-exception is Vectorised +The sole semi-exception is Vectorized Branch Conditional, in order to provide the usual Advanced Branching capability present in every Commercial 3D GPU ISA, but it -is the *Vectorised* Branch-Conditional that is augmented, not Scalar +is the *Vectorized* Branch-Conditional that is augmented, not Scalar Branch. # Basic principle @@ -195,7 +195,7 @@ of the Management Operations are anticipated for a future revision. **Simple-V SPRs** -* **SVSTATE** - 64-bit Vectorisation State sufficient for Precise-Interrupt +* **SVSTATE** - 64-bit Vectorization State sufficient for Precise-Interrupt Context-switching and no adverse latency, it may be considered to be a "Sub-PC" and as such absolutely must be treated with the same respect and priority as MSR and PC. @@ -305,7 +305,7 @@ decode performed above the Vector operations may then easily be passed downstream in a fully forward-progressive piplined fashion to independent parallel units for further analysis. -**Vectorised Branch-Conditional** +**Vectorized Branch-Conditional** As mentioned in the introduction this is the one sole instruction group that @@ -313,7 +313,7 @@ is different pseudocode from its scalar equivalent. However even there its various Mode bits and options can be set such that in the degenerate case the behaviour becomes identical to Scalar Branch-Conditional. -The two additional Modes within Vectorised Branch-Conditional, both of +The two additional Modes within Vectorized Branch-Conditional, both of which may be combined, are `CTR-Mode` and `VLI-Test` (aka "Data Fail First"). CTR Mode extends the way that CTR may be decremented unconditionally within Scalar Branch-Conditional, and not only makes it conditional but @@ -334,7 +334,7 @@ Also `SVLR` is introduced, which is a parallel twin of `LR`, and saving and restoring of LR and SVLR may be deferred until the final decision as to whether to branch. In this way `sv.bclrl` does not corrupt `LR`. -Vectorised Branch-Conditional due to its side-effects (e.g. reducing CTR +Vectorized Branch-Conditional due to its side-effects (e.g. reducing CTR or truncating VL) has practical uses even if the Branch is deliberately set to the next instruction (CIA+8). For example it may be used to reduce CTR by the number of bits set in a GPR, if that GPR is given as the predicate @@ -361,7 +361,7 @@ as it is the Base Address. One confusing thing is the unfortunate naming of LD/ST Indexed and REMAP Indexed: some care is taken in the spec to discern the two. LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB -may be marked as Vectorised), where obviously the order in which +may be marked as Vectorized), where obviously the order in which that Vector of RA (or RB) is read in the usual linear sequential fashion. REMAP Indexed affects the **order** in which the Vector of RA (or RB) is accessed, @@ -478,7 +478,7 @@ it is easier to then conceptualise VF vs HF Mode: through **registers** (or, register *elements* in traditional Cray-Vector ISAs) in full before moving on to the next *instruction*. -Mitch Alsup's VVM Extension is a form of hardware-level auto-vectorisation +Mitch Alsup's VVM Extension is a form of hardware-level auto-vectorization based around Zero-Overhead Loops. Using a Variable-Length Encoding all loop-invariant registers are "tagged" such that the Hazard Management Engine may perform optimally and do less work in automatically identifying @@ -490,7 +490,7 @@ The biggest advantage inherent in Vertical-First is that it is very easy to introduce into compilers, because all looping, as far as programs is concerned, remains expressed as *Scalar assembler*.[^autovec] Whilst Mitch Alsup's -VVM biggest strength is its hardware-level auto-vectorisation +VVM biggest strength is its hardware-level auto-vectorization but is limited in its ability to call functions, Simple-V's Vertical-First provides explicit control over the parallelism ("hphint")[^hphint] and also allows for full state to be stored/restored @@ -500,7 +500,7 @@ nested VF Loops. Simple-V Vertical-First Looping requires an explicit instruction to move `SVSTATE` regfile offsets forward: `svstep`. An early version of -Vectorised +Vectorized Branch-Conditional attempted to merge the functionality of `svstep` into `sv.bc`: it became CISC-like in its complexity and was quickly reverted. @@ -528,7 +528,7 @@ REMAP Schedules, such as Complex Number FFTs, by using Scalar intermediary temporary registers to compute results that have a Vector source or destination or both. Contrast this with a Standard Horizontal-First Vector ISA where the only -way to perform Vectorised Complex Arithmetic would be to add Complex Vector +way to perform Vectorized Complex Arithmetic would be to add Complex Vector Arithmetic operations, because due to the Horizontal (element-level) progression there is no way to utilise intermediary temporary (scalar) variables.[^complex] @@ -599,10 +599,10 @@ Note critically that: be required. The entire 24-bits is **required** for the abstracted Hardware-Looping Concept **even when these 24-bits are zero** * Any Scalar 64-bit instruction (regardless of how it is encoded) is unsafe to - then Vectorise because this creates the situation of Prefixed-Prefixed, + then Vectorize because this creates the situation of Prefixed-Prefixed, resulting in deep complexity in Hardware Decode at a critical juncture, as well as introducing 96-bit instructions. -* **All** of these Scalar instructions are candidates for Vectorisation. +* **All** of these Scalar instructions are candidates for Vectorization. Thus none of them may be 64-bit-Scalar-only. **Minor Opcodes to fit candidates above** @@ -656,18 +656,18 @@ legal and illegal allocations are given later. The primary point is that once an instruction is defined in Scalar 32-bit form its corresponding space **must** be reserved in the SVP64 area with the exact same 32-bit form, even if that instruction -is "Unvectoriseable" (`sc`, `sync`, `rfid` and `mtspr` for example). +is "Unvectorizeable" (`sc`, `sync`, `rfid` and `mtspr` for example). Instructions may **not** be added in the Vector space without also -being added in the Scalar space, and vice-versa, *even if Unvectoriseable*. +being added in the Scalar space, and vice-versa, *even if Unvectorizeable*. This is extremely important because the worst possible situation is if a conflicting Scalar instruction is added by another Stakeholder, -which then turns out to be Vectoriseable: it would then have to be +which then turns out to be Vectorizeable: it would then have to be added to the Vector Space with a *completely different Defined Word* and things go rapidly downhill in the Decode Phase from there. Setting a simple inviolate rule helps avoid this scenario but does need to be borne in mind when discussing potential allocation -schemes, as well as when new Vectoriseable Opcodes are proposed +schemes, as well as when new Vectorizeable Opcodes are proposed for addition by future RFCs: the opcodes **must** be uniformly added to Scalar **and** Vector spaces, or added in one and reserved in the other, or @@ -682,14 +682,14 @@ There are unfortunately some inviolate requirements that directly place pressure on the EXT000-EXT063 (32-bit) opcode space to such a degree that it risks jeapordising the Power ISA. These requirements are: -* all of the scalar operations must be Vectoriseable -* all of the scalar operations intended for Vectorisation +* all of the scalar operations must be Vectorizeable +* all of the scalar operations intended for Vectorization must be in a 32-bit encoding (not prefixed-prefixed to 96-bit) * bringing Scalar Power ISA up-to-date from the past 12 years needs 75% of two Major opcodes all on its own There exists a potential scheme which meets (exceeds) the above criteria, -providing plenty of room for both Scalar (and Vectorised) operations, +providing plenty of room for both Scalar (and Vectorized) operations, *and* provides SVP64-Single with room to grow. It is based loosely around Public v3.1 EXT001 Encoding.[^ext001] @@ -724,7 +724,7 @@ is based loosely around Public v3.1 EXT001 Encoding.[^ext001] If not allocated within the scope of this RFC then these are requested to be `RESERVED` for a future Simple-V proposal. -* **SVP64** - a (well-defined, 2 years) DRAFT Proposal for a Vectorisation +* **SVP64** - a (well-defined, 2 years) DRAFT Proposal for a Vectorization Augmentation of suffixes. For the needs identified by Libre-SOC (75% of 2 POs), @@ -737,7 +737,7 @@ allocation to new POs, `RESERVED2` does not.[^only2] |old bit6=1| `RESERVED2`:{EXT300-363} | `RESERVED4`:SVP64-Single:{EXT000-063} | SVP64:{EXT000-063} | * **`RESERVED2`:{EXT300-363}** (not strictly necessary to be added) is not - and **cannot** ever be Vectorised or Augmented by Simple-V or any future + and **cannot** ever be Vectorized or Augmented by Simple-V or any future Simple-V Scheme. it is a pure **Scalar-only** word-length PO Group. It may remain `RESERVED`. * **`RESERVED1`:{EXT200-263}** is also a new set of 64 word-length Major @@ -756,7 +756,7 @@ allocation to new POs, `RESERVED2` does not.[^only2] in effect Single-Augmented-Prefixed variants of the v3.0 32-bit Power ISA. Alternative instruction encodings other than the exact same 32-bit word from EXT000-EXT063 are likewise prohibited. -* **`SVP64:{EXT000-063}`** and **`SVP64:{EXT200-263}`** - Full Vectorisation +* **`SVP64:{EXT000-063}`** and **`SVP64:{EXT200-263}`** - Full Vectorization of EXT000-063 and EXT200-263 respectively, these Prefixed instructions are likewise prohibited from being a different encoding from their 32-bit scalar versions. @@ -773,7 +773,7 @@ overwhelmingly made moot. The only downside is that there is no `SVP64-Reserved` which will have to be achieved with SPRs (PCR or MSR). *Most importantly what this scheme does not do is provide large areas -for other (non-Vectoriseable) RFCs.* +for other (non-Vectorizeable) RFCs.* # Potential Opcode allocation solution (2) @@ -788,7 +788,7 @@ that is as follows: as a Prefix, which is a new RESERVED encoding. * when bit 6 is 0b0 and bits 32-33 are 0b11 are **defined** as also allocated to Simple-V -* all other patterns are `RESERVED` for other non-Vectoriseable +* all other patterns are `RESERVED` for other non-Vectorizeable purposes (just over 37.5%). | 0-5 | 6 | 7 | 8-31 | 32:33 | Description | @@ -804,7 +804,7 @@ that is as follows: This ensures that any potential for future conflict over uses of the EXT009 space, jeapordising Simple-V in the process, are avoided, yet leaves huge areas (just over 37.5% of the 64-bit space) for other -(non-Vectoriseable) uses. +(non-Vectorizeable) uses. These areas thus need to be Allocated (SVP64 and Scalar EXT248-263): @@ -828,28 +828,28 @@ and reserved areas, QTY 1of 32-bit, and QTY 3of 55-bit, are: * SVP64Single (`RESERVED3/4`) is *planned* for a future RFC (but needs reserving as part of this RFC) * `RESERVED1/2` is available for new general-purpose - (non-Vectoriseable) 32-bit encodings (other RFCs) + (non-Vectorizeable) 32-bit encodings (other RFCs) * EXT248-263 is for "new" instructions which **must** be granted corresponding space in SVP64. -* Anything Vectorised-EXT000-063 is **automatically** being +* Anything Vectorized-EXT000-063 is **automatically** being requested as 100% Reserved for every single "Defined Word" - (Public v3.1 1.6.3 definition). Vectorised-EXT001 or EXT009 + (Public v3.1 1.6.3 definition). Vectorized-EXT001 or EXT009 is defined as illegal. * Any **future** instruction added to EXT000-063 likewise, must **automatically** be assigned corresponding reservations in the SVP64:EXT000-063 and SVP64Single:EXT000-063 area, regardless of whether the - instruction is Vectoriseable or not. + instruction is Vectorizeable or not. Bit-allocation Summary: * EXT3nn and other areas provide space for up to - QTY 4of non-Vectoriseable EXTn00-EXTn47 ranges. + QTY 4of non-Vectorizeable EXTn00-EXTn47 ranges. * QTY 3of 55-bit spaces also exist for future use (longer by 3 bits than opcodes allocated in EXT001) * Simple-V EXT2nn is restricted to range EXT248-263 -* non-Simple-V (non-Vectoriseable) EXT2nn (if ever requested in any future RFC) is restricted to range EXT200-247 +* non-Simple-V (non-Vectorizeable) EXT2nn (if ever requested in any future RFC) is restricted to range EXT200-247 * Simple-V EXT0nn takes up 50% of PO9 for this and future Simple-V RFCs **This however potentially puts SVP64 under pressure (in 5-10 years).** @@ -866,8 +866,8 @@ it may be better to allocate 25% to `RESERVED`: The clear separation between Simple-V and non-Simple-V stops conflict in future RFCs, both of which get plenty of space. -EXT000-063 pressure is reduced in both Vectoriseable and -non-Vectoriseable, and the 100+ Vectoriseable Scalar operations +EXT000-063 pressure is reduced in both Vectorizeable and +non-Vectorizeable, and the 100+ Vectorizeable Scalar operations identified by Libre-SOC may safely be proposed and each evaluated on their merits. @@ -901,9 +901,9 @@ Augmenting EXT001 or EXT009 is prohibited. **SVP64:{EXT000-063}** bit6=old bit7=vector This encoding is identical to **SVP64:{EXT248-263}** except it -is the Vectorisation of existing v3.0/3.1 Scalar-words, EXT000-063. +is the Vectorization of existing v3.0/3.1 Scalar-words, EXT000-063. All the same rules apply with the addition that -Vectorisation of EXT001 or EXT009 is prohibited. +Vectorization of EXT001 or EXT009 is prohibited. | 0-5 | 6 | 7 | 8-31 | 32-63 | |--------|---|---|-------|---------| @@ -935,7 +935,7 @@ Must be allocated under Scalar *and* SVP64 simultaneously. **SVP64:{EXT248-263}** bit6=new bit7=vector This encoding, which permits VL to be dynamic (settable from GPR or CTR) -is the Vectorisation of EXT248-263. +is the Vectorization of EXT248-263. Instructions may not be placed in this category without also being implemented as pure Scalar *and* SVP64Single. Unlike SVP64Single however, there is **no reserved encoding** (bits 8-24 zero). @@ -1006,10 +1006,10 @@ to Simple-V, some not. | 64bit | ss.fishmv | 0x26!zero | 0x12345678| scalar SVP64Single:EXT0nn | | 64bit | unallocated | 0x27nnnnnn | 0x12345678| vector SVP64:EXT0nn | -This is illegal because the instruction is possible to Vectorise, -therefore it should be **defined** as Vectoriseable. +This is illegal because the instruction is possible to Vectorize, +therefore it should be **defined** as Vectorizeable. -**illegal due to unvectoriseable** +**illegal due to unvectorizeable** | width | assembler | prefix? | suffix | description | |-------|-----------|--------------|-----------|---------------| @@ -1017,11 +1017,11 @@ therefore it should be **defined** as Vectoriseable. | 64bit | ss.mtmsr | 0x26!zero | 0x12345678| scalar SVP64Single:EXT0nn | | 64bit | sv.mtmsr | 0x27nnnnnn | 0x12345678| vector SVP64:EXT0nn | -This is illegal because the instruction `mtmsr` is not possible to Vectorise, +This is illegal because the instruction `mtmsr` is not possible to Vectorize, at all. This does **not** convey an opportunity to allocate the space to an alternative instruction. -**illegal unvectoriseable in EXT2nn** +**illegal unvectorizeable in EXT2nn** | width | assembler | prefix? | suffix | description | |-------|-----------|--------------|-----------|---------------| @@ -1029,11 +1029,11 @@ space to an alternative instruction. | 64bit | ss.mtmsr2 | 0x24!zero | 0x12345678| scalar SVP64Single:EXT2nn | | 64bit | sv.mtmsr2 | 0x25nnnnnn | 0x12345678| vector SVP64:EXT2nn | -For a given hypothetical `mtmsr2` which is inherently Unvectoriseable +For a given hypothetical `mtmsr2` which is inherently Unvectorizeable whilst it may be put into the scalar EXT2nn space it may **not** be -allocated in the Vector space. As with Unvectoriseable EXT0nn opcodes +allocated in the Vector space. As with Unvectorizeable EXT0nn opcodes this does not convey the right to use the 0x24/0x26 space for alternative -opcodes. This hypothetical Unvectoriseable operation would be better off +opcodes. This hypothetical Unvectorizeable operation would be better off being allocated as EXT001 Prefixed, EXT000-063, or hypothetically in EXT300-363. @@ -1047,7 +1047,7 @@ EXT300-363. the use of 0x12345678 for fredmv in scalar but fishmv in Vector is illegal. the suffix in both 64-bit locations -must be allocated to a Vectoriseable EXT000-063 +must be allocated to a Vectorizeable EXT000-063 "Defined Word" (Public v3.1 Section 1.6.3 definition) or not at all. @@ -1090,7 +1090,7 @@ MSBs are actually *zero*, and the Vector EXT2nn space is only legal for Primary Opcodes in the range 232-263, where the top two MSBs are 0b11. Thus this faulty attempt actually falls unintentionally -into `RESERVED` "Non-Vectoriseable" Encoding space. +into `RESERVED` "Non-Vectorizeable" Encoding space. **illegal attempt to put Scalar EXT001 into Vector space** @@ -1104,7 +1104,7 @@ This becomes in effect an effort to define 96-bit instructions, which are illegal due to cost at the Decode Phase (Variable-Length Encoding). Likewise attempting to embed EXT009 (chained) is also illegal. The implications are clear unfortunately that all 64-bit -EXT001 Scalar instructions are Unvectoriseable. +EXT001 Scalar instructions are Unvectorizeable. \newpage{} # Use cases @@ -1293,18 +1293,18 @@ performance and also greatly simplify unlimited-length biginteger algorithms. \newpage{} -# Vectorised strncpy +# Vectorized strncpy Aside from the `blr` return instruction this is an entire fully-functional implementation of `strncpy` which demonstrates some of the remarkably powerful capabilities of Simple-V. Load Fault-First avoids instruction -traps and page faults in the middle of the Vectorised Load, providing +traps and page faults in the middle of the Vectorized Load, providing the *micro-architecture* with the opportunity to notify the program of the successful Vector Length. `sv.cmpi` is the next strategically-critical instruction, as it searches for a zero and yet *includes* it in a new Vector Length - bearing in mind that the previous instruction (the Load) *also* truncated down to the valid number of LDs performed. Finally, -a Vectorised Branch-Conditional automatically decrements CTR by the number +a Vectorized Branch-Conditional automatically decrements CTR by the number of elements copied (VL), rather than decrementing simply by one. ``` @@ -1344,7 +1344,7 @@ of elements copied (VL), rather than decrementing simply by one. [^ext001]: Recall that EXT100 to EXT163 is for Public v3.1 64-bit-augmented Operations prefixed by EXT001, for which, from Section 1.6.3, bit 6 is set to 1. This concept is where the above scheme originated. Section 1.6.3 uses the term "defined word" to refer to pre-existing EXT000-EXT063 32-bit instructions so prefixed to create the new numbering EXT100-EXT163, respectively [^futurevsx]: A future version or other Stakeholder *may* wish to drop Simple-V onto VSX: this would be a separate RFC [^vsx256]: imagine a hypothetical future VSX-256 using the exact same instructions as VSX. the binary incompatibility introducrd would catastrophically **and retroactively** damage existing IBM POWER8,9,10 hardware's reputation and that of Power ISA overall. -[^autovec]: Compiler auto-vectorisation for best exploitation of SIMD and Vector ISAs on Scalar programming languages (c, c++) is an Indusstry-wide known-hard decades-long problem. Cross-reference the number of hand-optimised assembler algorithms. +[^autovec]: Compiler auto-vectorization for best exploitation of SIMD and Vector ISAs on Scalar programming languages (c, c++) is an Indusstry-wide known-hard decades-long problem. Cross-reference the number of hand-optimised assembler algorithms. [^hphint]: intended for use when the compiler has determined the extent of Memory or register aliases in loops: `a[i] += a[i+4]` would necessitate a Vertical-First hphint of 4 [^svshape]: although SVSHAPE0-3 should, realistically, be regarded as high a priority as SVSTATE, and given corresponding SVSRR and SVLR equivalents, it was felt that having to context-switch **five** SPRs on Interrupts and function calls was too much. [^whoops]: two efforts were made to mix non-uniform encodings into Simple-V space: one deliberate to see how it would go, and one accidental. They both went extremely badly, the deliberate one costing over two months to add then remove. diff --git a/openpower/sv/rfc/ls001/discussion.mdwn b/openpower/sv/rfc/ls001/discussion.mdwn index bc6878430..a1e4453a3 100644 --- a/openpower/sv/rfc/ls001/discussion.mdwn +++ b/openpower/sv/rfc/ls001/discussion.mdwn @@ -20,7 +20,7 @@ of this discussion) the additional requirements are: -* all of the scalar operations must be Vectoriseable +* all of the scalar operations must be Vectorizeable * all of the scalar operations must be in a 32-bit encoding (not prefixed-prefixed) # use 75% of QTY 3 MAJOR ops @@ -114,7 +114,7 @@ it would be: having this `RESERVED` encoding in the middle of the space does complexify multi-issue decoding somewhat, but it does provide an entire new (independent, -non-vectorisable) 32-bit opcode space. **two** separate +non-vectorizable) 32-bit opcode space. **two** separate RESERVED Major opcode areas can be provided: numbering them EXT200-263 and EXT300-363 respectively seems sane. EXT300-363 for `RESERVED1` comes with a caveat that it can diff --git a/openpower/sv/rfc/ls002.fmi/discussion.mdwn b/openpower/sv/rfc/ls002.fmi/discussion.mdwn index 77705be5b..a92f0bb82 100644 --- a/openpower/sv/rfc/ls002.fmi/discussion.mdwn +++ b/openpower/sv/rfc/ls002.fmi/discussion.mdwn @@ -206,7 +206,7 @@ ack. done. (actually, removed the duplicate sentence/phrase) ** it is unlikely that we (Libre-SOC) will initially implement any of v3.1 -64-bit prefixing (it cannot be Vectorised, resulting unacceptably in +64-bit prefixing (it cannot be Vectorized, resulting unacceptably in 96-bit instructions which we decided is too much). that said, the LD addressing immediate extended range is extremely useful (along with the PC-relative modes and also other instructions diff --git a/openpower/sv/rfc/ls008.mdwn b/openpower/sv/rfc/ls008.mdwn index 7f4884bcf..570c0f8c7 100644 --- a/openpower/sv/rfc/ls008.mdwn +++ b/openpower/sv/rfc/ls008.mdwn @@ -57,7 +57,7 @@ **Keywords**: ``` - Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC), + Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC), Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model, Digital Signal Processing (DSP) ``` @@ -65,7 +65,7 @@ **Motivation** Power ISA is synonymous with Supercomputing and the early Supercomputers -(ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorisation. It is therefore anomalous +(ETA-10, ILLIAC-IV, CDC200, Cray) had Vectorization. It is therefore anomalous that Power ISA does not have Scalable Vectors. This presents the opportunity to modernise Power ISA keeping it at the top of Supercomputing. diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn index 3aa7fe83d..6a3b90941 100644 --- a/openpower/sv/rfc/ls009.mdwn +++ b/openpower/sv/rfc/ls009.mdwn @@ -57,7 +57,7 @@ **Keywords**: ``` - Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC), + Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC), Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model, Digital Signal Processing (DSP) ``` @@ -771,7 +771,7 @@ if __name__ == '__main__': DCT REMAP is RADIX2 only. Convolutions may be applied as usual to create non-RADIX2 DCT. Combined with appropriate Twin-butterfly instructions the algorithm below (written in python3), becomes part -of an in-place in-registers Vectorised DCT. The algorithms work +of an in-place in-registers Vectorized DCT. The algorithms work by loading data such that as the nested loops progress the result is sorted into correct sequential order. diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index 316347fa0..044e087b7 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -58,7 +58,7 @@ **Keywords**: ``` - Cray Supercomputing, Vectorisation, Zero-Overhead-Loop-Control (ZOLC), + Cray Supercomputing, Vectorization, Zero-Overhead-Loop-Control (ZOLC), True-Scalable Vectors, Multi-Issue Out-of-Order, Sequential Programming Model, Digital Signal Processing (DSP), High-level Assembler ``` diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn index 07fd9e04f..84ed809fb 100644 --- a/openpower/sv/rfc/ls012.mdwn +++ b/openpower/sv/rfc/ls012.mdwn @@ -11,7 +11,7 @@ The purpose of this RFC is: * to give a full list of upcoming **Scalar** opcodes developed by Libre-SOC - (being cognisant that *all* of them are Vectoriseable) + (being cognisant that *all* of them are Vectorizeable) * to give OPF Members and non-Members alike the opportunity to comment and get involved early in RFC submission * formally agree a priority order on an iterative basis with new versions @@ -43,10 +43,10 @@ that "Libre-SOC" != "RED Semiconductor Ltd". The two are completely **separate** organisations*. Worth bearing in mind during evaluation that every "Defined Word" may -or may not be Vectoriseable, but that every "Defined Word" should have -merits on its own, not just when Vectorised, precisely because the +or may not be Vectorizeable, but that every "Defined Word" should have +merits on its own, not just when Vectorized, precisely because the instructions are Scalar. An example of a borderline -Vectoriseable Defined Word is `mv.swizzle` which only really becomes +Vectorizeable Defined Word is `mv.swizzle` which only really becomes high-priority for Audio/Video, Vector GPU and HPC Workloads, but has less merit as a Scalar-only operation, yet when SVP64Single-Prefixed can be part of an atomic Compare-and-Swap sequence. @@ -126,7 +126,7 @@ why the instructions are needed. Future versions of SVP64 and SVP64Single are expected to be developed by future Power ISA Stakeholders on top of VSX. The decisions made -there about the meaning of Prefixed Vectorised VSX may be *completely +there about the meaning of Prefixed Vectorized VSX may be *completely different* from those made for Prefixed SFFS instructions. At which point the lack of SFFS equivalents would penalise SFFS implementors in a much more severe way, effectively expecting them and SFFS programmers to @@ -139,15 +139,15 @@ allow it to be stand-alone on its own merits. These without question have to go in EXT0xx. Future extended variants, bringing even more powerful capabilities, can be followed up later with EXT1xx prefixed variants, which is not possible if placed in EXT2xx. -*Only `svstep` is actually Vectoriseable*, all other Management -instructions are UnVectoriseable. PO1-Prefixed examples include +*Only `svstep` is actually Vectorizeable*, all other Management +instructions are UnVectorizeable. PO1-Prefixed examples include adding psvshape in order to support both Inner and Outer Product Matrix Schedules, by providing the option to directly reverse the order of the triple loops. Outer is used for standard Matrix Multiply (on top of a standard MAC or FMAC instruction), but Inner is required for Warshall Transitive Closure (on top of a cumulatively-applied max instruction). -Excpt for `svstep` which is Vectoriseable the Management Instructions +Excpt for `svstep` which is Vectorizeable the Management Instructions themselves are all 32-bit Defined Words (Scalar Operations), so PO1-Prefixing is perfectly reasonable. SVP64 Management instructions of which there are only 6 are all 5 or 6 bit XO, meaning that the opcode @@ -188,7 +188,7 @@ specialist. Found at [[sv/av_opcodes]] these do not require Saturated variants because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via [[sv/svp64-single]] Scalar Prefixing. This is important to note for -Opcode Allocation because placing these operations in the UnVectoriseable +Opcode Allocation because placing these operations in the UnVectorizeable areas would irredeemably damage their value. Unlike PackedSIMD ISAs the actual number of AV Opcodes is remarkably small once the usual cascading-option-multipliers (SIMD width, bitwidth, saturation, @@ -215,7 +215,7 @@ the Scalar side of the ISA to add the prerequisite "Twin Butterfly" operations, typically performing for example one multiply but in-place subtracting that product from one operand and adding it to the other. The *in-place* aspect is strategically extremely important for significant -reductions in Vectorised register usage, particularly for DCT. +reductions in Vectorized register usage, particularly for DCT. Further: even without Simple-V the number of instructions saved is huge: 8 for integer and 4 for floating-point vs one. @@ -308,7 +308,7 @@ as just one example. Whilst some of these instructions have VSX equivalents they must not be excluded on that basis. SVP64/VSX may have a different meaning from -SVP64/SFFS i e. the two *Vectorised* instructions may not be equivalent. +SVP64/SFFS i e. the two *Vectorized* instructions may not be equivalent. ## Bitmanip LUT2/3 @@ -348,7 +348,7 @@ as a Scalar instruction is limited *except* if combined with `cmpi` and SVP64Single Predication, whereupon the end result is the RISC-synthesis of Compare-and-Swap, in two instructions. -Where this instruction comes into its full value is when Vectorised. +Where this instruction comes into its full value is when Vectorized. 3D GPU and HPC numerical workloads astonishingly contain between 10 to 15% swizzle operations: access to YYZ, XY, of an XYZW Quaternion, performing balancing of ARGB pixel data. The usage is so high that 3D GPU ISAs make @@ -458,7 +458,7 @@ EXT2xx). \newpage{} -# Vectorisation: SVP64 and SVP64Single +# Vectorization: SVP64 and SVP64Single To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]], with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually @@ -479,7 +479,7 @@ Secondly: **any** Scalar instruction involving registers **automatically** becomes a candidate for Vector-Prefixing. This in turn means that when a new instruction is proposed, it becomes a hard requirement to consider not only the implications of its inclusion as a Scalar-only instruction, -but how it will best be utilised as a Vectorised instruction **as well**. +but how it will best be utilised as a Vectorized instruction **as well**. Extreme examples of this are the Big-Integer 3-in 2-out instructions that use one 64-bit register effectively as a Carry-in and Carry-out. The instructions were designed in a *Scalar* context to be inline-efficient @@ -487,7 +487,7 @@ in hardware (use of Operand-Forwarding to reduce the chain down to 2-in 1-out), but in a *Vector* context it is extremely straightforward to Micro-code an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD pipelines, and to perform a large internal Forward-Carry-Propagation on -for example the Vectorised-Multiply instruction. +for example the Vectorized-Multiply instruction. Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be considered as an independent stand-alone instruction (just like `REP`). @@ -714,17 +714,17 @@ The key to headings and sections are as follows: upcoming RFCs in development may be found. *Reading advance Draft RFCs and providing feedback strongly advised*, it saves time and effort for the OPF ISA Workgroup. -* **SVP64** - Vectoriseable (SVP64-Prefixable) - also implies that +* **SVP64** - Vectorizeable (SVP64-Prefixable) - also implies that SVP64Single is also permitted (required). * **page** - Libre-SOC wiki page at which further information can be found. Again: **advance reading strongly advised due to the sheer volume of information**. * **PO1** - the instruction is capable of being PO1-Prefixed (given an EXT1xx Opcode Allocation). Bear in mind that this option - is **mutually exclusively incompatible** with Vectorisation. + is **mutually exclusively incompatible** with Vectorization. * **group** - the Primary Opcode Group recommended for this instruction. Options are EXT0xx (EXT000-EXT063), EXT1xx and EXT2xx. A third area - (UnVectoriseable), + (UnVectorizeable), EXT3xx, was available in an early Draft RFC but has been made "RESERVED" instead. see [[sv/po9_encoding]]. * **Level** - Compliancy Subset and Simple-V Level. `SFFS` indicates "mandatory" diff --git a/openpower/sv/rfc/ls015.mdwn b/openpower/sv/rfc/ls015.mdwn index e85b11702..d7210bdd4 100644 --- a/openpower/sv/rfc/ls015.mdwn +++ b/openpower/sv/rfc/ls015.mdwn @@ -118,14 +118,14 @@ Basic concept: register to selectively target any four bits of a given CR Field * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed in one hit. -* Optional Vectorisation of the same when SVP64 is implemented +* Optional Vectorization of the same when SVP64 is implemented Purpose: * To provide a merged version of what is currently a multi-sequence of CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing instruction count. -* To provide a vectorised version of the same, suitable for advanced +* To provide a vectorized version of the same, suitable for advanced predication Useful side-effects: diff --git a/openpower/sv/rfc/ls016.mdwn b/openpower/sv/rfc/ls016.mdwn index ca45e5591..40c28787f 100644 --- a/openpower/sv/rfc/ls016.mdwn +++ b/openpower/sv/rfc/ls016.mdwn @@ -85,14 +85,14 @@ get efficient execution. RAp instructions, these instructions would not be proposed. 4. The read and write of two overlapping registers normally requires an intermediate register (similar to the justifcation for CAS - - Compare-and-Swap). When Vectorised the situation becomes even + Compare-and-Swap). When Vectorized the situation becomes even worse: an entire *Vector* of intermediate temporaries is required. Thus *even if implemented inefficiently* requiring more cycles to complete (taking an extra cycle to write the second result) these instructions still save on resources. 5. Macro-op fusion equivalents of these instructions is *not possible* for exactly the same reason that the equivalent CAS sequence may not be - macro-op fused. Full in-place Vectorised FFT and DCT algorithms *only* + macro-op fused. Full in-place Vectorized FFT and DCT algorithms *only* become possible due to these instructions atomically reading **both** Butterfly operands into internal Reservation Stations (exactly like CAS). 5. Although desirable (particularly to detect overflow) Rc=1 is hard to diff --git a/openpower/sv/sprs.mdwn b/openpower/sv/sprs.mdwn index 243f7ebd4..82467f638 100644 --- a/openpower/sv/sprs.mdwn +++ b/openpower/sv/sprs.mdwn @@ -255,7 +255,7 @@ can be made considerably faster than on other Implementations. SV Link Register, exactly analogous to LR (Link Register) may be used for temporary storage of SVSTATE, and, in particular, -Vectorised Branch-Conditional instructions may interchange +Vectorized Branch-Conditional instructions may interchange SVLR and SVSTATE whenever LR and NIA are. Note that there is no equivalent Link variant of SVREMAP or diff --git a/openpower/sv/sv_analysis.mdwn b/openpower/sv/sv_analysis.mdwn index 317689fa9..f947438a0 100644 --- a/openpower/sv/sv_analysis.mdwn +++ b/openpower/sv/sv_analysis.mdwn @@ -3,7 +3,7 @@ The creation and maintenance of SVP64 Categorisation is an automated process that uses "Register profiling", reading machine-readable versions of the Power ISA Specification and tables in order to -make the Vectorisation Categorisation. To create this information +make the Vectorization Categorisation. To create this information by hand is neither sensible nor desirable: it may take far longer and introduce errors. diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn index d97b24f2f..397f54ab6 100644 --- a/openpower/sv/svp64.mdwn +++ b/openpower/sv/svp64.mdwn @@ -44,7 +44,7 @@ Table of contents ## Introduction -Simple-V is a type of Vectorisation best described as a "Prefix Loop +Simple-V is a type of Vectorization best described as a "Prefix Loop Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR`[^bib_ldir] instruction and to the 8086 `REP`[^bib_rep] Prefix instruction. More advanced features are similar to the Z80 `CPIR`[^bib_cpir] instruction. If naively viewed one-dimensionally as an @@ -88,7 +88,7 @@ Mode. Post-Increment was considered sufficiently high priority (significantly reducing hot-loop instruction count) that one bit in the Prefix is reserved for it (*Note the intention to release that bit and move Post-Increment instructions to EXT2xx, as part of [[sv/rfc/ls011]]*). -Vectorised Branch-Conditional operations "embed" the original Scalar +Vectorized Branch-Conditional operations "embed" the original Scalar Branch-Conditional behaviour into a much more advanced variant that is highly suited to High-Performance Computation (HPC), Supercomputing, and parallel GPU Workloads. @@ -178,7 +178,7 @@ Trap raised. *Architectural Note: Given that a "pre-classification" Decode Phase is required (identifying whether the Suffix - Defined Word - is Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional), -adding "Unvectorised" to this phase is not unreasonable.* +adding "Unvectorized" to this phase is not unreasonable.* Vectorizable Defined Word-instructions are **required** to be Vectorized, or they may not be permitted to be added at all to the Power ISA as Defined @@ -427,7 +427,7 @@ For clarity in the table below: * The GPR-numbering is considered LSB0-ordered * The Element-numbering (result0-result4) is LSB0-ordered * Each of the results (result0-result4) are 16-bit -* "same" indicates "no change as a result of the Vectorised add" +* "same" indicates "no change as a result of the Vectorized add" ``` | MSB0: | 0:15 | 16:31 | 32:47 | 48:63 | @@ -446,7 +446,7 @@ the example having VL=5. Thus on "wrapping" - sequential progression from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom 16 LSBs of GPR(1). -If the 16-bit operation were to be followed up with a 32-bit Vectorised +If the 16-bit operation were to be followed up with a 32-bit Vectorized Operation, the exact same contents would be viewed as follows: ``` @@ -620,7 +620,7 @@ MAXVL when understanding this key aspect of SimpleV. ## Register Naming and size As indicated above SV Registers are simply the GPR, FPR and CR register -files extended linearly to larger sizes; SV Vectorisation iterates +files extended linearly to larger sizes; SV Vectorization iterates sequentially through these registers (LSB0 sequential ordering from 0 to VL-1). @@ -989,12 +989,12 @@ following meaning: | 110 | so/un | `CR[offs+i].FU` is set | | 111 | ns/nu | `CR[offs+i].FU` is clear | -`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised +`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorized Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). -The CR Predicates chosen must start on a boundary that Vectorised CR +The CR Predicates chosen must start on a boundary that Vectorized CR operations can access cleanly, in full. With EXTRA2 restricting starting -points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and +points to multiples of 8 (CR0, CR8, CR16...) both Vectorized Rc=1 and CR Predicate Masks have to be adapted to fit on these boundaries as well. ## Extra Remapped Encoding diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index 170ab08b0..73426d430 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -56,7 +56,7 @@ but only for SVP64 Prefixed Operations. XER.CA/CA32 on the other hand is expected and required to be implemented according to standard Power ISA Scalar behaviour. Interestingly, due to SVP64 being in effect a hardware for-loop around Scalar instructions -executing in precise Program Order, a little thought shows that a Vectorised +executing in precise Program Order, a little thought shows that a Vectorized Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In and producing, at the end, a single bit Carry out. High performance implementations may exploit this observation to deploy efficient @@ -516,7 +516,7 @@ to include the terminating zero. In CR-based data-driven fail-on-first there is only the option to select and test one bit of each CR (just as with branch BO). For more complex -tests this may be insufficient. If that is the case, a vectorised crops +tests this may be insufficient. If that is the case, a vectorized crops (crand, cror) may be used, and ffirst applied to the crop instead of to the arithmetic vector. @@ -528,7 +528,7 @@ One extremely important aspect of ffirst is: to zero. This is the only means in the entirety of SV that VL may be set to zero (with the exception of via the SV.STATE SPR). When VL is set zero due to the first element failing the CR bit-test, all subsequent - vectorised operations are effectively `nops` which is + vectorized operations are effectively `nops` which is *precisely the desired and intended behaviour*. Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily @@ -656,10 +656,10 @@ do not start on a 32-bit aligned boundary, performance may be affected. ### CR fields as inputs/outputs of vector operations CRs (or, the arithmetic operations associated with them) -may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR. +may be marked as Vectorized or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorized if the destination is Vectorized. Likewise if the destination is scalar then so is the CR. When vectorized, the CR inputs/outputs are sequentially read/written -to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin +to 4-bit CR fields. Vectorized Integer results, when Rc=1, will begin writing to CR8 (TBD evaluate) and increase sequentially from there. This is so that: @@ -676,8 +676,8 @@ EXTRA field the *standard* v3.0B behaviour applies: the accompanying CR when Rc=1 is written to. This is CR0 for integer operations and CR1 for FP operations. -Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which -has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER +Note that yes, the CR Fields are genuinely Vectorized. Unlike in SIMD VSX which +has a single CR (CR6) for a given SIMD result, SV Vectorized OpenPOWER v3.0B scalar operations produce a **tuple** of element results: the result of the operation as one part of that element *and a corresponding CR element*. Greatly simplified pseudocode: @@ -697,7 +697,7 @@ then a followup instruction must be performed, setting "reduce" mode on the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far more flexibility in analysing vectors than standard Vector ISAs. Normal Vector ISAs are typically restricted to "were all results nonzero" and -"were some results nonzero". The application of mapreduce to Vectorised +"were some results nonzero". The application of mapreduce to Vectorized cr operations allows far more sophisticated analysis, particularly in conjunction with the new crweird operations see [[sv/cr_int_predication]]. @@ -731,7 +731,7 @@ CRn is the notation used by the OpenPower spec to refer to CR field #i, so FP instructions with Rc=1 write to CR1 (n=1). CRs are not stored in SPRs: they are registers in their own right. -Therefore context-switching the full set of CRs involves a Vectorised +Therefore context-switching the full set of CRs involves a Vectorized mfcr or mtcr, using VL=8 to do so. This is exactly as how scalar OpenPOWER context-switches CRs: it is just that there are now more of them. @@ -1018,7 +1018,7 @@ In Scalar mode, `maddedu` therefore stores the two halves of the 128-bit multiply into RT and RT+1. What, then, of `sv.maddedu`? If the destination is hard-coded to RT and -RT+1 the instruction is not useful when Vectorised because the output +RT+1 the instruction is not useful when Vectorized because the output will be overwritten on the next element. To solve this is easy: define the destination registers as RT and RT+MAXVL respectively. This makes it easy for compilers to statically allocate registers even when VL diff --git a/openpower/sv/svp64/discussion.mdwn b/openpower/sv/svp64/discussion.mdwn index 78e782c28..f64b79291 100644 --- a/openpower/sv/svp64/discussion.mdwn +++ b/openpower/sv/svp64/discussion.mdwn @@ -153,7 +153,7 @@ i get the idea "r0 to be used therefore it is all zeros" but that makes 001 the | 111 | ~R30 | -# CR Vectorisation +# CR Vectorization Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much). @@ -231,7 +231,7 @@ Summary so far: ## only 1 src/dest -Instructions in this category are usually Unvectoriseable +Instructions in this category are usually Unvectorizeable or they are Load-Immediates. `fmvis` for example, is 1-Write, whilst SV.Branch-Conditional is BI (CR field bit). diff --git a/openpower/sv/svp64_quirks.mdwn b/openpower/sv/svp64_quirks.mdwn index 75f3d98bd..b87bf0da4 100644 --- a/openpower/sv/svp64_quirks.mdwn +++ b/openpower/sv/svp64_quirks.mdwn @@ -60,7 +60,7 @@ The complexity that resulted in the decode phase was too great. The lesson was learned, the hard way: it would be infinitely preferable to add a 32-bit Scalar Load-with-Shift -instruction *first*, which then inherently becomes Vectorised. +instruction *first*, which then inherently becomes Vectorized. Perhaps a future Power ISA spec will have this Load-with-Shift instruction: both ARM and x86 have it, because it saves greatly on instruction count in hot-loops. @@ -70,7 +70,7 @@ also having it as a Scalar un-prefixed instruction is that if the 32-bit encoding is ever allocated in a future revision of the Power ISA to a completely unrelated operation -then how can a Vectorised version of that new instruction ever be added? +then how can a Vectorized version of that new instruction ever be added? The uniformity and RISC Abstraction is irreparably damaged. Bottom line here is that the fundamental RISC Principle is strictly adhered to, even though these are Advanced 64-bit Vector instructions. @@ -81,7 +81,7 @@ SVP64 and the level of systematic abstraction kept between Prefix and Suffix. The basic principle of SVP64 is the prefix, which contains mode as well as register augmentation and predicates. When thinking of -instructions and Vectorising them, it is natural for arithmetic +instructions and Vectorizing them, it is natural for arithmetic operations (ADD, OR) to be the first to spring to mind. Arithmetic instructions have registers, therefore augmentation applies, end of story, right? @@ -90,7 +90,7 @@ Except, Load and Store deals also with Memory, not just registers. Power ISA has Condition Register Fields: how can element widths apply there? And branches: how can you have Saturation on something that does not return an arithmetic result? In short: there are actually -four different categories (five including those for which Vectorisation +four different categories (five including those for which Vectorization makes no sense at all, such as `sc` or `mtmsr`). The categories are: * arithmetic/logical including floating-point @@ -139,13 +139,13 @@ number of instructions in tight inner loop situations. Condition Register Fields are 4-bit wide and consequently element-width overrides make absolutely no sense whatsoever. Therefore the elwidth -override field bits can be used for other purposes when Vectorising +override field bits can be used for other purposes when Vectorizing CR Field instructions. Moreover, Rc=1 is completely invalid for CR operations such as `crand`: Rc=1 is for arithmetic operations, producing a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense. All of these differences, which require quite a lot of logical reasoning and deduction, help explain why there is an entirely different -CR ops Vectorisation Category. +CR ops Vectorization Category. A particularly strange quirk of CR-based Vector Operations is that the Scalar Power ISA CR Register is 32-bits, but actually comprises eight @@ -164,7 +164,7 @@ EQ/LT/GT/SO within that Field*) With SVP64 extending the number of CR *Fields* to 128, the number of 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields* (8 per CR Register). Then, it gets even more strange, when it comes -to Vectorisation, which applies to the CR Field *numbers*. The +to Vectorization, which applies to the CR Field *numbers*. The hardware-for-loop for Rc=1 for example starts at CR0 for element 0, and moves to CR1 for element 1, and so on. The reason here is quite simple: each element result has to have its own CR Field co-result. @@ -197,24 +197,24 @@ elwidth overrides, was particularly obtuse and hard to derive: some care and attention is advised, here, when reading the specification, especially on arithmetic loads (lbarx, lharx etc.) -**Non-vectorised** +**Non-vectorized** -The concept of a Vectorised halt (`attn`) makes no sense. There are never +The concept of a Vectorized halt (`attn`) makes no sense. There are never going to be a Vector of global MSRs (Machine Status Register). `mtcr` -on the other hand is a grey area: `mtspr` is clearly Vectoriseable. +on the other hand is a grey area: `mtspr` is clearly Vectorizeable. Even `td` and `tdi` makes a strange type of sense to permit it to be -Vectorised, because a sequence of comparisons could be Vectorised. -Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual +Vectorized, because a sequence of comparisons could be Vectorized. +Vectorized System Calls (`sc`) or `tlbie` and other Cache or Virtual Nemory Management -instructions, these make no sense to Vectorise. +instructions, these make no sense to Vectorize. However, it is really quite important to not be tempted to conclude that -just because these instructions are un-vectoriseable, the Prefix opcode space +just because these instructions are un-vectorizeable, the Prefix opcode space must be free for reiterpretation and use for other purposes. This would be a serious mistake because a future revision of the specification might *retire* the Scalar instruction, and, worse, replace it with another. Again this comes down to being quite strict about the rules: only Scalar -instructions get Vectorised: there are *no* actual explicit Vector +instructions get Vectorized: there are *no* actual explicit Vector instructions. **Summary** @@ -223,7 +223,7 @@ Where a traditional Vector ISA effectively duplicates the entirety of a Scalar ISA and then adds additional instructions which only make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to considerable lengths to keep strictly to augmentation and embedding -of an entire Scalar ISA's instructions into an abstract Vectorisation +of an entire Scalar ISA's instructions into an abstract Vectorization Context. That abstraction subdivides down into Categories appropriate for the type of operation (Branch, CRs, Memory, Arithmetic), and each Category has its own relevant but @@ -440,12 +440,12 @@ One key difference is that LR is only updated if certain additional conditions are met, whereas Scalar `bclrl` for example unconditionally overwrites LR. -Another is that the Vectorised Branch-Conditional instructions are the +Another is that the Vectorized Branch-Conditional instructions are the only ones where there are side-effects on predication when skipping is enabled. This is so as to be able to use CTR to count down *masked-out* elements. -Well over 500 Vectorised branch instructions exist in SVP64 due to the +Well over 500 Vectorized branch instructions exist in SVP64 due to the number of options available: close integration and interaction with the base Scalar Branch was unavoidable in order to create Conditional Branching suitable for parallel 3D / CUDA GPU workloads. @@ -693,7 +693,7 @@ of the Condition Register(s) which are always zero anyway. As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]] Scalar Power ISA lacks "Conditional Execution" present in ARM -Scalar ISA of several decades. When Vectorised the fact that +Scalar ISA of several decades. When Vectorized the fact that Rc=1 Vector results can immediately be used as a Predicate Mask back into the following instruction can result in large latency unless "Vector Chaining" is used in the Micro-Architecture. diff --git a/openpower/sv/svstep.mdwn b/openpower/sv/svstep.mdwn index 7356957bc..ecb1d2b47 100644 --- a/openpower/sv/svstep.mdwn +++ b/openpower/sv/svstep.mdwn @@ -30,7 +30,7 @@ Special Registers Altered: **Description** svstep may be used to enquire about the REMAP Schedule and it may be -used to alter Vectorisation State. When `vf=1` then stepping occurs. +used to alter Vectorization State. When `vf=1` then stepping occurs. When `vf=0` the enquiry is performed without altering internal state. If `SVi=0, Rc=0, vf=0` the instruction is a `nop`. @@ -58,7 +58,7 @@ to skip (or zero) elements. * Horizontal-First Mode can be used to return all indices, i.e. walks through all possible states. -**Vectorisation of svstep itself** +**Vectorization of svstep itself** As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as `sv.svstep`. This will work perfectly well in Horizontal-First @@ -104,7 +104,7 @@ any standard scalar v3.0B instruction. A mode of srcstep (SVi=0) is called which can move srcstep and dststep on to the next element, still respecting predicate masks. -In other words, where normal SVP64 Vectorisation acts "horizontally" +In other words, where normal SVP64 Vectorization acts "horizontally" by looping first through 0 to VL-1 and only then moving the PC to the next instruction, Vertical-First moves the PC onwards (vertically) through multiple instructions **with the same srcstep and dststep**, diff --git a/openpower/sv/vector_isa_comparison.mdwn b/openpower/sv/vector_isa_comparison.mdwn index b8c51490c..45cfd3b22 100644 --- a/openpower/sv/vector_isa_comparison.mdwn +++ b/openpower/sv/vector_isa_comparison.mdwn @@ -139,9 +139,9 @@ All of these are not Scalable Vector ISAs, they are SIMD ISAs. * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from Mitch under NDA on direct contact with him. It is a different approach from the - others, which may be termed "Cray-Style Horizontal-First" Vectorisation. + others, which may be termed "Cray-Style Horizontal-First" Vectorization. 66000 is a *Vertical-First* Vector ISA with hardware-level - auto-vectorisation. + auto-vectorization. * [ETA-10](http://50.204.185.175/collections/catalog/102641713) an extremely rare Scalable Vector Architecture from 1986, similar to the CDC Cyber 205. diff --git a/openpower/sv/vector_ops/discussion.mdwn b/openpower/sv/vector_ops/discussion.mdwn index e09283373..e9d54b19c 100644 --- a/openpower/sv/vector_ops/discussion.mdwn +++ b/openpower/sv/vector_ops/discussion.mdwn @@ -199,7 +199,7 @@ where bit 2 is inv, bits 0:1 select the bit of the CR. the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all. -scalar variant which can be Vectorised to give iotacr: +scalar variant which can be Vectorized to give iotacr: def crtaddi(RT, RA, BA, BO, D): if test_CR_bit(BA, BO):