X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsvp64.mdwn;h=8d732f474f1c1aad0c79fddfaa746fc9c5ecf542;hb=f61a664f43dce738d91ddde396809609c84ae306;hp=5ffc33a7da9d9e41e74e686b85e07bf7a0330442;hpb=a0f7e411243e1eff7dfb80ff88fa966bbe336f0c;p=libreriscv.git diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn index 5ffc33a7d..8d732f474 100644 --- a/openpower/sv/svp64.mdwn +++ b/openpower/sv/svp64.mdwn @@ -1,12 +1,10 @@ -[[!tag standards]] - -# DRAFT SVP64 for OpenPOWER ISA v3.0B +# SVP64 Zero-Overhead Loop Prefix Subsystem + * **DRAFT STATUS v0.1 18sep2021** Release notes + -This document describes [[SV|sv]] augmentation of the [[OpenPOWER|openpower]] v3.0B [[ISA|openpower/isa/]]. It is in Draft Status and -will be submitted to the [[!wikipedia OpenPOWER_Foundation]] ISA WG -via the External RFC Process. +This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]]. Credits and acknowledgements: @@ -19,9 +17,12 @@ Credits and acknowledgements: * NLnet Foundation, for funding * OpenPOWER Foundation * Paul Mackerras +* Brad Frey +* Cathy May * Toshaan Bharvani * IBM for the Power ISA itself + Links: * > @@ -30,166 +31,742 @@ Links: * * * TODO elwidth "infinite" discussion -* Saturating description. +* Saturating description. +* TODO [[sv/svp64-single]] +* External RFC ls010 +* [[sv/branches]] chapter +* [[sv/ldst]] chapter Table of contents [[!toc]] - -# Introduction - -This document focuses on the encoding of [[SV|sv]], and assumes familiarity with the same. It does not cover how SV works (merely the instruction encoding), and is therefore best read in conjunction with the [[sv/overview]], as well as the [[sv/svp64_quirks]] section. - -All bit numbers are in MSB0 form (the bits are numbered from 0 at the MSB -and counting up as you move to the LSB end). All bit ranges are inclusive -(so `4:6` means bits 4, 5, and 6). - -64-bit instructions are split into two 32-bit words, the prefix and the -suffix. The prefix always comes before the suffix in PC order. - -| 0:5 | 6:31 | 0:31 | -|--------|--------------|--------------| -| EXT01 | v3.1 Prefix | v3.1 Suffix | - -svp64 fits into the "reserved" portions of the v3.1 prefix, making it possible for svp64, v3.0B (or v3.1 including 64 bit prefixed) instructions to co-exist in the same binary without conflict. + + +## Introduction + +Simple-V is a type of Vectorization best described as a "Prefix Loop +Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR`[^bib_ldir] instruction and +to the 8086 `REP`[^bib_rep] Prefix instruction. More advanced features are similar +to the Z80 `CPIR`[^bib_cpir] instruction. + +[^bib_ldir]: [Zilog Z80 LDIR](http://z80-heaven.wikidot.com/instructions-set:ldir) +[^bib_cpir]: [Zilog Z80 CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir) +[^bib_rep]: [8086 REP](https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz) + +Except where explicitly stated all bit numbers remain as in the rest of +the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on +the left and counting up as you move rightwards to the LSB end). All bit +ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order). +**All register numbering and element numbering however is LSB0 ordering** +which is a different convention from that used elsewhere in the Power ISA. + +The SVP64 prefix always comes before the suffix in PC order and must be +considered an independent "Defined Word-instruction"[^dwi] that augments the behaviour of +the following instruction (also a Defined Word-instruction), but does **not** change the actual Decoding +of that following instruction just because it is Prefixed. Unlike EXT100-163, +where the Suffix is considered an entirely new Opcode Space, +SVP64-Prefixed instructions must never be treated or regarded +as a different Opcode Space. + +[^dwi]: Defined Word-instruction: Power ISA v3.1 Section 1.6 + +Two apparent exceptions to the above hard rule exist: SV +Branch-Conditional operations and LD/ST-update "Post-Increment" +Mode. Post-Increment was considered sufficiently high priority +(significantly reducing hot-loop instruction count) that one bit in +the Prefix is reserved for it (*Note the intention to release that bit +and move Post-Increment instructions to EXT2xx, as part of [[sv/rfc/ls011]]*). +Vectorized Branch-Conditional operations "embed" the original Scalar +Branch-Conditional behaviour into a much more advanced variant that is +highly suited to High-Performance Computation (HPC), Supercomputing, +and parallel GPU Workloads. + +*Architectural Resource Allocation note: at present it is possible to perform +partial parallel decode of the SVP64 24-bit Encoding Area at the same time +as decoding of the Suffix. Multi-Issue Implementations may even +Decode multiple 32-bit words in parallel and follow up with a second +cycle of joining Prefix and Suffix "after-the-fact". +Mixing and overlaying 64-bit Opcode Encodings into the +{SVP64 24-bit Prefix}{Defined Word-instruction} space creates +a hard dependency that catastrophically damages Multi-Issue Decoding by +greatly complexifying Parallel Instruction-Length Detection. +Therefore it has to be prohibited to accept RFCs +which fundamentally violate the following hard requirement: **under no circumstances** +must the use of SVP64 24-bit Suffixes **also** imply a different Opcode space +from **any** non-prefixed Word. Even RESERVED or Illegal Words must be +Orthogonal.* Subset implementations in hardware are permitted, as long as certain rules are followed, allowing for full soft-emulation including future -revisions. Details in the [[svp64/appendix]]. +revisions. Compliancy Subsets exist to ensure minimum levels of binary +interoperability expectations within certain environments. Details in +the [[svp64/appendix]]. ## SVP64 encoding features -A number of features need to be compacted into a very small space of only 24 bits: +A number of features need to be compacted into a very small space of +only 24 bits: -* Independent per-register Scalar/Vector tagging and range extension on every register +* Independent per-register Scalar/Vector tagging and range extension on + every register * Element width overrides on both source and destination * Predication on both source and destination * Two different sources of predication: INT and CR Fields -* SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and - predicate-result mode. +* SV Modes including saturation (for Audio, Video and DSP), mapreduce, + and fail-first mode. -This document focusses specifically on how that fits into available space. The [[svp64/appendix]] explains more of the details, whilst the [[sv/overview]] gives the basics. +Different classes of operations require different formats. The earlier +sections cover the common formats and the five separate modes have their own +section later: +* CR operations (crops), +* Arithmetic/Logical (termed "normal"), +* Load/Store Immediate, +* Load/Store Indexed, +* Branch-Conditional. -# Definition of Reserved in this spec. +## Definition of Reserved in this spec. For the new fields added in SVP64, instructions that have any of their fields set to a reserved value must cause an illegal instruction trap, -to allow emulation of future instruction sets, or for subsets of SVP64 -to be implemented in hardware and the rest emulated. -This includes SVP64 SPRs: reading or writing values which are not -supported in hardware must also raise illegal instruction traps -in order to allow emulation. +to allow emulation of future instruction sets, or for subsets of SVP64 to +be implemented in hardware and the rest emulated. This includes SVP64 +SPRs: reading or writing values which are not supported in hardware +must also raise illegal instruction traps in order to allow emulation. Unless otherwise stated, reserved values are always all zeros. -This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero. Where the standard OpenPOWER definition -is intended the red keyword `RESERVED` is used. - -# Scalar Identity Behaviour - -SVP64 is designed so that when the prefix is all zeros, and - VL=1, no effect or -influence occurs (no augmentation) such that all standard OpenPOWER -v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation"). - -Note that this is completely different from when VL=0. VL=0 turns all operations under its influence into `nops` (regardless of the prefix) - whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction (an "identity transformation"). - -# Register Naming and size - -SV Registers are simply the INT, FP and CR register files extended -linearly to larger sizes; SV Vectorisation iterates sequentially through these registers. - -Where the integer regfile in standard scalar -OpenPOWER v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127. -Likewise FP registers are extended to 128 (fp0 to fp127), and CRs are -extended to 128 entries, CR0 thru CR127. +This is unlike OpenPower ISA v3.1, which in many instances does not +require a trap if reserved fields are nonzero, instead relying on software +to avoid use of such fields. Where the standard Power +ISA definition is intended the red keyword `RESERVED` is used. + +## Definition of "PO9-Prefixed" + +Used in the context of "A PO9-Prefixed Word" this is a new area similar to EXT100-163 +that is shared between SVP64-Single, SVP64, 32 Vectorizable new Opcode areas +EXT200-231, one RESERVED 57-bit future Opcode space, and three new Unvectorizable +RESERVED 32-bit future Opcode spaces. See [[sv/po9_encoding]]. + +## Definition of "SVP64-Prefix" + +A 24-bit RISC-Paradigm Encoding area for Loop-Augmentation of the following +"Defined Word-instruction-instruction". +Used in the context of "An SVP64-Prefixed Defined Word-instruction", as separate and +distinct from the 32-bit PO9-Prefix that holds a 24-bit SVP64 Prefix. + +## Definition of "Vectorizable" and "Unvectorizable" + +"Vectorizable" Defined Word-instructions are Scalar instructions that +benefit from SVP64 Loop-Prefixing. +Conversely, any operation that inherently makes no sense if repeated in a +Vector Loop is termed +"Unvectorizable" or "Unvectorized". Examples include `sc` or `sync` +which have no registers. `mtmsr` is also classed as Unvectorizable +because there is only one `MSR`. + +UnVectorized instructions are required to be detected as such if +Prefixed (either SVP64 or SVP64Single) and an Illegal Instruction +Trap raised. + +*Architectural Note: Given that a "pre-classification" Decode Phase is +required (identifying whether the Suffix - Defined Word-instruction - is +Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional), +adding "Unvectorized" to this phase is not unreasonable.* + +Vectorizable Defined Word-instructions are **required** to be Vectorized, +or they may not be permitted to be added at all to the Power ISA as Defined +Word-instructions. + +*Engineering note: implementations may not choose to add Defined Word-instructions +without also adding hardware support for SVP64-Prefixing of the same.* + +*ISA Working Group note: Vectorized PackedSIMD instructions if ever proposed +should be considered Unvectorizable and except in extreme mitigating circumstances +rejected outright.* + +## Definition of Strict Element-Level Execution Order + +Where Instruction Execution Order[^ieo] guarantees the appearance of sequential +execution of instructions, Simple-V requires a corresponding guarantee for Elements +because in Simple-V Execution of Elements is synonymous with Execution of +instructions. + +[^ieo]: Strict Instruction Execution Order is defined in Public v3.1 Book I Section 2.2 + +## Precise Interrupt Guarantees + +Strict Instruction Execution Order is defined as giving the appearance, as far +as programs are concerned, that instructions were executed +strictly in the sequence that they occurred. A "Precise" +out-of-order +Micro-architecture goes to considerable lengths to ensure that +this is the case. + +Many Vector ISAs allow interrupts to occur in the middle of +processing of large Vector operations, only under the condition +that partial results are cleanly discarded, and continuation on return +from the Trap Handler will restart the entire operation. +The reason is that saving of full Architectural State is +not practical. An example would be a Floating-Point Horizontal Sum instruction +(very common in Vector ISAs) or a Dot Product instruction +that specifies a higher degree of accuracy for the *internal* +accumulator than the registers. + +Simple-V operates on an entirely different paradigm from traditional +Vector ISAs: as a "Sub-Execution Context", where "Elements" are synonymous +with Scalar instructions. With this in mind +implementations must observe Strict **Element**-Level Execution Order[[#svp64_eeo]] +at all times. +*Any* element is Interruptible, and Architectural State may +be fully preserved and restored regardless of that same State. + +*Engineering note: implementations are permitted have higher latency to +perform context-switching (particularly if REMAP +is active).* + +Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1` +but the full SVP64 Architectural State may be saved and +restored through manual copying of `SVSTATE` (and the four +REMAP SPRs if in use at the time, which may be determined by +`SVSTATE[32:46]` being non-zero). + +*Programmer's note: Trap Handlers (and any stack-based context save/restore) +must avoid the use of SVP64 Prefixed instructions to perform the necessary +save/restore of Simple-V Architectural State (SPR SVSTATE), +just as use of FPRs and VSRs is presently avoided. +However once saved, and set to known-good, SVP64 Prefixed instructions +may be used to save/restore GPRs, SPRs, FPRs and other state.* + +*Programmer's note: SVSHAPE0-3 alters Element Execution Order, but only +if activated in SVSHAPE. It is therefore technically possible in a Trap +Handler to save SVSTATE (`mfspr t0, SVSTATE`), then clear bits 32-46. +At this point it becomes safe to use SVP64 to save sequential batches +of SPRs (`setvli MAXVL=VL=4; sv.mfspr *t0, *SVSHAPE0`)* + +The only major caveat for REMAP is that +after an explicit change to +Architectural State caused by writing to the +Simple-V SPRs, some implementations may find +it easier to take longer to calculate where in a given Schedule +the re-mapping Indices were. Obvious examples include Interrupts occuring +in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3 +for example), which +will force some implementations to perform divide and modulo +calculations. + +An additional caveat involves Condition Register Fields +when also used as Predicate Masks. An operation that +overwrites the same CR Fields that are simultaneously +being used as a Predicate Mask should exercise extreme care +if the overwritten CR field element was needed by a +subsequent Element for its Predicate Mask bit. + +Some implementations may deploy Cray's technique of +"Vector Chaining" (including in this case reading the CR field +containing the Predicate bit until the very last moment), +and consequently avoiding the risk of +overwrite is the responsibility of the Programmer. +`hphint` may be used here to good effect. +Extra Special care is particularly needed here when using REMAP +and also Vertical-First Mode. + +The simplest option is to use Integer Predicate Masks but the +caveats are stricter: + +* In Vertical-First loops Programmers **must not** write to any + Integers (r3, r0, r31) used as Predicate Masks. Doing so + is `UNDEFINED` behaviour. +* An **entire** Vector is held up on Horizontal-First Mode if the + Integer Predicate is still in in-flight Reservation Stations + or pipelines. Speculative Vector Chained Execution mitigates delays + but can be heavy on Reservation Station resources. + +## Register files, elements, and Element-width Overrides + +The relationship between register files, elements, and element-width +overrides is expressed as follows: + +* register files are considered to be *byte-level* contiguous SRAMs, + accessed exclusively in Little-Endian Byte-Order at all times +* elements are sequential contiguous unbounded arrays starting at the "address" + of any given 64-bit GPR or FPR, numbered from 0 as the first, + "spilling" into numerically-sequentially-increasing GPRs +* element-width overrides set the width of the *elements* in the + sequentially-numbered contiguous array. + +The relationship is best defined in Canonical form, below, in ANSI c as a +union data structure. A key difference is that VSR elements are bounded +fixed at 128-bit, where SVP64 elements are conceptually unbounded and +only limited by the Maximum Vector Length. + +*Future specification note: SVP64 may be defined on top of VSRs in future. +At which point VSX also gains conceptually unbounded VSR register elements* + +In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR +Register files are expanded from 32 to 128 entries, and the number of +CR Fields expanded from CR0-CR7 to CR0-CR127. (Note: A future version +of SVP64 is anticipated to extend the VSR register file). + +Memory access remains exactly the same: the effects of `MSR.LE` remain +exactly the same, affecting as they already do and remain **only** +on the Load and Store memory-register operation byte-order, and having +nothing to do with the ordering of the contents of register files or +register-register arithmetic or logical operations. + +The only major impact on Arithmetic and Logical operations is that all +Scalar operations are defined, where practical and workable, to have +three new widths: elwidth=32, elwidth=16, elwidth=8. + +*Architectural note: a future revision of SVP64 for VSX may have entirely +different definitions of possible elwidths.* + +The default of +elwidth=64 is the pre-existing (Scalar) behaviour which remains 100% +unchanged. Thus, `addi` is now joined by a 32-bit, 16-bit, and 8-bit +variant of `addi`, but the sole exclusive difference is the width. +*In no way* is the actual `addi` instruction fundamentally altered +to become an entirely different operation (such as a subtract or multiply). +FP Operations elwidth overrides are also defined, as explained in +the [[svp64/appendix]]. + +To be absolutely clear: + +``` + There are no conceptual arithmetic ordering or other changes over the + Scalar Power ISA definitions to registers or register files or to + arithmetic or Logical Operations, beyond element-width subdivision +``` + +Element offset +numbering is naturally **LSB0-sequentially-incrementing from zero, not +MSB0-incrementing** including when element-width overrides are used, +at which point the elements progress through each register +sequentially from the LSB end +(confusingly numbered the highest in MSB0 ordering) and progress +incrementally to the MSB end (confusingly numbered the lowest in +MSB0 ordering). + +When exclusively using MSB0-numbering, SVP64 becomes unnecessarily complex +to both express and subsequently understand: the required conditional +subtractions from 63, 31, 15 and 7 needed to express the fact that +elements are LSB0-sequential unfortunately become a hostile minefield, +obscuring both intent and meaning. Therefore for the purposes of this +section the more natural **LSB0 numbering is assumed** and it is left +to the reader to translate to MSB0 numbering. + +The Canonical specification for how element-sequential numbering and +element-width overrides is defined is expressed in the following c +structure, assuming a Little-Endian system, and naturally using LSB0 +numbering everywhere because the ANSI c specification is inherently LSB0. +Note the deliberate similarity to how VSX register elements are defined, +from Figure 97, Book I, Section 6.3, Page 258: + +``` + #pragma pack + typedef union { + uint8_t actual_bytes[8]; + // all of these are very deliberately unbounded arrays + // that intentionally "wrap" into subsequent actual_bytes... + uint8_t bytes[]; // elwidth 8 + uint16_t hwords[]; // elwidth 16 + uint32_t words[]; // elwidth 32 + uint64_t dwords[]; // elwidth 64 + + } el_reg_t; + + // ... here, as packed statically-defined GPRs. + elreg_t int_regfile[128]; + + // use element 0 as the destination + void get_register_element(el_reg_t* el, int gpr, int element, int width) { + switch (width) { + case 64: el->dwords[0] = int_regfile[gpr].dwords[element]; + case 32: el->words[0] = int_regfile[gpr].words[element]; + case 16: el->hwords[0] = int_regfile[gpr].hwords[element]; + case 8 : el->bytes[0] = int_regfile[gpr].bytes[element]; + } + } + + // use element 0 as the source + void set_register_element(el_reg_t* el, int gpr, int element, int width) { + switch (width) { + case 64: int_regfile[gpr].dwords[element] = el->dwords[0]; + case 32: int_regfile[gpr].words[element] = el->words[0]; + case 16: int_regfile[gpr].hwords[element] = el->hwords[0]; + case 8 : int_regfile[gpr].bytes[element] = el->bytes[0]; + } + } +``` + +Example Vector-looped add operation implementation when elwidths are 64-bit: + +``` + # vector-add RT, RA,RB using the "uint64_t" union member, "dwords" + for i in range(VL): + int_regfile[RT].dword[i] = int_regfile[RA].dword[i] + int_regfile[RB].dword[i] +``` + +However if elwidth overrides are set to 16 for both source and destination: + +``` + # vector-add RT, RA, RB using the "uint64_t" union member "hwords" + for i in range(VL): + int_regfile[RT].hwords[i] = int_regfile[RA].hwords[i] + int_regfile[RB].hwords[i] +``` + +The most fundamental aspect here to understand is that the wrapping +into subsequent Scalar GPRs that occurs on larger-numbered elements +including and especially on smaller element widths is **deliberate +and intentional**. From this Canonical definition it should be clear +that sequential elements begin at the LSB end of any given underlying +Scalar GPR, progress to the MSB end, and then to the LSB end of the +*next numerically-larger Scalar GPR*. In the example above if VL=5 +and RT=1 then the contents of GPR(1) and GPR(2) will be as follows. +For clarity in the table below: + +* Both MSB0-ordered bitnumbering *and* LSB-ordered bitnumbering are shown +* The GPR-numbering is considered LSB0-ordered +* The Element-numbering (result0-result4) is LSB0-ordered +* Each of the results (result0-result4) are 16-bit +* "same" indicates "no change as a result of the Vectorized add" + +``` + | MSB0: | 0:15 | 16:31 | 32:47 | 48:63 | + | LSB0: | 63:48 | 47:32 | 31:16 | 15:0 | + |--------|---------|---------|---------|---------| + | GPR(0) | same | same | same | same | + | GPR(1) | result3 | result2 | result1 | result0 | + | GPR(2) | same | same | same | result4 | + | GPR(3) | same | same | same | same | + | ... | ... | ... | ... | ... | + | ... | ... | ... | ... | ... | +``` + +Note that the upper 48 bits of GPR(2) would **not** be modified due to +the example having VL=5. Thus on "wrapping" - sequential progression +from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom +16 LSBs of GPR(1). + +If the 16-bit operation were to be followed up with a 32-bit Vectorized +Operation, the exact same contents would be viewed as follows: + +``` + | MSB0: | 0:31 | 32:63 | + | LSB0: | 63:32 | 31:0 | + |--------|----------------------|----------------------| + | GPR(0) | same | same | + | GPR(1) | (result3 || result2) | (result1 || result0) | + | GPR(2) | same | (same || result4) | + | GPR(3) | same | same | + | ... | ... | ... | + | ... | ... | ... | +``` + +In other words, this perspective really is no different from the situation +where the actual Register File is treated as an Industry-standard +byte-level-addressable Little-Endian-addressed SRAM. Note that +this perspective does **not** involve `MSR.LE` in any way shape or +form because `MSR.LE` is directly in control of the Memory-to-Register +byte-ordering. This section is exclusively about how to correctly perceive +Simple-V-Augmented **Register** Files. + +*Engineering note: to avoid a Read-Modify-Write at the register +file it is strongly recommended to implement byte-level write-enable lines +exactly as has been implemented in DRAM ICs for many decades. Additionally +the predicate mask bit is advised to be associated with the element +operation and alongside the result ultimately passed to the register file. +When element-width is set to 64-bit the relevant predicate mask bit +may be repeated eight times and pull all eight write-port byte-level +lines HIGH. Clearly when element-width is set to 8-bit the relevant +predicate mask bit corresponds directly with one single byte-level +write-enable line. It is up to the Hardware Architect to then amortise +(merge) elements together into both PredicatedSIMD Pipelines as well +as simultaneous non-overlapping Register File writes, to achieve High +Performance designs. Overall it helps to think of the GPR and FPR +register files as being much more akin to a 64-bit-wide byte-level-addressable SRAM.* + +**Comparative equivalent using VSR registers** + +For a comparative data point the VSR Registers may be expressed in the +same fashion. The c code below is directly an expression of Figure 97 in +Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating +for MSB0 numbering in both bits and elements, adapting in full to LSB0 +numbering, and obeying LE ordering*. + +**Crucial to understanding why the subtraction from 1,3,7,15 is present is +because the Power ISA numbers VSX Registers elements also in MSB0 order**. +SVP64 very specifically numbers elements in **LSB0** order with the first +element (numbered zero) being at the bitwise-numbered **LSB** end of the +register, where VSX does the reverse: places the numerically-*highest* +(last-numbered) element at the LSB end of the register. + +``` + #pragma pack + typedef union { + // these do NOT match their Power ISA VSX numbering directly, they are all reversed + // bytes[15] is actually VSR.byte[0] for example. if this convention is not + // followed then everything ends up in the wrong place + uint8_t bytes[16]; // elwidth 8, QTY 16 FIXED total + uint16_t hwords[8]; // elwidth 16, QTY 8 FIXED total + uint32_t words[4]; // elwidth 32, QTY 8 FIXED total + uint64_t dwords[2]; // elwidth 64, QTY 2 FIXED total + uint8_t actual_bytes[16]; // totals 128-bit + } el_reg_t; + + elreg_t VSR_regfile[64]; + + static void check_num_elements(int elt, int width) { + switch (width) { + case 64: assert elt < 2; + case 32: assert elt < 4; + case 16: assert elt < 8; + case 8 : assert elt < 16; + } + } + void get_VSR_element(el_reg_t* el, int gpr, int elt, int width) { + check_num_elements(elt, width); + switch (width) { + case 64: el->dwords[0] = VSR_regfile[gpr].dwords[1-elt]; + case 32: el->words[0] = VSR_regfile[gpr].words[3-elt]; + case 16: el->hwords[0] = VSR_regfile[gpr].hwords[7-elt]; + case 8 : el->bytes[0] = VSR_regfile[gpr].bytes[15-elt]; + } + } + void set_VSR_element(el_reg_t* el, int gpr, int elt, int width) { + check_num_elements(elt, width); + switch (width) { + case 64: VSR_regfile[gpr].dwords[1-elt] = el->dwords[0]; + case 32: VSR_regfile[gpr].words[3-elt] = el->words[0]; + case 16: VSR_regfile[gpr].hwords[7-elt] = el->hwords[0]; + case 8 : VSR_regfile[gpr].bytes[15-elt] = el->bytes[0]; + } + } +``` + +For VSR Registers one key difference is that the overlay of different +element widths is clearly a *bounded static quantity*, whereas for +Simple-V the elements are unrestrained and permitted to flow into +*successive underlying Scalar registers*. This difference is absolutely +critical to a full understanding of the entire Simple-V paradigm and +why element-ordering, bit-numbering *and register numbering* are all so +strictly defined. + +Implementations are not permitted to violate the Canonical +definition. Software will be critically relying on the wrapped (overflow) +behaviour inherently implied by the unbounded variable-length c arrays. + +Illustrating the exact same loop with the exact same effect as achieved +by Simple-V we are first forced to create wrapper functions, to cater +for the fact that VSR register elements are static bounded: + +``` + int calc_VSR_reg_offs(int elt, int width) { + switch (width) { + case 64: return floor(elt / 2); + case 32: return floor(elt / 4); + case 16: return floor(elt / 8); + case 8 : return floor(elt / 16); + } + } + int calc_VSR_elt_offs(int elt, int width) { + switch (width) { + case 64: return (elt % 2); + case 32: return (elt % 4); + case 16: return (elt % 8); + case 8 : return (elt % 16); + } + } + void _set_VSR_element(el_reg_t* el, int gpr, int elt, int width) { + int new_elt = calc_VSR_elt_offs(elt, width); + int new_reg = calc_VSR_reg_offs(elt, width); + set_VSR_element(el, gpr+new_reg, new_elt, width); + } +``` + +And finally use these functions: + +``` + # VSX-add RT, RA, RB using the "uint64_t" union member "hwords" + for i in range(VL): + el_reg_t result, ra, rb; + _get_VSR_element(&ra, RA, i, 16); + _get_VSR_element(&rb, RB, i, 16); + result.hwords[0] = ra.hwords[0] + rb.hwords[0]; // use array 0 elements + _set_VSR_element(&result, RT, i, 16); + +``` + +## Scalar Identity Behaviour + +SVP64 is designed so that when the prefix is all zeros, and VL=1, no +effect or influence occurs (no augmentation) such that all standard Power +ISA v3.0/v3.1 instructions covered by the prefix are "unaltered". This +is termed `scalar identity behaviour` (based on the mathematical +definition for "identity", as in, "identity matrix" or better "identity +transformation"). + +Note that this is completely different from when VL=0. VL=0 turns all +operations under its influence into `nops` (regardless of the prefix) +whereas when VL=1 and the SV prefix is all zeros, the operation simply +acts as if SV had not been applied at all to the instruction (an +"identity transformation"). + +The fact that `VL` is dynamic and can be set to any value at runtime +based on program conditions and behaviour means very specifically that +`scalar identity behaviour` is **not** a redundant encoding. If the only +means by which VL could be set was by way of static-compiled immediates +then this assertion would be false. VL should not be confused with +MAXVL when understanding this key aspect of SimpleV. + +## Register Naming and size + +As indicated above SV Registers are simply the GPR, FPR and CR register +files extended linearly to larger sizes; SV Vectorization iterates +sequentially through these registers (LSB0 sequential ordering from 0 +to VL-1). + +Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is +r0 to r31, SV extends this range (in the Upper Compliancy Levels of SV) +as r0 to r127. Likewise FP registers are +extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries, +CR0 thru CR127. In the Lower SV Compliancy Levels the quantity of registers +remains the same in order to reduce implementation cost for Embedded systems. The names of the registers therefore reflects a simple linear extension -of the OpenPOWER v3.0B / v3.1B register naming, and in hardware this +of the Power ISA v3.0B / v3.1B register naming, and in hardware this would be reflected by a linear increase in the size of the underlying SRAM used for the regfiles. -Note: when an EXTRA field (defined below) is zero, SV is deliberately designed -so that the register fields are identical to as if SV was not in effect -i.e. under these circumstances (EXTRA=0) the register field names RA, -RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. This is part of -`scalar identity behaviour` described above. +Note: when an EXTRA field (defined below) is zero, SV is deliberately +designed so that the register fields are identical to as if SV was not in +effect i.e. under these circumstances (EXTRA=0) the register field names +RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. +This is part of `scalar identity behaviour` described above. + +**Condition Register(s)** + +The Scalar Power ISA Condition Register is a 64 bit register where +the top 32 MSBs (numbered 0:31 in MSB0 numbering) are not used. +This convention is *preserved* in SVP64 and an additional 15 Condition +Registers provided in order to store the new CR Fields, CR8-CR15, +CR16-CR23 etc. sequentially. The top 32 MSBs in each new SVP64 Condition +Register are *also* not used: only the bottom 32 bits (numbered 32:63 +in MSB0 numbering). + +*Programmer's note: using `sv.mfcr` without element-width overrides +to take into account the fact that the top 32 MSBs are zero and thus +effectively doubling the number of GPR registers required to hold all 128 +CR Fields would seem the only option because a source elwidth override +to 32-bit would take only the bottom 16 LSBs of the Condition Register +and set the top 16 LSBs to zeros. However in this case it +is possible to use destination element-width overrides (for `sv.mfcr`. +source overrides would be used on the GPR of `sv.mtocrf`), whereupon +truncation of the 64-bit Condition Register(s) occurs, throwing away +the zeros and storing the remaining (valid, desired) 32-bit values +sequentially into (LSB0-convention) lower-numbered and upper-numbered +halves of GPRs respectively. The programmer is expected to be aware +however that the full width of the entire 64-bit Condition Register +is considered to be "an element". This is **not** like any other +Condition-Register instructions because all other CR instructions, +on closer investigation, will be observed to all be CR-bit or CR-Field +related. Thus a `VL` of 16 must be used* + +**Condition Register Fields as Predicate Masks** + +Condition Register Fields perform an additional duty in Simple-V: they are +used for Predicate Masks. ARM's Scalar Instruction Set calls single-bit +predication "Conditional Execution", and utilises Condition Codes for +exactly this purpose to solve the problem caused by Branch Speculation. +In a Vector ISA context the concept of Predication is naturally extended +from single-bit to multi-bit, and the (well-known) benefits become all the +more critical given that parallel branches in Vector ISAs are impossible +(even a Vector ISA can only have Scalar branches). + +However the Scalar Power ISA does not have Conditional Execution (for +which, if it had ever been considered, Condition Register bits would be +a perfect natural fit). Thus, when adding Predication using CR Fields +via Simple-V it becomes a somewhat disruptive addition to the Power ISA. + +To ameliorate this situation, particularly for pre-existing Hardware +designs implementing up to Scalar Power ISA v3.1, some rules are set that +allow those pre-existing designs not to require heavy modification to +their existing Scalar pipelines. These rules effectively allow Hardware +Architects to add the additional CR Fields CR8 to CR127 as if they were +an **entirely separate register file**. + +* any instruction involving more than 1 source 1 destination + where one of the operands is a Condition Register is prohibited from + using registers from both the CR0-7 group and the CR8-127 group at + the same time. +* any instruction involving 1 source 1 destination where either the + source or the destination is a Condition Register is prohibited + from setting CR0-7 as a Vector. +* prohibitions are required to be enforced by raising Illegal Instruction + Traps + +Examples of permitted instructions: + +``` + sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127 + sv.mfcr cr5, *cr40 # only one source (CR40) copied to CR5 + sv.mfcr *cr16, cr40 # Vector-Splat CR40 onto CR16,17,18... + sv.mfcr *cr16, cr3 # Vector-Splat CR3 onto CR16,17,18... +``` + +Examples of prohibited instructions: + +``` + sv.mfcr *cr0, cr40 # Vector-Splat onto CR0,1,2 + sv.crand cr7, cr9, cr10 # crosses over between CR0-7 and CR8-127 +``` ## Future expansion. -With the way that EXTRA fields are defined and applied to register fields, -future versions of SV may involve 256 or greater registers. To accommodate 256 registers, numbering of Vectors will simply shift up by one bit, without -requiring additional prefix bits. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register). Beyond this, further discussion is out of scope for this version of svp64. - -# Remapped Encoding (`RM[0:23]`) - -To allow relatively easy remapping of which portions of the Prefix Opcode -Map are used for SVP64 without needing to rewrite a large portion of the -SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to -a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]` -at the LSB. - -The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding -is defined in the Prefix Fields section. - -## Prefix Opcode Map (64-bit instruction encoding) +With the way that EXTRA fields are defined and applied to register +fields, future versions of SV may involve 256 or greater registers +in some way as long as the reputation of Power ISA for full backwards +binary interoperability is preserved. Backwards binary compatibility +may be achieved with a PCR bit (Program Compatibility Register) or an +MSR bit analogous to SF. Further discussion is out of scope for this +version of SVP64. -In the original table in the v3.1B OpenPOWER ISA Spec on p1350, Table 12, prefix bits 6:11 are shown, with their allocations to different v3.1B pregix "modes". +Additionally, a future variant of SVP64 will be applied to the Scalar +(Quad-precision and 128-bit) VSX instructions. Element-width overrides are +an opportunity to expand a future version of the Power ISA to 256-bit, +512-bit and 1024-bit operations, as well as doubling or quadrupling the +number of VSX registers to 128 or 256. Again further discussion is out +of scope for this version of SVP64. -The table below hows both PowerISA v3.1 instructions as well as new SVP instructions fit; -empty spaces are yet-to-be-allocated Illegal Instructions. +-------- -| 6:11 | ---000 | ---001 | ---010 | ---011 | ---100 | ---101 | ---110 | ---111 | -|------|--------|--------|--------|--------|--------|--------|--------|--------| -|000---| 8LS | 8LS | 8LS | 8LS | 8LS | 8LS | 8LS | 8LS | -|001---| | | | | | | | | -|010---| 8RR | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`| -|011---| | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`| -|100---| MLS | MLS | MLS | MLS | MLS | MLS | MLS | MLS | -|101---| | | | | | | | | -|110---| MRR | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`| -|111---| | MMIRR | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`| +\newpage{} -Note that by taking up a block of 16, where in every case bits 7 and 9 are set, this allows svp64 to utilise four bits of the v3.1B Prefix space and "allocate" them to svp64's Remapped Encoding field, instead. +## SVP64 Remapped Encoding (`RM[0:23]`) -## Prefix Fields +In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits +32-37 are the Primary Opcode of the Suffix "Defined Word-instruction". 38-63 are the +remainder of the Defined Word-instruction. Note that the new EXT232-263 SVP64 area +it is obviously mandatory that bit 32 is required to be set to 1. -To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set -(see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV. -This is achieved by setting bits 7 and 9 to 1: +| 0-5 | 6 | 7 | 8-31 | 32-37 | 38-64 |Description | +|-----|---|---|----------|--------|----------|-----------------------| +| PO | 0 | 1 | RM[0:23] | 1nnnnn | xxxxxxxx | SVP64:EXT232-263 | +| PO | 1 | 1 | RM[0:23] | nnnnnn | xxxxxxxx | SVP64:EXT000-063 | -| Name | Bits | Value | Description | -|------------|---------|-------|--------------------------------| -| EXT01 | `0:5` | `1` | Indicates Prefixed 64-bit | -| `RM[0]` | `6` | | Bit 0 of Remapped Encoding | -| SVP64_7 | `7` | `1` | Indicates this is SVP64 | -| `RM[1]` | `8` | | Bit 1 of Remapped Encoding | -| SVP64_9 | `9` | `1` | Indicates this is SVP64 | -| `RM[2:23]` | `10:31` | | Bits 2-23 of Remapped Encoding | - -Laid out bitwise, this is as follows, showing how the 32-bits of the prefix -are constructed: - -| 0:5 | 6 | 7 | 8 | 9 | 10:31 | -|--------|-------|---|-------|---|----------| -| EXT01 | RM | 1 | RM | 1 | RM | -| 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] | - -Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1 -instruction. That instruction becomes "prefixed" with the SVP context: the -Remapped Encoding field (RM). - -It is important to note that unlike v3.1 64-bit prefixed instructions +It is important to note that unlike EXT1xx 64-bit prefixed instructions there is insufficient space in `RM` to provide identification of -any SVP64 Fields without first partially decoding the -32-bit suffix. Similar to the "Forms" (X-Form, D-Form) the -`RM` format is individually associated with every instruction. +any SVP64 Fields without first partially decoding the 32-bit suffix. +Similar to the "Forms" (X-Form, D-Form) the `RM` format is individually +associated with every instruction. However this still does not adversely +affect Multi-Issue Decoding because the identification of the *length* +of anything in the 64-bit space has been kept brutally simple (EXT009), +and further decoding of any number of 64-bit Encodings in parallel at +that point is fully independent. -Extreme caution and care must therefore be taken -when extending SVP64 in future, to not create unnecessary relationships -between prefix and suffix that could complicate decoding, adding latency. +Extreme caution and care must be taken when extending SVP64 +in future, to not create unnecessary relationships between prefix and +suffix that could complicate decoding, adding latency. -# Common RM fields +## Common RM fields The following fields are common to all Remapped Encodings: @@ -197,7 +774,7 @@ The following fields are common to all Remapped Encodings: |------------|------------|----------------------------------------| | MASKMODE | `0` | Execution (predication) Mask Kind | | MASK | `1:3` | Execution Mask | -| SUBVL | `8:9` | Sub-vector length | +| SUBVL | `8:9` | Sub-vector length | The following fields are optional or encoded differently depending on context after decoding of the Scalar suffix: @@ -205,47 +782,56 @@ on context after decoding of the Scalar suffix: | Field Name | Field bits | Description | |------------|------------|----------------------------------------| | ELWIDTH | `4:5` | Element Width | -| ELWIDTH_SRC | `6:7` | Element Width for Source | -| EXTRA | `10:18` | Register Extra encoding | +| ELWIDTH_SRC | `6:7` | Element Width for Source (or MASK_SRC in 2PM) | +| EXTRA | `10:18` | Register Extra encoding | | MODE | `19:23` | changes Vector behaviour | -* MODE changes the behaviour of the SV operation (result saturation, mapreduce) -* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work -* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width -* MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR). -* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix. - -Similar to OpenPOWER `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag. - -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. - -# Mode - -Mode is an augmentation of SV behaviour. Different types of -instructions have different needs, similar to Power ISA -v3.1 64 bit prefix 8LS and MTRR formats apply to different -instruction types. Modes include Reduction, Iteration, arithmetic -saturation, and Fail-First. More specific details in each -section and in the [[svp64/appendix]] +* MODE changes the behaviour of the SV operation (result saturation, + mapreduce) +* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D + and Audio/Video DSP work +* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and + source operand width +* MASK (and MASK_SRC) and MASKMODE provide predication (two types of + sources: scalar INT and Vector CR). +* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category + for the instruction, which is determined only by decoding the Scalar 32 + bit suffix. + +Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, +such as `RM-1P-3S1D` which indicates for this example that the operation +is to be single-predicated and that there are 3 source operand EXTRA +tags and one destination operand tag. + +Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance +or increased latency in some implementations due to lane-crossing. + +## Mode + +Mode is an augmentation of SV behaviour. Different types of instructions +have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR +formats apply to different instruction types. Modes include Reduction, +Iteration, arithmetic saturation, and Fail-First. More specific details +in each section and in the [[svp64/appendix]] * For condition register operations see [[sv/cr_ops]] * For LD/ST Modes, see [[sv/ldst]]. * For Branch modes, see [[sv/branches]] * For arithmetic and logical, see [[sv/normal]] -# ELWIDTH Encoding +## ELWIDTH Encoding -Default behaviour is set to 0b00 so that zeros follow the convention of -`scalar identity behaviour`. In this case it means that elwidth overrides -are not applicable. Thus if a 32 bit instruction operates on 32 bit, -`elwidth=0b00` specifies that this behaviour is unmodified. Likewise -when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00` -states that, again, the behaviour is not to be modified. +Default behaviour is set to 0b00 so that zeros follow the convention +of `scalar identity behaviour`. In this case it means that elwidth +overrides are not applicable. Thus if a 32 bit instruction operates +on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified. +Likewise when a processor is switched from 64 bit to 32 bit mode, +`elwidth=0b00` states that, again, the behaviour is not to be modified. Only when elwidth is nonzero is the element width overridden to the explicitly required value. -## Elwidth for Integers: +### Elwidth for Integers: | Value | Mnemonic | Description | |-------|----------------|------------------------------------| @@ -254,9 +840,10 @@ explicitly required value. | 10 | `ELWIDTH=h` | Halfword: 16-bit integer | | 11 | `ELWIDTH=b` | Byte: 8-bit integer | -This encoding is chosen such that the byte width may be computed as `(3-ew)<<8` +This encoding is chosen such that the byte width may be computed as +`8<<(3-ew)` -## Elwidth for FP Registers: +### Elwidth for FP Registers: | Value | Mnemonic | Description | |-------|----------------|------------------------------------| @@ -269,25 +856,26 @@ Note: [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) is reserved for a future implementation of SV -Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall -perform its operation at **half** the ELWIDTH then padded back out -to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT +Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) +shall perform its operation at **half** the ELWIDTH then padded back out +to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation +that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy then padded back out to fit in IEEE754 FP64, exactly as for Scalar -v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16 -or ELWIDTH=bf16 is reserved and must raise an illegal instruction -(IEEE754 FP8 or BF8 are not defined). +v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16 or +ELWIDTH=bf16 is reserved and must raise an illegal instruction (IEEE754 +FP8 or BF8 are not defined). -## Elwidth for CRs: +### Elwidth for CRs (no meaning) Element-width overrides for CR Fields has no meaning. The bits are therefore used for other purposes, or when Rc=1, the Elwidth -applies to the result being tested, but not to the Vector of CR Fields. - +applies to the result being tested (a GPR or FPR), but not to the +Vector of CR Fields. -# SUBVL Encoding +## SUBVL Encoding -the default for SUBVL is 1 and its encoding is 0b00 to indicate that +The default for SUBVL is 1 and its encoding is 0b00 to indicate that SUBVL is effectively disabled (a SUBVL for-loop of only one element). this lines up in combination with all other "default is all zeros" behaviour. @@ -302,19 +890,23 @@ The SUBVL encoding value may be thought of as an inclusive range of a sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore this may be considered to be elements 0b00 to 0b01 inclusive. -# MASK/MASK_SRC & MASKMODE Encoding +Effectively, SUBVL is like a SIMD multiplier: instead of just 1 +element operation issued, SUBVL element operations are issued (as an inner loop). +The key difference between VL looping and SUBVL looping +is that predication bits are applied per +**group**, rather than by individual element. -TODO: rename MASK_KIND to MASKMODE +Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`. + +## MASK/MASK_SRC & MASKMODE Encoding One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two types may not be mixed. -Special note: to disable predication this field must -be set to zero in combination with Integer Predication also being set -to 0b000. this has the effect of enabling "all 1s" in the predicate -mask, which is equivalent to "not having any predication at all" -and consequently, in combination with all other default zeros, fully -disables SV (`scalar identity behaviour`). +Special note: to disable predication this field must be set to zero in +combination with Integer Predication also being set to 0b000. this has the +effect of enabling "all 1s" in the predicate mask, which is equivalent to +"not having any predication at all". `MASKMODE` may be set to one of 2 values: @@ -324,32 +916,45 @@ disables SV (`scalar identity behaviour`). | 1 | MASK/MASK_SRC are encoded using CR-based Predication | Integer Twin predication has a second set of 3 bits that uses the same -encoding thus allowing either the same register (r3 or r10) to be used -for both src and dest, or different regs (one for src, one for dest). +encoding thus allowing either the same register (r3, r10 or r31) to be +used for both src and dest, or different regs (one for src, one for dest). Likewise CR based twin predication has a second set of 3 bits, allowing a different test to be applied. -Note that it is assumed that Predicate Masks (whether INT or CR) -are read *before* the operations proceed. In practice (for CR Fields) -this creates an unnecessary block on parallelism. Therefore, -it is up to the programmer to ensure that the CR fields used as -Predicate Masks are not being written to by any parallel Vector Loop. -Doing so results in **UNDEFINED** behaviour, according to the definition -outlined in the OpenPOWER v3.0B Specification. +Note that it cannot necessarily be assumed that Predicate Masks +(whether INT or CR) are read in full *before* the operations proceed. In practice (for CR Fields) +this creates an unnecessary block on parallelism, prohibiting +"Vector Chaining". Therefore, it is up +to the programmer to ensure that the CR field Elements used as Predicate Masks +are not overwritten by any parallel Vector Loop. Doing so results +in **UNDEFINED** behaviour, according to the definition outlined in the +Power ISA v3.0B Specification. Hardware Implementations are therefore free and clear to delay reading of individual CR fields until the actual predicated element operation -needs to take place, safe in the knowledge that no programmer will -have issued a Vector Instruction where previous elements could have -overwritten (destroyed) not-yet-executed CR-Predicated element operations. +needs to take place, safe in the knowledge that no programmer will have +issued a Vector Instruction where previous elements could have overwritten +(destroyed) not-yet-executed CR-Predicated element operations. +This particularly is an issue when using REMAP, as the order in +which CR-Field-based Predicate Mask bits could be read on a per-element +execution basis could well conflict with the order in which prior +elements wrote to the very same CR Field. + +Additionally Programmers should avoid using r3 r10 or r30 +as destination registers when these are also used as a Predicate +Mask. Doing so is again UNDEFINED behaviour. -## Integer Predication (MASKMODE=0) +Usually in 2P `MASK_SRC` is exclusively in the EXTRA area. However for +LD/ST-Indexed a different Encoding is required, designated `2PM`. + +### Integer Predication (MASKMODE=0) When the predicate mode bit is zero the 3 bits are interpreted as below. Twin predication has an identical 3 bit field similarly encoded. -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning: +`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the +following meaning: | Value | Mnemonic | Element `i` enabled if: | |-------|----------|------------------------------| @@ -362,14 +967,16 @@ Twin predication has an identical 3 bit field similarly encoded. | 110 | R30 | `R30 & (1 << i)` is non-zero | | 111 | ~R30 | `R30 & (1 << i)` is zero | -r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs. +r10 and r30 are at the high end of temporary and unused registers, +so as not to interfere with register allocation from ABIs. -## CR-based Predication (MASKMODE=1) +### CR-based Predication (MASKMODE=1) When the predicate mode bit is one the 3 bits are interpreted as below. Twin predication has an identical 3 bit field similarly encoded. -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning: +`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the +following meaning: | Value | Mnemonic | Element `i` is enabled if | |-------|----------|--------------------------| @@ -382,38 +989,64 @@ Twin predication has an identical 3 bit field similarly encoded. | 110 | so/un | `CR[offs+i].FU` is set | | 111 | ns/nu | `CR[offs+i].FU` is clear | -CR based predication. TODO: select alternate CR for twin predication? see -[[discussion]] Overlap of the two CR based predicates must be taken -into account, so the starting point for one of them must be suitably -high, or accept that for twin predication VL must not exceed the range -where overlap will occur, *or* that they use the same starting point -but select different *bits* of the same CRs +`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorized +Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). + +The CR Predicates chosen must start on a boundary that Vectorized CR +operations can access cleanly, in full. With EXTRA2 restricting starting +points to multiples of 8 (CR0, CR8, CR16...) both Vectorized Rc=1 and +CR Predicate Masks have to be adapted to fit on these boundaries as well. -`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). +## Extra Remapped Encoding -Notes from Jacob: CR6-7 allows Scalar ops to refer to these without having to do a transfer (v3.0B). Another idea: the DepMatrices treat scalar CRs as one "thing" and treat the Vectors as a completely separate "thing". also: do modulo arithmetic on allocation of CRs. +Shows all instruction-specific fields in the Remapped Encoding +`RM[10:18]` for all instruction variants. Note that due to the very +tight space, the encoding mode is *not* included in the prefix itself. +The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) +on a per-instruction basis, and, like "Forms" are given a designation +(below) of the form `RM-nP-nSnD`. The full list of which instructions +use which remaps is here [[opcode_regs_deduped]]. -# Extra Remapped Encoding +**Please note the following**: -Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants. Note that due to the very tight space, the encoding mode is *not* included in the prefix itself. The mode is "applied", similar to OpenPOWER "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*). +``` + Machine-readable CSV files have been autogenerated which will make the + task of creating SV-aware ISA decoders, documentation, assembler tools + compiler tools Simulators documentation all aspects of SVP64 easier + and less prone to mistakes. Please avoid manual re-creation of + information from the written specification wording in this chapter, + and use the CSV files or use the Canonical tool which creates the CSV + files, named sv_analysis.py. The information contained within + sv_analysis.py is considered to be part of this Specification, even + encoded as it is in python3. +``` -These mappings are part of the SVP64 Specification in exactly the same + +The mappings are part of the SVP64 Specification in exactly the same way as X-Form, D-Form. New Scalar instructions added to the Power ISA will need a corresponding SVP64 Mapping, which can be derived by-rote from examining the Register "Profile" of the instruction. -There are two categories: Single and Twin Predication. -Due to space considerations further subdivision of Single Predication -is based on whether the number of src operands is 2 or 3. With only -9 bits available some compromises have to be made. +There are two categories: Single and Twin Predication. Due to space +considerations further subdivision of Single Predication is based on +whether the number of src operands is 2 or 3. With only 9 bits available +some compromises have to be made. -* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd). -* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest) +* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand + instructions (fmadd, isel, madd). +* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand + instructions (src1 src2 dest) * `RM-2P-1S1D` Twin Predication (src=1, dest=1) * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed) * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update +* `RM-2PM-2S1D` Twin Predication (src=2, dest=1) for LD/ST Update (Indexed) + +The `2PM` designation uses bits 6 and 7 as well as the 9 EXTRA bits +in order to extend two registers to +EXTRA3, sacrificing destination elwidths in the process. +`MASK_SRC` has a different encoding in `2PM`. -## RM-1P-3S1D +### RM-1P-3S1D | Field Name | Field bits | Description | |------------|------------|----------------------------------------| @@ -421,14 +1054,14 @@ is based on whether the number of src operands is 2 or 3. With only | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) | -| EXTRA2_MODE | `18` | used by `divmod2du` and `madded` for RS | +| EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS | These are for 3 operand in and either 1 or 2 out instructions. 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions -such as `madded` have an implicit second destination, RS, the +such as `maddedu` have an implicit second destination, RS, the selection of which is determined by bit 18. -## RM-1P-2S1D +### RM-1P-2S1D | Field Name | Field bits | Description | |------------|------------|-------------------------------------------| @@ -437,24 +1070,25 @@ selection of which is determined by bit 18. | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 | These are for 2 operand 1 dest instructions, such as `add RT, RA, -RB`. However also included are unusual instructions with an implicit dest -that is identical to its src reg, such as `rlwinmi`. +RB`. However also included are unusual instructions with an implicit +dest that is identical to its src reg, such as `rlwinmi`. -Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow -an alternative destination. With SV however this becomes possible. -Therefore, the fact that the dest is implicitly also a src should not -mislead: due to the *prefix* they are different SV regs. +Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would +not have sufficient bit fields to allow an alternative destination. +With SV however this becomes possible. Therefore, the fact that the +dest is implicitly also a src should not mislead: due to the *prefix* +they are different SV regs. * `rlwimi RA, RS, ...` * Rsrc1_EXTRA3 applies to RS as the first src -* Rsrc2_EXTRA3 applies to RA as the secomd src +* Rsrc2_EXTRA3 applies to RA as the second src * Rdest_EXTRA3 applies to RA to create an **independent** dest. With the addition of the EXTRA bits, the three registers each may be *independently* made vector or scalar, and be independently augmented to 7 bits in length. -## RM-2P-1S1D/2S +### RM-2P-1S1D/2S | Field Name | Field bits | Description | |------------|------------|----------------------------| @@ -464,22 +1098,28 @@ augmented to 7 bits in length. `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2. -## RM-1P-2S1D +| Field Name | Field bits | Description | +|------------|------------|----------------------------| +| Rsrc1_EXTRA3 | `10:12` | extends Rsrc1 | +| Rsrc2_EXTRA3 | `13:15` | extends Rsrc2 | +| MASK_SRC | `16:18` | Execution Mask for Source | + +### RM-1P-2S1D single-predicate, three registers (2 read, 1 write) - + | Field Name | Field bits | Description | |------------|------------|----------------------------| | Rdest_EXTRA3 | `10:12` | extends Rdest | | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 | | Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 | -## RM-2P-2S1D/1S2D/3S +### RM-2P-2S1D/1S2D/3S The primary purpose for this encoding is for Twin Predication on LOAD -and STORE operations. see [[sv/ldst]] for detailed anslysis. +and STORE operations. see [[sv/ldst]] for detailed analysis. -RM-2P-2S1D: +**RM-2P-2S1D:** | Field Name | Field bits | Description | |------------|------------|----------------------------| @@ -488,19 +1128,56 @@ RM-2P-2S1D: | Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | | MASK_SRC | `16:18` | Execution Mask for Source | -Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2 -is in bits 10:11, Rdest1_EXTRA2 in 12:13) +**RM-2P-1S2D:** + +For RM-2P-1S2D dest2 is in bits 14:15 + +| Field Name | Field bits | Description | +|------------|------------|----------------------------| +| Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) | +| Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | +| Rdest2_EXTRA2 | `14:15` | extends Rdest22 (R\*\_EXTRA2 Encoding) | +| MASK_SRC | `16:18` | Execution Mask for Source | + +**RM-2P-3S:** -Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2. +Also that for RM-2P-3S (to cover `stdx` etc.) the names are switched to 3 src: +Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2. -Note also that LD with update indexed, which takes 2 src and 2 dest -(e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also -Twin Predication. therefore these are treated as RM-2P-2S1D and the -src spec for RA is also used for the same RA as a dest. +| Field Name | Field bits | Description | +|------------|------------|----------------------------| +| Rsrc1_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | +| Rsrc2_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | +| Rsrc3_EXTRA2 | `14:15` | extends Rsrc3 (R\*\_EXTRA2 Encoding) | +| MASK_SRC | `16:18` | Execution Mask for Source | -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. +Note also that LD with update indexed, which takes 2 src and +creates 2 dest registers (e.g. `lhaux RT,RA,RB`), does not have room +for 4 registers and also Twin Predication. Therefore these are treated as +RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest. + +Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance +or increased latency in some implementations due to lane-crossing. + +### RM-2PM-2S1D/1S2D/3S + +The primary purpose for this encoding is for Twin Predication on LOAD +and STORE operations providing EXTRA3 for RT, RA and RS. +see [[sv/ldst]] for detailed analysis. -# R\*\_EXTRA2/3 +**RM-2PM-2S1D:** + +RT or RS requires EXTRA3, RA requires EXTRA3, but for RB EXTRA2 will +suffice. `MASK_SRC` may be read from the bits normally used for dest-elwidth. + +| Field Name | Field bits | Description | +|------------|------------|----------------------------| +| Rdest_EXTRA3 | `10:12` | extends Rdest (R\*\_EXTRA2 Encoding) | +| Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | +| Rsrc2_EXTRA2 | `16:17` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | +| MASK_SRC | `6:7,18` | Execution Mask for Source | + +## R\*\_EXTRA2/3 EXTRA is the means by which two things are achieved: @@ -510,20 +1187,25 @@ EXTRA is the means by which two things are achieved: The register files are therefore extended: -* INT is extended from r0-31 to r0-127 -* FP is extended from fp0-32 to fp0-fp127 +* INT (GPR) is extended from r0-31 to r0-127 +* FP (FPR) is extended from fp0-32 to fp0-fp127 * CR Fields are extended from CR0-7 to CR0-127 +However due to pressure in `RM.EXTRA` not all these registers +are accessible by all instructions, particularly those with +a large number of operands (`madd`, `isel`). + In the following tables register numbers are constructed from the -standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 -or EXTRA3 field from the SV Prefix, determined by the specific -RM-xx-yyyy designation for a given instruction. -The prefixing is arranged so that +standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or +EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy +designation for a given instruction. The prefixing is arranged so that interoperability between prefixing and nonprefixing of scalar registers is direct and convenient (when the EXTRA field is all zeros). -A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs) +A pseudocode algorithm explains the relationship, for INT/FP (see +[[svp64/appendix]] for CRs) +``` if extra3_mode: spec = EXTRA3 else: @@ -532,17 +1214,18 @@ A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/append return (RA << 2) | spec[1:2] else: # scalar return (spec[1:2] << 5) | RA +``` Future versions may extend to 256 by shifting Vector numbering up. Scalar will not be altered. Note that in some cases the range of starting points for Vectors -is limited. +is limited. -## INT/FP EXTRA3 +### INT/FP EXTRA3 -If EXTRA3 is zero, maps to -"scalar identity" (scalar OpenPOWER ISA field naming). +If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field +naming). Fields are as follows: @@ -555,7 +1238,9 @@ Fields are as follows: * MSB..LSB: the bit field showing how the register opcode field combines with EXTRA to give (extend) the register number (GPR) -| Value | Mode | Range/Inc | 6..0 | +Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB) + +| Value | Mode | Range/Inc | 6..0 | |-----------|-------|---------------|---------------------| | 000 | Scalar | `r0-r31`/1 | `0b00 RA` | | 001 | Scalar | `r32-r63`/1 | `0b01 RA` | @@ -566,37 +1251,51 @@ Fields are as follows: | 110 | Vector | `r2-r126`/4 | `RA 0b10` | | 111 | Vector | `r3-r127`/4 | `RA 0b11` | -## INT/FP EXTRA2 +### INT/FP EXTRA2 -If EXTRA2 is zero will map to -"scalar identity behaviour" i.e Scalar OpenPOWER register naming: +If EXTRA2 is zero will map to "scalar identity behaviour" i.e Scalar +Power ISA register naming: -| Value | Mode | Range/inc | 6..0 | -|-----------|-------|---------------|-----------| +Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB) + +| Value | Mode | Range/inc | 6..0 | +|----------|-------|---------------|-----------| | 00 | Scalar | `r0-r31`/1 | `0b00 RA` | | 01 | Scalar | `r32-r63`/1 | `0b01 RA` | | 10 | Vector | `r0-r124`/4 | `RA 0b00` | | 11 | Vector | `r2-r126`/4 | `RA 0b10` | -## CR Field EXTRA3 +**Note that unlike in EXTRA3, in EXTRA2**: + +* the GPR Vectors may only start from + `r0, r2, r4, r6, r8` and likewise FPR Vectors. +* the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars. -CR Field encoding is essentially the same but made more complex due to CRs being bit-based. See [[svp64/appendix]] for explanation and pseudocode. -Note that Vectors may only start from CR0, CR4, CR8, CR12, CR16... +as there is insufficient bits to cover the full range. -Encoding shown MSB down to LSB +### CR Field EXTRA3 + +CR Field encoding is essentially the same but made more complex due to CRs +being bit-based, because the application of SVP64 element-numbering applies +to the CR *Field* numbering not the CR register *bit* numbering. +Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`... +and Scalars may only go from `CR0, CR1, ... CR31` + +Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB), +BA ranges are in MSB0. For a 5-bit operand (BA, BB, BT): | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 | |-------|------|---------------|-----------| --------|---------| -| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] | -| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] | -| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[4:2] | BA[1:0] | -| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[4:2] | BA[1:0] | -| 100 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] | -| 101 | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100 | BA[1:0] | -| 110 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] | -| 111 | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100 | BA[1:0] | +| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] | +| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] | +| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[0:2] | BA[3:4] | +| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[0:2] | BA[3:4] | +| 100 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] | +| 101 | Vector | `CR4-CR116`/16 | BA[0:2] 0 | 0b100 | BA[3:4] | +| 110 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] | +| 111 | Vector | `CR12-CR124`/16 | BA[0:2] 1 | 0b100 | BA[3:4] | For a 3-bit operand (e.g. BFA): @@ -611,22 +1310,24 @@ For a 3-bit operand (e.g. BFA): | 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 | | 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 | -## CR EXTRA2 +### CR EXTRA2 -CR encoding is essentially the same but made more complex due to CRs being bit-based. See separate section for explanation and pseudocode. +CR encoding is essentially the same but made more complex due to CRs +being bit-based, because the application of SVP64 element-numbering applies +to the CR *Field* numbering not the CR register *bit* numbering. Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32... - -Encoding shown MSB down to LSB +Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB), +BA ranges are in MSB0. For a 5-bit operand (BA, BB, BC): | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 | |-------|--------|----------------|---------|---------|---------| -| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] | -| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] | -| 10 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] | -| 11 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] | +| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] | +| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] | +| 10 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] | +| 11 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] | For a 3-bit operand (e.g. BFA): @@ -637,7 +1338,16 @@ For a 3-bit operand (e.g. BFA): | 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 | | 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 | -# Appendix + +## Appendix Now at its own page: [[svp64/appendix]] + +[[!tag standards]] + + + +-------- + +\newpage{}