From c373d930523b861164b6bc4f5b5ee43a5938e09d Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 29 Mar 2023 17:19:38 +0100 Subject: [PATCH] very large whitespace cleanup --- openpower/sv/rfc/ls010.mdwn | 1624 ++++++++++++++++++----------------- 1 file changed, 820 insertions(+), 804 deletions(-) diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index 264b096a2..15bf1fa8b 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -20,31 +20,34 @@ Links: # Introduction -Simple-V is a type of Vectorisation best described as a "Prefix Loop Subsystem" -similar to the Z80 `LDIR` instruction and to the x86 `REP` Prefix instruction. -More advanced features are similar to the Z80 `CPIR` instruction. If viewed -as an actual Vector ISA it introduces over 1.5 million 64-bit Vector instructions. -SVP64, the instruction format, is therefore best viewed as an orthogonal -RISC-style "Prefixing" subsystem instead. - -Except where explicitly stated all bit numbers remain as in the rest of the Power ISA: -in MSB0 form (the bits are numbered from 0 at the MSB on the left -and counting up as you move rightwards to the LSB end). All bit ranges are inclusive -(so `4:6` means bits 4, 5, and 6, in MSB0 order). **All register numbering and -element numbering however is LSB0 ordering** which is a different convention from that used -elsewhere in the Power ISA. - -The SVP64 prefix always comes before the suffix in PC order and must be considered -an independent "Defined word" that augments the behaviour of the following instruction, -but does **not** change the actual Decoding of that following instruction. -**All prefixed instructions retain their non-prefixed encoding and definition**. - -*Architectural Resource Allocation note: it is prohibited to accept RFCs which -fundamentally violate this hard requirement. Under no circumstances must the -Suffix space have an alternate instruction encoding allocated within SVP64 that is -entirely different from the non-prefixed Defined Word. Hardware Implementors -critically rely on this inviolate guarantee to implement High-Performance Multi-Issue -micro-architectures that can sustain 100% throughput* +Simple-V is a type of Vectorisation best described as a "Prefix Loop +Subsystem" similar to the Z80 `LDIR` instruction and to the x86 `REP` +Prefix instruction. More advanced features are similar to the Z80 +`CPIR` instruction. If viewed as an actual Vector ISA it introduces +over 1.5 million 64-bit Vector instructions. SVP64, the instruction +format, is therefore best viewed as an orthogonal RISC-style "Prefixing" +subsystem instead. + +Except where explicitly stated all bit numbers remain as in the rest of +the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on +the left and counting up as you move rightwards to the LSB end). All bit +ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order). +**All register numbering and element numbering however is LSB0 ordering** +which is a different convention from that used elsewhere in the Power ISA. + +The SVP64 prefix always comes before the suffix in PC order and must be +considered an independent "Defined word" that augments the behaviour of +the following instruction, but does **not** change the actual Decoding +of that following instruction. **All prefixed instructions retain their +non-prefixed encoding and definition**. + +*Architectural Resource Allocation note: it is prohibited to accept RFCs +which fundamentally violate this hard requirement. Under no circumstances +must the Suffix space have an alternate instruction encoding allocated +within SVP64 that is entirely different from the non-prefixed Defined +Word. Hardware Implementors critically rely on this inviolate guarantee +to implement High-Performance Multi-Issue micro-architectures that can +sustain 100% throughput* | 0:5 | 6:31 | 32:63 | |--------|--------------|--------------| @@ -57,26 +60,29 @@ interoperability expectations within certain environments. ## Register files, elements, and Element-width Overrides -In the Upper Compliancy Levels the size of the GPR and FPR Register files are expanded -from 32 to 128 entries, and the number of CR Fields expanded from CR0-CR7 to CR0-CR127. - -Memory access remains exactly the same: the effects of `MSR.LE` remain exactly the same, -affecting as they already do and remain **only** on the Load and Store memory-register -operation byte-order, and having nothing to do with the -ordering of the contents of register files or register-register operations. - -Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered and for -numbering to be sequentially incremental the element offset numbering is naturally -**LSB0-sequentially-incrementing from zero not MSB0-incrementing.** Expressed exclusively in -MSB0-numbering, SVP64 is unnecessarily complex to understand: the required -subtractions from 63, 31, 15 and 7 unfortunately become a hostile minefield. -Therefore for the purposes of this section the more natural -**LSB0 numbering is assumed** and it is up to the reader to translate to MSB0 numbering. - -The Canonical specification for how element-sequential numbering and element-width -overrides is defined is expressed in the following c structure, assuming a Little-Endian -system, and naturally using LSB0 numbering everywhere because the ANSI c specification -is inherently LSB0: +In the Upper Compliancy Levels the size of the GPR and FPR Register +files are expanded from 32 to 128 entries, and the number of CR Fields +expanded from CR0-CR7 to CR0-CR127. + +Memory access remains exactly the same: the effects of `MSR.LE` remain +exactly the same, affecting as they already do and remain **only** +on the Load and Store memory-register operation byte-order, and having +nothing to do with the ordering of the contents of register files or +register-register operations. + +Whilst the bits within the GPRs and FPRs are expected to be MSB0-ordered +and for numbering to be sequentially incremental the element offset +numbering is naturally **LSB0-sequentially-incrementing from zero not +MSB0-incrementing.** Expressed exclusively in MSB0-numbering, SVP64 is +unnecessarily complex to understand: the required subtractions from 63, +31, 15 and 7 unfortunately become a hostile minefield. Therefore for the +purposes of this section the more natural **LSB0 numbering is assumed** +and it is up to the reader to translate to MSB0 numbering. + +The Canonical specification for how element-sequential numbering and +element-width overrides is defined is expressed in the following c +structure, assuming a Little-Endian system, and naturally using LSB0 +numbering everywhere because the ANSI c specification is inherently LSB0: ``` #pragma pack @@ -124,90 +130,104 @@ However if elwidth overrides are set to 16 for both source and destination: int_regfile[RT].s[i] = int_regfile[RA].s[i] + int_regfile[RB].s[i] ``` -Hardware Architectural note: to avoid a Read-Modify-Write at the register file it is -strongly recommended to implement byte-level write-enable lines exactly as has been -implemented in DRAM ICs for many decades. Additionally the predicate mask bit is advised -to be associated with the element operation and alongside the result ultimately -passed to the register file. -When element-width is set to 64-bit the relevant predicate mask bit may be repeated -eight times and pull all eight write-port byte-level lines HIGH. Clearly when element-width -is set to 8-bit the relevant predicate mask bit corresponds directly with one single -byte-level write-enable line. It is up to the Hardware Architect to then amortise (merge) -elements together into both PredicatedSIMD Pipelines as well as simultaneous non-overlapping -Register File writes, to achieve High Performance designs. +Hardware Architectural note: to avoid a Read-Modify-Write at the register +file it is strongly recommended to implement byte-level write-enable lines +exactly as has been implemented in DRAM ICs for many decades. Additionally +the predicate mask bit is advised to be associated with the element +operation and alongside the result ultimately passed to the register file. +When element-width is set to 64-bit the relevant predicate mask bit +may be repeated eight times and pull all eight write-port byte-level +lines HIGH. Clearly when element-width is set to 8-bit the relevant +predicate mask bit corresponds directly with one single byte-level +write-enable line. It is up to the Hardware Architect to then amortise +(merge) elements together into both PredicatedSIMD Pipelines as well +as simultaneous non-overlapping Register File writes, to achieve High +Performance designs. ## SVP64 encoding features -A number of features need to be compacted into a very small space of only 24 bits: +A number of features need to be compacted into a very small space of +only 24 bits: -* Independent per-register Scalar/Vector tagging and range extension on every register +* Independent per-register Scalar/Vector tagging and range extension on + every register * Element width overrides on both source and destination * Predication on both source and destination * Two different sources of predication: INT and CR Fields -* SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and - predicate-result mode. +* SV Modes including saturation (for Audio, Video and DSP), mapreduce, + fail-first and predicate-result mode. -Different classes of operations require different formats. The earlier sections cover -the c9mmon formats and the four separate modes follow: CR operations (crops), -Arithmetic/Logical (termed "normal"), Load/Store and Branch-Conditional. +Different classes of operations require different formats. The earlier +sections cover the c9mmon formats and the four separate modes follow: +CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store +and Branch-Conditional. ## Definition of Reserved in this spec. For the new fields added in SVP64, instructions that have any of their fields set to a reserved value must cause an illegal instruction trap, -to allow emulation of future instruction sets, or for subsets of SVP64 -to be implemented in hardware and the rest emulated. -This includes SVP64 SPRs: reading or writing values which are not -supported in hardware must also raise illegal instruction traps -in order to allow emulation. +to allow emulation of future instruction sets, or for subsets of SVP64 to +be implemented in hardware and the rest emulated. This includes SVP64 +SPRs: reading or writing values which are not supported in hardware +must also raise illegal instruction traps in order to allow emulation. Unless otherwise stated, reserved values are always all zeros. -This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero. Where the standard Power ISA definition -is intended the red keyword `RESERVED` is used. +This is unlike OpenPower ISA v3.1, which in many instances does not +require a trap if reserved fields are nonzero. Where the standard Power +ISA definition is intended the red keyword `RESERVED` is used. ## Definition of "UnVectoriseable" -Any operation that inherently makes no sense if repeated is termed "UnVectoriseable" -or "UnVectorised". Examples include `sc` or `sync` which have no registers. `mtmsr` is -also classed as UnVectoriseable because there is only one `MSR`. +Any operation that inherently makes no sense if repeated is termed +"UnVectoriseable" or "UnVectorised". Examples include `sc` or `sync` +which have no registers. `mtmsr` is also classed as UnVectoriseable +because there is only one `MSR`. ## Scalar Identity Behaviour SVP64 is designed so that when the prefix is all zeros, and VL=1, no effect or influence occurs (no augmentation) such that all standard Power ISA -v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation"). +v3.0/v3 1 instructions covered by the prefix are "unaltered". This +is termed `scalar identity behaviour` (based on the mathematical +definition for "identity", as in, "identity matrix" or better "identity +transformation"). -Note that this is completely different from when VL=0. VL=0 turns all operations under its influence into `nops` (regardless of the prefix) - whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction (an "identity transformation"). +Note that this is completely different from when VL=0. VL=0 turns all +operations under its influence into `nops` (regardless of the prefix) + whereas when VL=1 and the SV prefix is all zeros, the operation simply + acts as if SV had not been applied at all to the instruction (an + "identity transformation"). ## Register Naming and size -As previously mentioned SV Registers are simply the INT, FP and CR register files extended -linearly to larger sizes; SV Vectorisation iterates sequentially through these registers -(LSB0 sequential ordering from 0 to VL-1). +As previously mentioned SV Registers are simply the INT, FP and CR +register files extended linearly to larger sizes; SV Vectorisation +iterates sequentially through these registers (LSB0 sequential ordering +from 0 to VL-1). -Where the integer regfile in standard scalar -Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127. -Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields -are -extended to 128 entries, CR0 thru CR127. +Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is +r0 to r31, SV extends this as r0 to r127. Likewise FP registers are +extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries, +CR0 thru CR127. The names of the registers therefore reflects a simple linear extension of the Power ISA v3.0B / v3.1B register naming, and in hardware this would be reflected by a linear increase in the size of the underlying SRAM used for the regfiles. -Note: when an EXTRA field (defined below) is zero, SV is deliberately designed -so that the register fields are identical to as if SV was not in effect -i.e. under these circumstances (EXTRA=0) the register field names RA, -RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. This is part of -`scalar identity behaviour` described above. +Note: when an EXTRA field (defined below) is zero, SV is deliberately +designed so that the register fields are identical to as if SV was not in +effect i.e. under these circumstances (EXTRA=0) the register field names +RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. +This is part of `scalar identity behaviour` described above. ## Future expansion. With the way that EXTRA fields are defined and applied to register fields, -future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register). Further discussion is out of scope for this version of SVP64. +future versions of SV may involve 256 or greater registers. Backwards +binary compatibility may be achieved with a PCR bit (Program Compatibility +Register). Further discussion is out of scope for this version of SVP64. -------- @@ -228,7 +248,8 @@ is defined in the Prefix Fields section. TODO incorporate EXT09 -To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set +To "activate" svp64 (in a way that does not conflict with v3.1B 64 +bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set (see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV. This is achieved by setting bits 7 and 9 to 1: @@ -249,19 +270,19 @@ are constructed: | EXT01 | RM | 1 | RM | 1 | RM | | 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] | -Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1 -instruction. That instruction becomes "prefixed" with the SVP context: the -Remapped Encoding field (RM). +Following the prefix will be the suffix: this is simply a 32-bit v3.0B +/ v3.1 instruction. That instruction becomes "prefixed" with the SVP +context: the Remapped Encoding field (RM). It is important to note that unlike v3.1 64-bit prefixed instructions -there is insufficient space in `RM` to provide identification of -any SVP64 Fields without first partially decoding the -32-bit suffix. Similar to the "Forms" (X-Form, D-Form) the -`RM` format is individually associated with every instruction. +there is insufficient space in `RM` to provide identification of any SVP64 +Fields without first partially decoding the 32-bit suffix. Similar to +the "Forms" (X-Form, D-Form) the `RM` format is individually associated +with every instruction. -Extreme caution and care must therefore be taken -when extending SVP64 in future, to not create unnecessary relationships -between prefix and suffix that could complicate decoding, adding latency. +Extreme caution and care must therefore be taken when extending SVP64 +in future, to not create unnecessary relationships between prefix and +suffix that could complicate decoding, adding latency. # Common RM fields @@ -283,24 +304,33 @@ on context after decoding of the Scalar suffix: | EXTRA | `10:18` | Register Extra encoding | | MODE | `19:23` | changes Vector behaviour | -* MODE changes the behaviour of the SV operation (result saturation, mapreduce) -* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work -* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width -* MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR). -* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix. - -Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag. - -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. +* MODE changes the behaviour of the SV operation (result saturation, + mapreduce) +* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D + and Audio/Video DSP work +* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and + source operand width +* MASK (and MASK_SRC) and MASKMODE provide predication (two types of + sources: scalar INT and Vector CR). +* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category + for the instruction, which is determined only by decoding the Scalar 32 + bit suffix. + +Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, +such as `RM-1P-3S1D` which indicates for this example that the operation +is to be single-predicated and that there are 3 source operand EXTRA +tags and one destination operand tag. + +Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance +or increased latency in some implementations due to lane-crossing. # Mode -Mode is an augmentation of SV behaviour. Different types of -instructions have different needs, similar to Power ISA -v3.1 64 bit prefix 8LS and MTRR formats apply to different -instruction types. Modes include Reduction, Iteration, arithmetic -saturation, and Fail-First. More specific details in each -section and in the [[svp64/appendix]] +Mode is an augmentation of SV behaviour. Different types of instructions +have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR +formats apply to different instruction types. Modes include Reduction, +Iteration, arithmetic saturation, and Fail-First. More specific details +in each section and in the SVP64 appendix * For condition register operations see [[sv/cr_ops]] * For LD/ST Modes, see [[sv/ldst]]. @@ -309,12 +339,12 @@ section and in the [[svp64/appendix]] # ELWIDTH Encoding -Default behaviour is set to 0b00 so that zeros follow the convention of -`scalar identity behaviour`. In this case it means that elwidth overrides -are not applicable. Thus if a 32 bit instruction operates on 32 bit, -`elwidth=0b00` specifies that this behaviour is unmodified. Likewise -when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00` -states that, again, the behaviour is not to be modified. +Default behaviour is set to 0b00 so that zeros follow the convention +of `scalar identity behaviour`. In this case it means that elwidth +overrides are not applicable. Thus if a 32 bit instruction operates +on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified. +Likewise when a processor is switched from 64 bit to 32 bit mode, +`elwidth=0b00` states that, again, the behaviour is not to be modified. Only when elwidth is nonzero is the element width overridden to the explicitly required value. @@ -362,7 +392,7 @@ Vector of CR Fields. # SUBVL Encoding -the default for SUBVL is 1 and its encoding is 0b00 to indicate that +The default for SUBVL is 1 and its encoding is 0b00 to indicate that SUBVL is effectively disabled (a SUBVL for-loop of only one element). this lines up in combination with all other "default is all zeros" behaviour. @@ -379,17 +409,14 @@ this may be considered to be elements 0b00 to 0b01 inclusive. # MASK/MASK_SRC & MASKMODE Encoding -TODO: rename MASK_KIND to MASKMODE - One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two types may not be mixed. -Special note: to disable predication this field must -be set to zero in combination with Integer Predication also being set -to 0b000. this has the effect of enabling "all 1s" in the predicate -mask, which is equivalent to "not having any predication at all" -and consequently, in combination with all other default zeros, fully -disables SV (`scalar identity behaviour`). +Special note: to disable predication this field must be set to zero in +combination with Integer Predication also being set to 0b000. this has the +effect of enabling "all 1s" in the predicate mask, which is equivalent to +"not having any predication at all" and consequently, in combination with +all other default zeros, fully disables SV (`scalar identity behaviour`). `MASKMODE` may be set to one of 2 values: @@ -399,32 +426,33 @@ disables SV (`scalar identity behaviour`). | 1 | MASK/MASK_SRC are encoded using CR-based Predication | Integer Twin predication has a second set of 3 bits that uses the same -encoding thus allowing either the same register (r3, r10 or r31) to be used -for both src and dest, or different regs (one for src, one for dest). +encoding thus allowing either the same register (r3, r10 or r31) to be +used for both src and dest, or different regs (one for src, one for dest). Likewise CR based twin predication has a second set of 3 bits, allowing a different test to be applied. -Note that it is assumed that Predicate Masks (whether INT or CR) -are read *before* the operations proceed. In practice (for CR Fields) -this creates an unnecessary block on parallelism. Therefore, -it is up to the programmer to ensure that the CR fields used as -Predicate Masks are not being written to by any parallel Vector Loop. -Doing so results in **UNDEFINED** behaviour, according to the definition -outlined in the Power ISA v3.0B Specification. +Note that it is assumed that Predicate Masks (whether INT or CR) are +read *before* the operations proceed. In practice (for CR Fields) +this creates an unnecessary block on parallelism. Therefore, it is up +to the programmer to ensure that the CR fields used as Predicate Masks +are not being written to by any parallel Vector Loop. Doing so results +in **UNDEFINED** behaviour, according to the definition outlined in the +Power ISA v3.0B Specification. Hardware Implementations are therefore free and clear to delay reading of individual CR fields until the actual predicated element operation -needs to take place, safe in the knowledge that no programmer will -have issued a Vector Instruction where previous elements could have -overwritten (destroyed) not-yet-executed CR-Predicated element operations. +needs to take place, safe in the knowledge that no programmer will have +issued a Vector Instruction where previous elements could have overwritten +(destroyed) not-yet-executed CR-Predicated element operations. ## Integer Predication (MASKMODE=0) When the predicate mode bit is zero the 3 bits are interpreted as below. Twin predication has an identical 3 bit field similarly encoded. -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning: +`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the +following meaning: | Value | Mnemonic | Element `i` enabled if: | |-------|----------|------------------------------| @@ -437,14 +465,16 @@ Twin predication has an identical 3 bit field similarly encoded. | 110 | R30 | `R30 & (1 << i)` is non-zero | | 111 | ~R30 | `R30 & (1 << i)` is zero | -r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs. +r10 and r30 are at the high end of temporary and unused registers, +so as not to interfere with register allocation from ABIs. ## CR-based Predication (MASKMODE=1) When the predicate mode bit is one the 3 bits are interpreted as below. Twin predication has an identical 3 bit field similarly encoded. -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning: +`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the +following meaning: | Value | Mnemonic | Element `i` is enabled if | |-------|----------|--------------------------| @@ -464,30 +494,40 @@ high, or accept that for twin predication VL must not exceed the range where overlap will occur, *or* that they use the same starting point but select different *bits* of the same CRs -`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). +`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised +Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). -The CR Predicates chosen must start on a boundary that Vectorised -CR operations can access cleanly, in full. -With EXTRA2 restricting starting points -to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate -Masks have to be adapted to fit on these boundaries as well. +The CR Predicates chosen must start on a boundary that Vectorised CR +operations can access cleanly, in full. With EXTRA2 restricting starting +points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and +CR Predicate Masks have to be adapted to fit on these boundaries as well. # Extra Remapped Encoding -Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants. Note that due to the very tight space, the encoding mode is *not* included in the prefix itself. The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*). +Shows all instruction-specific fields in the Remapped Encoding +`RM[10:18]` for all instruction variants. Note that due to the very +tight space, the encoding mode is *not* included in the prefix itself. +The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) +on a per-instruction basis, and, like "Forms" are given a designation +(below) of the form `RM-nP-nSnD`. The full list of which instructions +use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV +files have been provided which will make the task of creating SV-aware +ISA decoders easier*). These mappings are part of the SVP64 Specification in exactly the same way as X-Form, D-Form. New Scalar instructions added to the Power ISA will need a corresponding SVP64 Mapping, which can be derived by-rote from examining the Register "Profile" of the instruction. -There are two categories: Single and Twin Predication. -Due to space considerations further subdivision of Single Predication -is based on whether the number of src operands is 2 or 3. With only -9 bits available some compromises have to be made. +There are two categories: Single and Twin Predication. Due to space +considerations further subdivision of Single Predication is based on +whether the number of src operands is 2 or 3. With only 9 bits available +some compromises have to be made. -* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd). -* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest) +* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand + instructions (fmadd, isel, madd). +* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand + instructions (src1 src2 dest) * `RM-2P-1S1D` Twin Predication (src=1, dest=1) * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed) * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update @@ -516,13 +556,14 @@ selection of which is determined by bit 18. | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 | These are for 2 operand 1 dest instructions, such as `add RT, RA, -RB`. However also included are unusual instructions with an implicit dest -that is identical to its src reg, such as `rlwinmi`. +RB`. However also included are unusual instructions with an implicit +dest that is identical to its src reg, such as `rlwinmi`. -Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow -an alternative destination. With SV however this becomes possible. -Therefore, the fact that the dest is implicitly also a src should not -mislead: due to the *prefix* they are different SV regs. +Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would +not have sufficient bit fields to allow an alternative destination. +With SV however this becomes possible. Therefore, the fact that the +dest is implicitly also a src should not mislead: due to the *prefix* +they are different SV regs. * `rlwimi RA, RS, ...` * Rsrc1_EXTRA3 applies to RS as the first src @@ -570,14 +611,16 @@ RM-2P-2S1D: Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2 is in bits 10:11, Rdest1_EXTRA2 in 12:13) -Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2. +Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: +Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2. Note also that LD with update indexed, which takes 2 src and 2 dest (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also Twin Predication. therefore these are treated as RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest. -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. +Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance +or increased latency in some implementations due to lane-crossing. # R\*\_EXTRA2/3 @@ -598,14 +641,14 @@ are accessible by all instructions, particularly those with a large number of operands (`madd`, `isel`). In the following tables register numbers are constructed from the -standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 -or EXTRA3 field from the SV Prefix, determined by the specific -RM-xx-yyyy designation for a given instruction. -The prefixing is arranged so that +standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or +EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy +designation for a given instruction. The prefixing is arranged so that interoperability between prefixing and nonprefixing of scalar registers is direct and convenient (when the EXTRA field is all zeros). -A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs) +A pseudocode algorithm explains the relationship, for INT/FP (see +SVP64 appendix for CRs) ``` if extra3_mode: @@ -626,8 +669,8 @@ is limited. ## INT/FP EXTRA3 -If EXTRA3 is zero, maps to -"scalar identity" (scalar Power ISA field naming). +If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field +naming). Fields are as follows: @@ -673,7 +716,8 @@ as there is insufficient bits to cover the full range. ## CR Field EXTRA3 -CR Field encoding is essentially the same but made more complex due to CRs being bit-based. See [[svp64/appendix]] for explanation and pseudocode. +CR Field encoding is essentially the same but made more complex due to CRs +being bit-based. See [[svp64/appendix]] for explanation and pseudocode. Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`... and Scalars may only go from `CR0, CR1, ... CR31` @@ -707,7 +751,8 @@ For a 3-bit operand (e.g. BFA): ## CR EXTRA2 -CR encoding is essentially the same but made more complex due to CRs being bit-based. See separate section for explanation and pseudocode. +CR encoding is essentially the same but made more complex due to CRs +being bit-based. See separate section for explanation and pseudocode. Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32... @@ -745,29 +790,40 @@ field is bits 19-23 of the [[svp64]] RM Field. ## Mode Mode is an augmentation of SV behaviour, providing additional -functionality. Some of these alterations are element-based (saturation), others involve post-analysis (predicate result) and others are Vector-based (mapreduce, fail-on-first). +functionality. Some of these alterations are element-based (saturation), +others involve post-analysis (predicate result) and others are +Vector-based (mapreduce, fail-on-first). -[[sv/ldst]], -[[sv/cr_ops]] and [[sv/branches]] are covered separately: the following -Modes apply to Arithmetic and Logical SVP64 operations: +[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately: +the following Modes apply to Arithmetic and Logical SVP64 operations: -* **simple** mode is straight vectorisation. no augmentations: the vector comprises an array of independently created results. -* **ffirst** or data-dependent fail-on-first: see separate section. the vector may be truncated depending on certain criteria. +* **simple** mode is straight vectorisation. no augmentations: the + vector comprises an array of independently created results. +* **ffirst** or data-dependent fail-on-first: see separate section. + the vector may be truncated depending on certain criteria. *VL is altered as a result*. -* **sat mode** or saturation: clamps each element result to a min/max rather than overflows / wraps. allows signed and unsigned clamping for both INT -and FP. +* **sat mode** or saturation: clamps each element result to a min/max + rather than overflows / wraps. allows signed and unsigned clamping + for both INT and FP. * **reduce mode**. if used correctly, a mapreduce (or a prefix sum) is performed. see [[svp64/appendix]]. note that there are comprehensive caveats when using this mode. -* **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch conditional testing) and if the test fails it -is as if the -*destination* predicate bit was zero even before starting the operation. -When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed. See appendix for details. - -Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations. ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result. simple, saturate and pred-result are however inter-element independent and may easily be parallelised to give high performance, regardless of the value of VL. - -The Mode table for Arithmetic and Logical operations - is laid out as follows: +* **pred-result** will test the result (CR testing selects a bit of CR + and inverts it, just like branch conditional testing) and if the + test fails it is as if the *destination* predicate bit was zero even + before starting the operation. When Rc=1 the CR element however is + still stored in the CR regfile, even if the test failed. See appendix + for details. + +Note that ffirst and reduce modes are not anticipated to be +high-performance in some implementations. ffirst due to interactions +with VL, and reduce due to it requiring additional operations to produce +a result. simple, saturate and pred-result are however inter-element +independent and may easily be parallelised to give high performance, +regardless of the value of VL. + +The Mode table for Arithmetic and Logical operations is laid out as +follows: | 0-1 | 2 | 3 4 | description | | --- | --- |---------|-------------------------- | @@ -782,28 +838,30 @@ The Mode table for Arithmetic and Logical operations Fields: -* **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context. +* **sz / dz** if predication is enabled will put zeros into the dest + (or as src in the case of twin pred) when the predicate bit is zero. + otherwise the element is ignored or skipped, depending on context. * **zz**: both sz and dz are set equal to this flag -* **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1) +* **inv CR bit** just as in branches (BO) these bits allow testing of + a CR bit and whether it is set (inv=0) or unset (inv=1) * **RG** inverts the Vector Loop order (VL-1 downto 0) rather -than the normal 0..VL-1 + than the normal 0..VL-1 * **N** sets signed/unsigned saturation. * **RC1** as if Rc=1, enables access to `VLi`. * **VLi** VL inclusive: in fail-first mode, the truncation of VL *includes* the current element at the failure point rather than excludes it from the count. -For LD/ST Modes, see [[sv/ldst]]. For Condition Registers -see [[sv/cr_ops]]. -For Branch modes, see [[sv/branches]]. +For LD/ST Modes, see [[sv/ldst]]. For Condition Registers see +[[sv/cr_ops]]. For Branch modes, see [[sv/branches]]. ## Rounding, clamp and saturate -To help ensure for example that audio quality is not compromised by overflow, -"saturation" is provided, as well as a way to detect when saturation -occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs, -one CR per element in the result (Note: this is different from VSX which -has a single CR per block). +To help ensure for example that audio quality is not compromised by +overflow, "saturation" is provided, as well as a way to detect when +saturation occurred if desired (Rc=1). When Rc=1 there will be a *vector* +of CRs, one CR per element in the result (Note: this is different from +VSX which has a single CR per block). When N=0 the result is saturated to within the maximum range of an unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar @@ -821,99 +879,94 @@ the hugely detrimental effect it has on parallel processing, XER.SO is overflow bit is therefore simply set to zero if saturation did not occur, and to one if it did. -Note also that saturate on operations that set OE=1 must raise an -Illegal Instruction due to the conflicting use of the CR.so bit for -storing if -saturation occurred. Integer Operations that produce a Carry-Out (CA, CA32): -these two bits will be `UNDEFINED` if saturation is also requested. +Note also that saturate on operations that set OE=1 must raise an Illegal +Instruction due to the conflicting use of the CR.so bit for storing if +saturation occurred. Integer Operations that produce a Carry-Out (CA, +CA32): these two bits will be `UNDEFINED` if saturation is also requested. Note that the operation takes place at the maximum bitwidth (max of src and dest elwidth) and that truncation occurs to the range of the dest elwidth. -*Programmer's Note: Post-analysis of the Vector of CRs to find out if any given element hit -saturation may be done using a mapreduced CR op (cror), or by using the -new crrweird instruction with Rc=1, which will transfer the required -CR bits to a scalar integer and update CR0, which will allow testing -the scalar integer for nonzero. see [[sv/cr_int_predication]]* +*Programmer's Note: Post-analysis of the Vector of CRs to find out if any +given element hit saturation may be done using a mapreduced CR op (cror), +or by using the new crrweird instruction with Rc=1, which will transfer +the required CR bits to a scalar integer and update CR0, which will allow +testing the scalar integer for nonzero. see [[sv/cr_int_predication]]* ## Reduce mode -Reduction in SVP64 is similar in essence to other Vector Processing -ISAs, but leverages the underlying scalar Base v3.0B operations. -Thus it is more a convention that the programmer may utilise to give -the appearance and effect of a Horizontal Vector Reduction. Due -to the unusual decoupling it is also possible to perform -prefix-sum (Fibonacci Series) in certain circumstances. Details are in the [[svp64/appendix]] +Reduction in SVP64 is similar in essence to other Vector Processing ISAs, +but leverages the underlying scalar Base v3.0B operations. Thus it is +more a convention that the programmer may utilise to give the appearance +and effect of a Horizontal Vector Reduction. Due to the unusual decoupling +it is also possible to perform prefix-sum (Fibonacci Series) in certain +circumstances. Details are in the SVP64 appendix Reduce Mode should not be confused with Parallel Reduction [[sv/remap]]. As explained in the [[sv/appendix]] Reduce Mode switches off the check which would normally stop looping if the result register is scalar. Thus, the result scalar register, if also used as a source scalar, may be used to perform sequential accumulation. This *deliberately* -sets up a chain -of Register Hazard Dependencies, whereas Parallel Reduce [[sv/remap]] -deliberately issues a Tree-Schedule of operations that may be parallelised. +sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce +[[sv/remap]] deliberately issues a Tree-Schedule of operations that may +be parallelised. ## Fail-on-first Data-dependent fail-on-first has two distinct variants: one for LD/ST, -the other for arithmetic operations (actually, CR-driven). Note in each -case the assumption is that vector elements are required to appear to be -executed in sequential Program Order. When REMAP is not active, +the other for arithmetic operations (actually, CR-driven). Note in +each case the assumption is that vector elements are required to appear +to be executed in sequential Program Order. When REMAP is not active, element 0 would be the first. Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp). Similar to branch, an analysis of the CR is performed and if the test fails, the vector operation terminates and discards all element operations **at and -above the current one**, and VL is truncated to either -the *previous* element or the current one, depending on whether -VLi (VL "inclusive") is clear or set, respectively. +above the current one**, and VL is truncated to either the *previous* +element or the current one, depending on whether VLi (VL "inclusive") +is clear or set, respectively. -Thus the new VL comprises a contiguous vector of results, -all of which pass the testing criteria (equal to zero, less than zero etc -as defined by the CR-bit test). +Thus the new VL comprises a contiguous vector of results, all of which +pass the testing criteria (equal to zero, less than zero etc as defined +by the CR-bit test). *Note: when VLi is clear, the behaviour at first seems counter-intuitive. A result is calculated but if the test fails it is prohibited from being actually written. This becomes intuitive again when it is remembered -that the length that VL is set to is the number of *written* elements, -and only when VLI is set will the current element be included in that -count.* - -The CR-based data-driven fail-on-first is "new" and not found in ARM -SVE or RVV. At the same time it is "old" because it is almost -identical to a generalised form of Z80's `CPIR` instruction. -It is extremely useful for reducing instruction count, -however requires speculative execution involving modifications of VL -to get high performance implementations. An additional mode (RC1=1) -effectively turns what would otherwise be an arithmetic operation -into a type of `cmp`. The CR is stored (and the CR.eq bit tested -against the `inv` field). -If the CR.eq bit is equal to `inv` then the Vector is truncated and -the loop ends. +that the length that VL is set to is the number of *written* elements, and +only when VLI is set will the current element be included in that count.* + +The CR-based data-driven fail-on-first is "new" and not found in ARM SVE +or RVV. At the same time it is "old" because it is almost identical to +a generalised form of Z80's `CPIR` instruction. It is extremely useful +for reducing instruction count, however requires speculative execution +involving modifications of VL to get high performance implementations. +An additional mode (RC1=1) effectively turns what would otherwise be an +arithmetic operation into a type of `cmp`. The CR is stored (and the +CR.eq bit tested against the `inv` field). If the CR.eq bit is equal to +`inv` then the Vector is truncated and the loop ends. VLi is only available as an option when `Rc=0` (or for instructions -which do not have Rc). When set, the current element is always -also included in the count (the new length that VL will be set to). -This may be useful in combination with "inv" to truncate the Vector -to *exclude* elements that fail a test, or, in the case of implementations -of strncpy, to include the terminating zero. +which do not have Rc). When set, the current element is always also +included in the count (the new length that VL will be set to). This may +be useful in combination with "inv" to truncate the Vector to *exclude* +elements that fail a test, or, in the case of implementations of strncpy, +to include the terminating zero. In CR-based data-driven fail-on-first there is only the option to select and test one bit of each CR (just as with branch BO). For more complex tests this may be insufficient. If that is the case, a vectorised crop such as crand, cror or [[sv/cr_int_predication]] crweirder may be used, -and ffirst applied to the crop instead of to -the arithmetic vector. Note that crops are covered by -the [[sv/cr_ops]] Mode format. +and ffirst applied to the crop instead of to the arithmetic vector. Note +that crops are covered by the [[sv/cr_ops]] Mode format. -*Programmer's note: `VLi` is only accessible in normal operations -which in turn limits the CR field bit-testing to only `EQ/NE`. -[[sv/cr_ops]] are not so limited. Thus it is possible to use for -example `sv.cror/ff=gt/vli *0,*0,*0`, which is not a `nop` because -it allows Fail-First Mode to perform a test and truncate VL.* +*Programmer's note: `VLi` is only accessible in normal operations which in +turn limits the CR field bit-testing to only `EQ/NE`. [[sv/cr_ops]] are +not so limited. Thus it is possible to use for example `sv.cror/ff=gt/vli +*0,*0,*0`, which is not a `nop` because it allows Fail-First Mode to +perform a test and truncate VL.* Two extremely important aspects of ffirst are: @@ -929,41 +982,39 @@ Two extremely important aspects of ffirst are: The second crucial aspect, compared to LDST Ffirst: * LD/ST Failfirst may (beyond the initial first element - conditions) truncate VL for any architecturally - suitable reason. Beyond the first element LD/ST Failfirst is - arbitrarily speculative and 100% non-deterministic. + conditions) truncate VL for any architecturally suitable reason. Beyond + the first element LD/ST Failfirst is arbitrarily speculative and 100% + non-deterministic. * CR-based data-dependent first on the other hand MUST NOT truncate VL arbitrarily to a length decided by the hardware: VL MUST only be - truncated based explicitly on whether a test fails. - This because it is a precise Deterministic test on which algorithms - can and will will rely. + truncated based explicitly on whether a test fails. This because it is + a precise Deterministic test on which algorithms can and will will rely. **Floating-point Exceptions** When Floating-point exceptions are enabled VL must be truncated at the point where the Exception appears not to have occurred. If `VLi` -is set then VL must include the faulting element, and thus the -faulting element will always raise its exception. If however `VLi` -is clear then VL **excludes** the faulting element and thus the -exception will **never** be raised. +is set then VL must include the faulting element, and thus the faulting +element will always raise its exception. If however `VLi` is clear then +VL **excludes** the faulting element and thus the exception will **never** +be raised. -Although very strongly -discouraged the Exception Mode that permits Floating Point Exception -notification to arrive too late to unwind is permitted -(under protest, due it violating -the otherwise 100% Deterministic nature of Data-dependent Fail-first). +Although very strongly discouraged the Exception Mode that permits +Floating Point Exception notification to arrive too late to unwind +is permitted (under protest, due it violating the otherwise 100% +Deterministic nature of Data-dependent Fail-first) and is `UNDEFINED` +behaviour. **Use of lax FP Exception Notification Mode could result in parallel computations proceeding with invalid results that have to be explicitly detected, whereas with the strict FP Execption Mode enabled, FFirst -truncates VL, allows subsequent parallel computation to avoid -the exceptions entirely** +truncates VL, allows subsequent parallel computation to avoid the +exceptions entirely** ## Data-dependent fail-first on CR operations (crand etc) -Operations that actually produce or alter CR Field as a result -have their own SVP64 Mode, described -in [[sv/cr_ops]]. +Operations that actually produce or alter CR Field as a result have +their own SVP64 Mode, described in [[sv/cr_ops]]. ## pred-result mode @@ -1013,9 +1064,10 @@ Load and Store operations that go far beyond the capabilities of Scalar RISC and most CISC processors, yet at their heart on an individual element basis may be found to be no different from RISC Scalar equivalents. -The resource savings from Vector LD/ST are significant and stem from -the fact that one single instruction can trigger a dozen (or in some -microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses. +The resource savings from Vector LD/ST are significant and stem +from the fact that one single instruction can trigger a dozen (or in +some microarchitectures such as Cray or NEC SX Aurora) hundreds of +element-level Memory accesses. Additionally, and simply: if the Arithmetic side of an ISA supports Vector Operations, then in order to keep the ALUs 100% occupied the @@ -1024,10 +1076,9 @@ Memory Operations as well. Vectorised Load and Store also presents an extra dimension (literally) which creates scenarios unique to Vector applications, that a Scalar -(and even a SIMD) ISA simply never encounters. SVP64 endeavours to -add the modes typically found in *all* Scalable Vector ISAs, -without changing the behaviour of the underlying Base -(Scalar) v3.0B operations in any way. +(and even a SIMD) ISA simply never encounters. SVP64 endeavours to add +the modes typically found in *all* Scalable Vector ISAs, without changing +the behaviour of the underlying Base (Scalar) v3.0B operations in any way. ## Modes overview @@ -1040,41 +1091,39 @@ a number of different modes: * **Speculative fail-first** - where it makes sense to do so * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode. -*Despite being constructed from Scalar LD/ST none of these Modes -exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs* +*Despite being constructed from Scalar LD/ST none of these Modes exist +or make sense in any Scalar ISA. They **only** exist in Vector ISAs* Also included in SVP64 LD/ST is both signed and unsigned Saturation, as well as Element-width overrides and Twin-Predication. -Note also that Indexed [[sv/remap]] mode may be applied to both -v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions. -LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification -is provided below. +Note also that Indexed [[sv/remap]] mode may be applied to both v3.0 +LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions. +LD/ST-Indexed should not be conflated with Indexed REMAP mode: +clarification is provided below. **Determining the LD/ST Modes** A minor complication (caused by the retro-fitting of modern Vector features to a Scalar ISA) is that certain features do not exactly make sense or are considered a security risk. Fail-first on Vector Indexed -would allow attackers to probe large numbers of pages from userspace, where -strided fail-first (by creating contiguous sequential LDs) does not. +would allow attackers to probe large numbers of pages from userspace, +where strided fail-first (by creating contiguous sequential LDs) does not. -In addition, reduce mode makes no sense. -Realistically we need -an alternative table definition for [[sv/svp64]] `RM.MODE`. -The following modes make sense: +In addition, reduce mode makes no sense. Realistically we need an +alternative table definition for [[sv/svp64]] `RM.MODE`. The following +modes make sense: * saturation * predicate-result (mostly for cache-inhibited LD/ST) * simple (no augmentation) * fail-first (where Vector Indexed is banned) * Signed Effective Address computation (Vector Indexed only) -* Pack/Unpack (on LD/ST immediate operations only) More than that however it is necessary to fit the usual Vector ISA -capabilities onto both Power ISA LD/ST with immediate and to -LD/ST Indexed. They present subtly different Mode tables, which, due -to lack of space, have the following quirks: +capabilities onto both Power ISA LD/ST with immediate and to LD/ST +Indexed. They present subtly different Mode tables, which, due to lack +of space, have the following quirks: * LD/ST Immediate has no individual control over src/dest zeroing, whereas LD/ST Indexed does. @@ -1085,13 +1134,19 @@ to lack of space, have the following quirks: Fields used in tables below: -* **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context. +* **sz / dz** if predication is enabled will put zeros into the dest + (or as src in the case of twin pred) when the predicate bit is zero. + otherwise the element is ignored or skipped, depending on context. * **zz**: both sz and dz are set equal to this flag. -* **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1) +* **inv CR bit** just as in branches (BO) these bits allow testing of + a CR bit and whether it is set (inv=0) or unset (inv=1) * **N** sets signed/unsigned saturation. * **RC1** as if Rc=1, stores CRs *but not the result* * **SEA** - Signed Effective Address, if enabled performs sign-extension on registers that have been reduced due to elwidth overrides +* **PI** - post-increment mode (applies to LD/ST with update only). + the Effective Address utilised is always just RA, i.e. the computation of + EA is stored in RA **after** it is actually used. **LD/ST immediate** @@ -1120,41 +1175,37 @@ whether stride is unit or element: svctx.ldstmode = elementstride ``` -An immediate of zero is a safety-valve to allow `LD-VSPLAT`: -in effect the multiplication of the immediate-offset by zero results -in reading from the exact same memory location, *even with a Vector -register*. (Normally this type of behaviour is reserved for the -mapreduce modes) - -For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur -just the once and be copied, rather than hitting the Data Cache -multiple times with the same memory read at the same location. -The benefit of Cache-inhibited LD-splats is that it allows -for memory-mapped peripherals to have multiple -data values read in quick succession and stored in sequentially -numbered registers (but, see Note below). - -For non-cache-inhibited ST from a vector source onto a scalar -destination: with the Vector -loop effectively creating multiple memory writes to the same location, -we can deduce that the last of these will be the "successful" one. Thus, -implementations are free and clear to optimise out the overwriting STs, -leaving just the last one as the "winner". Bear in mind that predicate -masks will skip some elements (in source non-zeroing mode). -Cache-inhibited ST operations on the other hand **MUST** write out -a Vector source multiple successive times to the exact same Scalar -destination. Just like Cache-inhibited LDs, multiple values may be -written out in quick succession to a memory-mapped peripheral from -sequentially-numbered registers. +An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect +the multiplication of the immediate-offset by zero results in reading from +the exact same memory location, *even with a Vector register*. (Normally +this type of behaviour is reserved for the mapreduce modes) + +For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just +the once and be copied, rather than hitting the Data Cache multiple +times with the same memory read at the same location. The benefit of +Cache-inhibited LD-splats is that it allows for memory-mapped peripherals +to have multiple data values read in quick succession and stored in +sequentially numbered registers (but, see Note below). + +For non-cache-inhibited ST from a vector source onto a scalar destination: +with the Vector loop effectively creating multiple memory writes to +the same location, we can deduce that the last of these will be the +"successful" one. Thus, implementations are free and clear to optimise +out the overwriting STs, leaving just the last one as the "winner". +Bear in mind that predicate masks will skip some elements (in source +non-zeroing mode). Cache-inhibited ST operations on the other hand +**MUST** write out a Vector source multiple successive times to the exact +same Scalar destination. Just like Cache-inhibited LDs, multiple values +may be written out in quick succession to a memory-mapped peripheral +from sequentially-numbered registers. Note that any memory location may be Cache-inhibited (Power ISA v3.1, Book III, 1.6.1, p1033) -*Programmer's Note: an immediate also with a Scalar source as -a "VSPLAT" mode is simply not possible: there are not enough -Mode bits. One single Scalar Load operation may be used instead, followed -by any arithmetic operation (including a simple mv) in "Splat" -mode.* +*Programmer's Note: an immediate also with a Scalar source as a "VSPLAT" +mode is simply not possible: there are not enough Mode bits. One single +Scalar Load operation may be used instead, followed by any arithmetic +operation (including a simple mv) in "Splat" mode.* **LD/ST Indexed** @@ -1176,49 +1227,58 @@ Vector Indexed Strided Mode is qualified as follows: A summary of the effect of Vectorisation of src or dest: - imm(RA) RT.v RA.v no stride allowed - imm(RA) RT.s RA.v no stride allowed - imm(RA) RT.v RA.s stride-select allowed - imm(RA) RT.s RA.s not vectorised - RA,RB RT.v {RA|RB}.v Standard Indexed - RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT) - RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable - RA,RB RT.s {RA&RB}.s not vectorised (scalar identity) - -Signed Effective Address computation is only relevant for -Vector Indexed Mode, when elwidth overrides are applied. -The source override applies to RB, and before adding to -RA in order to calculate the Effective Address, if SEA is -set RB is sign-extended from elwidth bits to the full 64 -bits. For other Modes (ffirst, saturate), -all EA computation with elwidth overrides is unsigned. - -Note that cache-inhibited LD/ST when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. Even with scalar src a -Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals. -If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar* -cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv, -copying the one *scalar* value into multiple register destinations. +``` + imm(RA) RT.v RA.v no stride allowed + imm(RA) RT.s RA.v no stride allowed + imm(RA) RT.v RA.s stride-select allowed + imm(RA) RT.s RA.s not vectorised + RA,RB RT.v {RA|RB}.v Standard Indexed + RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT) + RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable + RA,RB RT.s {RA&RB}.s not vectorised (scalar identity) +``` + +Signed Effective Address computation is only relevant for Vector Indexed +Mode, when elwidth overrides are applied. The source override applies to +RB, and before adding to RA in order to calculate the Effective Address, +if SEA is set RB is sign-extended from elwidth bits to the full 64 bits. +For other Modes (ffirst, saturate), all EA computation with elwidth +overrides is unsigned. + +Note that cache-inhibited LD/ST when VSPLAT is activated will perform +**multiple** LD/ST operations, sequentially. Even with scalar src +a Cache-inhibited LD will read the same memory location *multiple +times*, storing the result in successive Vector destination registers. +This because the cache-inhibit instructions are typically used to read +and write memory-mapped peripherals. If a genuine cache-inhibited +LD-VSPLAT is required then a single *scalar* cache-inhibited LD should +be performed, followed by a VSPLAT-augmented mv, copying the one *scalar* +value into multiple register destinations. Note also that cache-inhibited VSPLAT with Predicate-result is possible. This allows for example to issue a massive batch of memory-mapped peripheral reads, stopping at the first NULL-terminated character and -truncating VL to that point. No branch is needed to issue that large burst -of LDs, which may be valuable in Embedded scenarios. +truncating VL to that point. No branch is needed to issue that large +burst of LDs, which may be valuable in Embedded scenarios. ## Vectorisation of Scalar Power ISA v3.0B Scalar Power ISA Load/Store operations may be seen from their pseudocode to be of the form: +``` lbux RT, RA, RB EA <- (RA) + (RB) RT <- MEM(EA) +``` and for immediate variants: +``` lb RT,D(RA) EA <- RA + EXTS(D) RT <- MEM(EA) +``` Thus in the first example, the source registers may each be independently marked as scalar or vector, and likewise the destination; in the second @@ -1310,15 +1370,19 @@ Note that Element-Strided uses the Destination Step because with both sources being Scalar as a prerequisite condition of activation of Element-Stride Mode, the source step (being Scalar) would never advance. -Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range. +Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" +mode (`ldux`) to be effectively a *completely different* register from +RA-as-a-source. This because there is room in svp64 to extend RA-as-src +as well as RA-as-dest, both independently as scalar or vector *and* +independently extending their range. -*Programmer's note: being able to set RA-as-a-source - as separate from RA-as-a-destination as Scalar is **extremely valuable** - once it is remembered that Simple-V element operations must - be in Program Order, especially in loops, for saving on - multiple address computations. Care does have - to be taken however that RA-as-src is not overwritten by - RA-as-dest unless intentionally desired, especially in element-strided Mode.* +*Programmer's note: being able to set RA-as-a-source as separate from +RA-as-a-destination as Scalar is **extremely valuable** once it is +remembered that Simple-V element operations must be in Program Order, +especially in loops, for saving on multiple address computations. Care +does have to be taken however that RA-as-src is not overwritten by +RA-as-dest unless intentionally desired, especially in element-strided +Mode.* ## LD/ST Indexed vs Indexed REMAP @@ -1332,22 +1396,21 @@ contexts, potentially causing confusion. Mode that can be applied to *any* instruction **including those named LD/ST Indexed**. -Whilst it may be costly in terms of register reads to allow REMAP -Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as -`sv.ld *RT,RA,*RB`, or even misleadingly -labelled as redundant, firstly the strict -application of the RISC Paradigm that Simple-V follows makes it awkward -to consider *preventing* the application of Indexed REMAP to such -operations, and secondly they are not actually the same at all. +Whilst it may be costly in terms of register reads to allow REMAP Indexed +Mode to be applied to any Vectorised LD/ST Indexed operation such as +`sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly +the strict application of the RISC Paradigm that Simple-V follows makes +it awkward to consider *preventing* the application of Indexed REMAP to +such operations, and secondly they are not actually the same at all. Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB` effectively performs an *in-place* re-ordering of the offsets, RB. To achieve the same effect without Indexed REMAP would require taking a *copy* of the Vector of offsets starting at RB, manually explicitly -reordering them, and finally using the copy of re-ordered offsets in -a non-REMAP'ed `sv.ld`. Using non-strided LD as an example, -pseudocode showing what actually occurs, -where the pseudocode for `indexed_remap` may be found in [[sv/remap]]: +reordering them, and finally using the copy of re-ordered offsets in a +non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode +showing what actually occurs, where the pseudocode for `indexed_remap` +may be found in [[sv/remap]]: ``` # sv.ld *RT,RA,*RB with Index REMAP applied to RB @@ -1380,39 +1443,42 @@ considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. -See ``` for(i = 0; i < VL; i++) reg[rt + i] = mem[reg[ra] + i * reg[rb]]; ``` -High security implementations where any kind of speculative probing -of memory pages is considered a risk should take advantage of the fact that -implementations may truncate VL at any point, without requiring software -to be rewritten and made non-portable. Such implementations may choose -to *always* set VL=1 which will have the effect of terminating any -speculative probing (and also adversely affect performance), but will -at least not require applications to be rewritten. - -Low-performance simpler hardware implementations may also -choose (always) to also set VL=1 as the bare minimum compliant implementation of -LD/ST Fail-First. It is however critically important to remember that -the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. -**MUST** raise exceptions exactly like an ordinary LD/ST. - -For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary -such as the beginning of a cache line, or beginning of a Virtual Memory -page. Likewise, to reduce workloads or balance resources. - -Vertical-First Mode is slightly strange in that only one element -at a time is ever executed anyway. Given that programmers may -legitimately choose to alter srcstep and dststep in non-sequential -order as part of explicit loops, it is neither possible nor -safe to make speculative assumptions about future LD/STs. -Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`. -This is very different from Arithmetic (Data-dependent) FFirst -where Vertical-First Mode is fully deterministic, not speculative. +High security implementations where any kind of speculative probing of +memory pages is considered a risk should take advantage of the fact +that implementations may truncate VL at any point, without requiring +software to be rewritten and made non-portable. Such implementations may +choose to *always* set VL=1 which will have the effect of terminating +any speculative probing (and also adversely affect performance), but +will at least not require applications to be rewritten. + +Low-performance simpler hardware implementations may also choose (always) +to also set VL=1 as the bare minimum compliant implementation of LD/ST +Fail-First. It is however critically important to remember that the first +element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST** +raise exceptions exactly like an ordinary LD/ST. + +For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value +for any implementation-specific reason. For example: it is perfectly +reasonable for implementations to alter VL when ffirst LD or ST operations +are initiated on a nonaligned boundary, such that within a loop the +subsequent iteration of that loop begins the following ffirst LD/ST +operations on an aligned boundary such as the beginning of a cache line, +or beginning of a Virtual Memory page. Likewise, to reduce workloads or +balance resources. + +Vertical-First Mode is slightly strange in that only one element at a time +is ever executed anyway. Given that programmers may legitimately choose +to alter srcstep and dststep in non-sequential order as part of explicit +loops, it is neither possible nor safe to make speculative assumptions +about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is +`UNDEFINED`. This is very different from Arithmetic (Data-dependent) +FFirst where Vertical-First Mode is fully deterministic, not speculative. ## LOAD/STORE Elwidths @@ -1444,17 +1510,15 @@ is treated effectively as completely separate and distinct from SV augmentation. This is primarily down to quirks surrounding LE/BE and byte-reversal. -It is rather unfortunately possible to request an elwidth override -on the memory side which -does not mesh with the overridden operation width: these result in -`UNDEFINED` -behaviour. The reason is that the effect of attempting a 64-bit `sv.ld` -operation with a source elwidth override of 8/16/32 would result in -overlapping memory requests, particularly on unit and element strided -operations. Thus it is `UNDEFINED` when the elwidth is smaller than -the memory operation width. Examples include `sv.lw/sw=16/els` which -requests (overlapping) 4-byte memory reads offset from -each other at 2-byte intervals. Store likewise is also `UNDEFINED` +It is rather unfortunately possible to request an elwidth override on +the memory side which does not mesh with the overridden operation width: +these result in `UNDEFINED` behaviour. The reason is that the effect +of attempting a 64-bit `sv.ld` operation with a source elwidth override +of 8/16/32 would result in overlapping memory requests, particularly +on unit and element strided operations. Thus it is `UNDEFINED` when +the elwidth is smaller than the memory operation width. Examples include +`sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset +from each other at 2-byte intervals. Store likewise is also `UNDEFINED` where the dest elwidth override is less than the operation width. Note the following regarding the pseudocode to follow: @@ -1470,13 +1534,13 @@ Note the following regarding the pseudocode to follow: * `svctx` specifies the SV Context and includes VL as well as source and destination elwidth overrides. -Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in -both Immediate and Indexed LD/ST, -does not have element-width overriding applied to it. +Below is the pseudocode for Unit-Strided LD (which includes Vector +capability). Observe in particular that RA, as the base address in both +Immediate and Indexed LD/ST, does not have element-width overriding +applied to it. -Note that predication, predication-zeroing, -and other modes except saturation have all been removed, -for clarity and simplicity: +Note that predication, predication-zeroing, and other modes except +saturation have all been removed, for clarity and simplicity: ``` # LD not VLD! @@ -1566,17 +1630,16 @@ for clarity: predication and all modes except saturation are removed: ## Remapped LD/ST -In the [[sv/remap]] page the concept of "Remapping" is described. -Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides -a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 -elements worth of LDs or STs. The usual interest in such re-mapping -is for example in separating out 24-bit RGB channel data into separate -contiguous registers. +In the [[sv/remap]] page the concept of "Remapping" is described. Whilst +it is expensive to set up (2 64-bit opcodes minimum) it provides a way to +arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth +of LDs or STs. The usual interest in such re-mapping is for example in +separating out 24-bit RGB channel data into separate contiguous registers. -REMAP easily covers this capability, and with dest -elwidth overrides and saturation may do so with built-in conversion that -would normally require additional width-extension, sign-extension and -min/max Vectorised instructions as post-processing stages. +REMAP easily covers this capability, and with dest elwidth overrides +and saturation may do so with built-in conversion that would normally +require additional width-extension, sign-extension and min/max Vectorised +instructions as post-processing stages. Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes because the generic abstracted concept of "Remapping", when applied to @@ -1584,8 +1647,8 @@ LD/ST, will give that same capability, with far more flexibility. It is worth noting that Pack/Unpack Modes of SVSTATE, which may be established through `svstep`, are also an easy way to perform regular -Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond -that, REMAP will need to be used. +Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that, +REMAP will need to be used. -------- @@ -1617,13 +1680,12 @@ in SVP64 `RM` may be used for other purposes. This alternative mapping **only** applies to instructions that **only** reference a CR Field or CR bit as the sole exclusive result. This section **does not** apply to instructions which primarily produce arithmetic -results that also, as an aside, produce a corresponding -CR Field (such as when Rc=1). -Instructions that involve Rc=1 are definitively arithmetic in nature, -where the corresponding Condition Register Field can be considered to -be a "co-result". Such CR Field "co-result" arithmeric operations -are firmly out of scope for -this section, being covered fully by [[sv/normal]]. +results that also, as an aside, produce a corresponding CR Field (such as +when Rc=1). Instructions that involve Rc=1 are definitively arithmetic +in nature, where the corresponding Condition Register Field can be +considered to be a "co-result". Such CR Field "co-result" arithmeric +operations are firmly out of scope for this section, being covered fully +by [[sv/normal]]. * Examples of v3.0B instructions to which this section does apply is @@ -1640,19 +1702,15 @@ instruction is purely to a Condition Register Field. Other modes are still applicable and include: * **Data-dependent fail-first**. - useful to truncate VL based on - analysis of a Condition Register result bit. + useful to truncate VL based on analysis of a Condition Register result bit. * **Reduction**. - Reduction is useful -for analysing a Vector of Condition Register Fields -and reducing it to one -single Condition Register Field. + Reduction is useful for analysing a Vector of Condition Register Fields + and reducing it to one single Condition Register Field. -Predicate-result does not make any sense because -when Rc=1 a co-result is created (a CR Field). Testing the co-result -allows the decision to be made to store or not store the main -result, and for CR Ops the CR Field result *is* -the main result. +Predicate-result does not make any sense because when Rc=1 a co-result +is created (a CR Field). Testing the co-result allows the decision to +be made to store or not store the main result, and for CR Ops the CR +Field result *is* the main result. ## Format @@ -1667,12 +1725,16 @@ SVP64 RM `MODE` (includes `ELWIDTH_SRC` bits) for CR-based operations: Fields: -* **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context. +* **sz / dz** if predication is enabled will put zeros into the dest + (or as src in the case of twin pred) when the predicate bit is zero. + otherwise the element is ignored or skipped, depending on context. * **zz** set both sz and dz equal to this flag -* **SNZ** In fail-first mode, on the bit being tested, when sz=1 and SNZ=1 a value "1" is put in place of "0". -* **inv CR-bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1) +* **SNZ** In fail-first mode, on the bit being tested, when sz=1 and + SNZ=1 a value "1" is put in place of "0". +* **inv CR-bit** just as in branches (BO) these bits allow testing of + a CR bit and whether it is set (inv=0) or unset (inv=1) * **RG** inverts the Vector Loop order (VL-1 downto 0) rather -than the normal 0..VL-1 + than the normal 0..VL-1 * **SVM** sets "subvector" reduce mode * **VLi** VL inclusive: in fail-first mode, the truncation of VL *includes* the current element at the failure point rather @@ -1680,25 +1742,23 @@ than the normal 0..VL-1 ## Data-dependent fail-first on CR operations -The principle of data-dependent fail-first is that if, during -the course of sequentially evaluating an element's Condition Test, -one such test is encountered which fails, -then VL (Vector Length) is truncated (set) at that point. In the case -of Arithmetic SVP64 Operations the Condition Register Field generated from -Rc=1 is used as the basis for the truncation decision. -However with CR-based operations that CR Field result to be -tested is provided -*by the operation itself*. - -Data-dependent SVP64 Vectorised Operations involving the creation or -modification of a CR can require an extra two bits, which are not available -in the compact space of the SVP64 RM `MODE` Field. With the concept of element -width overrides being meaningless for CR Fields it is possible to use the -`ELWIDTH` field for alternative purposes. - -Condition Register based operations such as `sv.mfcr` and `sv.crand` can thus -be made more flexible. However the rules that apply in this section -also apply to future CR-based instructions. +The principle of data-dependent fail-first is that if, during the course +of sequentially evaluating an element's Condition Test, one such test +is encountered which fails, then VL (Vector Length) is truncated (set) +at that point. In the case of Arithmetic SVP64 Operations the Condition +Register Field generated from Rc=1 is used as the basis for the truncation +decision. However with CR-based operations that CR Field result to be +tested is provided *by the operation itself*. + +Data-dependent SVP64 Vectorised Operations involving the creation +or modification of a CR can require an extra two bits, which are not +available in the compact space of the SVP64 RM `MODE` Field. With the +concept of element width overrides being meaningless for CR Fields it +is possible to use the `ELWIDTH` field for alternative purposes. + +Condition Register based operations such as `sv.mfcr` and `sv.crand` +can thus be made more flexible. However the rules that apply in this +section also apply to future CR-based instructions. There are two primary different types of CR operations: @@ -1706,12 +1766,11 @@ There are two primary different types of CR operations: * Those which have a 5-bit operand (referring to a bit within the whole 32-bit CR) -Examining these two types it is observed that the -difference may be considered to be that the 5-bit variant -*already* provides the -prerequisite information about which CR Field bit (EQ, GE, LT, SO) is to -be operated on by the instruction. -Thus, logically, we may set the following rule: +Examining these two types it is observed that the difference may +be considered to be that the 5-bit variant *already* provides the +prerequisite information about which CR Field bit (EQ, GE, LT, SO) is +to be operated on by the instruction. Thus, logically, we may set the +following rule: * When a 5-bit CR Result field is used in an instruction, the 5-bit variant of Data-Dependent Fail-First @@ -1722,12 +1781,11 @@ Thus, logically, we may set the following rule: in order to select which CR Field bit of the result shall be tested (EQ, LE, GE, SO) -The reason why the 3-bit CR variant needs the additional CR-bit -field should be obvious from the fact that the 3-bit CR Field -from the base Power ISA v3.0B operation clearly does not contain -and is missing the two CR Field Selector bits. Thus, these two -bits (to select EQ, LE, GE or SO) must be provided in another -way. +The reason why the 3-bit CR variant needs the additional CR-bit field +should be obvious from the fact that the 3-bit CR Field from the base +Power ISA v3.0B operation clearly does not contain and is missing the +two CR Field Selector bits. Thus, these two bits (to select EQ, LE, +GE or SO) must be provided in another way. Examples of the former type: @@ -1749,29 +1807,27 @@ is *required*. ## Reduction and Iteration Bearing in mind as described in the svp64 Appendix, SVP64 Horizontal -Reduction is a deterministic schedule on top of base Scalar v3.0 operations, -the same rules apply to CR Operations, i.e. that programmers must -follow certain conventions in order for an *end result* of a -reduction to be achieved. Unlike -other Vector ISAs *there are no explicit reduction opcodes* -in SVP64: Schedules however achieve the same effect. +Reduction is a deterministic schedule on top of base Scalar v3.0 +operations, the same rules apply to CR Operations, i.e. that programmers +must follow certain conventions in order for an *end result* of a +reduction to be achieved. Unlike other Vector ISAs *there are no explicit +reduction opcodes* in SVP64: Schedules however achieve the same effect. Due to these conventions only reduction on operations such as `crand` and `cror` are meaningful because these have Condition Register Fields -as both input and output. -Meaningless operations are not prohibited because the cost in hardware -of doing so is prohibitive, but neither are they `UNDEFINED`. Implementations -are still required to execute them but are at liberty to optimise out -any operations that would ultimately be overwritten, as long as Strict -Program Order is still obvservable by the programmer. +as both input and output. Meaningless operations are not prohibited +because the cost in hardware of doing so is prohibitive, but neither +are they `UNDEFINED`. Implementations are still required to execute them +but are at liberty to optimise out any operations that would ultimately +be overwritten, as long as Strict Program Order is still obvservable by +the programmer. Also bear in mind that 'Reverse Gear' may be enabled, which can be -used in combination with overlapping CR operations to iteratively accumulate -results. Issuing a `sv.crand` operation for example with `BA` -differing from `BB` by one Condition Register Field would -result in a cascade effect, where the first-encountered CR Field -would set the result to zero, and also all subsequent CR Field -elements thereafter: +used in combination with overlapping CR operations to iteratively +accumulate results. Issuing a `sv.crand` operation for example with +`BA` differing from `BB` by one Condition Register Field would result +in a cascade effect, where the first-encountered CR Field would set the +result to zero, and also all subsequent CR Field elements thereafter: ``` # sv.crand/mr/rg CR4.ge.v, CR5.ge.v, CR4.ge.v @@ -1779,13 +1835,13 @@ elements thereafter: CR.field[4+i].ge &= CR.field[5+i].ge ``` -`sv.crxor` with reduction would be particularly useful for parity calculation -for example, although there are many ways in which the same calculation -could be carried out after transferring a vector of CR Fields to a GPR -using crweird operations. +`sv.crxor` with reduction would be particularly useful for parity +calculation for example, although there are many ways in which the same +calculation could be carried out after transferring a vector of CR Fields +to a GPR using crweird operations. -Implementations are free and clear to optimise these reductions in any -way they see fit, as long as the end-result is compatible with Strict Program +Implementations are free and clear to optimise these reductions in any way +they see fit, as long as the end-result is compatible with Strict Program Order being observed, and Interrupt latency is not adversely impacted. ## Unusual and quirky CR operations @@ -1822,67 +1878,59 @@ expected to be Micro-coded by most Hardware implementations. # SVP64 Branch Conditional behaviour Please note: although similar, SVP64 Branch instructions should be -considered completely separate and distinct from -standard scalar OpenPOWER-approved v3.0B branches. -**v3.0B branches are in no way impacted, altered, -changed or modified in any way, shape or form by -the SVP64 Vectorised Variants**. - -It is also -extremely important to note that Branches are the -sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`. -SVP64 Branches contain additional modes that are useful -for scalar operations (i.e. even when VL=1 or when -using single-bit predication). +considered completely separate and distinct from standard scalar +OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way +impacted, altered, changed or modified in any way, shape or form by the +SVP64 Vectorised Variants**. + +It is also extremely important to note that Branches are the sole +pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches +contain additional modes that are useful for scalar operations (i.e. even +when VL=1 or when using single-bit predication). **Rationale** -Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a -Condition Register. However for parallel processing it is simply impossible -to perform multiple independent branches: the Program Counter simply -cannot branch to multiple destinations based on multiple conditions. -The best that can be done is -to test multiple Conditions and make a decision of a *single* branch, -based on analysis of a *Vector* of CR Fields -which have just been calculated from a *Vector* of results. - -In 3D Shader -binaries, which are inherently parallelised and predicated, testing all or -some results and branching based on multiple tests is extremely common, -and a fundamental part of Shader Compilers. Example: -without such multi-condition -test-and-branch, if a predicate mask is all zeros a large batch of -instructions may be masked out to `nop`, and it would waste -CPU cycles to run them. 3D GPU ISAs can test for this scenario -and, with the appropriate predicate-analysis instruction, -jump over fully-masked-out operations, by spotting that -*all* Conditions are false. +Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test +a Condition Register. However for parallel processing it is simply +impossible to perform multiple independent branches: the Program +Counter simply cannot branch to multiple destinations based on multiple +conditions. The best that can be done is to test multiple Conditions +and make a decision of a *single* branch, based on analysis of a *Vector* +of CR Fields which have just been calculated from a *Vector* of results. + +In 3D Shader binaries, which are inherently parallelised and predicated, +testing all or some results and branching based on multiple tests is +extremely common, and a fundamental part of Shader Compilers. Example: +without such multi-condition test-and-branch, if a predicate mask is +all zeros a large batch of instructions may be masked out to `nop`, +and it would waste CPU cycles to run them. 3D GPU ISAs can test for +this scenario and, with the appropriate predicate-analysis instruction, +jump over fully-masked-out operations, by spotting that *all* Conditions +are false. Unless Branches are aware and capable of such analysis, additional instructions would be required which perform Horizontal Cumulative -analysis of Vectorised Condition Register Fields, in order to -reduce the Vector of CR Fields down to one single yes or no -decision that a Scalar-only v3.0B Branch-Conditional could cope with. -Such instructions would be unavoidable, required, and costly -by comparison to a single Vector-aware Branch. -Therefore, in order to be commercially competitive, `sv.bc` and -other Vector-aware Branch Conditional instructions are a high priority -for 3D GPU (and OpenCL-style) workloads. +analysis of Vectorised Condition Register Fields, in order to reduce +the Vector of CR Fields down to one single yes or no decision that a +Scalar-only v3.0B Branch-Conditional could cope with. Such instructions +would be unavoidable, required, and costly by comparison to a single +Vector-aware Branch. Therefore, in order to be commercially competitive, +`sv.bc` and other Vector-aware Branch Conditional instructions are a +high priority for 3D GPU (and OpenCL-style) workloads. Given that Power ISA v3.0B is already quite powerful, particularly -the Condition Registers and their interaction with Branches, there -are opportunities to create extremely flexible and compact -Vectorised Branch behaviour. In addition, the side-effects (updating -of CTR, truncation of VL, described below) make it a useful instruction -even if the branch points to the next instruction (no actual branch). +the Condition Registers and their interaction with Branches, there are +opportunities to create extremely flexible and compact Vectorised Branch +behaviour. In addition, the side-effects (updating of CTR, truncation +of VL, described below) make it a useful instruction even if the branch +points to the next instruction (no actual branch). ## Overview When considering an "array" of branch-tests, there are four -primarily-useful modes: -AND, OR, NAND and NOR of all Conditions. -NAND and NOR may be synthesised from AND and OR by -inverting `BO[1]` which just leaves two modes: +primarily-useful modes: AND, OR, NAND and NOR of all Conditions. +NAND and NOR may be synthesised from AND and OR by inverting `BO[1]` +which just leaves two modes: * Branch takes place on the **first** CR Field test to succeed (a Great Big OR of all condition tests). Exit occurs @@ -1895,31 +1943,27 @@ Early-exit is enacted such that the Vectorised Branch does not perform needless extra tests, which will help reduce reads on the Condition Register file. -*Note: Early-exit is **MANDATORY** (required) behaviour. -Branches **MUST** exit at the first sequentially-encountered -failure point, for -exactly the same reasons for which it is mandatory in -programming languages doing early-exit: to avoid -damaging side-effects and to provide deterministic -behaviour. Speculative testing of Condition -Register Fields is permitted, as is speculative calculation -of CTR, as long as, as usual in any Out-of-Order microarchitecture, -that speculative testing is cancelled should an early-exit occur. -i.e. the speculation must be "precise": Program Order must be preserved* - -Also note that when early-exit occurs in Horizontal-first Mode, -srcstep, dststep etc. are all reset, ready to begin looping from the -beginning for the next instruction. However for Vertical-first -Mode srcstep etc. are incremented "as usual" i.e. an early-exit -has no special impact, regardless of whether the branch -occurred or not. This can leave srcstep etc. in what may be -considered an unusual -state on exit from a loop and it is up to the programmer to -reset srcstep, dststep etc. to known-good values -*(easily achieved with `setvl`)*. - -Additional useful behaviour involves two primary Modes (both of -which may be enabled and combined): +*Note: Early-exit is **MANDATORY** (required) behaviour. Branches +**MUST** exit at the first sequentially-encountered failure point, +for exactly the same reasons for which it is mandatory in programming +languages doing early-exit: to avoid damaging side-effects and to provide +deterministic behaviour. Speculative testing of Condition Register +Fields is permitted, as is speculative calculation of CTR, as long as, +as usual in any Out-of-Order microarchitecture, that speculative testing +is cancelled should an early-exit occur. i.e. the speculation must be +"precise": Program Order must be preserved* + +Also note that when early-exit occurs in Horizontal-first Mode, srcstep, +dststep etc. are all reset, ready to begin looping from the beginning +for the next instruction. However for Vertical-first Mode srcstep +etc. are incremented "as usual" i.e. an early-exit has no special impact, +regardless of whether the branch occurred or not. This can leave srcstep +etc. in what may be considered an unusual state on exit from a loop and +it is up to the programmer to reset srcstep, dststep etc. to known-good +values *(easily achieved with `setvl`)*. + +Additional useful behaviour involves two primary Modes (both of which +may be enabled and combined): * **VLSET Mode**: identical to Data-Dependent Fail-First Mode for Arithmetic SVP64 operations, with more @@ -1930,50 +1974,43 @@ which may be enabled and combined): CTR is decremented, including options to decrement if a Condition test succeeds *or if it fails*. -With these side-effects, basic Boolean Logic Analysis advises that -it is important to provide a means -to enact them each based on whether testing succeeds *or fails*. This -results in a not-insignificant number of additional Mode Augmentation bits, -accompanying VLSET and CTR-test Modes respectively. - -Predicate skipping or zeroing may, as usual with SVP64, be controlled -by `sz`. -Where the predicate is masked out and -zeroing is enabled, then in such circumstances -the same Boolean Logic Analysis dictates that -rather than testing only against zero, the option to test -against one is also prudent. This introduces a new -immediate field, `SNZ`, which works in conjunction with -`sz`. - - -Vectorised Branches can be used -in either SVP64 Horizontal-First or Vertical-First Mode. Essentially, -at an element level, the behaviour is identical in both Modes, -although the `ALL` bit is meaningless in Vertical-First Mode. - -It is also important -to bear in mind that, fundamentally, Vectorised Branch-Conditional -is still extremely close to the Scalar v3.0B Branch-Conditional -instructions, and that the same v3.0B Scalar Branch-Conditional -instructions are still -*completely separate and independent*, being unaltered and -unaffected by their SVP64 variants in every conceivable way. - -*Programming note: One important point is that SVP64 instructions are 64 bit. -(8 bytes not 4). This needs to be taken into consideration when computing -branch offsets: the offset is relative to the start of the instruction, -which **includes** the SVP64 Prefix* +With these side-effects, basic Boolean Logic Analysis advises that it +is important to provide a means to enact them each based on whether +testing succeeds *or fails*. This results in a not-insignificant number +of additional Mode Augmentation bits, accompanying VLSET and CTR-test +Modes respectively. + +Predicate skipping or zeroing may, as usual with SVP64, be controlled by +`sz`. Where the predicate is masked out and zeroing is enabled, then in +such circumstances the same Boolean Logic Analysis dictates that rather +than testing only against zero, the option to test against one is also +prudent. This introduces a new immediate field, `SNZ`, which works in +conjunction with `sz`. + +Vectorised Branches can be used in either SVP64 Horizontal-First or +Vertical-First Mode. Essentially, at an element level, the behaviour +is identical in both Modes, although the `ALL` bit is meaningless in +Vertical-First Mode. + +It is also important to bear in mind that, fundamentally, Vectorised +Branch-Conditional is still extremely close to the Scalar v3.0B +Branch-Conditional instructions, and that the same v3.0B Scalar +Branch-Conditional instructions are still *completely separate and +independent*, being unaltered and unaffected by their SVP64 variants in +every conceivable way. + +*Programming note: One important point is that SVP64 instructions are +64 bit. (8 bytes not 4). This needs to be taken into consideration +when computing branch offsets: the offset is relative to the start of +the instruction, which **includes** the SVP64 Prefix* ## Format and fields -With element-width overrides being meaningless for Condition -Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional -Mode bits. +With element-width overrides being meaningless for Condition Register +Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits. -SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, -and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch -Conditional: +SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and +`ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional: | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description | | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- | @@ -2023,53 +2060,48 @@ Brief description of fields: in CTR-test Mode. LRu and CTR-test modes are where SVP64 Branches subtly differ from -Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas -`sv.bcl/lru` will only update LR if the branch succeeds. - -Of special interest is that when using ALL Mode (Great Big AND -of all Condition Tests), if `VL=0`, -which is rare but can occur in Data-Dependent Modes, the Branch -will always take place because there will be no failing Condition -Tests to prevent it. Likewise when not using ALL Mode (Great Big OR -of all Condition Tests) and `VL=0` the Branch is guaranteed not -to occur because there will be no *successful* Condition Tests -to make it happen. +Scalar v3.0B Branches. `sv.bcl` for example will always update LR, +whereas `sv.bcl/lru` will only update LR if the branch succeeds. + +Of special interest is that when using ALL Mode (Great Big AND of all +Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent +Modes, the Branch will always take place because there will be no failing +Condition Tests to prevent it. Likewise when not using ALL Mode (Great +Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not +to occur because there will be no *successful* Condition Tests to make +it happen. ## Vectorised CR Field numbering, and Scalar behaviour It is important to keep in mind that just like all SVP64 instructions, -the `BI` field of the base v3.0B Branch Conditional instruction -may be extended by SVP64 EXTRA augmentation, as well as be marked -as either Scalar or Vector. It is also crucially important to keep in mind -that for CRs, SVP64 sequentially increments the CR *Field* numbers. -CR *Fields* are treated as elements, not bit-numbers of the CR *register*. +the `BI` field of the base v3.0B Branch Conditional instruction may be +extended by SVP64 EXTRA augmentation, as well as be marked as either +Scalar or Vector. It is also crucially important to keep in mind that for +CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields* +are treated as elements, not bit-numbers of the CR *register*. The `BI` operand of Branch Conditional operations is five bits, in scalar -v3.0B this would select one bit of the 32 bit CR, -comprising eight CR Fields of 4 bits each. In SVP64 there are -16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of -`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits -are extended to either scalar or vector and to select CR Fields 0..127 -as specified in SVP64 [[sv/svp64/appendix]]. +v3.0B this would select one bit of the 32 bit CR, comprising eight CR +Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing +128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from +the CR Field (EQ LT GT SO), and the top 3 bits are extended to either +scalar or vector and to select CR Fields 0..127 as specified in SVP64 +[[sv/svp64/appendix]]. When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar, -then as the usual SVP64 rules apply: -the Vector loop ends at the first element tested -(the first CR *Field*), after taking -predication into consideration. Thus, also as usual, when a predicate mask is -given, and `BI` marked as scalar, and `sz` is zero, srcstep -skips forward to the first non-zero predicated element, and only that -one element is tested. - -In other words, the fact that this is a Branch -Operation (instead of an arithmetic one) does not result, ultimately, -in significant changes as to -how SVP64 is fundamentally applied, except with respect to: +then as the usual SVP64 rules apply: the Vector loop ends at the first +element tested (the first CR *Field*), after taking predication into +consideration. Thus, also as usual, when a predicate mask is given, and +`BI` marked as scalar, and `sz` is zero, srcstep skips forward to the +first non-zero predicated element, and only that one element is tested. + +In other words, the fact that this is a Branch Operation (instead of an +arithmetic one) does not result, ultimately, in significant changes as +to how SVP64 is fundamentally applied, except with respect to: * the unique properties associated with conditionally - changing the Program -Counter (aka "a Branch"), resulting in early-out -opportunities + changing the Program Counter (aka "a Branch"), resulting in early-out + opportunities * CTR-testing Both are outlined below, in later sections. @@ -2083,16 +2115,15 @@ non-ALL mode (Great Big Or) on first success early exit also occurs, however this time with the Branch proceeding. In both cases the testing of the Vector of CRs should be done in linear sequential order (or in REMAP re-sequenced order): such that tests that are sequentially beyond -the exit point are *not* carried out. (*Note: it is standard practice in -Programming languages to exit early from conditional tests, however -a little unusual to consider in an ISA that is designed for Parallel -Vector Processing. The reason is to have strictly-defined guaranteed -behaviour*) +the exit point are *not* carried out. (*Note: it is standard practice +in Programming languages to exit early from conditional tests, however a +little unusual to consider in an ISA that is designed for Parallel Vector +Processing. The reason is to have strictly-defined guaranteed behaviour*) In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED` -behaviour. Given that only one element is being tested at a time -in Vertical-First Mode, a test designed to be done on multiple -bits is meaningless. +behaviour. Given that only one element is being tested at a time in +Vertical-First Mode, a test designed to be done on multiple bits is +meaningless. ## Description and Modes @@ -2102,13 +2133,12 @@ other SVP64 operations. When `sz` is zero, any masked-out Branch-element operations are not included in condition testing, exactly like all other SVP64 operations, *including* side-effects such as potentially updating LR or CTR, which will also be skipped. There is *one* exception here, -which is when -`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element -predicate mask bit is also zero: -under these special circumstances CTR will also decrement. +which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element +predicate mask bit is also zero: under these special circumstances CTR +will also decrement. -When `sz` is non-zero, this normally requests insertion of a zero -in place of the input data, when the relevant predicate mask bit is zero. +When `sz` is non-zero, this normally requests insertion of a zero in +place of the input data, when the relevant predicate mask bit is zero. This would mean that a zero is inserted in place of `CR[BI+32]` for testing against `BO`, which may not be desirable in all circumstances. Therefore, an extra field is provided `SNZ`, which, if set, will insert @@ -2120,52 +2150,49 @@ controlled by the Predicate mask. This is particularly useful in `VLSET` mode, which will truncate SVSTATE.VL at the point of the first failed test.*) -Normally, CTR mode will decrement once per Condition Test, resulting -under normal circumstances that CTR reduces by up to VL in Horizontal-First -Mode. Just as when v3.0B Branch-Conditional saves at -least one instruction on tight inner loops through auto-decrementation -of CTR, likewise it is also possible to save instruction count for -SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly -in circumstances where there is conditional interaction between the -element computation and testing, and the continuation (or otherwise) -of a given loop. The potential combinations of interactions is why CTR -testing options have been added. +Normally, CTR mode will decrement once per Condition Test, resulting under +normal circumstances that CTR reduces by up to VL in Horizontal-First +Mode. Just as when v3.0B Branch-Conditional saves at least one instruction +on tight inner loops through auto-decrementation of CTR, likewise it +is also possible to save instruction count for SVP64 loops in both +Vertical-First and Horizontal-First Mode, particularly in circumstances +where there is conditional interaction between the element computation +and testing, and the continuation (or otherwise) of a given loop. The +potential combinations of interactions is why CTR testing options have +been added. Also, the unconditional bit `BO[0]` is still relevant when Predication is applied to the Branch because in `ALL` mode all nonmasked bits have -to be tested, and when `sz=0` skipping occurs. -Even when VLSET mode is not used, CTR -may still be decremented by the total number of nonmasked elements, -acting in effect as either a popcount or cntlz depending on which -mode bits are set. -In short, Vectorised Branch becomes an extremely powerful tool. - -**Micro-Architectural Implementation Note**: *when implemented on -top of a Multi-Issue Out-of-Order Engine it is possible to pass -a copy of the predicate and the prerequisite CR Fields to all -Branch Units, as well as the current value of CTR at the time of -multi-issue, and for each Branch Unit to compute how many times -CTR would be subtracted, in a fully-deterministic and parallel -fashion. A SIMD-based Branch Unit, receiving and processing -multiple CR Fields covered by multiple predicate bits, would -do the exact same thing. Obviously, however, if CTR is modified -within any given loop (mtctr) the behaviour of CTR is no longer -deterministic.* +to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is +not used, CTR may still be decremented by the total number of nonmasked +elements, acting in effect as either a popcount or cntlz depending +on which mode bits are set. In short, Vectorised Branch becomes an +extremely powerful tool. + +**Micro-Architectural Implementation Note**: *when implemented on top +of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of +the predicate and the prerequisite CR Fields to all Branch Units, as +well as the current value of CTR at the time of multi-issue, and for +each Branch Unit to compute how many times CTR would be subtracted, +in a fully-deterministic and parallel fashion. A SIMD-based Branch +Unit, receiving and processing multiple CR Fields covered by multiple +predicate bits, would do the exact same thing. Obviously, however, if +CTR is modified within any given loop (mtctr) the behaviour of CTR is +no longer deterministic.* ### Link Register Update -For a Scalar Branch, unconditional updating of the Link Register -LR is useful and practical. However, if a loop of CR Fields is -tested, unconditional updating of LR becomes problematic. +For a Scalar Branch, unconditional updating of the Link Register LR +is useful and practical. However, if a loop of CR Fields is tested, +unconditional updating of LR becomes problematic. For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode, LR's value will be unconditionally overwritten after the first element, -such that for execution (testing) of the second element, LR -has the value `CIA+8`. This is covered in the `bclrl` example, in -a later section. +such that for execution (testing) of the second element, LR has the value +`CIA+8`. This is covered in the `bclrl` example, in a later section. -The addition of a LRu bit modifies behaviour in conjunction -with LK, as follows: +The addition of a LRu bit modifies behaviour in conjunction with LK, +as follows: * `sv.bc` When LRu=0,LK=0, Link Register is not updated * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally @@ -2174,27 +2201,25 @@ with LK, as follows: * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if the Branch Condition succeeds. -This avoids -destruction of LR during loops (particularly Vertical-First +This avoids destruction of LR during loops (particularly Vertical-First ones). **SVLR and SVSTATE** For precisely the reasons why `LK=1` was added originally to the Power -ISA, with SVSTATE being a peer of the Program Counter it becomes -necessary to also add an SVLR (SVSTATE Link Register) -and corresponding control bits `SL` and `SLu`. +ISA, with SVSTATE being a peer of the Program Counter it becomes necessary +to also add an SVLR (SVSTATE Link Register) and corresponding control bits +`SL` and `SLu`. ### CTR-test -Where a standard Scalar v3.0B branch unconditionally decrements -CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility -which allows CTR to be used for many more types of Vector loops -constructs. +Where a standard Scalar v3.0B branch unconditionally decrements CTR when +`BO[2]` is clear, CTR-test Mode introduces more flexibility which allows +CTR to be used for many more types of Vector loops constructs. -CTR-test mode and CTi interaction is as follows: note that -`BO[2]` is still required to be clear for CTR decrements to be -considered, exactly as is the case in Scalar Power ISA v3.0B +CTR-test mode and CTi interaction is as follows: note that `BO[2]` +is still required to be clear for CTR decrements to be considered, +exactly as is the case in Scalar Power ISA v3.0B * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis if `BO[2]` is zero. Masked-out elements when `sz=0` are @@ -2225,47 +2250,45 @@ rest of the Branch operation is skipped. ### VLSET Mode VLSET Mode truncates the Vector Length so that subsequent instructions -operate on a reduced Vector Length. This is similar to -Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the -truncation occurs at the Branch decision-point. +operate on a reduced Vector Length. This is similar to Data-dependent +Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs +at the Branch decision-point. -Interestingly, due to the side-effects of `VLSET` mode -it is actually useful to use Branch Conditional even -to perform no actual branch operation, i.e to point to the instruction -after the branch. Truncation of VL would thus conditionally occur yet control -flow alteration would not. +Interestingly, due to the side-effects of `VLSET` mode it is actually +useful to use Branch Conditional even to perform no actual branch +operation, i.e to point to the instruction after the branch. Truncation of +VL would thus conditionally occur yet control flow alteration would not. `VLSET` mode with Vertical-First is particularly unusual. Vertical-First is designed to be used for explicit looping, where an explicit call to -`svstep` is required to move both srcstep and dststep on to -the next element, until VL (or other condition) is reached. -Vertical-First Looping is expected (required) to terminate if the end -of the Vector, VL, is reached. If however that loop is terminated early -because VL is truncated, VLSET with Vertical-First becomes meaningless. -Resolving this would require two branches: one Conditional, the other -branching unconditionally to create the loop, where the Conditional -one jumps over it. - -Therefore, with `VSb`, the option to decide whether truncation should occur if the -branch succeeds *or* if the branch condition fails allows for the flexibility -required. This allows a Vertical-First Branch to *either* be used as -a branch-back (loop) *or* as part of a conditional exit or function -call from *inside* a loop, and for VLSET to be integrated into both -types of decision-making. - -In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes -place if success conditions are met, but on exit from that loop -(branch condition fails), VL will be truncated. This is extremely +`svstep` is required to move both srcstep and dststep on to the next +element, until VL (or other condition) is reached. Vertical-First Looping +is expected (required) to terminate if the end of the Vector, VL, is +reached. If however that loop is terminated early because VL is truncated, +VLSET with Vertical-First becomes meaningless. Resolving this would +require two branches: one Conditional, the other branching unconditionally +to create the loop, where the Conditional one jumps over it. + +Therefore, with `VSb`, the option to decide whether truncation should +occur if the branch succeeds *or* if the branch condition fails allows +for the flexibility required. This allows a Vertical-First Branch to +*either* be used as a branch-back (loop) *or* as part of a conditional +exit or function call from *inside* a loop, and for VLSET to be integrated +into both types of decision-making. + +In the case of a Vertical-First branch-back (loop), with `VSb=0` the +branch takes place if success conditions are met, but on exit from that +loop (branch condition fails), VL will be truncated. This is extremely useful. -`VLSET` mode with Horizontal-First when `VSb=0` is still -useful, because it can be used to truncate VL to the first predicated -(non-masked-out) element. +`VLSET` mode with Horizontal-First when `VSb=0` is still useful, because +it can be used to truncate VL to the first predicated (non-masked-out) +element. The truncation point for VL, when VLi is clear, must not include skipped -elements that preceded the current element being tested. -Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition -Register failure point is at CR Field element 4. +elements that preceded the current element being tested. Example: +`sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register +failure point is at CR Field element 4. * Testing at element 0 is skipped because its predicate bit is zero * Testing at element 1 passed @@ -2283,9 +2306,9 @@ of the element actually being tested. ### VLSET and CTR-test combined -If both CTR-test and VLSET Modes are requested, it's important to -observe the correct order. What occurs depends on whether VLi -is enabled, because VLi affects the length, VL. +If both CTR-test and VLSET Modes are requested, it is important to +observe the correct order. What occurs depends on whether VLi is enabled, +because VLi affects the length, VL. If VLi (VL truncate inclusive) is set: @@ -2308,17 +2331,15 @@ should **not** be considered part of the Vector. Consequently: ## Boolean Logic combinations -In a Scalar ISA, Branch-Conditional testing even of vector -results may be performed through inversion of tests. NOR of -all tests may be performed by inversion of the scalar condition -and branching *out* from the scalar loop around elements, -using scalar operations. +In a Scalar ISA, Branch-Conditional testing even of vector results may be +performed through inversion of tests. NOR of all tests may be performed +by inversion of the scalar condition and branching *out* from the scalar +loop around elements, using scalar operations. In a parallel (Vector) ISA it is the ISA itself which must perform -the prerequisite logic manipulation. -Thus for SVP64 there are an extraordinary number of nesessary combinations -which provide completely different and useful behaviour. -Available options to combine: +the prerequisite logic manipulation. Thus for SVP64 there are an +extraordinary number of nesessary combinations which provide completely +different and useful behaviour. Available options to combine: * `BO[0]` to make an unconditional branch would seem irrelevant if it were not for predication and for side-effects (CTR Mode @@ -2343,13 +2364,13 @@ Available options to combine: need explicit instructions. The most obviously useful combinations here are to set `BO[1]` to zero -in order to turn `ALL` into Great-Big-NAND and `ANY` into -Great-Big-NOR. Other Mode bits which perform behavioural inversion then -have to work round the fact that the Condition Testing is NOR or NAND. -The alternative to not having additional behavioural inversion -(`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional) -branch directly after the first, which the first branch jumps over. -This contrivance is avoided by the behavioural inversion bits. +in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR. +Other Mode bits which perform behavioural inversion then have to work +round the fact that the Condition Testing is NOR or NAND. The alternative +to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`) +would be to have a second (unconditional) branch directly after the first, +which the first branch jumps over. This contrivance is avoided by the +behavioural inversion bits. ## Pseudocode and examples @@ -2372,15 +2393,14 @@ For comparative purposes this is a copy of the v3.0B `bc` pseudocode Simplified pseudocode including LRu and CTR skipping, which illustrates clearly that SVP64 Scalar Branches (VL=1) are **not** identical to -v3.0B Scalar Branches. The key areas where differences occur are -the inclusion of predication (which can still be used when VL=1), in -when and why CTR is decremented (CTRtest Mode) and whether LR is -updated (which is unconditional in v3.0B when LK=1, and conditional -in SVP64 when LRu=1). +v3.0B Scalar Branches. The key areas where differences occur are the +inclusion of predication (which can still be used when VL=1), in when and +why CTR is decremented (CTRtest Mode) and whether LR is updated (which +is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1). -Inline comments highlight the fact that the Scalar Branch behaviour -and pseudocode is still clearly visible and embedded within the -Vectorised variant: +Inline comments highlight the fact that the Scalar Branch behaviour and +pseudocode is still clearly visible and embedded within the Vectorised +variant: ``` if (mode_is_64bit) then M <- 0 @@ -2419,8 +2439,8 @@ Vectorised variant: ``` Below is the pseudocode for SVP64 Branches, which is a little less -obvious but identical to the above. The lack of obviousness is down -to the early-exit opportunities. +obvious but identical to the above. The lack of obviousness is down to +the early-exit opportunities. Effective pseudocode for Horizontal-First Mode: @@ -2586,11 +2606,9 @@ c code: } ``` -Under these circumstances exiting from the loop is not only -based on CTR it has become conditional on a CR result. -Thus it is desirable that NIA *and* LR only be modified -if the conditions are met - +Under these circumstances exiting from the loop is not only based on +CTR it has become conditional on a CR result. Thus it is desirable that +NIA *and* LR only be modified if the conditions are met v3.0 pseudocode for `bclrl`: @@ -2620,18 +2638,16 @@ the latter part for SVP64 `bclrl` becomes: ``` The reason why should be clear from this being a Vector loop: -unconditional destruction of LR when LK=1 makes `sv.bclrl` -ineffective, because the intention going into the loop is -that the branch should be to the copy of LR set at the *start* -of the loop, not half way through it. -However if the change to LR only occurs if -the branch is taken then it becomes a useful instruction. - -The following pseudocode should **not** be implemented because -it violates the fundamental principle of SVP64 which is that -SVP64 looping is a thin wrapper around Scalar Instructions. -The pseducode below is more an actual Vector ISA Branch and -as such is not at all appropriate: +unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective, +because the intention going into the loop is that the branch should be to +the copy of LR set at the *start* of the loop, not half way through it. +However if the change to LR only occurs if the branch is taken then it +becomes a useful instruction. + +The following pseudocode should **not** be implemented because it +violates the fundamental principle of SVP64 which is that SVP64 looping +is a thin wrapper around Scalar Instructions. The pseducode below is +more an actual Vector ISA Branch and as such is not at all appropriate: ``` for i in 0 to VL-1: -- 2.30.2