From 8d7e609f208445022a5a7fc3d144e287a15195a5 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 3 Apr 2023 10:57:00 +0100 Subject: [PATCH] remove svp64.mdwn duplicated information from ls010.mdwn --- openpower/sv/rfc/Makefile | 12 +- openpower/sv/rfc/ls010.mdwn | 1027 ----------------------------------- 2 files changed, 8 insertions(+), 1031 deletions(-) diff --git a/openpower/sv/rfc/Makefile b/openpower/sv/rfc/Makefile index 27df9704f..bb3815458 100644 --- a/openpower/sv/rfc/Makefile +++ b/openpower/sv/rfc/Makefile @@ -1,7 +1,8 @@ all: ls001.pdf ls002.pdf ls003.pdf ls004.pdf ls005.pdf ls006.pdf ls007.pdf -ls010.pdf: ls010.mdwn ../ldst.mdwn ../branches.mdwn - pandoc \ +ls010.pdf: ../svp64.mdwn ls010.mdwn ../ldst.mdwn ../branches.mdwn + cd ../.. && pandoc \ + --filter pandoc_img.py \ -V margin-top=0.9in \ -V margin-bottom=0.9in \ -V margin-left=0.4in \ @@ -9,10 +10,13 @@ ls010.pdf: ls010.mdwn ../ldst.mdwn ../branches.mdwn -V fontsize=9pt \ -V papersize=legal \ -V linkcolor=blue \ - -f markdown ls010.mdwn ../ldst.mdwn ../branches.mdwn \ + -f markdown sv/svp64.mdwn \ + sv/rfc/ls010.mdwn \ + sv/ldst.mdwn \ + sv/branches.mdwn \ -s --self-contained \ --mathjax \ - -o ls010.pdf + -o sv/rfc/ls010.pdf %.pdf: %.mdwn pandoc \ diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index 706b83c64..a0f70141f 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -1,1030 +1,3 @@ -# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem - -Credits and acknowledgements: - -* Luke Leighton -* Jacob Lifshay -* Hendrik Boom -* Richard Wilbur -* Alexandre Oliva -* Cesar Strauss -* NLnet Foundation, for funding -* OpenPOWER Foundation -* Paul Mackerras -* Toshaan Bharvani -* IBM for the Power ISA itself - -Links: - -* -* [[sv/branches]] chapter -* [[sv/ldst]] chapter - -# Introduction - -Simple-V is a type of Vectorisation best described as a "Prefix Loop -Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and -to the 8086 `REP` Prefix instruction. More advanced features are similar -to the Z80 `CPIR` instruction. If viewed one-dimensionally as an actual -Vector ISA it introduces over 1.5 million 64-bit Vector instructions. -SVP64, the instruction format used by Simple-V, is therefore best viewed -as an orthogonal RISC-paradigm "Prefixing" subsystem instead. - -Except where explicitly stated all bit numbers remain as in the rest of -the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on -the left and counting up as you move rightwards to the LSB end). All bit -ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order). -**All register numbering and element numbering however is LSB0 ordering** -which is a different convention from that used elsewhere in the Power ISA. - -The SVP64 prefix always comes before the suffix in PC order and must be -considered an independent "Defined word" that augments the behaviour of -the following instruction, but does **not** change the actual Decoding -of that following instruction. **All prefixed instructions retain their -non-prefixed encoding and definition**. - -Two apparent exceptions to the above hard rule exist: SV Branch-Conditional -operations and LD/ST-update "Post-Increment" Mode. Post-Increment -was considered sufficiently high priority (significantly reducing hot-loop -instruction count) that one bit in the Prefix is reserved for it. -Vectorised Branch-Conditional operations "embed" the original Scalar -Branch-Conditional behaviour into a much more advanced variant that -is highly suited to High-Performance Computation (HPC), Supercomputing, -and parallel GPU Workloads. - -*Architectural Resource Allocation note: it is prohibited to accept RFCs -which fundamentally violate this hard requirement. Under no circumstances -must the Suffix space have an alternate instruction encoding allocated -within SVP64 that is entirely different from the non-prefixed Defined -Word. Hardware Implementors critically rely on this inviolate guarantee -to implement High-Performance Multi-Issue micro-architectures that can -sustain 100% throughput* - -Subset implementations in hardware are permitted, as long as certain -rules are followed, allowing for full soft-emulation including future -revisions. Compliancy Subsets exist to ensure minimum levels of binary -interoperability expectations within certain environments. - -## SVP64 encoding features - -A number of features need to be compacted into a very small space of -only 24 bits: - -* Independent per-register Scalar/Vector tagging and range extension on - every register -* Element width overrides on both source and destination -* Predication on both source and destination -* Two different sources of predication: INT and CR Fields -* SV Modes including saturation (for Audio, Video and DSP), mapreduce, - fail-first and predicate-result mode. - -Different classes of operations require different formats. The earlier -sections cover the common formats and the four separate modes follow: -CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store -and Branch-Conditional. - -## Definition of Reserved in this spec. - -For the new fields added in SVP64, instructions that have any of their -fields set to a reserved value must cause an illegal instruction trap, -to allow emulation of future instruction sets, or for subsets of SVP64 to -be implemented in hardware and the rest emulated. This includes SVP64 -SPRs: reading or writing values which are not supported in hardware -must also raise illegal instruction traps in order to allow emulation. -Unless otherwise stated, reserved values are always all zeros. - -This is unlike OpenPower ISA v3.1, which in many instances does not -require a trap if reserved fields are nonzero. Where the standard Power -ISA definition is intended the red keyword `RESERVED` is used. - -## Definition of "UnVectoriseable" - -Any operation that inherently makes no sense if repeated is termed -"UnVectoriseable" or "UnVectorised". Examples include `sc` or `sync` -which have no registers. `mtmsr` is also classed as UnVectoriseable -because there is only one `MSR`. - -## Register files, elements, and Element-width Overrides - -In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR -Register files are expanded from 32 to 128 entries, and the number of -CR Fields expanded from CR0-CR7 to CR0-CR127. (Note: A future version -of SVP64 is anticipated to extend the VSR register file). - -Memory access remains exactly the same: the effects of `MSR.LE` remain -exactly the same, affecting as they already do and remain **only** -on the Load and Store memory-register operation byte-order, and having -nothing to do with the ordering of the contents of register files or -register-register operations. - -To be absolutely clear: - -``` - There are no conceptual arithmetic ordering or other changes over the - Scalar Power ISA definitions to registers or register files or to - arithmetic or Logical Operations beyond element-width subdivision -``` - -Element offset -numbering is naturally **LSB0-sequentially-incrementing from zero, not -MSB0-incrementing** including when element-width overrides are used, -at which point the elements progress through each register -sequentially from the LSB end -(confusingly numbered the highest in MSB0 ordering) and progress -incrementally to the MSB end (confusingly numbered the lowest in -MSB0 ordering). - -When exclusively using MSB0-numbering, SVP64 -becomes unnecessarily complex to both express and subsequently understand: -the required conditional subtractions from 63, -31, 15 and 7 needed to express the fact that elements are LSB0-sequential -unfortunately become a hostile minefield, obscuring both -intent and meaning. Therefore for the -purposes of this section the more natural **LSB0 numbering is assumed** -and it is left to the reader to translate to MSB0 numbering. - -The Canonical specification for how element-sequential numbering and -element-width overrides is defined is expressed in the following c -structure, assuming a Little-Endian system, and naturally using LSB0 -numbering everywhere because the ANSI c specification is inherently LSB0. -Note the deliberate similarity to how VSX register elements are defined: - -``` - #pragma pack - typedef union { - uint8_t bytes[]; // elwidth 8 - uint16_t hwords[]; // elwidth 16 - uint32_t words[]; // elwidth 32 - uint64_t dwords[]; // elwidth 64 - uint8_t actual_bytes[8]; - } el_reg_t; - - elreg_t int_regfile[128]; - - void get_register_element(el_reg_t* el, int gpr, int element, int width) { - switch (width) { - case 64: el->dwords[0] = int_regfile[gpr].dwords[element]; - case 32: el->words[0] = int_regfile[gpr].words[element]; - case 16: el->hwords[0] = int_regfile[gpr].hwords[element]; - case 8 : el->bytes[0] = int_regfile[gpr].bytes[element]; - } - } - void set_register_element(el_reg_t* el, int gpr, int element, int width) { - switch (width) { - case 64: int_regfile[gpr].dwords[element] = el->dwords[0]; - case 32: int_regfile[gpr].words[element] = el->words[0]; - case 16: int_regfile[gpr].hwords[element] = el->hwords[0]; - case 8 : int_regfile[gpr].bytes[element] = el->bytes[0]; - } - } -``` - -Example Vector-looped add operation implementation when elwidths are 64-bit: - -``` - # vector-add RT, RA,RB using the "uint64_t" union member, "dwords" - for i in range(VL): - int_regfile[RT].dword[i] = int_regfile[RA].dword[i] + int_regfile[RB].dword[i] -``` - -However if elwidth overrides are set to 16 for both source and destination: - -``` - # vector-add RT, RA, RB using the "uint64_t" union member "halfs" - for i in range(VL): - int_regfile[RT].halfs[i] = int_regfile[RA].halfs[i] + int_regfile[RB].halfs[i] -``` - -Hardware Architectural note: to avoid a Read-Modify-Write at the register -file it is strongly recommended to implement byte-level write-enable lines -exactly as has been implemented in DRAM ICs for many decades. Additionally -the predicate mask bit is advised to be associated with the element -operation and alongside the result ultimately passed to the register file. -When element-width is set to 64-bit the relevant predicate mask bit -may be repeated eight times and pull all eight write-port byte-level -lines HIGH. Clearly when element-width is set to 8-bit the relevant -predicate mask bit corresponds directly with one single byte-level -write-enable line. It is up to the Hardware Architect to then amortise -(merge) elements together into both PredicatedSIMD Pipelines as well -as simultaneous non-overlapping Register File writes, to achieve High -Performance designs. - -**Comparative equivalent using VSR registers** - -For a comparative data point the VSR Registers may be expressed in the -same fashion. The c code below is directly an expression of Figure 97 in -Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating for -MSB0 numbering in both bits and elements, adapting in full to LSB0 numbering, -and obeying LE ordering*. - -**Crucial to understanding why the subtraction from 1,3,7,15 is present -is because the Power ISA numbers VSX Registers elements also in MSB0 order**. -SVP64 very specifically numbers elements in **LSB0** order with the first -element (numbered zero) being at the bitwise-numbered **LSB** end of the register, where VSX -does the reverse: places the numerically-*highest* (last-numbered) element at -the LSB end of the register. - -``` - #pragma pack - typedef union { - // these do NOT match their Power ISA VSX numbering directly, they are all reversed - // bytes[15] is actually VSR.byte[0] for example. if this convention is not - // followed then everything ends up in the wrong place - uint8_t bytes[16]; // elwidth 8, QTY 16 FIXED total - uint16_t hwords[8]; // elwidth 16, QTY 8 FIXED total - uint32_t words[4]; // elwidth 32, QTY 8 FIXED total - uint64_t dwords[2]; // elwidth 64, QTY 2 FIXED total - uint8_t actual_bytes[16]; // totals 128-bit - } el_reg_t; - - elreg_t VSR_regfile[64]; - - static void check_num_elements(int elt, int width) { - switch (width) { - case 64: assert elt < 2; - case 32: assert elt < 4; - case 16: assert elt < 8; - case 8 : assert elt < 16; - } - } - void get_VSR_element(el_reg_t* el, int gpr, int elt, int width) { - check_num_elements(elt, width); - switch (width) { - case 64: el->dwords[0] = VSR_regfile[gpr].dwords[1-elt]; - case 32: el->words[0] = VSR_regfile[gpr].words[3-elt]; - case 16: el->hwords[0] = VSR_regfile[gpr].hwords[7-elt]; - case 8 : el->bytes[0] = VSR_regfile[gpr].bytes[15-elt]; - } - } - void set_VSR_element(el_reg_t* el, int gpr, int elt, int width) { - check_num_elements(elt, width); - switch (width) { - case 64: VSR_regfile[gpr].dwords[1-elt] = el->dwords[0]; - case 32: VSR_regfile[gpr].words[3-elt] = el->words[0]; - case 16: VSR_regfile[gpr].hwords[7-elt] = el->hwords[0]; - case 8 : VSR_regfile[gpr].bytes[15-elt] = el->bytes[0]; - } - } -``` - -For VSR Registers one key difference is that the overlay of different element -widths is clearly a *bounded static quantity*, whereas for Simple-V the -elements are -unrestrained and permitted to flow into *successive underlying Scalar registers*. -This difference is absolutely critical to a full understanding of the entire -Simple-V paradigm and why element-ordering, bit-numbering *and register numbering* -are all so strictly defined. - -Implementations are not permitted to violate the Canonical definition. Software -will be critically relying on the wrapped (overflow) behaviour inherently -implied by the unbounded variable-length c arrays. - -Illustrating the exact same loop with the exact same effect as achieved by Simple-V -we are first forced to create wrapper functions, to cater for the fact -that VSR register elements are static bounded: - -``` - int calc_VSR_reg_offs(int elt, int width) { - switch (width) { - case 64: return floor(elt / 2); - case 32: return floor(elt / 4); - case 16: return floor(elt / 8); - case 8 : return floor(elt / 16); - } - } - int calc_VSR_elt_offs(int elt, int width) { - switch (width) { - case 64: return (elt % 2); - case 32: return (elt % 4); - case 16: return (elt % 8); - case 8 : return (elt % 16); - } - } - void _set_VSR_element(el_reg_t* el, int gpr, int elt, int width) { - int new_elt = calc_VSR_elt_offs(elt, width); - int new_reg = calc_VSR_reg_offs(elt, width); - set_VSR_element(el, gpr+new_reg, new_elt, width); - } -``` - -And finally use these functions: - -``` - # VSX-add RT, RA, RB using the "uint64_t" union member "halfs" - for i in range(VL): - el_reg_t result, ra, rb; - _get_VSR_element(&ra, RA, i, 16); - _get_VSR_element(&rb, RB, i, 16); - result.halfs[0] = ra.halfs[0] + rb.halfs[0]; // use array 0 elements - _set_VSR_element(&result, RT, i, 16); - -``` - -## Scalar Identity Behaviour - -SVP64 is designed so that when the prefix is all zeros, and VL=1, no -effect or influence occurs (no augmentation) such that all standard Power -ISA v3.0/v3.1 instructions covered by the prefix are "unaltered". This -is termed `scalar identity behaviour` (based on the mathematical -definition for "identity", as in, "identity matrix" or better "identity -transformation"). - -Note that this is completely different from when VL=0. VL=0 turns all -operations under its influence into `nops` (regardless of the prefix) -whereas when VL=1 and the SV prefix is all zeros, the operation simply -acts as if SV had not been applied at all to the instruction (an -"identity transformation"). - -The fact that `VL` is dynamic and can be set to any value at runtime based -on program conditions and behaviour means very specifically that -`scalar identity behaviour` is **not** a redundant encoding. If the -only means by which VL could be set was by way of static-compiled -immediates then this assertion would be false. VL should not -be confused with MAXVL when understanding this key aspect of SimpleV. - -## Register Naming and size - -As indicated above SV Registers are simply the GPR, FPR and CR -register files extended linearly to larger sizes; SV Vectorisation -iterates sequentially through these registers (LSB0 sequential ordering -from 0 to VL-1). - -Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is -r0 to r31, SV extends this as r0 to r127. Likewise FP registers are -extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries, -CR0 thru CR127. - -The names of the registers therefore reflects a simple linear extension -of the Power ISA v3.0B / v3.1B register naming, and in hardware this -would be reflected by a linear increase in the size of the underlying -SRAM used for the regfiles. - -Note: when an EXTRA field (defined below) is zero, SV is deliberately -designed so that the register fields are identical to as if SV was not in -effect i.e. under these circumstances (EXTRA=0) the register field names -RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. -This is part of `scalar identity behaviour` described above. - -**Condition Register(s)** - -The Scalar Power ISA Condition Register is a 64 bit register where the top -32 MSBs (numbered 0:31 in MSB0 numbering) are not used. This convention is -*preserved* -in SVP64 and an additional 15 Condition Registers provided in -order to store the new CR Fields, CR8-CR15, CR16-CR23 etc. sequentially. -The top 32 MSBs in each new SVP64 Condition Register are *also* not used: -only the bottom 32 bits (numbered 32:63 in MSB0 numbering). - -*Programmer's note: using `sv.mfcr` without element-width overrides -to take into account the fact that the top 32 MSBs are zero and thus -effectively doubling the number of GPR registers required to hold all 128 -CR Fields would seem the only option because normally elwidth overrides -would halve the capacity of the instruction. However in this case it -is possible to use destination element-width overrides (for `sv.mfcr`. -source overrides would be used on the GPR of `sv.mtocrf`), whereupon -truncation of the 64-bit Condition Register(s) occurs, throwing away -the zeros and storing the remaining (valid, desired) 32-bit values -sequentially into (LSB0-convention) lower-numbered and upper-numbered -halves of GPRs respectively. The programmer is expected to be aware -however that the full width of the entire 64-bit Condition Register -is considered to be "an element". This is **not** like any other -Condition-Register instructions because all other CR instructions, -on closer investigation, will be observed to all be CR-bit or CR-Field -related. Thus a `VL` of 16 must be used* - -## Future expansion. - -With the way that EXTRA fields are defined and applied to register fields, -future versions of SV may involve 256 or greater registers. Backwards -binary compatibility may be achieved with a PCR bit (Program Compatibility -Register) or an MSR bit analogous to SF. -Further discussion is out of scope for this version of SVP64. - -Additionally, a future variant of SVP64 will be applied to the Scalar -(Quad-precision and 128-bit) VSX instructions. Element-width overrides -are an opportunity to expand a future version of the Power ISA -to 256-bit, 512-bit and -1024-bit operations, as well as doubling or quadrupling the number -of VSX registers to 128 or 256. Again further discussion is out of -scope for this version of SVP64. - --------- - -\newpage{} - -# New 64-bit Instruction Encoding spaces - -The following seven new areas are defined within Primary Opcode 9 (EXT009) -as a new 64-bit encoding space, alongside Primary Opcode 1 -(EXT1xx). - -| 0-5 | 6 | 7 | 8-31 | 32| Description | -|-----|---|---|-------|---|------------------------------------| -| PO | 0 | x | xxxx | 0 | `RESERVED2` (57-bit) | -| PO | 0 | 0 | !zero | 1 | SVP64Single:EXT232-263, or `RESERVED3` | -| PO | 0 | 0 | 0000 | 1 | Scalar EXT232-263 | -| PO | 0 | 1 | nnnn | 1 | SVP64:EXT232-263 | -| PO | 1 | 0 | 0000 | x | `RESERVED1` (32-bit) | -| PO | 1 | 0 | !zero | n | SVP64Single:EXT000-063 or `RESERVED4` | -| PO | 1 | 1 | nnnn | n | SVP64:EXT000-063 | - -Note that for the future SVP64Single Encoding (currently RESERVED3 and 4) -it is prohibited to have bits 8-31 be zero, unlike for SVP64 Vector space, -for which bits 8-31 can be zero (termed `scalar identity behaviour`). This -prohibition allows SVP64Single to share its Encoding space with Scalar -Ext232-263 and Scalar EXT300-363. - -Also that RESERVED1 and 2 are candidates for future Major opcode -areas EXT200-231 and EXT300-363 respectively, however as RESERVED areas -they may equally be allocated entirely differently. - -*Architectural Resource Allocation Note: **under no circumstances** must -different Defined Words be allocated within any `EXT{z}` prefixed -or unprefixed space for a given value of `z`. Even if UnVectoriseable -an instruction Defined Word space must have the exact same Instruction -and exact same Instruction Encoding in all spaces (including -being RESERVED if UnVectoriseable) or not be allocated at all. -This is required as an inviolate hard rule governing Primary Opcode 9 -that may not be revoked under any circumstances. A useful way to think -of this is that the Prefix Encoding is, like the 8086 REP instruction, -an independent 32-bit Defined Word. The only semi-exceptions are -the Post-Increment Mode of LD/ST-Update and Vectorised Branch-Conditional.* - -Encoding spaces and their potential are illustrated: - -| Encoding | Available bits | Scalar | Vectoriseable | SVP64Single | -|----------|----------------|--------|---------------|--------------| -|EXT000-063| 32 | yes | yes |yes | -|EXT100-163| 64 | yes | no |no | -|RESERVED2 | 57 | N/A |not applicable |not applicable| -|EXT232-263| 32 | yes | yes |yes | -|RESERVED1 | 32 | N/A | no |no | - -Notes: - -* Prefixed-Prefixed (96-bit) instructions are prohibited. EXT1xx is - thus inherently UnVectoriseable as the EXT1xx prefix is 32-bit - on top of an SVP64 prefix which is 32-bit on top of a Defined Word - and the complexity at the Decoder becomes too great for High - Performance Multi-Issue systems. -* RESERVED2 presently remains unallocated as of yet and therefore its - potential is not yet defined (Not Applicable). -* RESERVED1 is also unallocated at present, but it is known in advance - that the area is UnVectoriseable and also cannot be Prefixed with - SVP64Single. -* Considerable care is needed both on Architectural Resource Allocation - as well as instruction design itself. Once an instruction is allocated - in an UnVectoriseable area it can never be Vectorised without providing - an entirely new Encoding. - -# Remapped Encoding (`RM[0:23]`) - -In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits -32-37 are the Primary Opcode of the Suffix "Defined Word". 38-63 are the -remainder of the Defined Word. Note that the new EXT232-263 SVP64 area -it is obviously mandatory that bit 32 is required to be set to 1. - -| 0-5 | 6 | 7 | 8-31 | 32-37 | 38-64 |Description | -|-----|---|---|----------|--------|----------|-----------------------| -| PO | 0 | 1 | RM[0:23] | 1nnnnn | xxxxxxxx | SVP64:EXT232-263 | -| PO | 1 | 1 | RM[0:23] | nnnnnn | xxxxxxxx | SVP64:EXT000-063 | - -It is important to note that unlike EXT1xx 64-bit prefixed instructions -there is insufficient space in `RM` to provide identification of -any SVP64 Fields without first partially decoding the 32-bit suffix. -Similar to the "Forms" (X-Form, D-Form) the `RM` format is individually -associated with every instruction. However this still does not adversely -affect Multi-Issue Decoding because the identification of the *length* -of anything in the 64-bit space has been kept brutally simple (EXT009), -and further decoding of any number of 64-bit Encodings in parallel at -that point is fully independent. - -Extreme caution and care must be taken when extending SVP64 -in future, to not create unnecessary relationships between prefix and -suffix that could complicate decoding, adding latency. - -## Common RM fields - -The following fields are common to all Remapped Encodings: - -| Field Name | Field bits | Description | -|------------|------------|----------------------------------------| -| MASKMODE | `0` | Execution (predication) Mask Kind | -| MASK | `1:3` | Execution Mask | -| SUBVL | `8:9` | Sub-vector length | - -The following fields are optional or encoded differently depending -on context after decoding of the Scalar suffix: - -| Field Name | Field bits | Description | -|------------|------------|----------------------------------------| -| ELWIDTH | `4:5` | Element Width | -| ELWIDTH_SRC | `6:7` | Element Width for Source | -| EXTRA | `10:18` | Register Extra encoding | -| MODE | `19:23` | changes Vector behaviour | - -* MODE changes the behaviour of the SV operation (result saturation, - mapreduce) -* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D - and Audio/Video DSP work -* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and - source operand width -* MASK (and MASK_SRC) and MASKMODE provide predication (two types of - sources: scalar INT and Vector CR). -* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category - for the instruction, which is determined only by decoding the Scalar 32 - bit suffix. - -Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, -such as `RM-1P-3S1D` which indicates for this example that the operation -is to be single-predicated and that there are 3 source operand EXTRA -tags and one destination operand tag. - -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance -or increased latency in some implementations due to lane-crossing. - -## Mode - -Mode is an augmentation of SV behaviour. Different types of instructions -have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR -formats apply to different instruction types. Modes include Reduction, -Iteration, arithmetic saturation, and Fail-First. More specific details -in each section and in the SVP64 appendix - -* For condition register operations see [[sv/cr_ops]] -* For LD/ST Modes, see [[sv/ldst]]. -* For Branch modes, see [[sv/branches]] -* For arithmetic and logical, see [[sv/normal]] - -## ELWIDTH Encoding - -Default behaviour is set to 0b00 so that zeros follow the convention -of `scalar identity behaviour`. In this case it means that elwidth -overrides are not applicable. Thus if a 32 bit instruction operates -on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified. -Likewise when a processor is switched from 64 bit to 32 bit mode, -`elwidth=0b00` states that, again, the behaviour is not to be modified. - -Only when elwidth is nonzero is the element width overridden to the -explicitly required value. - -### Elwidth for Integers: - -| Value | Mnemonic | Description | -|-------|----------------|------------------------------------| -| 00 | DEFAULT | default behaviour for operation | -| 01 | `ELWIDTH=w` | Word: 32-bit integer | -| 10 | `ELWIDTH=h` | Halfword: 16-bit integer | -| 11 | `ELWIDTH=b` | Byte: 8-bit integer | - -This encoding is chosen such that the byte width may be computed as -`8<<(3-ew)` - -### Elwidth for FP Registers: - -| Value | Mnemonic | Description | -|-------|----------------|------------------------------------| -| 00 | DEFAULT | default behaviour for FP operation | -| 01 | `ELWIDTH=f32` | 32-bit IEEE 754 Single floating-point | -| 10 | `ELWIDTH=f16` | 16-bit IEEE 754 Half floating-point | -| 11 | `ELWIDTH=bf16` | Reserved for `bf16` | - -Note: -[`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) -is reserved for a future implementation of SV - -Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) -shall perform its operation at **half** the ELWIDTH then padded back out -to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation -that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT -clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy -then padded back out to fit in IEEE754 FP64, exactly as for Scalar -v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16 or -ELWIDTH=bf16 is reserved and must raise an illegal instruction (IEEE754 -FP8 or BF8 are not defined). - -### Elwidth for CRs (no meaning) - -Element-width overrides for CR Fields has no meaning. The bits -are therefore used for other purposes, or when Rc=1, the Elwidth -applies to the result being tested (a GPR or FPR), but not to the -Vector of CR Fields. - -## SUBVL Encoding - -The default for SUBVL is 1 and its encoding is 0b00 to indicate that -SUBVL is effectively disabled (a SUBVL for-loop of only one element). this -lines up in combination with all other "default is all zeros" behaviour. - -| Value | Mnemonic | Subvec | Description | -|-------|-----------|---------|------------------------| -| 00 | `SUBVL=1` | single | Sub-vector length of 1 | -| 01 | `SUBVL=2` | vec2 | Sub-vector length of 2 | -| 10 | `SUBVL=3` | vec3 | Sub-vector length of 3 | -| 11 | `SUBVL=4` | vec4 | Sub-vector length of 4 | - -The SUBVL encoding value may be thought of as an inclusive range of a -sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore -this may be considered to be elements 0b00 to 0b01 inclusive. - -## MASK/MASK_SRC & MASKMODE Encoding - -One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two -types may not be mixed. - -Special note: to disable predication this field must be set to zero in -combination with Integer Predication also being set to 0b000. this has the -effect of enabling "all 1s" in the predicate mask, which is equivalent to -"not having any predication at all". - -`MASKMODE` may be set to one of 2 values: - -| Value | Description | -|-----------|------------------------------------------------------| -| 0 | MASK/MASK_SRC are encoded using Integer Predication | -| 1 | MASK/MASK_SRC are encoded using CR-based Predication | - -Integer Twin predication has a second set of 3 bits that uses the same -encoding thus allowing either the same register (r3, r10 or r31) to be -used for both src and dest, or different regs (one for src, one for dest). - -Likewise CR based twin predication has a second set of 3 bits, allowing -a different test to be applied. - -Note that it is assumed that Predicate Masks (whether INT or CR) are -read *before* the operations proceed. In practice (for CR Fields) -this creates an unnecessary block on parallelism. Therefore, it is up -to the programmer to ensure that the CR fields used as Predicate Masks -are not being written to by any parallel Vector Loop. Doing so results -in **UNDEFINED** behaviour, according to the definition outlined in the -Power ISA v3.0B Specification. - -Hardware Implementations are therefore free and clear to delay reading -of individual CR fields until the actual predicated element operation -needs to take place, safe in the knowledge that no programmer will have -issued a Vector Instruction where previous elements could have overwritten -(destroyed) not-yet-executed CR-Predicated element operations. - -### Integer Predication (MASKMODE=0) - -When the predicate mode bit is zero the 3 bits are interpreted as below. -Twin predication has an identical 3 bit field similarly encoded. - -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the -following meaning: - -| Value | Mnemonic | Element `i` enabled if: | -|-------|----------|------------------------------| -| 000 | ALWAYS | predicate effectively all 1s | -| 001 | 1 << R3 | `i == R3` | -| 010 | R3 | `R3 & (1 << i)` is non-zero | -| 011 | ~R3 | `R3 & (1 << i)` is zero | -| 100 | R10 | `R10 & (1 << i)` is non-zero | -| 101 | ~R10 | `R10 & (1 << i)` is zero | -| 110 | R30 | `R30 & (1 << i)` is non-zero | -| 111 | ~R30 | `R30 & (1 << i)` is zero | - -r10 and r30 are at the high end of temporary and unused registers, -so as not to interfere with register allocation from ABIs. - -### CR-based Predication (MASKMODE=1) - -When the predicate mode bit is one the 3 bits are interpreted as below. -Twin predication has an identical 3 bit field similarly encoded. - -`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the -following meaning: - -| Value | Mnemonic | Element `i` is enabled if | -|-------|----------|--------------------------| -| 000 | lt | `CR[offs+i].LT` is set | -| 001 | nl/ge | `CR[offs+i].LT` is clear | -| 010 | gt | `CR[offs+i].GT` is set | -| 011 | ng/le | `CR[offs+i].GT` is clear | -| 100 | eq | `CR[offs+i].EQ` is set | -| 101 | ne | `CR[offs+i].EQ` is clear | -| 110 | so/un | `CR[offs+i].FU` is set | -| 111 | ns/nu | `CR[offs+i].FU` is clear | - -`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised -Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD). - -The CR Predicates chosen must start on a boundary that Vectorised CR -operations can access cleanly, in full. With EXTRA2 restricting starting -points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and -CR Predicate Masks have to be adapted to fit on these boundaries as well. - -## Extra Remapped Encoding - -Shows all instruction-specific fields in the Remapped Encoding -`RM[10:18]` for all instruction variants. Note that due to the very -tight space, the encoding mode is *not* included in the prefix itself. -The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) -on a per-instruction basis, and, like "Forms" are given a designation -(below) of the form `RM-nP-nSnD`. The full list of which instructions -use which remaps is here [[opcode_regs_deduped]]. - -**Please note the following**: - -``` - Machine-readable CSV files have been autogenerated which will make the - task of creating SV-aware ISA decoders, documentation, assembler tools - compiler tools Simulators documentation all aspects of SVP64 easier - and less prone to mistakes. Please avoid manual re-creation of - information from the written specification wording in this chapter, - and use the CSV files or use the Canonical tool which creates the CSV - files, named sv_analysis.py. The information contained within - sv_analysis.py is considered to be part of this Specification, even - encoded as it is in python3. -``` - -The mappings are part of the SVP64 Specification in exactly the same -way as X-Form, D-Form. New Scalar instructions added to the Power ISA -will need a corresponding SVP64 Mapping, which can be derived by-rote -from examining the Register "Profile" of the instruction. - -There are two categories: Single and Twin Predication. Due to space -considerations further subdivision of Single Predication is based on -whether the number of src operands is 2 or 3. With only 9 bits available -some compromises have to be made. - -* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand - instructions (fmadd, isel, madd). -* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand - instructions (src1 src2 dest) -* `RM-2P-1S1D` Twin Predication (src=1, dest=1) -* `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed) -* `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update - -### RM-1P-3S1D - -| Field Name | Field bits | Description | -|------------|------------|----------------------------------------| -| Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) | -| Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | -| Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | -| Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) | -| EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS | - -These are for 3 operand in and either 1 or 2 out instructions. -3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions -such as `maddedu` have an implicit second destination, RS, the -selection of which is determined by bit 18. - -### RM-1P-2S1D - -| Field Name | Field bits | Description | -|------------|------------|-------------------------------------------| -| Rdest\_EXTRA3 | `10:12` | extends Rdest | -| Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1 | -| Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 | - -These are for 2 operand 1 dest instructions, such as `add RT, RA, -RB`. However also included are unusual instructions with an implicit -dest that is identical to its src reg, such as `rlwinmi`. - -Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would -not have sufficient bit fields to allow an alternative destination. -With SV however this becomes possible. Therefore, the fact that the -dest is implicitly also a src should not mislead: due to the *prefix* -they are different SV regs. - -* `rlwimi RA, RS, ...` -* Rsrc1_EXTRA3 applies to RS as the first src -* Rsrc2_EXTRA3 applies to RA as the secomd src -* Rdest_EXTRA3 applies to RA to create an **independent** dest. - -With the addition of the EXTRA bits, the three registers -each may be *independently* made vector or scalar, and be independently -augmented to 7 bits in length. - -### RM-2P-1S1D/2S - -| Field Name | Field bits | Description | -|------------|------------|----------------------------| -| Rdest_EXTRA3 | `10:12` | extends Rdest | -| Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 | -| MASK_SRC | `16:18` | Execution Mask for Source | - -`RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2. - -### RM-1P-2S1D - -single-predicate, three registers (2 read, 1 write) - -| Field Name | Field bits | Description | -|------------|------------|----------------------------| -| Rdest_EXTRA3 | `10:12` | extends Rdest | -| Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 | -| Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 | - -### RM-2P-2S1D/1S2D/3S - -The primary purpose for this encoding is for Twin Predication on LOAD -and STORE operations. see [[sv/ldst]] for detailed anslysis. - -**RM-2P-2S1D:** - -| Field Name | Field bits | Description | -|------------|------------|----------------------------| -| Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) | -| Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | -| Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | -| MASK_SRC | `16:18` | Execution Mask for Source | - -**RM-2P-1S2D:** - -For RM-2P-1S2D the EXTRA2 dest and src names are switched (Rsrc_EXTRA2 -is in bits 10:11, Rdest1_EXTRA2 in 12:13) - -| Field Name | Field bits | Description | -|------------|------------|----------------------------| -| Rsrc2_EXTRA2 | `10:11` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | -| Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | -| Rdest_EXTRA2 | `14:15` | extends Rdest (R\*\_EXTRA2 Encoding) | -| MASK_SRC | `16:18` | Execution Mask for Source | - -**RM-2P-3S:** - -Also that for RM-2P-3S (to cover `stdx` etc.) the names are switched to 3 src: -Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2. - -| Field Name | Field bits | Description | -|------------|------------|----------------------------| -| Rsrc1_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding) | -| Rsrc2_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding) | -| Rsrc3_EXTRA2 | `14:15` | extends Rsrc3 (R\*\_EXTRA2 Encoding) | -| MASK_SRC | `16:18` | Execution Mask for Source | - -Note also that LD with update indexed, which takes 2 src and -creates 2 dest registers (e.g. `lhaux RT,RA,RB`), does not have room -for 4 registers and also Twin Predication. Therefore these are treated as -RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest. - -Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance -or increased latency in some implementations due to lane-crossing. - -## R\*\_EXTRA2/3 - -EXTRA is the means by which two things are achieved: - -1. Registers are marked as either Vector *or Scalar* -2. Register field numbers (limited typically to 5 bit) - are extended in range, both for Scalar and Vector. - -The register files are therefore extended: - -* INT (GPR) is extended from r0-31 to r0-127 -* FP (FPR) is extended from fp0-32 to fp0-fp127 -* CR Fields are extended from CR0-7 to CR0-127 - -However due to pressure in `RM.EXTRA` not all these registers -are accessible by all instructions, particularly those with -a large number of operands (`madd`, `isel`). - -In the following tables register numbers are constructed from the -standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or -EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy -designation for a given instruction. The prefixing is arranged so that -interoperability between prefixing and nonprefixing of scalar registers -is direct and convenient (when the EXTRA field is all zeros). - -A pseudocode algorithm explains the relationship, for INT/FP (see -SVP64 appendix for CRs) - -``` - if extra3_mode: - spec = EXTRA3 - else: - spec = EXTRA2 << 1 # same as EXTRA3, shifted - if spec[0]: # vector - return (RA << 2) | spec[1:2] - else: # scalar - return (spec[1:2] << 5) | RA -``` - -Future versions may extend to 256 by shifting Vector numbering up. -Scalar will not be altered. - -Note that in some cases the range of starting points for Vectors -is limited. - -### INT/FP EXTRA3 - -If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field -naming). - -Fields are as follows: - -* Value: R_EXTRA3 -* Mode: register is tagged as scalar or vector -* Range/Inc: the range of registers accessible from this EXTRA - encoding, and the "increment" (accessibility). "/4" means - that this EXTRA encoding may only give access (starting point) - every 4th register. -* MSB..LSB: the bit field showing how the register opcode field - combines with EXTRA to give (extend) the register number (GPR) - -| Value | Mode | Range/Inc | 6..0 | -|-----------|-------|---------------|---------------------| -| 000 | Scalar | `r0-r31`/1 | `0b00 RA` | -| 001 | Scalar | `r32-r63`/1 | `0b01 RA` | -| 010 | Scalar | `r64-r95`/1 | `0b10 RA` | -| 011 | Scalar | `r96-r127`/1 | `0b11 RA` | -| 100 | Vector | `r0-r124`/4 | `RA 0b00` | -| 101 | Vector | `r1-r125`/4 | `RA 0b01` | -| 110 | Vector | `r2-r126`/4 | `RA 0b10` | -| 111 | Vector | `r3-r127`/4 | `RA 0b11` | - -### INT/FP EXTRA2 - -If EXTRA2 is zero will map to -"scalar identity behaviour" i.e Scalar Power ISA register naming: - -| Value | Mode | Range/inc | 6..0 | -|----------|-------|---------------|-----------| -| 00 | Scalar | `r0-r31`/1 | `0b00 RA` | -| 01 | Scalar | `r32-r63`/1 | `0b01 RA` | -| 10 | Vector | `r0-r124`/4 | `RA 0b00` | -| 11 | Vector | `r2-r126`/4 | `RA 0b10` | - -**Note that unlike in EXTRA3, in EXTRA2**: - -* the GPR Vectors may only start from - `r0, r2, r4, r6, r8` and likewise FPR Vectors. -* the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars. - -as there is insufficient bits to cover the full range. - -### CR Field EXTRA3 - -CR Field encoding is essentially the same but made more complex due to CRs -being bit-based, because the application of SVP64 element-numbering applies -to the CR *Field* numbering not the CR register *bit* numbering. -Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`... -and Scalars may only go from `CR0, CR1, ... CR31` - -Encoding shown MSB down to LSB - -For a 5-bit operand (BA, BB, BT): - -| Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 | -|-------|------|---------------|-----------| --------|---------| -| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] | -| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] | -| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[0:2] | BA[3:4] | -| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[0:2] | BA[3:4] | -| 100 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] | -| 101 | Vector | `CR4-CR116`/16 | BA[0:2] 0 | 0b100 | BA[3:4] | -| 110 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] | -| 111 | Vector | `CR12-CR124`/16 | BA[0:2] 1 | 0b100 | BA[3:4] | - -For a 3-bit operand (e.g. BFA): - -| Value | Mode | Range/Inc | 6..3 | 2..0 | -|-------|------|---------------|-----------| --------| -| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA | -| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA | -| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BFA | -| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BFA | -| 100 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 | -| 101 | Vector | `CR4-CR116`/16 | BFA 0 | 0b100 | -| 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 | -| 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 | - -### CR EXTRA2 - -CR encoding is essentially the same but made more complex due to CRs -being bit-based, because the application of SVP64 element-numbering applies -to the CR *Field* numbering not the CR register *bit* numbering. -See separate section for explanation and pseudocode. -Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32... - -Encoding shown MSB down to LSB - -For a 5-bit operand (BA, BB, BC): - -| Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 | -|-------|--------|----------------|---------|---------|---------| -| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] | -| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] | -| 10 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] | -| 11 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] | - -For a 3-bit operand (e.g. BFA): - -| Value | Mode | Range/Inc | 6..3 | 2..0 | -|-------|------|---------------|-----------| --------| -| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA | -| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA | -| 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 | -| 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 | - --------- - -\newpage{} - - # Normal SVP64 Modes, for Arithmetic and Logical Operations Normal SVP64 Mode covers Arithmetic and Logical operations -- 2.30.2