-[[!tag standards]]
-
-# DRAFT SVP64 for Power ISA v3.0B
+# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem
* **DRAFT STATUS v0.1 18sep2021** Release notes <https://bugs.libre-soc.org/show_bug.cgi?id=699>
* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001650.html>
* <https://bugs.libre-soc.org/show_bug.cgi?id=550>
* <https://bugs.libre-soc.org/show_bug.cgi?id=573> TODO elwidth "infinite" discussion
-* <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturating description.
+* <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturating description.
* <https://bugs.libre-soc.org/show_bug.cgi?id=905> TODO [[sv/svp64-single]]
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1045> External RFC ls010
+
Table of contents
# Introduction
-This document focuses on the encoding of [[SV|sv]], and assumes familiarity with the same. It does not cover how SV works (merely the instruction encoding), and is therefore best read in conjunction with the [[sv/overview]], as well as the [[sv/svp64_quirks]] section.
-It is also crucial to note that whilst this format augments instruction
-behaviour it works in conjunction with SVSTATE and other [[sv/sprs]].
-
-All bit numbers are in MSB0 form (the bits are numbered from 0 at the MSB
-on the left
-and counting up as you move rightwards to the LSB end). All bit ranges are inclusive
-(so `4:6` means bits 4, 5, and 6, in MSB0 order).
-
-64-bit instructions are split into two 32-bit words, the prefix and the
-suffix. The prefix always comes before the suffix in PC order.
-
-| 0:5 | 6:31 | 32:63 |
-|--------|--------------|--------------|
-| EXT01 | v3.1 Prefix | v3.0/1 Suffix |
-
-svp64 fits into the "reserved" portions of the v3.1 prefix, making it possible for svp64, v3.0B (or v3.1 including 64 bit prefixed) instructions to co-exist in the same binary without conflict.
+Simple-V is a type of Vectorisation best described as a "Prefix Loop
+Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and
+to the 8086 `REP` Prefix instruction. More advanced features are similar
+to the Z80 `CPIR` instruction. If viewed one-dimensionally as an actual
+Vector ISA it introduces over 1.5 million 64-bit Vector instructions.
+SVP64, the instruction format used by Simple-V, is therefore best viewed
+as an orthogonal RISC-paradigm "Prefixing" subsystem instead.
+
+Except where explicitly stated all bit numbers remain as in the rest of
+the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
+the left and counting up as you move rightwards to the LSB end). All bit
+ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order).
+**All register numbering and element numbering however is LSB0 ordering**
+which is a different convention from that used elsewhere in the Power ISA.
+
+The SVP64 prefix always comes before the suffix in PC order and must be
+considered an independent "Defined word" that augments the behaviour of
+the following instruction, but does **not** change the actual Decoding
+of that following instruction. **All prefixed instructions retain their
+non-prefixed encoding and definition**.
+
+Two apparent exceptions to the above hard rule exist: SV Branch-Conditional
+operations and LD/ST-update "Post-Increment" Mode. Post-Increment
+was considered sufficiently high priority (significantly reducing hot-loop
+instruction count) that one bit in the Prefix is reserved for it.
+Vectorised Branch-Conditional operations "embed" the original Scalar
+Branch-Conditional behaviour into a much more advanced variant that
+is highly suited to High-Performance Computation (HPC), Supercomputing,
+and parallel GPU Workloads.
+
+*Architectural Resource Allocation note: it is prohibited to accept RFCs
+which fundamentally violate this hard requirement. Under no circumstances
+must the Suffix space have an alternate instruction encoding allocated
+within SVP64 that is entirely different from the non-prefixed Defined
+Word. Hardware Implementors critically rely on this inviolate guarantee
+to implement High-Performance Multi-Issue micro-architectures that can
+sustain 100% throughput*
Subset implementations in hardware are permitted, as long as certain
rules are followed, allowing for full soft-emulation including future
-revisions. Details in the [[svp64/appendix]].
+revisions. Compliancy Subsets exist to ensure minimum levels of binary
+interoperability expectations within certain environments. Details in
+the [[svp64/appendix]].
## SVP64 encoding features
-A number of features need to be compacted into a very small space of only 24 bits:
+A number of features need to be compacted into a very small space of
+only 24 bits:
-* Independent per-register Scalar/Vector tagging and range extension on every register
+* Independent per-register Scalar/Vector tagging and range extension on
+ every register
* Element width overrides on both source and destination
* Predication on both source and destination
* Two different sources of predication: INT and CR Fields
-* SV Modes including saturation (for Audio, Video and DSP), mapreduce, fail-first and
- predicate-result mode.
+* SV Modes including saturation (for Audio, Video and DSP), mapreduce,
+ fail-first and predicate-result mode.
-This document focusses specifically on how that fits into available space. The [[svp64/appendix]] explains more of the details, whilst the [[sv/overview]] gives the basics.
+Different classes of operations require different formats. The earlier
+sections cover the common formats and the four separate modes follow:
+CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store
+and Branch-Conditional.
-# Definition of Reserved in this spec.
+## Definition of Reserved in this spec.
For the new fields added in SVP64, instructions that have any of their
fields set to a reserved value must cause an illegal instruction trap,
-to allow emulation of future instruction sets, or for subsets of SVP64
-to be implemented in hardware and the rest emulated.
-This includes SVP64 SPRs: reading or writing values which are not
-supported in hardware must also raise illegal instruction traps
-in order to allow emulation.
+to allow emulation of future instruction sets, or for subsets of SVP64 to
+be implemented in hardware and the rest emulated. This includes SVP64
+SPRs: reading or writing values which are not supported in hardware
+must also raise illegal instruction traps in order to allow emulation.
Unless otherwise stated, reserved values are always all zeros.
-This is unlike OpenPower ISA v3.1, which in many instances does not require a trap if reserved fields are nonzero. Where the standard Power ISA definition
-is intended the red keyword `RESERVED` is used.
-
-# Scalar Identity Behaviour
-
-SVP64 is designed so that when the prefix is all zeros, and
- VL=1, no effect or
-influence occurs (no augmentation) such that all standard Power ISA
-v3.0/v3 1 instructions covered by the prefix are "unaltered". This is termed `scalar identity behaviour` (based on the mathematical definition for "identity", as in, "identity matrix" or better "identity transformation").
-
-Note that this is completely different from when VL=0. VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
- whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction (an "identity transformation").
-
-# Register Naming and size
-
-SV Registers are simply the INT, FP and CR register files extended
-linearly to larger sizes; SV Vectorisation iterates sequentially through these registers.
-
-Where the integer regfile in standard scalar
-Power ISA v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
-Likewise FP registers are extended to 128 (fp0 to fp127), and CR Fields
-are
-extended to 128 entries, CR0 thru CR127.
+This is unlike OpenPower ISA v3.1, which in many instances does not
+require a trap if reserved fields are nonzero. Where the standard Power
+ISA definition is intended the red keyword `RESERVED` is used.
+
+## Definition of "UnVectoriseable"
+
+Any operation that inherently makes no sense if repeated is termed
+"UnVectoriseable" or "UnVectorised". Examples include `sc` or `sync`
+which have no registers. `mtmsr` is also classed as UnVectoriseable
+because there is only one `MSR`.
+
+## Register files, elements, and Element-width Overrides
+
+In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR
+Register files are expanded from 32 to 128 entries, and the number of
+CR Fields expanded from CR0-CR7 to CR0-CR127. (Note: A future version
+of SVP64 is anticipated to extend the VSR register file).
+
+Memory access remains exactly the same: the effects of `MSR.LE` remain
+exactly the same, affecting as they already do and remain **only**
+on the Load and Store memory-register operation byte-order, and having
+nothing to do with the ordering of the contents of register files or
+register-register operations.
+
+To be absolutely clear:
+
+```
+ There are no conceptual arithmetic ordering or other changes over the
+ Scalar Power ISA definitions to registers or register files or to
+ arithmetic or Logical Operations beyond element-width subdivision
+```
+
+Element offset
+numbering is naturally **LSB0-sequentially-incrementing from zero, not
+MSB0-incrementing** including when element-width overrides are used,
+at which point the elements progress through each register
+sequentially from the LSB end
+(confusingly numbered the highest in MSB0 ordering) and progress
+incrementally to the MSB end (confusingly numbered the lowest in
+MSB0 ordering).
+
+When exclusively using MSB0-numbering, SVP64
+becomes unnecessarily complex to both express and subsequently understand:
+the required conditional subtractions from 63,
+31, 15 and 7 needed to express the fact that elements are LSB0-sequential
+unfortunately become a hostile minefield, obscuring both
+intent and meaning. Therefore for the
+purposes of this section the more natural **LSB0 numbering is assumed**
+and it is left to the reader to translate to MSB0 numbering.
+
+The Canonical specification for how element-sequential numbering and
+element-width overrides is defined is expressed in the following c
+structure, assuming a Little-Endian system, and naturally using LSB0
+numbering everywhere because the ANSI c specification is inherently LSB0.
+Note the deliberate similarity to how VSX register elements are defined:
+
+```
+ #pragma pack
+ typedef union {
+ uint8_t bytes[]; // elwidth 8
+ uint16_t hwords[]; // elwidth 16
+ uint32_t words[]; // elwidth 32
+ uint64_t dwords[]; // elwidth 64
+ uint8_t actual_bytes[8];
+ } el_reg_t;
+
+ elreg_t int_regfile[128];
+
+ void get_register_element(el_reg_t* el, int gpr, int element, int width) {
+ switch (width) {
+ case 64: el->dwords[0] = int_regfile[gpr].dwords[element];
+ case 32: el->words[0] = int_regfile[gpr].words[element];
+ case 16: el->hwords[0] = int_regfile[gpr].hwords[element];
+ case 8 : el->bytes[0] = int_regfile[gpr].bytes[element];
+ }
+ }
+ void set_register_element(el_reg_t* el, int gpr, int element, int width) {
+ switch (width) {
+ case 64: int_regfile[gpr].dwords[element] = el->dwords[0];
+ case 32: int_regfile[gpr].words[element] = el->words[0];
+ case 16: int_regfile[gpr].hwords[element] = el->hwords[0];
+ case 8 : int_regfile[gpr].bytes[element] = el->bytes[0];
+ }
+ }
+```
+
+Example Vector-looped add operation implementation when elwidths are 64-bit:
+
+```
+ # vector-add RT, RA,RB using the "uint64_t" union member, "dwords"
+ for i in range(VL):
+ int_regfile[RT].dword[i] = int_regfile[RA].dword[i] + int_regfile[RB].dword[i]
+```
+
+However if elwidth overrides are set to 16 for both source and destination:
+
+```
+ # vector-add RT, RA, RB using the "uint64_t" union member "halfs"
+ for i in range(VL):
+ int_regfile[RT].halfs[i] = int_regfile[RA].halfs[i] + int_regfile[RB].halfs[i]
+```
+
+Hardware Architectural note: to avoid a Read-Modify-Write at the register
+file it is strongly recommended to implement byte-level write-enable lines
+exactly as has been implemented in DRAM ICs for many decades. Additionally
+the predicate mask bit is advised to be associated with the element
+operation and alongside the result ultimately passed to the register file.
+When element-width is set to 64-bit the relevant predicate mask bit
+may be repeated eight times and pull all eight write-port byte-level
+lines HIGH. Clearly when element-width is set to 8-bit the relevant
+predicate mask bit corresponds directly with one single byte-level
+write-enable line. It is up to the Hardware Architect to then amortise
+(merge) elements together into both PredicatedSIMD Pipelines as well
+as simultaneous non-overlapping Register File writes, to achieve High
+Performance designs.
+
+**Comparative equivalent using VSR registers**
+
+For a comparative data point the VSR Registers may be expressed in the
+same fashion. The c code below is directly an expression of Figure 97 in
+Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating for
+MSB0 numbering in both bits and elements, adapting in full to LSB0 numbering,
+and obeying LE ordering*.
+
+**Crucial to understanding why the subtraction from 1,3,7,15 is present
+is because VSX Registers number elements also in MSB0 order**. SVP64
+very specifically numbers elements in **LSB0** order with the first
+element being at the **LSB** end of the register, where VSX places
+the numerically-lowest element at the **MSB** end of the register.
+
+```
+ #pragma pack
+ typedef union {
+ uint8_t bytes[16]; // elwidth 8, QTY 16 FIXED total
+ uint16_t hwords[8]; // elwidth 16, QTY 8 FIXED total
+ uint32_t words[4]; // elwidth 32, QTY 8 FIXED total
+ uint64_t dwords[2]; // elwidth 64, QTY 2 FIXED total
+ uint8_t actual_bytes[16]; // totals 128-bit
+ } el_reg_t;
+
+ elreg_t VSR_regfile[64];
+
+ static void check_num_elements(int elt, int width) {
+ switch (width) {
+ case 64: assert elt < 2;
+ case 32: assert elt < 4;
+ case 16: assert elt < 8;
+ case 8 : assert elt < 16;
+ }
+ }
+ void get_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
+ check_num_elements(elt, width);
+ switch (width) {
+ case 64: el->dwords[0] = VSR_regfile[gpr].dwords[1-elt];
+ case 32: el->words[0] = VSR_regfile[gpr].words[3-elt];
+ case 16: el->hwords[0] = VSR_regfile[gpr].hwords[7-elt];
+ case 8 : el->bytes[0] = VSR_regfile[gpr].bytes[15-elt];
+ }
+ }
+ void set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
+ check_num_elements(elt, width);
+ switch (width) {
+ case 64: VSR_regfile[gpr].dwords[1-elt] = el->dwords[0];
+ case 32: VSR_regfile[gpr].words[3-elt] = el->words[0];
+ case 16: VSR_regfile[gpr].hwords[7-elt] = el->hwords[0];
+ case 8 : VSR_regfile[gpr].bytes[15-elt] = el->bytes[0];
+ }
+ }
+```
+
+For VSR Registers one key difference is that the overlay of different element
+widths is clearly a *bounded static quantity*, whereas for Simple-V the
+elements are
+unrestrained and permitted to flow into *successive underlying Scalar registers*.
+This difference is absolutely critical to a full understanding of the entire
+Simple-V paradigm and why element-ordering, bit-numbering *and register numbering*
+are all so strictly defined.
+
+Implementations are not permitted to violate the Canonical definition. Software
+will be critically relying on the wrapped (overflow) behaviour inherently
+implied by the unbounded variable-length c arrays.
+
+Illustrating the exact same loop with the exact same effect as achieved by Simple-V
+we are first forced to create wrapper functions, to cater for the fact
+that VSR register elements are static bounded:
+
+```
+ int calc_VSR_reg_offs(int elt, int width) {
+ switch (width) {
+ case 64: return floor(elt / 2);
+ case 32: return floor(elt / 4);
+ case 16: return floor(elt / 8);
+ case 8 : return floor(elt / 16);
+ }
+ }
+ int calc_VSR_elt_offs(int elt, int width) {
+ switch (width) {
+ case 64: return (elt % 2);
+ case 32: return (elt % 4);
+ case 16: return (elt % 8);
+ case 8 : return (elt % 16);
+ }
+ }
+ void _set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
+ int new_elt = calc_VSR_elt_offs(elt, width);
+ int new_reg = calc_VSR_reg_offs(elt, width);
+ set_VSR_element(el, gpr+new_reg, new_elt, width);
+ }
+```
+
+And finally use these functions:
+
+```
+ # VSX-add RT, RA, RB using the "uint64_t" union member "halfs"
+ for i in range(VL):
+ el_reg_t result, ra, rb;
+ _get_VSR_element(&ra, RA, i, 16);
+ _get_VSR_element(&rb, RB, i, 16);
+ result.halfs[0] = ra.halfs[0] + rb.halfs[0]; // use array 0 elements
+ _set_VSR_element(&result, RT, i, 16);
+
+```
+
+## Scalar Identity Behaviour
+
+SVP64 is designed so that when the prefix is all zeros, and VL=1, no
+effect or influence occurs (no augmentation) such that all standard Power
+ISA v3.0/v3.1 instructions covered by the prefix are "unaltered". This
+is termed `scalar identity behaviour` (based on the mathematical
+definition for "identity", as in, "identity matrix" or better "identity
+transformation").
+
+Note that this is completely different from when VL=0. VL=0 turns all
+operations under its influence into `nops` (regardless of the prefix)
+whereas when VL=1 and the SV prefix is all zeros, the operation simply
+acts as if SV had not been applied at all to the instruction (an
+"identity transformation").
+
+The fact that `VL` is dynamic and can be set to any value at runtime based
+on program conditions and behaviour means very specifically that
+`scalar identity behaviour` is **not** a redundant encoding. If the
+only means by which VL could be set was by way of static-compiled
+immediates then this assertion would be false. VL should not
+be confused with MAXVL when understanding this key aspect of SimpleV.
+
+## Register Naming and size
+
+As indicated above SV Registers are simply the GPR, FPR and CR
+register files extended linearly to larger sizes; SV Vectorisation
+iterates sequentially through these registers (LSB0 sequential ordering
+from 0 to VL-1).
+
+Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
+r0 to r31, SV extends this as r0 to r127. Likewise FP registers are
+extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries,
+CR0 thru CR127.
The names of the registers therefore reflects a simple linear extension
of the Power ISA v3.0B / v3.1B register naming, and in hardware this
would be reflected by a linear increase in the size of the underlying
SRAM used for the regfiles.
-Note: when an EXTRA field (defined below) is zero, SV is deliberately designed
-so that the register fields are identical to as if SV was not in effect
-i.e. under these circumstances (EXTRA=0) the register field names RA,
-RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers. This is part of
-`scalar identity behaviour` described above.
+Note: when an EXTRA field (defined below) is zero, SV is deliberately
+designed so that the register fields are identical to as if SV was not in
+effect i.e. under these circumstances (EXTRA=0) the register field names
+RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.
+This is part of `scalar identity behaviour` described above.
+
+**Condition Register(s)**
+
+The Scalar Power ISA Condition Register is a 64 bit register where the top
+32 MSBs (numbered 0:31 in MSB0 numbering) are not used. This convention is
+*preserved*
+in SVP64 and an additional 15 Condition Registers provided in
+order to store the new CR Fields, CR8-CR15, CR16-CR23 etc. sequentially.
+The top 32 MSBs in each new SVP64 Condition Register are *also* not used:
+only the bottom 32 bits (numbered 32:63 in MSB0 numbering).
+
+*Programmer's note: using `sv.mfcr` without element-width overrides
+to take into account the fact that the top 32 MSBs are zero and thus
+effectively doubling the number of GPR registers required to hold all 128
+CR Fields would seem the only option because normally elwidth overrides
+would halve the capacity of the instruction. However in this case it
+is possible to use destination element-width overrides (for `sv.mfcr`.
+source overrides would be used on the GPR of `sv.mtocrf`), whereupon
+truncation of the 64-bit Condition Register(s) occurs, throwing away
+the zeros and storing the remaining (valid, desired) 32-bit values
+sequentially into (LSB0-convention) lower-numbered and upper-numbered
+halves of GPRs respectively. The programmer is expected to be aware
+however that the full width of the entire 64-bit Condition Register
+is considered to be "an element". This is **not** like any other
+Condition-Register instructions because all other CR instructions,
+on closer investigation, will be observed to all be CR-bit or CR-Field
+related. Thus a `VL` of 16 must be used*
## Future expansion.
With the way that EXTRA fields are defined and applied to register fields,
-future versions of SV may involve 256 or greater registers. Backwards binary compatibility may be achieved with a PCR bit (Program Compatibility Register). Further discussion is out of scope for this version of SVP64.
+future versions of SV may involve 256 or greater registers. Backwards
+binary compatibility may be achieved with a PCR bit (Program Compatibility
+Register) or an MSR bit analogous to SF.
+Further discussion is out of scope for this version of SVP64.
+
+Additionally, a future variant of SVP64 will be applied to the Scalar
+(Quad-precision and 128-bit) VSX instructions. Element-width overrides
+are an opportunity to expand a future version of the Power ISA
+to 256-bit, 512-bit and
+1024-bit operations, as well as doubling or quadrupling the number
+of VSX registers to 128 or 256. Again further discussion is out of
+scope for this version of SVP64.
+
+--------
+
+\newpage{}
+
+# New 64-bit Instruction Encoding spaces
+
+The following seven new areas are defined within Primary Opcode 9 (EXT009)
+as a new 64-bit encoding space, alongside Primary Opcode 1
+(EXT1xx).
+
+| 0-5 | 6 | 7 | 8-31 | 32| Description |
+|-----|---|---|-------|---|------------------------------------|
+| PO | 0 | x | xxxx | 0 | `RESERVED2` (57-bit) |
+| PO | 0 | 0 | !zero | 1 | SVP64Single:EXT232-263, or `RESERVED3` |
+| PO | 0 | 0 | 0000 | 1 | Scalar EXT232-263 |
+| PO | 0 | 1 | nnnn | 1 | SVP64:EXT232-263 |
+| PO | 1 | 0 | 0000 | x | `RESERVED1` (32-bit) |
+| PO | 1 | 0 | !zero | n | SVP64Single:EXT000-063 or `RESERVED4` |
+| PO | 1 | 1 | nnnn | n | SVP64:EXT000-063 |
+
+Note that for the future SVP64Single Encoding (currently RESERVED3 and 4)
+it is prohibited to have bits 8-31 be zero, unlike for SVP64 Vector space,
+for which bits 8-31 can be zero (termed `scalar identity behaviour`). This
+prohibition allows SVP64Single to share its Encoding space with Scalar
+Ext232-263 and Scalar EXT300-363.
+
+Also that RESERVED1 and 2 are candidates for future Major opcode
+areas EXT200-231 and EXT300-363 respectively, however as RESERVED areas
+they may equally be allocated entirely differently.
+
+*Architectural Resource Allocation Note: **under no circumstances** must
+different Defined Words be allocated within any `EXT{z}` prefixed
+or unprefixed space for a given value of `z`. Even if UnVectoriseable
+an instruction Defined Word space must have the exact same Instruction
+and exact same Instruction Encoding in all spaces (including
+being RESERVED if UnVectoriseable) or not be allocated at all.
+This is required as an inviolate hard rule governing Primary Opcode 9
+that may not be revoked under any circumstances. A useful way to think
+of this is that the Prefix Encoding is, like the 8086 REP instruction,
+an independent 32-bit Defined Word. The only semi-exceptions are
+the Post-Increment Mode of LD/ST-Update and Vectorised Branch-Conditional.*
+
+Encoding spaces and their potential are illustrated:
+
+| Encoding | Available bits | Scalar | Vectoriseable | SVP64Single |
+|----------|----------------|--------|---------------|--------------|
+|EXT000-063| 32 | yes | yes |yes |
+|EXT100-163| 64 | yes | no |no |
+|RESERVED2 | 57 | N/A |not applicable |not applicable|
+|EXT232-263| 32 | yes | yes |yes |
+|RESERVED1 | 32 | N/A | no |no |
+
+Notes:
+
+* Prefixed-Prefixed (96-bit) instructions are prohibited. EXT1xx is
+ thus inherently UnVectoriseable as the EXT1xx prefix is 32-bit
+ on top of an SVP64 prefix which is 32-bit on top of a Defined Word
+ and the complexity at the Decoder becomes too great for High
+ Performance Multi-Issue systems.
+* RESERVED2 presently remains unallocated as of yet and therefore its
+ potential is not yet defined (Not Applicable).
+* RESERVED1 is also unallocated at present, but it is known in advance
+ that the area is UnVectoriseable and also cannot be Prefixed with
+ SVP64Single.
+* Considerable care is needed both on Architectural Resource Allocation
+ as well as instruction design itself. Once an instruction is allocated
+ in an UnVectoriseable area it can never be Vectorised without providing
+ an entirely new Encoding.
# Remapped Encoding (`RM[0:23]`)
-To allow relatively easy remapping of which portions of the Prefix Opcode
-Map are used for SVP64 without needing to rewrite a large portion of the
-SVP64 spec, a mapping is defined from the OpenPower v3.1 prefix bits to
-a new 24-bit Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]`
-at the LSB.
+In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits
+32-37 are the Primary Opcode of the Suffix "Defined Word". 38-63 are the
+remainder of the Defined Word. Note that the new EXT232-263 SVP64 area
+it is obviously mandatory that bit 32 is required to be set to 1.
-The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
-is defined in the Prefix Fields section.
+| 0-5 | 6 | 7 | 8-31 | 32-37 | 38-64 |Description |
+|-----|---|---|----------|--------|----------|-----------------------|
+| PO | 0 | 1 | RM[0:23] | 1nnnnn | xxxxxxxx | SVP64:EXT232-263 |
+| PO | 1 | 1 | RM[0:23] | nnnnnn | xxxxxxxx | SVP64:EXT000-063 |
-## Prefix Opcode Map (64-bit instruction encoding)
-
-In the original table in the v3.1B Power ISA Spec on p1350, Table 12, prefix bits 6:11 are shown, with their allocations to different v3.1B pregix "modes".
-
-The table below hows both PowerISA v3.1 instructions as well as new SVP instructions fit;
-empty spaces are yet-to-be-allocated Illegal Instructions.
-
-| 6:11 | ---000 | ---001 | ---010 | ---011 | ---100 | ---101 | ---110 | ---111 |
-|------|--------|--------|--------|--------|--------|--------|--------|--------|
-|000---| 8LS | 8LS | 8LS | 8LS | 8LS | 8LS | 8LS | 8LS |
-|001---| | | | | | | | |
-|010---| 8RR | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|011---| | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|100---| MLS | MLS | MLS | MLS | MLS | MLS | MLS | MLS |
-|101---| | | | | | | | |
-|110---| MRR | | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-|111---| | MMIRR | | | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
-
-Note that by taking up a block of 16, where in every case bits 7 and 9 are set, this allows svp64 to utilise four bits of the v3.1B Prefix space and "allocate" them to svp64's Remapped Encoding field, instead.
-
-## Prefix Fields
-
-To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Prefix mode), fields within the v3.1B Prefix Opcode Map are set
-(see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
-This is achieved by setting bits 7 and 9 to 1:
-
-| Name | Bits | Value | Description |
-|------------|---------|-------|--------------------------------|
-| EXT01 | `0:5` | `1` | Indicates Prefixed 64-bit |
-| `RM[0]` | `6` | | Bit 0 of Remapped Encoding |
-| SVP64_7 | `7` | `1` | Indicates this is SVP64 |
-| `RM[1]` | `8` | | Bit 1 of Remapped Encoding |
-| SVP64_9 | `9` | `1` | Indicates this is SVP64 |
-| `RM[2:23]` | `10:31` | | Bits 2-23 of Remapped Encoding |
-
-Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
-are constructed:
-
-| 0:5 | 6 | 7 | 8 | 9 | 10:31 |
-|--------|-------|---|-------|---|----------|
-| EXT01 | RM | 1 | RM | 1 | RM |
-| 000001 | RM[0] | 1 | RM[1] | 1 | RM[2:23] |
-
-Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1
-instruction. That instruction becomes "prefixed" with the SVP context: the
-Remapped Encoding field (RM).
-
-It is important to note that unlike v3.1 64-bit prefixed instructions
+It is important to note that unlike EXT1xx 64-bit prefixed instructions
there is insufficient space in `RM` to provide identification of
-any SVP64 Fields without first partially decoding the
-32-bit suffix. Similar to the "Forms" (X-Form, D-Form) the
-`RM` format is individually associated with every instruction.
+any SVP64 Fields without first partially decoding the 32-bit suffix.
+Similar to the "Forms" (X-Form, D-Form) the `RM` format is individually
+associated with every instruction. However this still does not adversely
+affect Multi-Issue Decoding because the identification of the *length*
+of anything in the 64-bit space has been kept brutally simple (EXT009),
+and further decoding of any number of 64-bit Encodings in parallel at
+that point is fully independent.
-Extreme caution and care must therefore be taken
-when extending SVP64 in future, to not create unnecessary relationships
-between prefix and suffix that could complicate decoding, adding latency.
+Extreme caution and care must be taken when extending SVP64
+in future, to not create unnecessary relationships between prefix and
+suffix that could complicate decoding, adding latency.
-# Common RM fields
+## Common RM fields
The following fields are common to all Remapped Encodings:
|------------|------------|----------------------------------------|
| MASKMODE | `0` | Execution (predication) Mask Kind |
| MASK | `1:3` | Execution Mask |
-| SUBVL | `8:9` | Sub-vector length |
+| SUBVL | `8:9` | Sub-vector length |
The following fields are optional or encoded differently depending
on context after decoding of the Scalar suffix:
|------------|------------|----------------------------------------|
| ELWIDTH | `4:5` | Element Width |
| ELWIDTH_SRC | `6:7` | Element Width for Source |
-| EXTRA | `10:18` | Register Extra encoding |
+| EXTRA | `10:18` | Register Extra encoding |
| MODE | `19:23` | changes Vector behaviour |
-* MODE changes the behaviour of the SV operation (result saturation, mapreduce)
-* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D and Audio/Video DSP work
-* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and source operand width
-* MASK (and MASK_SRC) and MASKMODE provide predication (two types of sources: scalar INT and Vector CR).
-* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category for the instruction, which is determined only by decoding the Scalar 32 bit suffix.
-
-Similar to Power ISA `X-Form` etc. EXTRA bits are given designations, such as `RM-1P-3S1D` which indicates for this example that the operation is to be single-predicated and that there are 3 source operand EXTRA tags and one destination operand tag.
-
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
-
-# Mode
-
-Mode is an augmentation of SV behaviour. Different types of
-instructions have different needs, similar to Power ISA
-v3.1 64 bit prefix 8LS and MTRR formats apply to different
-instruction types. Modes include Reduction, Iteration, arithmetic
-saturation, and Fail-First. More specific details in each
-section and in the [[svp64/appendix]]
+* MODE changes the behaviour of the SV operation (result saturation,
+ mapreduce)
+* SUBVL groups elements together into vec2, vec3, vec4 for use in 3D
+ and Audio/Video DSP work
+* ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and
+ source operand width
+* MASK (and MASK_SRC) and MASKMODE provide predication (two types of
+ sources: scalar INT and Vector CR).
+* Bits 10 to 18 (EXTRA) are further decoded depending on the RM category
+ for the instruction, which is determined only by decoding the Scalar 32
+ bit suffix.
+
+Similar to Power ISA `X-Form` etc. EXTRA bits are given designations,
+such as `RM-1P-3S1D` which indicates for this example that the operation
+is to be single-predicated and that there are 3 source operand EXTRA
+tags and one destination operand tag.
+
+Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
+or increased latency in some implementations due to lane-crossing.
+
+## Mode
+
+Mode is an augmentation of SV behaviour. Different types of instructions
+have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR
+formats apply to different instruction types. Modes include Reduction,
+Iteration, arithmetic saturation, and Fail-First. More specific details
+in each section and in the [[svp64/appendix]]
* For condition register operations see [[sv/cr_ops]]
* For LD/ST Modes, see [[sv/ldst]].
* For Branch modes, see [[sv/branches]]
* For arithmetic and logical, see [[sv/normal]]
-# ELWIDTH Encoding
+## ELWIDTH Encoding
-Default behaviour is set to 0b00 so that zeros follow the convention of
-`scalar identity behaviour`. In this case it means that elwidth overrides
-are not applicable. Thus if a 32 bit instruction operates on 32 bit,
-`elwidth=0b00` specifies that this behaviour is unmodified. Likewise
-when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00`
-states that, again, the behaviour is not to be modified.
+Default behaviour is set to 0b00 so that zeros follow the convention
+of `scalar identity behaviour`. In this case it means that elwidth
+overrides are not applicable. Thus if a 32 bit instruction operates
+on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified.
+Likewise when a processor is switched from 64 bit to 32 bit mode,
+`elwidth=0b00` states that, again, the behaviour is not to be modified.
Only when elwidth is nonzero is the element width overridden to the
explicitly required value.
-## Elwidth for Integers:
+### Elwidth for Integers:
| Value | Mnemonic | Description |
|-------|----------------|------------------------------------|
This encoding is chosen such that the byte width may be computed as
`8<<(3-ew)`
-## Elwidth for FP Registers:
+### Elwidth for FP Registers:
| Value | Mnemonic | Description |
|-------|----------------|------------------------------------|
[`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
is reserved for a future implementation of SV
-Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`) shall
-perform its operation at **half** the ELWIDTH then padded back out
-to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
+Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`)
+shall perform its operation at **half** the ELWIDTH then padded back out
+to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation
+that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
then padded back out to fit in IEEE754 FP64, exactly as for Scalar
-v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16
-or ELWIDTH=bf16 is reserved and must raise an illegal instruction
-(IEEE754 FP8 or BF8 are not defined).
+v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16 or
+ELWIDTH=bf16 is reserved and must raise an illegal instruction (IEEE754
+FP8 or BF8 are not defined).
-## Elwidth for CRs:
+### Elwidth for CRs (no meaning)
Element-width overrides for CR Fields has no meaning. The bits
are therefore used for other purposes, or when Rc=1, the Elwidth
applies to the result being tested (a GPR or FPR), but not to the
Vector of CR Fields.
-# SUBVL Encoding
+## SUBVL Encoding
-the default for SUBVL is 1 and its encoding is 0b00 to indicate that
+The default for SUBVL is 1 and its encoding is 0b00 to indicate that
SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
lines up in combination with all other "default is all zeros" behaviour.
sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
this may be considered to be elements 0b00 to 0b01 inclusive.
-# MASK/MASK_SRC & MASKMODE Encoding
-
-TODO: rename MASK_KIND to MASKMODE
+## MASK/MASK_SRC & MASKMODE Encoding
One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
types may not be mixed.
-Special note: to disable predication this field must
-be set to zero in combination with Integer Predication also being set
-to 0b000. this has the effect of enabling "all 1s" in the predicate
-mask, which is equivalent to "not having any predication at all"
-and consequently, in combination with all other default zeros, fully
-disables SV (`scalar identity behaviour`).
+Special note: to disable predication this field must be set to zero in
+combination with Integer Predication also being set to 0b000. this has the
+effect of enabling "all 1s" in the predicate mask, which is equivalent to
+"not having any predication at all".
`MASKMODE` may be set to one of 2 values:
| 1 | MASK/MASK_SRC are encoded using CR-based Predication |
Integer Twin predication has a second set of 3 bits that uses the same
-encoding thus allowing either the same register (r3, r10 or r31) to be used
-for both src and dest, or different regs (one for src, one for dest).
+encoding thus allowing either the same register (r3, r10 or r31) to be
+used for both src and dest, or different regs (one for src, one for dest).
Likewise CR based twin predication has a second set of 3 bits, allowing
a different test to be applied.
-Note that it is assumed that Predicate Masks (whether INT or CR)
-are read *before* the operations proceed. In practice (for CR Fields)
-this creates an unnecessary block on parallelism. Therefore,
-it is up to the programmer to ensure that the CR fields used as
-Predicate Masks are not being written to by any parallel Vector Loop.
-Doing so results in **UNDEFINED** behaviour, according to the definition
-outlined in the Power ISA v3.0B Specification.
+Note that it is assumed that Predicate Masks (whether INT or CR) are
+read *before* the operations proceed. In practice (for CR Fields)
+this creates an unnecessary block on parallelism. Therefore, it is up
+to the programmer to ensure that the CR fields used as Predicate Masks
+are not being written to by any parallel Vector Loop. Doing so results
+in **UNDEFINED** behaviour, according to the definition outlined in the
+Power ISA v3.0B Specification.
Hardware Implementations are therefore free and clear to delay reading
of individual CR fields until the actual predicated element operation
-needs to take place, safe in the knowledge that no programmer will
-have issued a Vector Instruction where previous elements could have
-overwritten (destroyed) not-yet-executed CR-Predicated element operations.
+needs to take place, safe in the knowledge that no programmer will have
+issued a Vector Instruction where previous elements could have overwritten
+(destroyed) not-yet-executed CR-Predicated element operations.
-## Integer Predication (MASKMODE=0)
+### Integer Predication (MASKMODE=0)
When the predicate mode bit is zero the 3 bits are interpreted as below.
Twin predication has an identical 3 bit field similarly encoded.
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
+`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
+following meaning:
| Value | Mnemonic | Element `i` enabled if: |
|-------|----------|------------------------------|
| 110 | R30 | `R30 & (1 << i)` is non-zero |
| 111 | ~R30 | `R30 & (1 << i)` is zero |
-r10 and r30 are at the high end of temporary and unused registers, so as not to interfere with register allocation from ABIs.
+r10 and r30 are at the high end of temporary and unused registers,
+so as not to interfere with register allocation from ABIs.
-## CR-based Predication (MASKMODE=1)
+### CR-based Predication (MASKMODE=1)
When the predicate mode bit is one the 3 bits are interpreted as below.
Twin predication has an identical 3 bit field similarly encoded.
-`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the following meaning:
+`MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
+following meaning:
| Value | Mnemonic | Element `i` is enabled if |
|-------|----------|--------------------------|
| 110 | so/un | `CR[offs+i].FU` is set |
| 111 | ns/nu | `CR[offs+i].FU` is clear |
-CR based predication. TODO: select alternate CR for twin predication? see
-[[discussion]] Overlap of the two CR based predicates must be taken
-into account, so the starting point for one of them must be suitably
-high, or accept that for twin predication VL must not exceed the range
-where overlap will occur, *or* that they use the same starting point
-but select different *bits* of the same CRs
+`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
+Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
+The CR Predicates chosen must start on a boundary that Vectorised CR
+operations can access cleanly, in full. With EXTRA2 restricting starting
+points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
+CR Predicate Masks have to be adapted to fit on these boundaries as well.
-The CR Predicates chosen must start on a boundary that Vectorised
-CR operations can access cleanly, in full.
-With EXTRA2 restricting starting points
-to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and CR Predicate
-Masks have to be adapted to fit on these boundaries as well.
+## Extra Remapped Encoding <a name="extra_remap"> </a>
-# Extra Remapped Encoding <a name="extra_remap"> </a>
+Shows all instruction-specific fields in the Remapped Encoding
+`RM[10:18]` for all instruction variants. Note that due to the very
+tight space, the encoding mode is *not* included in the prefix itself.
+The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form)
+on a per-instruction basis, and, like "Forms" are given a designation
+(below) of the form `RM-nP-nSnD`. The full list of which instructions
+use which remaps is here [[opcode_regs_deduped]].
-Shows all instruction-specific fields in the Remapped Encoding `RM[10:18]` for all instruction variants. Note that due to the very tight space, the encoding mode is *not* included in the prefix itself. The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form) on a per-instruction basis, and, like "Forms" are given a designation (below) of the form `RM-nP-nSnD`. The full list of which instructions use which remaps is here [[opcode_regs_deduped]]. (*Machine-readable CSV files have been provided which will make the task of creating SV-aware ISA decoders easier*).
+**Please note the following**:
-These mappings are part of the SVP64 Specification in exactly the same
+```
+ Machine-readable CSV files have been autogenerated which will make the
+ task of creating SV-aware ISA decoders, documentation, assembler tools
+ compiler tools Simulators documentation all aspects of SVP64 easier
+ and less prone to mistakes. Please avoid manual re-creation of
+ information from the written specification wording in this chapter,
+ and use the CSV files or use the Canonical tool which creates the CSV
+ files, named sv_analysis.py. The information contained within
+ sv_analysis.py is considered to be part of this Specification, even
+ encoded as it is in python3.
+```
+
+
+The mappings are part of the SVP64 Specification in exactly the same
way as X-Form, D-Form. New Scalar instructions added to the Power ISA
will need a corresponding SVP64 Mapping, which can be derived by-rote
from examining the Register "Profile" of the instruction.
-There are two categories: Single and Twin Predication.
-Due to space considerations further subdivision of Single Predication
-is based on whether the number of src operands is 2 or 3. With only
-9 bits available some compromises have to be made.
+There are two categories: Single and Twin Predication. Due to space
+considerations further subdivision of Single Predication is based on
+whether the number of src operands is 2 or 3. With only 9 bits available
+some compromises have to be made.
-* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
-* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
+* `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand
+ instructions (fmadd, isel, madd).
+* `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand
+ instructions (src1 src2 dest)
* `RM-2P-1S1D` Twin Predication (src=1, dest=1)
* `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
* `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
-## RM-1P-3S1D
+### RM-1P-3S1D
| Field Name | Field bits | Description |
|------------|------------|----------------------------------------|
such as `maddedu` have an implicit second destination, RS, the
selection of which is determined by bit 18.
-## RM-1P-2S1D
+### RM-1P-2S1D
| Field Name | Field bits | Description |
|------------|------------|-------------------------------------------|
| Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 |
These are for 2 operand 1 dest instructions, such as `add RT, RA,
-RB`. However also included are unusual instructions with an implicit dest
-that is identical to its src reg, such as `rlwinmi`.
+RB`. However also included are unusual instructions with an implicit
+dest that is identical to its src reg, such as `rlwinmi`.
-Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would not have sufficient bit fields to allow
-an alternative destination. With SV however this becomes possible.
-Therefore, the fact that the dest is implicitly also a src should not
-mislead: due to the *prefix* they are different SV regs.
+Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would
+not have sufficient bit fields to allow an alternative destination.
+With SV however this becomes possible. Therefore, the fact that the
+dest is implicitly also a src should not mislead: due to the *prefix*
+they are different SV regs.
* `rlwimi RA, RS, ...`
* Rsrc1_EXTRA3 applies to RS as the first src
each may be *independently* made vector or scalar, and be independently
augmented to 7 bits in length.
-## RM-2P-1S1D/2S
+### RM-2P-1S1D/2S
| Field Name | Field bits | Description |
|------------|------------|----------------------------|
`RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
-## RM-1P-2S1D
+### RM-1P-2S1D
single-predicate, three registers (2 read, 1 write)
-
+
| Field Name | Field bits | Description |
|------------|------------|----------------------------|
| Rdest_EXTRA3 | `10:12` | extends Rdest |
| Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
| Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 |
-## RM-2P-2S1D/1S2D/3S
+### RM-2P-2S1D/1S2D/3S
The primary purpose for this encoding is for Twin Predication on LOAD
and STORE operations. see [[sv/ldst]] for detailed anslysis.
-RM-2P-2S1D:
+**RM-2P-2S1D:**
| Field Name | Field bits | Description |
|------------|------------|----------------------------|
| Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
| MASK_SRC | `16:18` | Execution Mask for Source |
-Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
+**RM-2P-1S2D:**
+
+For RM-2P-1S2D the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
is in bits 10:11, Rdest1_EXTRA2 in 12:13)
-Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
+| Field Name | Field bits | Description |
+|------------|------------|----------------------------|
+| Rsrc2_EXTRA2 | `10:11` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
+| Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
+| Rdest_EXTRA2 | `14:15` | extends Rdest (R\*\_EXTRA2 Encoding) |
+| MASK_SRC | `16:18` | Execution Mask for Source |
-Note also that LD with update indexed, which takes 2 src and 2 dest
-(e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
-Twin Predication. therefore these are treated as RM-2P-2S1D and the
-src spec for RA is also used for the same RA as a dest.
+**RM-2P-3S:**
-Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing.
+Also that for RM-2P-3S (to cover `stdx` etc.) the names are switched to 3 src:
+Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
-# R\*\_EXTRA2/3
+| Field Name | Field bits | Description |
+|------------|------------|----------------------------|
+| Rsrc1_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
+| Rsrc2_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
+| Rsrc3_EXTRA2 | `14:15` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
+| MASK_SRC | `16:18` | Execution Mask for Source |
+
+Note also that LD with update indexed, which takes 2 src and
+creates 2 dest registers (e.g. `lhaux RT,RA,RB`), does not have room
+for 4 registers and also Twin Predication. Therefore these are treated as
+RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest.
+
+Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
+or increased latency in some implementations due to lane-crossing.
+
+## R\*\_EXTRA2/3
EXTRA is the means by which two things are achieved:
The register files are therefore extended:
-* INT is extended from r0-31 to r0-127
-* FP is extended from fp0-32 to fp0-fp127
+* INT (GPR) is extended from r0-31 to r0-127
+* FP (FPR) is extended from fp0-32 to fp0-fp127
* CR Fields are extended from CR0-7 to CR0-127
However due to pressure in `RM.EXTRA` not all these registers
a large number of operands (`madd`, `isel`).
In the following tables register numbers are constructed from the
-standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2
-or EXTRA3 field from the SV Prefix, determined by the specific
-RM-xx-yyyy designation for a given instruction.
-The prefixing is arranged so that
+standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or
+EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy
+designation for a given instruction. The prefixing is arranged so that
interoperability between prefixing and nonprefixing of scalar registers
is direct and convenient (when the EXTRA field is all zeros).
-A pseudocode algorithm explains the relationship, for INT/FP (see [[svp64/appendix]] for CRs)
+A pseudocode algorithm explains the relationship, for INT/FP (see
+[[svp64/appendix]] for CRs)
+```
if extra3_mode:
spec = EXTRA3
else:
return (RA << 2) | spec[1:2]
else: # scalar
return (spec[1:2] << 5) | RA
+```
Future versions may extend to 256 by shifting Vector numbering up.
Scalar will not be altered.
Note that in some cases the range of starting points for Vectors
-is limited.
+is limited.
-## INT/FP EXTRA3
+### INT/FP EXTRA3
-If EXTRA3 is zero, maps to
-"scalar identity" (scalar Power ISA field naming).
+If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field
+naming).
Fields are as follows:
* MSB..LSB: the bit field showing how the register opcode field
combines with EXTRA to give (extend) the register number (GPR)
-| Value | Mode | Range/Inc | 6..0 |
+Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB)
+
+| Value | Mode | Range/Inc | 6..0 |
|-----------|-------|---------------|---------------------|
| 000 | Scalar | `r0-r31`/1 | `0b00 RA` |
| 001 | Scalar | `r32-r63`/1 | `0b01 RA` |
| 110 | Vector | `r2-r126`/4 | `RA 0b10` |
| 111 | Vector | `r3-r127`/4 | `RA 0b11` |
-## INT/FP EXTRA2
+### INT/FP EXTRA2
-If EXTRA2 is zero will map to
-"scalar identity behaviour" i.e Scalar Power ISA register naming:
+If EXTRA2 is zero will map to "scalar identity behaviour" i.e Scalar
+Power ISA register naming:
-| Value | Mode | Range/inc | 6..0 |
-|-----------|-------|---------------|-----------|
+Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB)
+
+| Value | Mode | Range/inc | 6..0 |
+|----------|-------|---------------|-----------|
| 00 | Scalar | `r0-r31`/1 | `0b00 RA` |
| 01 | Scalar | `r32-r63`/1 | `0b01 RA` |
| 10 | Vector | `r0-r124`/4 | `RA 0b00` |
as there is insufficient bits to cover the full range.
-## CR Field EXTRA3
+### CR Field EXTRA3
-CR Field encoding is essentially the same but made more complex due to CRs being bit-based. See [[svp64/appendix]] for explanation and pseudocode.
+CR Field encoding is essentially the same but made more complex due to CRs
+being bit-based, because the application of SVP64 element-numbering applies
+to the CR *Field* numbering not the CR register *bit* numbering.
Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
and Scalars may only go from `CR0, CR1, ... CR31`
-Encoding shown MSB down to LSB
+Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB),
+BA ranges are in MSB0.
For a 5-bit operand (BA, BB, BT):
| Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
|-------|------|---------------|-----------| --------|---------|
-| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
-| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
-| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[4:2] | BA[1:0] |
-| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[4:2] | BA[1:0] |
-| 100 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
-| 101 | Vector | `CR4-CR116`/16 | BA[4:2] 0 | 0b100 | BA[1:0] |
-| 110 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
-| 111 | Vector | `CR12-CR124`/16 | BA[4:2] 1 | 0b100 | BA[1:0] |
+| 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] |
+| 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] |
+| 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[0:2] | BA[3:4] |
+| 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[0:2] | BA[3:4] |
+| 100 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] |
+| 101 | Vector | `CR4-CR116`/16 | BA[0:2] 0 | 0b100 | BA[3:4] |
+| 110 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] |
+| 111 | Vector | `CR12-CR124`/16 | BA[0:2] 1 | 0b100 | BA[3:4] |
For a 3-bit operand (e.g. BFA):
| 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
| 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 |
-## CR EXTRA2
+### CR EXTRA2
-CR encoding is essentially the same but made more complex due to CRs being bit-based. See separate section for explanation and pseudocode.
+CR encoding is essentially the same but made more complex due to CRs
+being bit-based, because the application of SVP64 element-numbering applies
+to the CR *Field* numbering not the CR register *bit* numbering.
Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
-
-Encoding shown MSB down to LSB
+Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB),
+BA ranges are in MSB0.
For a 5-bit operand (BA, BB, BC):
| Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
|-------|--------|----------------|---------|---------|---------|
-| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[4:2] | BA[1:0] |
-| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[4:2] | BA[1:0] |
-| 10 | Vector | `CR0-CR112`/16 | BA[4:2] 0 | 0b000 | BA[1:0] |
-| 11 | Vector | `CR8-CR120`/16 | BA[4:2] 1 | 0b000 | BA[1:0] |
+| 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] |
+| 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] |
+| 10 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] |
+| 11 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] |
For a 3-bit operand (e.g. BFA):
| 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
| 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
-# Appendix
+## Appendix
Now at its own page: [[svp64/appendix]]
+--------
+
+\newpage{}
+
+[[!tag standards]]
+