# SVP64 Zero-Overhead Loop Prefix Subsystem
+<!-- hide -->
* **DRAFT STATUS v0.1 18sep2021** Release notes <https://bugs.libre-soc.org/show_bug.cgi?id=699>
+<!-- show -->
-This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]]. It is in Draft Status and
-will be submitted to the [[!wikipedia OpenPOWER_Foundation]] ISA WG
-via the External RFC Process.
+This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]].
Credits and acknowledgements:
* NLnet Foundation, for funding
* OpenPOWER Foundation
* Paul Mackerras
+* Brad Frey
+* Cathy May
* Toshaan Bharvani
* IBM for the Power ISA itself
+<!-- hide -->
Links:
* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
* [[sv/branches]] chapter
* [[sv/ldst]] chapter
-
Table of contents
[[!toc]]
+<!-- show -->
## Introduction
-Simple-V is a type of Vectorisation best described as a "Prefix Loop
-Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and
-to the 8086 `REP` Prefix instruction. More advanced features are similar
-to the Z80 `CPIR` instruction. If naively viewed one-dimensionally as an
-actual Vector ISA it introduces over 1.5 million 64-bit True-Scalable
-Vector instructions on the SFFS Subset and closer to 10 million 64-bit
-True-Scalable Vector instructions if introduced on VSX. SVP64, the
-instruction format used by Simple-V, is therefore best viewed as an
-orthogonal RISC-paradigm "Prefixing" subsystem instead.
+Simple-V is a type of Vectorization best described as a "Prefix Loop
+Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR`[^bib_ldir] instruction and
+to the 8086 `REP`[^bib_rep] Prefix instruction. More advanced features are similar
+to the Z80 `CPIR`[^bib_cpir] instruction.
+
+[^bib_ldir]: [Zilog Z80 LDIR](http://z80-heaven.wikidot.com/instructions-set:ldir)
+[^bib_cpir]: [Zilog Z80 CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir)
+[^bib_rep]: [8086 REP](https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz)
Except where explicitly stated all bit numbers remain as in the rest of
the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
which is a different convention from that used elsewhere in the Power ISA.
The SVP64 prefix always comes before the suffix in PC order and must be
-considered an independent "Defined word" that augments the behaviour of
-the following instruction, but does **not** change the actual Decoding
-of that following instruction. **All prefixed 32-bit instructions
-(Defined Words) retain their non-prefixed encoding and definition**.
+considered an independent "Defined Word-instruction"[^dwi] that augments the behaviour of
+the following instruction (also a Defined Word-instruction), but does **not** change the actual Decoding
+of that following instruction just because it is Prefixed. Unlike EXT100-163,
+where the Suffix is considered an entirely new Opcode Space,
+SVP64-Prefixed instructions must never be treated or regarded
+as a different Opcode Space.
+
+[^dwi]: Defined Word-instruction: Power ISA v3.1 Section 1.6
Two apparent exceptions to the above hard rule exist: SV
Branch-Conditional operations and LD/ST-update "Post-Increment"
Mode. Post-Increment was considered sufficiently high priority
(significantly reducing hot-loop instruction count) that one bit in
the Prefix is reserved for it (*Note the intention to release that bit
-and move Post-Increment instructions to EXT2xx, as part of [[ls011]]*).
-Vectorised Branch-Conditional operations "embed" the original Scalar
+and move Post-Increment instructions to EXT2xx, as part of [[sv/rfc/ls011]]*).
+Vectorized Branch-Conditional operations "embed" the original Scalar
Branch-Conditional behaviour into a much more advanced variant that is
highly suited to High-Performance Computation (HPC), Supercomputing,
and parallel GPU Workloads.
-*Architectural Resource Allocation note: it is prohibited to accept RFCs
-which fundamentally violate this hard requirement. Under no circumstances
-must the Suffix space have an alternate instruction encoding allocated
-within SVP64 that is entirely different from the non-prefixed Defined
-Word. Hardware Implementors critically rely on this inviolate guarantee
-to implement High-Performance Multi-Issue micro-architectures that can
-sustain 100% throughput*
+*Architectural Resource Allocation note: at present it is possible to perform
+partial parallel decode of the SVP64 24-bit Encoding Area at the same time
+as decoding of the Suffix. Multi-Issue Implementations may even
+Decode multiple 32-bit words in parallel and follow up with a second
+cycle of joining Prefix and Suffix "after-the-fact".
+Mixing and overlaying 64-bit Opcode Encodings into the
+{SVP64 24-bit Prefix}{Defined Word-instruction} space creates
+a hard dependency that catastrophically damages Multi-Issue Decoding by
+greatly complexifying Parallel Instruction-Length Detection.
+Therefore it has to be prohibited to accept RFCs
+which fundamentally violate the following hard requirement: **under no circumstances**
+must the use of SVP64 24-bit Suffixes **also** imply a different Opcode space
+from **any** non-prefixed Word. Even RESERVED or Illegal Words must be
+Orthogonal.*
Subset implementations in hardware are permitted, as long as certain
rules are followed, allowing for full soft-emulation including future
interoperability expectations within certain environments. Details in
the [[svp64/appendix]].
-## Strict Program Order
-
-Many Vector ISAs allow interrupts to occur in the middle of
-processing of large Vector operations, only under the condition
-that continuation on return will restart the entire operation.
-The reason is that saving of full Architectural State is
-not practical.
-
-Simple-V operates on an entirely different paradigm from traditional
-Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
-with Scalar instructions. With this in mind it is critical for
-implementations to observe Strict Element-Level Program Order
-at all times. *Any* element is Interruptible and Simple-V has
-been carefully designed to ensure that Architectural State may
-be fully preserved regardless of that same State.
-
-Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
-but the full SVP64 Architectural State may be saved and
-restored through manual copying of `SVSTATE` (and the four
-REMAP SPRs if in use at the time)
-Whilst this initially sounds unsafe in reality
-all that Trap Handlers (and function call stack save/restore)
-need do is avoid
-use of SVP64 Prefixed instructions to perform the necessary
-save/restore of Simple-V Architectural State.
-This capability also allows nested function calls to be made from
-inside Vector loops, which is very rare for Vector ISAs.
-
-Strict Program Order is also preserved by the Parallel Reduction
-REMAP Schedule, but only at the cost of requiring the destination
-Vector to be used (Deterministically) to store partial progress of the
-Parallel Reduction Schedule.
-
-The only major caveat for REMAP is that
-after an explicit change to
-Architectural State caused by writing to the
-Simple-V SPRs, some implementations may find
-it easier to take longer to calculate where in a given Schedule
-the re-mapping Indices were. Obvious examples include Interrupts occuring
-in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
-for example), which
-will force implementations to perform divide and modulo
-calculations.
-
-With that context in mind it is important to bear in mind throughout
-this document that when referring to "Strict Program Order", this
-includes Elements, because in Simple-V all Elements
-are conceptually synonymous with Scalar execution.
-
## SVP64 encoding features
A number of features need to be compacted into a very small space of
* Predication on both source and destination
* Two different sources of predication: INT and CR Fields
* SV Modes including saturation (for Audio, Video and DSP), mapreduce,
- fail-first and predicate-result mode.
+ and fail-first mode.
Different classes of operations require different formats. The earlier
-sections cover the common formats and the four separate modes follow:
-CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store
-and Branch-Conditional.
+sections cover the common formats and the five separate modes have their own
+section later:
+* CR operations (crops),
+* Arithmetic/Logical (termed "normal"),
+* Load/Store Immediate,
+* Load/Store Indexed,
+* Branch-Conditional.
## Definition of Reserved in this spec.
Unless otherwise stated, reserved values are always all zeros.
This is unlike OpenPower ISA v3.1, which in many instances does not
-require a trap if reserved fields are nonzero. Where the standard Power
+require a trap if reserved fields are nonzero, instead relying on software
+to avoid use of such fields. Where the standard Power
ISA definition is intended the red keyword `RESERVED` is used.
-## Definition of "UnVectoriseable"
+## Definition of "PO9-Prefixed"
+
+Used in the context of "A PO9-Prefixed Word" this is a new area similar to EXT100-163
+that is shared between SVP64-Single, SVP64, 32 Vectorizable new Opcode areas
+EXT200-231, one RESERVED 57-bit future Opcode space, and three new Unvectorizable
+RESERVED 32-bit future Opcode spaces. See [[sv/po9_encoding]].
-Any operation that inherently makes no sense if repeated is termed
-"UnVectoriseable" or "UnVectorised". Examples include `sc` or `sync`
-which have no registers. `mtmsr` is also classed as UnVectoriseable
+## Definition of "SVP64-Prefix"
+
+A 24-bit RISC-Paradigm Encoding area for Loop-Augmentation of the following
+"Defined Word-instruction-instruction".
+Used in the context of "An SVP64-Prefixed Defined Word-instruction", as separate and
+distinct from the 32-bit PO9-Prefix that holds a 24-bit SVP64 Prefix.
+
+## Definition of "Vectorizable" and "Unvectorizable"
+
+"Vectorizable" Defined Word-instructions are Scalar instructions that
+benefit from SVP64 Loop-Prefixing.
+Conversely, any operation that inherently makes no sense if repeated in a
+Vector Loop is termed
+"Unvectorizable" or "Unvectorized". Examples include `sc` or `sync`
+which have no registers. `mtmsr` is also classed as Unvectorizable
because there is only one `MSR`.
-UnVectorised instructions are required to be detected as such if
+UnVectorized instructions are required to be detected as such if
Prefixed (either SVP64 or SVP64Single) and an Illegal Instruction
Trap raised.
*Architectural Note: Given that a "pre-classification" Decode Phase is
-required (identifying whether the Suffix - Defined Word - is
+required (identifying whether the Suffix - Defined Word-instruction - is
Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
-adding "UnVectorised" to this phase is not unreasonable.*
+adding "Unvectorized" to this phase is not unreasonable.*
+
+Vectorizable Defined Word-instructions are **required** to be Vectorized,
+or they may not be permitted to be added at all to the Power ISA as Defined
+Word-instructions.
+
+*Engineering note: implementations may not choose to add Defined Word-instructions
+without also adding hardware support for SVP64-Prefixing of the same.*
+
+*ISA Working Group note: Vectorized PackedSIMD instructions if ever proposed
+should be considered Unvectorizable and except in extreme mitigating circumstances
+rejected outright.*
+
+## Definition of Strict Element-Level Execution Order<a name="svp64_eeo"> </a>
+
+Where Instruction Execution Order[^ieo] guarantees the appearance of sequential
+execution of instructions, Simple-V requires a corresponding guarantee for Elements
+because in Simple-V Execution of Elements is synonymous with Execution of
+instructions.
+
+[^ieo]: Strict Instruction Execution Order is defined in Public v3.1 Book I Section 2.2
+
+## Precise Interrupt Guarantees
+
+Strict Instruction Execution Order is defined as giving the appearance, as far
+as programs are concerned, that instructions were executed
+strictly in the sequence that they occurred. A "Precise"
+out-of-order
+Micro-architecture goes to considerable lengths to ensure that
+this is the case.
+
+Many Vector ISAs allow interrupts to occur in the middle of
+processing of large Vector operations, only under the condition
+that partial results are cleanly discarded, and continuation on return
+from the Trap Handler will restart the entire operation.
+The reason is that saving of full Architectural State is
+not practical. An example would be a Floating-Point Horizontal Sum instruction
+(very common in Vector ISAs) or a Dot Product instruction
+that specifies a higher degree of accuracy for the *internal*
+accumulator than the registers.
+
+Simple-V operates on an entirely different paradigm from traditional
+Vector ISAs: as a "Sub-Execution Context", where "Elements" are synonymous
+with Scalar instructions. With this in mind
+implementations must observe Strict **Element**-Level Execution Order[[#svp64_eeo]]
+at all times.
+*Any* element is Interruptible, and Architectural State may
+be fully preserved and restored regardless of that same State.
+
+*Engineering note: implementations are permitted have higher latency to
+perform context-switching (particularly if REMAP
+is active).*
+
+Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
+but the full SVP64 Architectural State may be saved and
+restored through manual copying of `SVSTATE` (and the four
+REMAP SPRs if in use at the time, which may be determined by
+`SVSTATE[32:46]` being non-zero).
+
+*Programmer's note: Trap Handlers (and any stack-based context save/restore)
+must avoid the use of SVP64 Prefixed instructions to perform the necessary
+save/restore of Simple-V Architectural State (SPR SVSTATE),
+just as use of FPRs and VSRs is presently avoided.
+However once saved, and set to known-good, SVP64 Prefixed instructions
+may be used to save/restore GPRs, SPRs, FPRs and other state.*
+
+*Programmer's note: SVSHAPE0-3 alters Element Execution Order, but only
+if activated in SVSHAPE. It is therefore technically possible in a Trap
+Handler to save SVSTATE (`mfspr t0, SVSTATE`), then clear bits 32-46.
+At this point it becomes safe to use SVP64 to save sequential batches
+of SPRs (`setvli MAXVL=VL=4; sv.mfspr *t0, *SVSHAPE0`)*
+
+The only major caveat for REMAP is that
+after an explicit change to
+Architectural State caused by writing to the
+Simple-V SPRs, some implementations may find
+it easier to take longer to calculate where in a given Schedule
+the re-mapping Indices were. Obvious examples include Interrupts occuring
+in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
+for example), which
+will force some implementations to perform divide and modulo
+calculations.
+
+An additional caveat involves Condition Register Fields
+when also used as Predicate Masks. An operation that
+overwrites the same CR Fields that are simultaneously
+being used as a Predicate Mask should exercise extreme care
+if the overwritten CR field element was needed by a
+subsequent Element for its Predicate Mask bit.
+
+Some implementations may deploy Cray's technique of
+"Vector Chaining" (including in this case reading the CR field
+containing the Predicate bit until the very last moment),
+and consequently avoiding the risk of
+overwrite is the responsibility of the Programmer.
+`hphint` may be used here to good effect.
+Extra Special care is particularly needed here when using REMAP
+and also Vertical-First Mode.
+
+The simplest option is to use Integer Predicate Masks but the
+caveats are stricter:
+
+* In Vertical-First loops Programmers **must not** write to any
+ Integers (r3, r0, r31) used as Predicate Masks. Doing so
+ is `UNDEFINED` behaviour.
+* An **entire** Vector is held up on Horizontal-First Mode if the
+ Integer Predicate is still in in-flight Reservation Stations
+ or pipelines. Speculative Vector Chained Execution mitigates delays
+ but can be heavy on Reservation Station resources.
## Register files, elements, and Element-width Overrides
exactly the same, affecting as they already do and remain **only**
on the Load and Store memory-register operation byte-order, and having
nothing to do with the ordering of the contents of register files or
-register-register operations.
+register-register arithmetic or logical operations.
The only major impact on Arithmetic and Logical operations is that all
Scalar operations are defined, where practical and workable, to have
-three new widths: elwidth=32, elwidth=16, elwidth=8. The default of
+three new widths: elwidth=32, elwidth=16, elwidth=8.
+
+*Architectural note: a future revision of SVP64 for VSX may have entirely
+different definitions of possible elwidths.*
+
+The default of
elwidth=64 is the pre-existing (Scalar) behaviour which remains 100%
unchanged. Thus, `addi` is now joined by a 32-bit, 16-bit, and 8-bit
variant of `addi`, but the sole exclusive difference is the width.
-*In no way* is the actual `addi` instruction fundamentally altered.
+*In no way* is the actual `addi` instruction fundamentally altered
+to become an entirely different operation (such as a subtract or multiply).
FP Operations elwidth overrides are also defined, as explained in
the [[svp64/appendix]].
```
There are no conceptual arithmetic ordering or other changes over the
Scalar Power ISA definitions to registers or register files or to
- arithmetic or Logical Operations beyond element-width subdivision
+ arithmetic or Logical Operations, beyond element-width subdivision
```
Element offset
* The GPR-numbering is considered LSB0-ordered
* The Element-numbering (result0-result4) is LSB0-ordered
* Each of the results (result0-result4) are 16-bit
-* "same" indicates "no change as a result of the Vectorised add"
+* "same" indicates "no change as a result of the Vectorized add"
```
| MSB0: | 0:15 | 16:31 | 32:47 | 48:63 |
from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom
16 LSBs of GPR(1).
-Hardware Architectural note: to avoid a Read-Modify-Write at the register
-file it is strongly recommended to implement byte-level write-enable lines
-exactly as has been implemented in DRAM ICs for many decades. Additionally
-the predicate mask bit is advised to be associated with the element
-operation and alongside the result ultimately passed to the register file.
-When element-width is set to 64-bit the relevant predicate mask bit
-may be repeated eight times and pull all eight write-port byte-level
-lines HIGH. Clearly when element-width is set to 8-bit the relevant
-predicate mask bit corresponds directly with one single byte-level
-write-enable line. It is up to the Hardware Architect to then amortise
-(merge) elements together into both PredicatedSIMD Pipelines as well
-as simultaneous non-overlapping Register File writes, to achieve High
-Performance designs. Overall it helps to think of the register files
-as being much more akin to a byte-level-addressable SRAM.
-
-If the 16-bit operation were to be followed up with a 32-bit Vectorised
+If the 16-bit operation were to be followed up with a 32-bit Vectorized
Operation, the exact same contents would be viewed as follows:
```
byte-ordering. This section is exclusively about how to correctly perceive
Simple-V-Augmented **Register** Files.
+*Engineering note: to avoid a Read-Modify-Write at the register
+file it is strongly recommended to implement byte-level write-enable lines
+exactly as has been implemented in DRAM ICs for many decades. Additionally
+the predicate mask bit is advised to be associated with the element
+operation and alongside the result ultimately passed to the register file.
+When element-width is set to 64-bit the relevant predicate mask bit
+may be repeated eight times and pull all eight write-port byte-level
+lines HIGH. Clearly when element-width is set to 8-bit the relevant
+predicate mask bit corresponds directly with one single byte-level
+write-enable line. It is up to the Hardware Architect to then amortise
+(merge) elements together into both PredicatedSIMD Pipelines as well
+as simultaneous non-overlapping Register File writes, to achieve High
+Performance designs. Overall it helps to think of the GPR and FPR
+register files as being much more akin to a 64-bit-wide byte-level-addressable SRAM.*
+
**Comparative equivalent using VSR registers**
For a comparative data point the VSR Registers may be expressed in the
register, where VSX does the reverse: places the numerically-*highest*
(last-numbered) element at the LSB end of the register.
-
```
#pragma pack
typedef union {
## Register Naming and size
As indicated above SV Registers are simply the GPR, FPR and CR register
-files extended linearly to larger sizes; SV Vectorisation iterates
+files extended linearly to larger sizes; SV Vectorization iterates
sequentially through these registers (LSB0 sequential ordering from 0
to VL-1).
Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
-r0 to r31, SV extends this as r0 to r127. Likewise FP registers are
+r0 to r31, SV extends this range (in the Upper Compliancy Levels of SV)
+as r0 to r127. Likewise FP registers are
extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries,
-CR0 thru CR127.
+CR0 thru CR127. In the Lower SV Compliancy Levels the quantity of registers
+remains the same in order to reduce implementation cost for Embedded systems.
The names of the registers therefore reflects a simple linear extension
of the Power ISA v3.0B / v3.1B register naming, and in hardware this
sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
sv.mfcr cr5, *cr40 # only one source (CR40) copied to CR5
sv.mfcr *cr16, cr40 # Vector-Splat CR40 onto CR16,17,18...
+ sv.mfcr *cr16, cr3 # Vector-Splat CR3 onto CR16,17,18...
```
Examples of prohibited instructions:
## SVP64 Remapped Encoding (`RM[0:23]`)
In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits
-32-37 are the Primary Opcode of the Suffix "Defined Word". 38-63 are the
-remainder of the Defined Word. Note that the new EXT232-263 SVP64 area
+32-37 are the Primary Opcode of the Suffix "Defined Word-instruction". 38-63 are the
+remainder of the Defined Word-instruction. Note that the new EXT232-263 SVP64 area
it is obviously mandatory that bit 32 is required to be set to 1.
| 0-5 | 6 | 7 | 8-31 | 32-37 | 38-64 |Description |
| Field Name | Field bits | Description |
|------------|------------|----------------------------------------|
| ELWIDTH | `4:5` | Element Width |
-| ELWIDTH_SRC | `6:7` | Element Width for Source |
+| ELWIDTH_SRC | `6:7` | Element Width for Source (or MASK_SRC in 2PM) |
| EXTRA | `10:18` | Register Extra encoding |
| MODE | `19:23` | changes Vector behaviour |
sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
this may be considered to be elements 0b00 to 0b01 inclusive.
+Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+element operation issued, SUBVL element operations are issued (as an inner loop).
+The key difference between VL looping and SUBVL looping
+is that predication bits are applied per
+**group**, rather than by individual element.
+
+Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
+
## MASK/MASK_SRC & MASKMODE Encoding
One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
Likewise CR based twin predication has a second set of 3 bits, allowing
a different test to be applied.
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed. In practice (for CR Fields)
-this creates an unnecessary block on parallelism. Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop. Doing so results
+Note that it cannot necessarily be assumed that Predicate Masks
+(whether INT or CR) are read in full *before* the operations proceed. In practice (for CR Fields)
+this creates an unnecessary block on parallelism, prohibiting
+"Vector Chaining". Therefore, it is up
+to the programmer to ensure that the CR field Elements used as Predicate Masks
+are not overwritten by any parallel Vector Loop. Doing so results
in **UNDEFINED** behaviour, according to the definition outlined in the
Power ISA v3.0B Specification.
needs to take place, safe in the knowledge that no programmer will have
issued a Vector Instruction where previous elements could have overwritten
(destroyed) not-yet-executed CR-Predicated element operations.
+This particularly is an issue when using REMAP, as the order in
+which CR-Field-based Predicate Mask bits could be read on a per-element
+execution basis could well conflict with the order in which prior
+elements wrote to the very same CR Field.
+
+Additionally Programmers should avoid using r3 r10 or r30
+as destination registers when these are also used as a Predicate
+Mask. Doing so is again UNDEFINED behaviour.
+
+Usually in 2P `MASK_SRC` is exclusively in the EXTRA area. However for
+LD/ST-Indexed a different Encoding is required, designated `2PM`.
### Integer Predication (MASKMODE=0)
| 110 | so/un | `CR[offs+i].FU` is set |
| 111 | ns/nu | `CR[offs+i].FU` is clear |
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
+`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorized
Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
-The CR Predicates chosen must start on a boundary that Vectorised CR
+The CR Predicates chosen must start on a boundary that Vectorized CR
operations can access cleanly, in full. With EXTRA2 restricting starting
-points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
+points to multiples of 8 (CR0, CR8, CR16...) both Vectorized Rc=1 and
CR Predicate Masks have to be adapted to fit on these boundaries as well.
## Extra Remapped Encoding <a name="extra_remap"> </a>
* `RM-2P-1S1D` Twin Predication (src=1, dest=1)
* `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
* `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
+* `RM-2PM-2S1D` Twin Predication (src=2, dest=1) for LD/ST Update (Indexed)
+
+The `2PM` designation uses bits 6 and 7 as well as the 9 EXTRA bits
+in order to extend two registers to
+EXTRA3, sacrificing destination elwidths in the process.
+`MASK_SRC` has a different encoding in `2PM`.
### RM-1P-3S1D
* `rlwimi RA, RS, ...`
* Rsrc1_EXTRA3 applies to RS as the first src
-* Rsrc2_EXTRA3 applies to RA as the secomd src
+* Rsrc2_EXTRA3 applies to RA as the second src
* Rdest_EXTRA3 applies to RA to create an **independent** dest.
With the addition of the EXTRA bits, the three registers
### RM-2P-2S1D/1S2D/3S
The primary purpose for this encoding is for Twin Predication on LOAD
-and STORE operations. see [[sv/ldst]] for detailed anslysis.
+and STORE operations. see [[sv/ldst]] for detailed analysis.
**RM-2P-2S1D:**
Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
or increased latency in some implementations due to lane-crossing.
+### RM-2PM-2S1D/1S2D/3S
+
+The primary purpose for this encoding is for Twin Predication on LOAD
+and STORE operations providing EXTRA3 for RT, RA and RS.
+see [[sv/ldst]] for detailed analysis.
+
+**RM-2PM-2S1D:**
+
+RT or RS requires EXTRA3, RA requires EXTRA3, but for RB EXTRA2 will
+suffice. `MASK_SRC` may be read from the bits normally used for dest-elwidth.
+
+| Field Name | Field bits | Description |
+|------------|------------|----------------------------|
+| Rdest_EXTRA3 | `10:12` | extends Rdest (R\*\_EXTRA2 Encoding) |
+| Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
+| Rsrc2_EXTRA2 | `16:17` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
+| MASK_SRC | `6:7,18` | Execution Mask for Source |
+
## R\*\_EXTRA2/3
EXTRA is the means by which two things are achieved:
| 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
| 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
+<!-- hide -->
## Appendix
Now at its own page: [[svp64/appendix]]
---------
[[!tag standards]]
+<!-- show -->
+
+--------
+
\newpage{}