(no commit message)

[libreriscv.git] / openpower / sv / svp64.mdwn
diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn

index 700d26b9ebd12b59c1ca5169e97ae360e7417b35..8d732f474f1c1aad0c79fddfaa746fc9c5ecf542 100644 (file)
--- a/openpower/sv/svp64.mdwn
+++ b/openpower/sv/svp64.mdwn
@@ -1,10 +1,10 @@
  # SVP64 Zero-Overhead Loop Prefix Subsystem
  
+<!-- hide -->
  * **DRAFT STATUS v0.1 18sep2021** Release notes <https://bugs.libre-soc.org/show_bug.cgi?id=699>
+<!-- show -->
  
-This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]]. It is in Draft Status and
-will be submitted to the [[!wikipedia OpenPOWER_Foundation]] ISA WG
-via the External RFC Process.
+This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]]. 
  
  Credits and acknowledgements:
  
@@ -17,9 +17,12 @@ Credits and acknowledgements:
  * NLnet Foundation, for funding
  * OpenPOWER Foundation
  * Paul Mackerras
+* Brad Frey
+* Cathy May
  * Toshaan Bharvani
  * IBM for the Power ISA itself
  
+<!-- hide -->
  Links:
  
  * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
@@ -34,22 +37,21 @@ Links:
  * [[sv/branches]] chapter
  * [[sv/ldst]] chapter
  
-
  Table of contents
  
  [[!toc]]
+<!-- show -->
  
  ## Introduction
  
-Simple-V is a type of Vectorisation best described as a "Prefix Loop
-Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and
-to the 8086 `REP` Prefix instruction.  More advanced features are similar
-to the Z80 `CPIR` instruction. If naively viewed one-dimensionally as an
-actual Vector ISA it introduces over 1.5 million 64-bit True-Scalable
-Vector instructions on the SFFS Subset and closer to 10 million 64-bit
-True-Scalable Vector instructions if introduced on VSX.  SVP64, the
-instruction format used by Simple-V, is therefore best viewed as an
-orthogonal RISC-paradigm "Prefixing" subsystem instead.
+Simple-V is a type of Vectorization best described as a "Prefix Loop
+Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR`[^bib_ldir] instruction and
+to the 8086 `REP`[^bib_rep] Prefix instruction.  More advanced features are similar
+to the Z80 `CPIR`[^bib_cpir] instruction.
+
+[^bib_ldir]:  [Zilog Z80 LDIR](http://z80-heaven.wikidot.com/instructions-set:ldir)
+[^bib_cpir]:  [Zilog Z80 CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir)
+[^bib_rep]: [8086 REP](https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz)
  
  Except where explicitly stated all bit numbers remain as in the rest of
  the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
@@ -59,29 +61,40 @@ ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order).
  which is a different convention from that used elsewhere in the Power ISA.
  
  The SVP64 prefix always comes before the suffix in PC order and must be
-considered an independent "Defined word" that augments the behaviour of
-the following instruction, but does **not** change the actual Decoding
-of that following instruction.  **All prefixed 32-bit instructions
-(Defined Words) retain their non-prefixed encoding and definition**.
+considered an independent "Defined Word-instruction"[^dwi] that augments the behaviour of
+the following instruction (also a Defined Word-instruction), but does **not** change the actual Decoding
+of that following instruction just because it is Prefixed.  Unlike EXT100-163,
+where the Suffix is considered an entirely new Opcode Space,
+SVP64-Prefixed instructions must never be treated or regarded
+as a different Opcode Space.
+
+[^dwi]: Defined Word-instruction: Power ISA v3.1 Section 1.6
  
  Two apparent exceptions to the above hard rule exist: SV
  Branch-Conditional operations and LD/ST-update "Post-Increment"
  Mode.  Post-Increment was considered sufficiently high priority
  (significantly reducing hot-loop instruction count) that one bit in
  the Prefix is reserved for it (*Note the intention to release that bit
-and move Post-Increment instructions to EXT2xx, as part of [[ls011]]*).
-Vectorised Branch-Conditional operations "embed" the original Scalar
+and move Post-Increment instructions to EXT2xx, as part of [[sv/rfc/ls011]]*).
+Vectorized Branch-Conditional operations "embed" the original Scalar
  Branch-Conditional behaviour into a much more advanced variant that is
  highly suited to High-Performance Computation (HPC), Supercomputing,
  and parallel GPU Workloads.
  
-*Architectural Resource Allocation note: it is prohibited to accept RFCs
-which fundamentally violate this hard requirement.  Under no circumstances
-must the Suffix space have an alternate instruction encoding allocated
-within SVP64 that is entirely different from the non-prefixed Defined
-Word. Hardware Implementors critically rely on this inviolate guarantee
-to implement High-Performance Multi-Issue micro-architectures that can
-sustain 100% throughput*
+*Architectural Resource Allocation note: at present it is possible to perform
+partial parallel decode of the SVP64 24-bit Encoding Area at the same time
+as decoding of the Suffix. Multi-Issue Implementations may even
+Decode multiple 32-bit words in parallel and follow up with a second
+cycle of joining Prefix and Suffix "after-the-fact". 
+Mixing and overlaying 64-bit Opcode Encodings into the
+{SVP64 24-bit Prefix}{Defined Word-instruction} space creates
+a hard dependency that catastrophically damages Multi-Issue Decoding by
+greatly complexifying Parallel Instruction-Length Detection.
+Therefore it has to be prohibited to accept RFCs
+which fundamentally violate the following hard requirement: **under no circumstances**
+must the use of SVP64 24-bit Suffixes **also** imply a different Opcode space
+from **any** non-prefixed Word. Even RESERVED or Illegal Words must be
+Orthogonal.*
  
  Subset implementations in hardware are permitted, as long as certain
  rules are followed, allowing for full soft-emulation including future
@@ -89,55 +102,6 @@ revisions.  Compliancy Subsets exist to ensure minimum levels of binary
  interoperability expectations within certain environments.  Details in
  the [[svp64/appendix]].
  
-## Strict Program Order
-
-Many Vector ISAs allow interrupts to occur in the middle of
-processing of large Vector operations, only under the condition
-that continuation on return will restart the entire operation.
-The reason is that saving of full Architectural State is
-not practical.
-
-Simple-V operates on an entirely different paradigm from traditional
-Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
-with Scalar instructions. With this in mind it is critical for
-implementations to observe Strict Element-Level Program Order
-at all times.  *Any* element is Interruptible and Simple-V has
-been carefully designed to ensure that Architectural State may
-be fully preserved regardless of that same State.
-
-Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
-but the full SVP64 Architectural State may be saved and
-restored through manual copying of `SVSTATE` (and the four
-REMAP SPRs if in use at the time)
-Whilst this initially sounds unsafe in reality
-all that Trap Handlers (and function call stack save/restore)
-need do is avoid
-use of SVP64 Prefixed instructions to perform the necessary
-save/restore of Simple-V Architectural State.
-This capability also allows nested function calls to be made from
-inside Vector loops, which is very rare for Vector ISAs.
-
-Strict Program Order is also preserved by the Parallel Reduction
-REMAP Schedule, but only at the cost of requiring the destination
-Vector to be used (Deterministically) to store partial progress of the
-Parallel Reduction Schedule.
-
-The only major caveat for REMAP is that
-after an explicit change to
-Architectural State caused by writing to the
-Simple-V SPRs, some implementations may find
-it easier to take longer to calculate where in a given Schedule
-the re-mapping Indices were.  Obvious examples include Interrupts occuring
-in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
-for example), which
-will force implementations to perform divide and modulo
-calculations.
-
-With that context in mind it is important to bear in mind throughout
-this document that when referring to "Strict Program Order", this
-includes Elements, because in Simple-V all Elements
-are conceptually synonymous with Scalar execution.
-
  ## SVP64 encoding features
  
  A number of features need to be compacted into a very small space of
@@ -149,12 +113,16 @@ only 24 bits:
  * Predication on both source and destination
  * Two different sources of predication: INT and CR Fields
  * SV Modes including saturation (for Audio, Video and DSP), mapreduce,
-  fail-first and predicate-result mode.
+  and fail-first mode.
  
  Different classes of operations require different formats. The earlier
-sections cover the common formats and the four separate modes follow:
-CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store
-and Branch-Conditional.
+sections cover the common formats and the five separate modes have their own
+section later:
+* CR operations (crops),
+* Arithmetic/Logical (termed "normal"),
+* Load/Store Immediate,
+* Load/Store Indexed,
+* Branch-Conditional.
  
  ## Definition of Reserved in this spec.
  
@@ -167,24 +135,150 @@ must also raise illegal instruction traps in order to allow emulation.
  Unless otherwise stated, reserved values are always all zeros.
  
  This is unlike OpenPower ISA v3.1, which in many instances does not
-require a trap if reserved fields are nonzero.  Where the standard Power
+require a trap if reserved fields are nonzero, instead relying on software
+to avoid use of such fields.  Where the standard Power
  ISA definition is intended the red keyword `RESERVED` is used.
  
-##  Definition of "UnVectoriseable"
+## Definition of "PO9-Prefixed"
+
+Used in the context of "A PO9-Prefixed Word" this is a new area similar to EXT100-163
+that is shared between SVP64-Single, SVP64, 32 Vectorizable new Opcode areas
+EXT200-231, one RESERVED 57-bit future Opcode space, and three new Unvectorizable
+RESERVED 32-bit future Opcode spaces. See [[sv/po9_encoding]].
  
-Any operation that inherently makes no sense if repeated is termed
-"UnVectoriseable" or "UnVectorised".  Examples include `sc` or `sync`
-which have no registers. `mtmsr` is also classed as UnVectoriseable
+## Definition of "SVP64-Prefix"
+
+A 24-bit RISC-Paradigm Encoding area for Loop-Augmentation of the following
+"Defined Word-instruction-instruction".
+Used in the context of "An SVP64-Prefixed Defined Word-instruction", as separate and
+distinct from the 32-bit PO9-Prefix that holds a 24-bit SVP64 Prefix.
+
+##  Definition of "Vectorizable" and "Unvectorizable"
+
+"Vectorizable" Defined Word-instructions are Scalar instructions that
+benefit from SVP64 Loop-Prefixing.
+Conversely, any operation that inherently makes no sense if repeated in a
+Vector Loop is termed
+"Unvectorizable" or "Unvectorized".  Examples include `sc` or `sync`
+which have no registers. `mtmsr` is also classed as Unvectorizable
  because there is only one `MSR`.
  
-UnVectorised instructions are required to be detected as such if
+UnVectorized instructions are required to be detected as such if
  Prefixed (either SVP64 or SVP64Single) and an Illegal Instruction
  Trap raised.
  
  *Architectural Note: Given that a "pre-classification" Decode Phase is
-required (identifying whether the Suffix - Defined Word - is
+required (identifying whether the Suffix - Defined Word-instruction - is
  Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
-adding "UnVectorised" to this phase is not unreasonable.*
+adding "Unvectorized" to this phase is not unreasonable.*
+
+Vectorizable Defined Word-instructions are **required** to be Vectorized,
+or they may not be permitted to be added at all to the Power ISA as Defined
+Word-instructions.
+
+*Engineering note: implementations may not choose to add Defined Word-instructions
+without also adding hardware support for SVP64-Prefixing of the same.*
+
+*ISA Working Group note: Vectorized PackedSIMD instructions if ever proposed
+should be considered Unvectorizable and except in extreme mitigating circumstances
+rejected outright.*
+
+## Definition of Strict Element-Level Execution Order<a name="svp64_eeo"> </a>
+
+Where Instruction Execution Order[^ieo] guarantees the appearance of sequential
+execution of instructions, Simple-V requires a corresponding guarantee for Elements
+because in Simple-V Execution of Elements is synonymous with Execution of
+instructions.
+
+[^ieo]: Strict Instruction Execution Order is defined in Public v3.1 Book I Section 2.2
+
+## Precise Interrupt Guarantees
+
+Strict Instruction Execution Order is defined as giving the appearance, as far
+as programs are concerned, that instructions were executed
+strictly in the sequence that they occurred.  A "Precise"
+out-of-order
+Micro-architecture goes to considerable lengths to ensure that
+this is the case.
+
+Many Vector ISAs allow interrupts to occur in the middle of
+processing of large Vector operations, only under the condition
+that partial results are cleanly discarded, and continuation on return
+from the Trap Handler will restart the entire operation.
+The reason is that saving of full Architectural State is
+not practical. An example would be a Floating-Point Horizontal Sum instruction
+(very common in Vector ISAs) or a Dot Product instruction
+that specifies a higher degree of accuracy for the *internal*
+accumulator than the registers.
+
+Simple-V operates on an entirely different paradigm from traditional
+Vector ISAs: as a "Sub-Execution Context", where "Elements" are synonymous
+with Scalar instructions. With this in mind
+implementations must observe Strict **Element**-Level Execution Order[[#svp64_eeo]]
+at all times.
+*Any* element is Interruptible, and Architectural State may
+be fully preserved and restored regardless of that same State.
+
+*Engineering note: implementations are permitted have higher latency to
+perform context-switching  (particularly if REMAP
+is active).*
+
+Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
+but the full SVP64 Architectural State may be saved and
+restored through manual copying of `SVSTATE` (and the four
+REMAP SPRs if in use at the time, which may be determined by
+`SVSTATE[32:46]` being non-zero).
+
+*Programmer's note: Trap Handlers (and any stack-based context save/restore)
+must avoid the use of SVP64 Prefixed instructions to perform the necessary
+save/restore of Simple-V Architectural State (SPR SVSTATE),
+just as use of FPRs and VSRs is presently avoided.
+However once saved, and set to known-good, SVP64 Prefixed instructions
+may be used to save/restore GPRs, SPRs, FPRs and other state.*
+
+*Programmer's note: SVSHAPE0-3 alters Element Execution Order, but only
+if activated in SVSHAPE. It is therefore technically possible in a Trap
+Handler to save SVSTATE (`mfspr t0, SVSTATE`), then clear bits 32-46.
+At this point it becomes safe to use SVP64 to save sequential batches
+of SPRs (`setvli MAXVL=VL=4; sv.mfspr *t0, *SVSHAPE0`)*
+
+The only major caveat for REMAP is that
+after an explicit change to
+Architectural State caused by writing to the
+Simple-V SPRs, some implementations may find
+it easier to take longer to calculate where in a given Schedule
+the re-mapping Indices were.  Obvious examples include Interrupts occuring
+in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
+for example), which
+will force some implementations to perform divide and modulo
+calculations.
+
+An additional caveat involves Condition Register Fields
+when also used as Predicate Masks. An operation that
+overwrites the same CR Fields that are simultaneously
+being used as a Predicate Mask should exercise extreme care
+if the overwritten CR field element was needed by a
+subsequent Element for its Predicate Mask bit.
+
+Some implementations may deploy Cray's technique of
+"Vector Chaining" (including in this case reading the CR field
+containing the Predicate bit until the very last moment),
+and consequently avoiding the risk of
+overwrite is the responsibility of the Programmer.
+`hphint` may be used here to good effect.
+Extra Special care is particularly needed here when using REMAP
+and also Vertical-First Mode.
+
+The simplest option is to use Integer Predicate Masks but the
+caveats are stricter:
+
+* In Vertical-First loops Programmers **must not** write to any
+  Integers (r3, r0, r31) used as Predicate Masks. Doing so
+  is `UNDEFINED` behaviour.
+* An **entire** Vector is held up on Horizontal-First Mode if the
+  Integer Predicate is still in in-flight Reservation Stations
+  or pipelines.  Speculative Vector Chained Execution mitigates delays
+  but can be heavy on Reservation Station resources.
  
  ## Register files, elements, and Element-width Overrides
  
@@ -216,15 +310,21 @@ Memory access remains exactly the same: the effects of `MSR.LE` remain
  exactly the same, affecting as they already do and remain **only**
  on the Load and Store memory-register operation byte-order, and having
  nothing to do with the ordering of the contents of register files or
-register-register operations.
+register-register arithmetic or logical operations.
  
  The only major impact on Arithmetic and Logical operations is that all
  Scalar operations are defined, where practical and workable, to have
-three new widths: elwidth=32, elwidth=16, elwidth=8.  The default of
+three new widths: elwidth=32, elwidth=16, elwidth=8.
+
+*Architectural note: a future revision of SVP64 for VSX may have entirely
+different definitions of possible elwidths.*
+
+The default of
  elwidth=64 is the pre-existing (Scalar) behaviour which remains 100%
  unchanged. Thus, `addi` is now joined by a 32-bit, 16-bit, and 8-bit
  variant of `addi`, but the sole exclusive difference is the width.
-*In no way* is the actual `addi` instruction fundamentally altered.
+*In no way* is the actual `addi` instruction fundamentally altered
+to become an entirely different operation (such as a subtract or multiply).
  FP Operations elwidth overrides are also defined, as explained in
  the [[svp64/appendix]].
  
@@ -233,7 +333,7 @@ To be absolutely clear:
  ```
      There are no conceptual arithmetic ordering or other changes over the
      Scalar Power ISA definitions to registers or register files or to
-    arithmetic or Logical Operations beyond element-width subdivision
+    arithmetic or Logical Operations, beyond element-width subdivision
  ```
  
  Element offset
@@ -327,7 +427,7 @@ For clarity in the table below:
  * The GPR-numbering is considered LSB0-ordered
  * The Element-numbering (result0-result4) is LSB0-ordered
  * Each of the results (result0-result4) are 16-bit
-* "same" indicates "no change as a result of the Vectorised add"
+* "same" indicates "no change as a result of the Vectorized add"
  
  ```
      | MSB0:  | 0:15    | 16:31   | 32:47   | 48:63   |
@@ -346,22 +446,7 @@ the example having VL=5.  Thus on "wrapping" - sequential progression
  from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom
  16 LSBs of GPR(1).
  
-Hardware Architectural note: to avoid a Read-Modify-Write at the register
-file it is strongly recommended to implement byte-level write-enable lines
-exactly as has been implemented in DRAM ICs for many decades. Additionally
-the predicate mask bit is advised to be associated with the element
-operation and alongside the result ultimately passed to the register file.
-When element-width is set to 64-bit the relevant predicate mask bit
-may be repeated eight times and pull all eight write-port byte-level
-lines HIGH. Clearly when element-width is set to 8-bit the relevant
-predicate mask bit corresponds directly with one single byte-level
-write-enable line.  It is up to the Hardware Architect to then amortise
-(merge) elements together into both PredicatedSIMD Pipelines as well
-as simultaneous non-overlapping Register File writes, to achieve High
-Performance designs.  Overall it helps to think of the register files
-as being much more akin to a byte-level-addressable SRAM.
-
-If the 16-bit operation were to be followed up with a 32-bit Vectorised
+If the 16-bit operation were to be followed up with a 32-bit Vectorized
  Operation, the exact same contents would be viewed as follows:
  
  ```
@@ -384,6 +469,21 @@ form because `MSR.LE` is directly in control of the Memory-to-Register
  byte-ordering. This section is exclusively about how to correctly perceive
  Simple-V-Augmented **Register** Files.
  
+*Engineering note: to avoid a Read-Modify-Write at the register
+file it is strongly recommended to implement byte-level write-enable lines
+exactly as has been implemented in DRAM ICs for many decades. Additionally
+the predicate mask bit is advised to be associated with the element
+operation and alongside the result ultimately passed to the register file.
+When element-width is set to 64-bit the relevant predicate mask bit
+may be repeated eight times and pull all eight write-port byte-level
+lines HIGH. Clearly when element-width is set to 8-bit the relevant
+predicate mask bit corresponds directly with one single byte-level
+write-enable line.  It is up to the Hardware Architect to then amortise
+(merge) elements together into both PredicatedSIMD Pipelines as well
+as simultaneous non-overlapping Register File writes, to achieve High
+Performance designs.  Overall it helps to think of the GPR and FPR
+register files as being much more akin to a 64-bit-wide byte-level-addressable SRAM.*
+
  **Comparative equivalent using VSR registers**
  
  For a comparative data point the VSR Registers may be expressed in the
@@ -399,7 +499,6 @@ element (numbered zero) being at the bitwise-numbered **LSB** end of the
  register, where VSX does the reverse: places the numerically-*highest*
  (last-numbered) element at the LSB end of the register.
  
-
  ```
      #pragma pack
      typedef union {
@@ -521,14 +620,16 @@ MAXVL when understanding this key aspect of SimpleV.
  ## Register Naming and size
  
  As indicated above SV Registers are simply the GPR, FPR and CR register
-files extended linearly to larger sizes; SV Vectorisation iterates
+files extended linearly to larger sizes; SV Vectorization iterates
  sequentially through these registers (LSB0 sequential ordering from 0
  to VL-1).
  
  Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
-r0 to r31, SV extends this as r0 to r127.  Likewise FP registers are
+r0 to r31, SV extends this range (in the Upper Compliancy Levels of SV)
+as r0 to r127.  Likewise FP registers are
  extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries,
-CR0 thru CR127.
+CR0 thru CR127.  In the Lower SV Compliancy Levels the quantity of registers
+remains the same in order to reduce implementation cost for Embedded systems.
  
  The names of the registers therefore reflects a simple linear extension
  of the Power ISA v3.0B / v3.1B register naming, and in hardware this
@@ -608,6 +709,7 @@ Examples of permitted instructions:
      sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
      sv.mfcr cr5, *cr40                   # only one source (CR40) copied to CR5
      sv.mfcr *cr16, cr40                  # Vector-Splat CR40 onto CR16,17,18...
+    sv.mfcr *cr16, cr3                  # Vector-Splat CR3 onto CR16,17,18...
  ```
  
  Examples of prohibited instructions:
@@ -641,8 +743,8 @@ of scope for this version of SVP64.
  ## SVP64 Remapped Encoding (`RM[0:23]`)
  
  In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits
-32-37 are the Primary Opcode of the Suffix "Defined Word". 38-63 are the
-remainder of the Defined Word.  Note that the new EXT232-263 SVP64 area
+32-37 are the Primary Opcode of the Suffix "Defined Word-instruction". 38-63 are the
+remainder of the Defined Word-instruction.  Note that the new EXT232-263 SVP64 area
  it is obviously mandatory that bit 32 is required to be set to 1.
  
  | 0-5 | 6 | 7 | 8-31     | 32-37  | 38-64    |Description                        |
@@ -680,7 +782,7 @@ on context after decoding of the Scalar suffix:
  | Field Name | Field bits | Description                            |
  |------------|------------|----------------------------------------|
  | ELWIDTH       | `4:5`      | Element Width                       |
-| ELWIDTH_SRC   | `6:7`      | Element Width for Source      |
+| ELWIDTH_SRC   | `6:7`      | Element Width for Source (or MASK_SRC in 2PM)    |
  | EXTRA         | `10:18`    | Register Extra encoding                |
  | MODE          | `19:23`    | changes Vector behaviour               |
  
@@ -788,6 +890,14 @@ The SUBVL encoding value may be thought of as an inclusive range of a
  sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
  this may be considered to be elements 0b00 to 0b01 inclusive.
  
+Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+element operation issued, SUBVL element operations are issued (as an inner loop).
+The key difference between VL looping and SUBVL looping
+is that predication bits are applied per
+**group**, rather than by individual element.  
+
+Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
+
  ## MASK/MASK_SRC & MASKMODE Encoding
  
  One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
@@ -812,11 +922,12 @@ used for both src and dest, or different regs (one for src, one for dest).
  Likewise CR based twin predication has a second set of 3 bits, allowing
  a different test to be applied.
  
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed.  In practice (for CR Fields)
-this creates an unnecessary block on parallelism.  Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop.  Doing so results
+Note that it cannot necessarily be assumed that Predicate Masks
+(whether INT or CR) are read in full *before* the operations proceed.  In practice (for CR Fields)
+this creates an unnecessary block on parallelism, prohibiting
+"Vector Chaining".  Therefore, it is up
+to the programmer to ensure that the CR field Elements used as Predicate Masks
+are not overwritten by any parallel Vector Loop.  Doing so results
  in **UNDEFINED** behaviour, according to the definition outlined in the
  Power ISA v3.0B Specification.
  
@@ -825,6 +936,17 @@ of individual CR fields until the actual predicated element operation
  needs to take place, safe in the knowledge that no programmer will have
  issued a Vector Instruction where previous elements could have overwritten
  (destroyed) not-yet-executed CR-Predicated element operations.
+This particularly is an issue when using REMAP, as the order in
+which CR-Field-based Predicate Mask bits could be read on a per-element
+execution basis could well conflict with the order in which prior
+elements wrote to the very same CR Field.
+
+Additionally Programmers should avoid using r3 r10 or r30
+as destination registers when these are also used as a Predicate
+Mask. Doing so is again UNDEFINED behaviour.
+
+Usually in 2P `MASK_SRC` is exclusively in the EXTRA area. However for
+LD/ST-Indexed a different Encoding is required, designated `2PM`.
  
  ### Integer Predication (MASKMODE=0)
  
@@ -867,12 +989,12 @@ following meaning:
  | 110   | so/un    | `CR[offs+i].FU` is set   |
  | 111   | ns/nu    | `CR[offs+i].FU` is clear |
  
-`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
+`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorized
  Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
  
-The CR Predicates chosen must start on a boundary that Vectorised CR
+The CR Predicates chosen must start on a boundary that Vectorized CR
  operations can access cleanly, in full.  With EXTRA2 restricting starting
-points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
+points to multiples of 8 (CR0, CR8, CR16...) both Vectorized Rc=1 and
  CR Predicate Masks have to be adapted to fit on these boundaries as well.
  
  ## Extra Remapped Encoding <a name="extra_remap"> </a>
@@ -917,6 +1039,12 @@ some compromises have to be made.
  * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
  * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
  * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
+* `RM-2PM-2S1D` Twin Predication (src=2, dest=1) for LD/ST Update (Indexed)
+
+The `2PM` designation uses bits 6 and 7 as well as the 9 EXTRA bits
+in order to extend two registers to
+EXTRA3, sacrificing destination elwidths in the process.
+`MASK_SRC` has a different encoding in `2PM`.
  
  ### RM-1P-3S1D
  
@@ -953,7 +1081,7 @@ they are different SV regs.
  
  * `rlwimi RA, RS, ...`
  * Rsrc1_EXTRA3 applies to RS as the first src
-* Rsrc2_EXTRA3 applies to RA as the secomd src
+* Rsrc2_EXTRA3 applies to RA as the second src
  * Rdest_EXTRA3 applies to RA to create an **independent** dest.
  
  With the addition of the EXTRA bits, the three registers
@@ -989,7 +1117,7 @@ single-predicate, three registers (2 read, 1 write)
  ### RM-2P-2S1D/1S2D/3S
  
  The primary purpose for this encoding is for Twin Predication on LOAD
-and STORE operations.  see [[sv/ldst]] for detailed anslysis.
+and STORE operations.  see [[sv/ldst]] for detailed analysis.
  
  **RM-2P-2S1D:**
  
@@ -1031,6 +1159,24 @@ RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest.
  Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
  or increased latency in some implementations due to lane-crossing.
  
+### RM-2PM-2S1D/1S2D/3S
+
+The primary purpose for this encoding is for Twin Predication on LOAD
+and STORE operations providing EXTRA3 for RT, RA and RS.
+see [[sv/ldst]] for detailed analysis.
+
+**RM-2PM-2S1D:**
+
+RT or RS requires EXTRA3, RA requires EXTRA3, but for RB EXTRA2 will
+suffice.  `MASK_SRC` may be read from the bits normally used for dest-elwidth.
+
+| Field Name | Field bits | Description                     |
+|------------|------------|----------------------------|
+| Rdest_EXTRA3 | `10:12`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
+| Rsrc1_EXTRA3 | `13:15`  | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
+| Rsrc2_EXTRA2 | `16:17`  | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
+| MASK_SRC     | `6:7,18` | Execution Mask for Source     |
+
  ## R\*\_EXTRA2/3
  
  EXTRA is the means by which two things are achieved:
@@ -1192,12 +1338,16 @@ For a 3-bit operand (e.g. BFA):
  | 10    | Vector | `CR0-CR112`/16 | BFA 0 | 0b000   |
  | 11    | Vector | `CR8-CR120`/16 | BFA 1 | 0b000   |
  
+<!-- hide -->
  ## Appendix
  
  Now at its own page: [[svp64/appendix]]
  
---------
  
  [[!tag standards]]
  
+<!-- show -->
+
+--------
+
  \newpage{}