(no commit message)

[libreriscv.git] / openpower / sv / svp_rewrite / svp64.mdwn
diff --git a/openpower/sv/svp_rewrite/svp64.mdwn b/openpower/sv/svp_rewrite/svp64.mdwn

index 4b9022883cd4a0d28d7e0ffa97b01dc2feedc252..537979c6c361b85aa24b4137b1ee5fe68c8555b5 100644 (file)
--- a/openpower/sv/svp_rewrite/svp64.mdwn
+++ b/openpower/sv/svp_rewrite/svp64.mdwn
@@ -1,7 +1,13 @@
-# Rewrite of SVP64 for OpenPower ISA v3.1
+# Links
  
+* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
  * [[svp64/discussion]]
  * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001650.html>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=550>
+
+# Rewrite of SVP64 for OpenPower ISA v3.1
+
+This document focuses on the encoding of SV.  It it best read in conjunction with the [[sv/overview]] which explains the background.
  
  The plan is to create an encoding for SVP64, then to create an encoding
  for SVP48, then to reorganize them both to improve field overlap,
@@ -14,6 +20,12 @@ and counting up as you move to the LSB end). All bit ranges are inclusive
  64-bit instructions are split into two 32-bit words, the prefix and the
  suffix. The prefix always comes before the suffix in PC order.
  
+| 0:5    | 6:31         | 0:31         |
+|--------|--------------|--------------|
+| EXT01  | v3.1B Prefix | v3.1B Suffix |
+
+svp64 fits into the "reserved" portions of the v3.1B prefix, making it possible for svp64, v3.0B (or v3.1B including 64 bit prefixed) instructions  to co-exist in the same binary without conflict.
+
  # Definition of Reserved in this spec.
  
  For the new fields added in SVP64, instructions that have any of their
@@ -32,24 +44,13 @@ v3.0/1B instructions covered by the prefix are "unaltered". This is termed `scal
  Note that this is completely different from when VL=0.  VL=0 turns all operations under its influence into `nops` (regardless of the prefix)
   whereas when VL=1 and the SV prefix is all zeros, the operation simply acts as if SV had not been applied at all to the instruction  (an "identity operation").
  
-# XER, SO and other global flags
-
-Vector systems are expected to be high performance.  This is achieved
-through parallelism, which requires that elements in the vector be
-independent.  XER SO and other global "accumulation" flags (CR.OV) cause
-Read-Write Hazards on single-bit global resources, having a significant
-detrimental adverse effect.
-
-Consequently in SV, XER.SO and CR.OV behaviour is disregarded.  XER is
-simply neither read nor written.  This includes when `scalar identity behaviour` occurs.  If OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1 instructions should be used, not SV Prefixed ones.
-
  # Register Naming and size
  
  SV Registers are simply the INT, FP and CR register files extended
  linearly to larger sizes; SV Vectorisation iterates sequentially through these registers.
  
  Where the integer regfile in standard scalar
-OpenPOWER v3.0B and v3.1B is r0 to r31, SV extends this as r0 to r127.
+OpenPOWER v3.0B/v3.1B is r0 to r31, SV extends this as r0 to r127.
  Likewise FP registers are extended to 128 (fp0 to fp127), and CRs are
  extended to 64 entries, CR0 thru CR63.
  
@@ -81,9 +82,12 @@ at the LSB.
  The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding
  is defined in the Prefix Fields section.
  
-## Prefix Opcode Map (64-bit instruction encoding) (prefix bits 6:11)
+## Prefix Opcode Map (64-bit instruction encoding)
+
+In the original table in the v3.1B OpenPOWER ISA Spec on p1350, Table 12, prefix bits 6:11 are shown, with their allocations to different v3.1B pregix "modes".
  
-(shows both PowerISA v3.1 instructions as well as new SVP instructions; empty spaces are yet-to-be-allocated Illegal Instructions)
+The table below hows both PowerISA v3.1 instructions as well as new SVP instructions fit;
+empty spaces are yet-to-be-allocated Illegal Instructions.  
  
  | 6:11 | ---000 | ---001 | ---010 | ---011 | ---100 | ---101 | ---110 | ---111 |
  |------|--------|--------|--------|--------|--------|--------|--------|--------|
@@ -96,8 +100,14 @@ is defined in the Prefix Fields section.
  |110---| MRR    |        |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
  |111---|        | MMIRR  |        |        | `SVP64`| `SVP64`| `SVP64`| `SVP64`|
  
+Note that by taking up a block of 16, where in every case bits 7 and 9 are set, this allows svp64 to utilise four bits of the v3.1B Prefix space and "allocate" them to svp64's Remapped Encoding field, instead.
+
  ## Prefix Fields
  
+To "activate" svp64 (in a way that does not conflict with v3.1B 64 bit Pregix mode), fields within the v3.1B Prefix Opcode Map are set
+(see Prefix Opcode Map, above), leaving 24 bits "free" for use by SV.
+This is achieved by setting bits 7 and 9 to 1:  
+
  | Name       | Bits    | Value | Description                    |
  |------------|---------|-------|--------------------------------|
  | EXT01      | `0:5`   | `1`   | Indicates Prefixed 64-bit      |
@@ -107,6 +117,17 @@ is defined in the Prefix Fields section.
  | SVP64_9    | `9`     | `1`   | Indicates this is SVP64        |
  | `RM[2:23]` | `10:31` |       | Bits 2-23 of Remapped Encoding |
  
+Laid out bitwise, this is as follows, showing how the 32-bits of the prefix
+are constructed:
+
+| 0:5    | 6     | 7 | 8     | 9 | 10:31    |
+|--------|-------|---|-------|---|----------|
+| EXT01  | RM    | 1 | RM    | 1 | RM       |
+| 000001 | RM[0] | 1 | RM[1] | 1 | RM]2:23] |
+
+Following the prefix will be the suffix: this is simply a 32-bit v3.0B / v3.1B
+instruction.  That instruction becomes "prefixed" with the SVP context: the
+Remapped Encoding field (RM).
  
  # Remapped Encoding Fields
  
@@ -115,41 +136,45 @@ variants.  There are two categories:  Single and Twin Predication.
  Due to space considerations further subdivision of Single Predication
  is based on whether the number of src operands is 2 or 3.
  
-
  * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
  * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
  * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
  * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
  * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
  
-## RM-1P-3S1D
+## Common RM fields
+
+The following fields are common to all Remapped Encodings:
+
  
  | Field Name | Field bits | Description                            |
  |------------|------------|----------------------------------------|
-| MASK\_KIND    | `0`        | Execution Mask Kind                 |
+| MASK\_KIND    | `0`        | Execution (predication) Mask Kind                 |
  | MASK          | `1:3`      | Execution Mask                      |
  | ELWIDTH       | `4:5`      | Element Width                       |
-| SUBVL         | `6:7`      | Sub-vector length                   |
+| SUBVL         | `6:7`      | Sub-vector length                   |                          
+| MODE          | `19:23` | changes Vector behaviour               |
+
+Bits 9 to 18 are further decoded depending on RM category for the instruction.
+
+## RM-1P-3S1D
+
+| Field Name | Field bits | Description                            |
+|------------|------------|----------------------------------------|
  | Rdest\_EXTRA2 | `8:9`   | extends Rdest (R\*\_EXTRA2 Encoding)   |
  | Rsrc1\_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
  | Rsrc2\_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
  | Rsrc3\_EXTRA2 | `14:15` | extends Rsrc3 (R\*\_EXTRA2 Encoding)   |
  | reserved      | `16`    | reserved                               |
-| MODE          | `19:23` | changes Vector behaviour               |
  
  ## RM-1P-2S1D
  
  | Field Name | Field bits | Description                               |
  |------------|------------|-------------------------------------------|
-| MASK\_KIND    | `0`     | Execution Mask Kind                       |
-| MASK          | `1:3`   | Execution Mask                            |
-| ELWIDTH       | `4:5`   | Element Width                             |
-| SUBVL         | `6:7`   | Sub-vector length                         |
  | Rdest\_EXTRA3 | `8:10`  | extends Rdest  |
  | Rsrc1\_EXTRA3 | `11:13` | extends Rsrc1  |
  | Rsrc2\_EXTRA3 | `14:16` | extends Rsrc3    |
  | ELWIDTH_SRC   | `17:18` | Element Width for Source      |
-| MODE          | `19:23` | changes Vector behaviour                  |
  
  These are for 2 operand 1 dest instructions, such as `add RT, RA,
  RB`. However also included are unusual instructions with an implicit dest
@@ -171,45 +196,39 @@ augmented to 7 bits in length.
  
  Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. 
  
-## RM-2P-1S1D
+## RM-2P-1S1D/2S
  
-| Field Name | Field bits | Description                                 |
+| Field Name | Field bits | Description                 |
  |------------|------------|----------------------------|
-| MASK_KIND  | `0`        | Execution Mask Kind                          |
-| MASK       | `1:3`      | Execution Mask                               |
-| ELWIDTH    | `4:5`      | Element Width                                |
-| SUBVL      | `6:7`      | Sub-vector length                           |
-| Rdest_EXTRA3 | `8:10`     | extends Rdest                     |
-| Rsrc1_EXTRA3 | `11:13`    | extends Rsrc1                      |
-| MASK_SRC     | `14:16`    | Execution Mask for Source     |
-| ELWIDTH_SRC  | `17:18`    | Element Width for Source      |
-| MODE         | `19:23`    | changes Vector behaviour                       |
+| Rdest_EXTRA3 | `8:10`     | extends Rdest             |
+| Rsrc1_EXTRA3 | `11:13`    | extends Rsrc1             |
+| MASK_SRC     | `14:16`    | Execution Mask for Source |
+| ELWIDTH_SRC  | `17:18`    | Element Width for Source  |
  
  Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance or increased latency in some implementations due to lane-crossing. 
  
-## RM-2P-2S1D/1S2D
+`RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
+
+## RM-2P-2S1D/1S2D/3S
  
  The primary purpose for this encoding is for Twin Predication on LOAD
  and STORE operations.  see [[sv/ldst]] for detailed anslysis.
  
  RM-2P-2S1D:
  
-| Field Name | Field bits | Description                                 |
+| Field Name | Field bits | Description                     |
  |------------|------------|----------------------------|
-| MASK_KIND  | `0`        | Execution Mask Kind                          |
-| MASK       | `1:3`      | Execution Mask                               |
-| ELWIDTH    | `4:5`      | Element Width                                |
-| SUBVL      | `6:7`      | Sub-vector length                           |
  | Rdest_EXTRA2 | `8:9`   | extends Rdest (R\*\_EXTRA2 Encoding)   |
  | Rsrc1_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding)   |
  | Rsrc2_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding)   |
  | MASK_SRC     | `14:16`    | Execution Mask for Source     |
  | ELWIDTH_SRC  | `17:18`    | Element Width for Source      |
-| MODE         | `19:23`    | changes Vector behaviour                       |
  
  Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
  is in bits 8:9, Rdest1_EXTRA2 in 10:11)
  
+Also that for 3S (to cover `stdx` etc.) the names are switched to 3 src: Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
+
  Note also that LD with update indexed, which takes 2 src and 2 dest
  (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also
  Twin Predication.  therefore these are treated as RM-2P-2S1D and the
@@ -229,7 +248,7 @@ These are the modes:
  * **sat mode** or saturation: clamps each elemrnt result to a min/max rather than overflows / wraps.  allows signed and unsigned clamping. 
  * **reduce mode**. a mapreduce is performed.  the result is a scalar.  a result vector however is required, as the upper elements may be used to store intermediary computations.  the result of the mapreduce is in the first element with a nonzero predicate bit.  see separate section below.
    note that there are comprehensive caveats when using this mode.
-* **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch testing) and if the test fails it is as if the predicate bit was zero.  When Rc=1 the CR element (CR0) however is still stored in the CR regfile.  This scheme does not apply to crops (crand, cror).
+* **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch testing) and if the test fails it is as if the predicate bit was zero.  When Rc=1 the CR element however is still stored in the CR regfile, even if the test failed.  This scheme does not apply to crops (crand, cror).  See appendix for details.
  
  Note that ffirst and reduce modes are not anticipated to be high-performance in some implementations.  ffirst due to interactions with VL, and reduce due to it requiring additional operations to produce a result.  normal, saturate and pred-result are however independent and may easily be parallelised to give high performance, regardless of the value of VL.
  
@@ -470,10 +489,74 @@ high, or accept that for twin predication VL must not exceed the range
  where overlap will occur, *or* that they use the same starting point
  but select different *bits* of the same CRs
  
-`offs` is defined as CR48 (6x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below).  Arithmetic Rc=1 operations start from CR16 (TBD); FP Rc=1 from CR32 (TBD).
+`offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised Rc=1 operations (see below).  Rc=1 operations start from CR8 (TBD).
  
  # Appendix
  
+## XER, SO and other global flags
+
+Vector systems are expected to be high performance.  This is achieved
+through parallelism, which requires that elements in the vector be
+independent.  XER SO and other global "accumulation" flags (CR.OV) cause
+Read-Write Hazards on single-bit global resources, having a significant
+detrimental effect.
+
+Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including in cmp ibstructions) .  XER is
+simply neither read nor written.  This includes when `scalar identity behaviour` occurs.  If precise OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1 instructions should be used without an SV Prefix.
+
+An interesting side-effect of this decision is that the OE flag is now free for other uses when SV Prefixing is used.
+
+Regarding XER.CA: this does not fit either: it was designed for a scalar ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given Vector element.  This provides a means to perform large parallel batches of Vectorised carry-capable additions.  crweird instructions can be used to transfer the CRs in and out of an integer, where bitmanipulation may be performed to analyse the carry bits (including carry lookahead propagation) before continuing with further parallel additions.
+
+## v3.0B/v3.1B relevant instructions
+
+SV is primarily designed for use as an efficient hybrid 3D GPU / VPU / CPU ISA.
+
+As mentioned above, OE=1 is not applicable in SV, freeing this bit for alternative uses.  Additionally, Vectorisation of the VSX SIMD system likewise makes no sense whatsoever. SV *replaces* VSX and provides, at the very minimum, predication (which VSX was designed without).  Thus all VSX Major Opcodes - all of them - are "unused" and must raise illegal instruction exceptions in SV Prefix Mode.
+
+Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to have because they are not only provided by SV, the SV alternatives may be predicated as well, making them far better suited to use in function calls and context-switching.
+
+Additionally, some v3.0/1 instructions simply make no sense at all in a Vector context: `twi` and `tdi` fall into this category, as do branch operations as well as `sc` and `scv`.  Here there is simply no point trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions should be called instead.
+
+Fortuitously this leaves several Major Opcodes free for use by SV to fit alternative future instructions.  In a 3D context this means Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST operations, and others critical to an efficient, effective 3D GPU and VPU ISA. With such instructions being included as standard in other commercially-successful GPU ISAs it is likewise critical that a 3D GPU/VPU based on svp64 also have such instructions.
+
+Note however that svp64 is stand-alone and is in no way critically dependent on the existence or provision of 3D GPU or VPU instructions. These should be considered extensions, and their discussion and specification is out of scope for this document.
+
+Note, again: this is *only* under svp64 prefixing.  Standard v3.0B / v3.1B is *not* altered by svp64 in any way.
+
+### Major opcode map (v3.0B)
+
+This table is taken from v3.0B.
+Table 9: Primary Opcode Map (opcode bits 0:5)
+
+        |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
+    000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
+    001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
+    010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
+    011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
+    100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
+    101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
+    110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
+    111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
+        |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
+
+### Suitable for svp64
+
+This is the same table containing v3.0B Primary Opcodes except those that make mo sense in a Vectorisation Context have been removed.  These removed POs can, *in the SV Vector Context only*, be assigned to alternative (Vectorised-only) instructions, including future extensions.
+
+Note, again, to emphasise: outside of svp64 these opcodes **do not** change.  When not prefixed with svp64 these opcodes **specifically** retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
+
+        |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
+    000 |        |       |       |       |       |        |       | mulli | 000
+    001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
+    010 |        |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
+    011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
+    100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
+    101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
+    110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
+    111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
+        |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
+
  ## Twin Predication
  
  This is a novel concept that allows predication to be applied to a single
@@ -505,28 +588,28 @@ Additional unusual capabilities of Twin Predication include a back-to-back
  version of VCOMPRESS-VEXPAND which is effectively the ability to do 
  sequentially ordered multiple VINSERTs.  The source predicate selects a 
  sequentially ordered subset of elements to be inserted; the destination predicate specifies the sequentially ordered recipient locations.
+This is equivalent to
+`llvm.masked.compressstore.*`
+followed by
+`llvm.masked.expandload.*`
  
  ## Rounding, clamp and saturate
  
-One of the issues with vector ops is that in integer DSP ops for example
-in Audio the operation must clamp or saturate rather than overflow or
-ignore the upper bits and become a modulo operation.  This for Audio
-is extremely important, also to provide an indicator as to whether
-saturation occurred.  see  [[av_opcodes]].
+see  [[av_opcodes]].
  
  To help ensure that audio quality is not compromised by overflow,
  "saturation" is provided, as well as a way to detect when saturation
-occurred (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per
+occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per
  element in the result (Note: this is different from VSX which has a
  single CR per block).
  
  When N=0 the result is saturated to within the maximum range of an
  unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
  logic applies to FP operations, with the result being saturated to
-maximum rather than returning INF.
+maximum rather than returning INF, and the minimum to +0.0
  
  When N=1 the same occurs except that the result is saturated to the min
-or max of a signed result.
+or max of a signed result, and for FP to the min and max value rather than returning +/- INF.
  
  When Rc=1, the CR "overflow" bit is set on the CR associated with the
  element, to indicate whether saturation occurred.  Note that due to
@@ -535,6 +618,8 @@ the hugely detrimental effect it has on parallel processing, XER.SO is
  overflow bit is therefore simply set to zero if saturation did not occur,
  and to one if it did.
  
+Note also that saturate on operations that produce a carry output are prohibited due to the conflicting use of the CR.so bit for storing if saturation occurred.
+
  Post-analysis of the Vector of CRs to find out if any given element hit
  saturation may be done using a mapreduced CR op (cror), or by using the
  new crweird instruction, transferring the relevant CR bits to a scalar
@@ -544,15 +629,15 @@ Note that the operation takes place at the maximum bitwidth (max of src and dest
  
  ## Reduce mode
  
-1. limited to single predicated dual src operations (add RT, RA, RB) and
-   to triple source operations where one of the inputs is set to a scalar
-   (these are rare)
+1. limited to single predicated dual src operations (add RT, RA, RB).
+   triple source operations are prohibited (fma).
  2. limited to operations that make sense.  divide is excluded, as is
     subtract (X - Y - Z produces different answers depending on the order)
     and asymmetric CRops (crandc, crorc). sane  operations:
     multiply, min/max, add, logical bitwise OR, most other CR ops.
     operations that do have the same source and dest register type are
-   also excluded (isel, cmp)
+   also excluded (isel, cmp). operations involving carry or overflow
+   (XER.CA / OV) are also prohibited.
  3. the destination is a vector but the result is stored, ultimately,
     in the first nonzero predicated element.  all other nonzero predicated
     elements are undefined. *this includes the CR vector* when Rc=1
@@ -574,9 +659,6 @@ Note that the operation takes place at the maximum bitwidth (max of src and dest
     unaltered (not used for the purposes of intermediary storage); the
     scalar result is placed in the first available unmasked element.
  
-TODO: Rc=1 on Scalar Logical Operations? is this possible? was space
-reserved in Logical Ops?
-
  Pseudocode for the case where RA==RB:
  
      result = op(iregs[RA], iregs[RA+1])
@@ -593,7 +675,7 @@ Pseudocode for the case where RA==RB:
  TODO: case where RA!=RB which involves first a vector of 2-operand
  results followed by a mapreduce on the intermediates.
  
-Note that when SUBVL!=1 the sub-elements are *independent*, i.e. they
+Note that when SVM is clear and SUBVL!=1 the sub-elements are *independent*, i.e. they
  are mapreduced per *sub-element* as a result.  illustration with a vec2:
  
      result.x = op(iregs[RA].x, iregs[RA+1].x)
@@ -602,16 +684,19 @@ are mapreduced per *sub-element* as a result.  illustration with a vec2:
          result.x = op(result.x, iregs[RA+i].x)
          result.y = op(result.y, iregs[RA+i].y)
  
-When SVM is set and SUBVL!=1, another variant is enabled, which switches
-to `RM-2P-2S1D` such that different elwidths may be applied to src
-and dest.
+Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
+
+
+When SVM is set and SUBVL!=1, another variant is enabled: horizontal subvector mode.  Example for a vec3:
  
      for i in range(VL):
          result = op(iregs[RA+i].x, iregs[RA+i].x)
-        result = op(result, iregs[RA+i].z)
+        result = op(result, iregs[RA+i].y)
          result = op(result, iregs[RA+i].z)
          iregs[RT+i] = result
  
+In this mode, when Rc=1 the Vector of CRs is as normal: each result element creates a corresponding CR element.
+
  ## Fail-on-first
  
  Data-dependent fail-on-first has two distinct variants: one for LD/ST,
@@ -653,6 +738,33 @@ One extremely important aspect of ffirst is:
    vectorised operations are effectively `nops` which is
    *precisely the desired and intended behaviour*.
  
+Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason.  For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce workloads or balance resources.
+
+CR-based data-dependent first on the other hand MUST not truncate VL arbitrarily.  This because it is a precise test on which algorithms will rely.
+
+## pred-result mode
+
+This mode merges common CR testing with predication, saving on instruction count. Below is the pseudocode excluding predicate zeroing and elwidth overrides.
+
+    for i in range(VL):
+        # predication test, skip all masked out elements.
+        if predicate_masked_out(i):
+             continue
+        result = op(iregs[RA+i], iregs[RB+i])
+        CRnew = analyse(result) # calculates eq/lt/gt
+        # Rc=1 always stores the CR
+        if Rc=1:
+            crregs[offs+i] = CRnew
+        # now test CR, similar to branch
+        if CRnew[BO[0:1]] != BO[2]:
+            continue # test failed: cancel store
+        # result optionally stored but CR always is
+        iregs[RT+i] = result
+
+The reason for allowing the CR element to be stored is so that post-analysis
+of the CR Vector may be carried out.  For example: Saturation may have occurred (and been prevented from updating, by the test) but it is desirable to know *which* elements fail saturation.
+
+Note that predication is still respected: predicate zeroing is slightly different: elements that fail the CR test *or* are masked out are zero'd.
  
  ## CR Operations
  
@@ -727,8 +839,7 @@ may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that
  
  When vectorized, the CR inputs/outputs are sequentially read/written
  to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
-writing to CR16 (TBD evaluate) and increase sequentially from there.
-Vectorised FP results, when Rc=1, start from CR32 (TBD evaluate).
+writing to CR8 (TBD evaluate) and increase sequentially from there.
  This is so that:
  
  * implementations may rely on the Vector CRs being aligned to 8. This
@@ -736,9 +847,7 @@ This is so that:
    (8 CRs per batch), for high performance implementations.
  * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
    overwritten by vector Rc=1 operations except for very large VL
-* Vector FP and Integer Rc=1 operations do not overwrite each other
-  except for large VL.
-* CR-based predication, from CR48, is also not interfered with
+* CR-based predication, from CR32, is also not interfered with
    (except by large VL).
  
  However when the SV result (destination) is marked as a scalar by the
@@ -776,6 +885,13 @@ hindrance, regardless of the length of VL.
  
  (see [[discussion]].  some alternative schemes are described there)
  
+### Rc=1 when SUBVL!=1
+
+sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of predicate is allocated per subvector; likewise only one CR is allocated
+per subvector.
+
+This leaves a conundrum as to how to apply CR computation per subvector, when normally Rc=1 is exclusively applied to scalar elements.  A solution is to perform a bitwise OR or AND of the subvector tests.  Given that OE is ignored, rhis field may (when available) be used to select OR or AND behavior.
+
  ### Table of CR fields
  
  CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
@@ -790,8 +906,6 @@ are arranged.  TODO a python program that auto-generates a CSV file
  which can be included in a table, which is in a new page (so as not to
  overwhelm this one). [[svp64/cr_names]]
  
-
-
  ## Register Profiles
  
  **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
@@ -804,3 +918,60 @@ Vectorised (mtspr, bc, dcbz, twi)
  
  TODO generate table which will be here [[svp64/reg_profiles]]
  
+## SV pseudocode illilustration
+
+### Single-predicated Instruction
+
+illustration of normal mode add operation: zeroing not included, elwidth overrides not included.  if there is no predicate, it is set to all 1s
+
+    function op_add(rd, rs1, rs2) # add not VADD!
+      int i, id=0, irs1=0, irs2=0;
+      predval = get_pred_val(FALSE, rd);
+      for (i = 0; i < VL; i++)
+        STATE.srcoffs = i # save context
+        if (predval & 1<<i) # predication uses intregs
+           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+           if (!int_vec[rd ].isvec) break;
+        if (rd.isvec)  { id += 1; }
+        if (rs1.isvec)  { irs1 += 1; }
+        if (rs2.isvec)  { irs2 += 1; }
+        if (id == VL or irs1 == VL or irs2 == VL) {
+          # end VL hardware loop
+          STATE.srcoffs = 0; # reset
+          return;
+        }
+
+This has several modes:
+
+* RT.v = RA.v RB.v
+* RT.v = RA.v RB.s (and RA.s RB.v)
+* RT.v = RA.s RB.s
+* RT.s = RA.v RB.v
+* RT.s = RA.v RB.s (and RA.s RB.v)
+* RT.s = RA.s RB.s
+
+All of these may be predicated.  Vector-Vector is straightfoward.  When one of source is a Vector and the other a Scalar, it is clear that each element of the Vector source should be added to the Scalar source, each result placed into the Vector (or, if the destination is a scalar, only the first nonpredicated result). 
+
+The one that is not obvious is RT=vector but both RA/RB=scalar.  Here this acts as a "splat scalar result", copying the same result into all nonpredicated result elements.  If a fixed destination scalar was intended, then an all-Scalar operation should be used.
+
+See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
+
+## Assembly Annotation
+
+Assembly code annotation is required for SV to be able to successfully
+mark instructions as "prefixed".
+
+A reasonable (prototype) starting point:
+
+    svp64 [field=value]*
+
+Fields:
+
+* ew=8/16/32 - element width
+* sew=8/16/32 - source element width
+* vec=2/3/4 - SUBVL
+* mode=reduce/satu/sats/crpred
+* pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
+* spred={reg spec}
+
+similar to x86 "rex" prefix.