(no commit message)

[libreriscv.git] / simple_v_extension / specification.mdwn
diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn

index 1785e8afa61748ff6cc589372409dd16be34fc94..0d164f72cad2eeea934c1e067810ebc6925fc203 100644 (file)
--- a/simple_v_extension/specification.mdwn
+++ b/simple_v_extension/specification.mdwn
@@ -79,7 +79,8 @@ The principle of SV is as follows:
    bit format (single instruction option) or a variable
   length VLIW-like prefix (multi or "grouped" option).
  * The prefix(es) indicate which registers are "tagged" as
-  "vectorised". Predicates can also be added, and element widths overridden on any src or dest register.
+  "vectorised". Predicates can also be added, and element widths
+  overridden on any src or dest register.
  * A "Vector Length" CSR is set, indicating the span of any future
    "parallel" operations.
  * If any operation (a **scalar** standard RV opcode) uses a register
@@ -127,13 +128,22 @@ And likewise for M-Mode:
  * mePCVLIW
  * meSTATE
  
-The u/m/s CSRs are treated and handled exactly like their (x)epc equivalents. On entry to a privilege level, the contents of its (x)eSTATE and (x)ePCVLIW CSRs are copied into STATE and PCVLIW respectively, and on exit from a priv level the STATE and PCVLIW CSRs are copied to the exited priv level's corresponding CSRs.
+The u/m/s CSRs are treated and handled exactly like their (x)epc
+equivalents. On entry to a privilege level, the contents of its (x)eSTATE
+and (x)ePCVLIW CSRs are copied into STATE and PCVLIW respectively, and
+on exit from a priv level the STATE and PCVLIW CSRs are copied to the
+exited priv level's corresponding CSRs.
  
-Thus for example, a User Mode trap will end up swapping STATE and ueSTATE (on both entry and exit), allowing User Mode traps to have their own Vectorisation Context set up, separated from and unaffected by normal user applications.
+Thus for example, a User Mode trap will end up swapping STATE and ueSTATE
+(on both entry and exit), allowing User Mode traps to have their own
+Vectorisation Context set up, separated from and unaffected by normal
+user applications.
  
-Likewise, Supervisor Mode may perform context-switches, safe in the knowledge that its Vectorisation State is unaffected by User Mode.
+Likewise, Supervisor Mode may perform context-switches, safe in the
+knowledge that its Vectorisation State is unaffected by User Mode.
  
-For this to work, the (x)eSTATE CSR must be saved onto the stack by the trap, just like (x)epc, before modifying the trap atomicity flag (x)ie.
+For this to work, the (x)eSTATE CSR must be saved onto the stack by the
+trap, just like (x)epc, before modifying the trap atomicity flag (x)ie.
  
  The access pattern for these groups of CSRs in each mode follows the
  same pattern for other CSRs that have M-Mode and S-Mode "mirrors":
@@ -158,10 +168,13 @@ and assisting low-latency fast context-switching *once and only once*
  (for example at boot time), without the need for re-initialising the
  CSRs needed to do so.
  
-Another interesting side effect of separate S Mode CSRs is that Vectorised
-saving of the entire register file to the stack is a single instruction
-(accidental provision of LOAD-MULTI semantics).  If the SVPrefix P64-LD-type format is used, LOAD-MULTI may even be done with a single standalone 64 bit opcode (P64 may set up both VL and MVL from an immediate field). It can even be predicated,
-which opens up some very interesting possibilities.
+Another interesting side effect of separate S Mode CSRs is that
+Vectorised saving of the entire register file to the stack is a single
+instruction (accidental provision of LOAD-MULTI semantics).  If the
+SVPrefix P64-LD-type format is used, LOAD-MULTI may even be done with a
+single standalone 64 bit opcode (P64 may set up both VL and MVL from an
+immediate field). It can even be predicated, which opens up some very
+interesting possibilities.
  
  The (x)EPCVLIW CSRs must be treated exactly like their corresponding (x)epc
  equivalents. See VLIW section for details.
@@ -174,8 +187,9 @@ however limited to the regfile bitwidth XLEN (1-32 for RV32,
  1-64 for RV64 and so on).
  
  The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the hardware implementation a little simpler.
+marked as such, may fit into a single register as opposed to fanning
+out over several registers.  This keeps the hardware implementation a
+little simpler.
  
  The other important factor to note is that the actual MVL is internally
  stored **offset by one**, so that it can fit into only 6 bits (for RV64)
@@ -237,38 +251,39 @@ may be done with a single instruction, useful particularly for a
  context-load/save.  There are however limitations: CSRWI's immediate
  is limited to 0-31 (representing VL=1-32).
  
-Note that when VL is set to 1, vector operations cease (but not subvector operations: that requires setting SUBVL=1) the
-hardware loop is reduced to a single element: scalar operations.
-This is in effect the default, normal
-operating mode. However it is important
-to appreciate that this does **not**
-result in the Register table or SUBVL 
-being disabled. Only when the Register
-table is empty (P48/64 prefix fields notwithstanding)
+Note that when VL is set to 1, vector operations cease (but not subvector
+operations: that requires setting SUBVL=1) the hardware loop is reduced
+to a single element: scalar operations.  This is in effect the default,
+normal operating mode. However it is important to appreciate that this
+does **not** result in the Register table or SUBVL being disabled. Only
+when the Register table is empty (P48/64 prefix fields notwithstanding)
  would SV have no effect.
  
  ## SUBVL - Sub Vector Length
  
-This is a "group by quantity" that effectivrly asks each iteration of the hardware loop to load SUBVL elements of width elwidth at a time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1 operation issued, SUBVL operations are issued.
+This is a "group by quantity" that effectivrly asks each iteration
+of the hardware loop to load SUBVL elements of width elwidth at a
+time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+operation issued, SUBVL operations are issued.
  
-Another way to view SUBVL is that each element in the VL length vector is now SUBVL times elwidth bits in length and
-now comprises SUBVL discrete sub
-operations.  An inner SUBVL for-loop within
-a VL for-loop in effect, with the
-sub-element increased every time in the
-innermost loop. This is best illustrated
-in the (simplified) pseudocode example,
-later.
+Another way to view SUBVL is that each element in the VL length vector is
+now SUBVL times elwidth bits in length and now comprises SUBVL discrete
+sub operations.  An inner SUBVL for-loop within a VL for-loop in effect,
+with the sub-element increased every time in the innermost loop. This
+is best illustrated in the (simplified) pseudocode example, later.
  
-The primary use case for SUBVL is for 3D FP Vectors. A Vector of 3D coordinates X,Y,Z for example may be loaded and multiplied the stored, per VL element iteration, rather than having to set VL to three times larger.
+The primary use case for SUBVL is for 3D FP Vectors. A Vector of 3D
+coordinates X,Y,Z for example may be loaded and multiplied the stored, per
+VL element iteration, rather than having to set VL to three times larger.
  
-Legal values are 1, 2, 3 and 4 (and the STATE CSR must hold the 2 bit values 0b00 thru 0b11 to represent them).
+Legal values are 1, 2, 3 and 4 (and the STATE CSR must hold the 2 bit
+values 0b00 thru 0b11 to represent them).
  
  Setting this CSR to 0 must raise an exception.  Setting it to a value
  greater than 4 likewise.
  
-The main effect of SUBVL is that predication bits are applied per **group**, 
-rather than by individual element.
+The main effect of SUBVL is that predication bits are applied per
+**group**, rather than by individual element.
  
  This saves a not insignificant number of instructions when handling 3D
  vectors, as otherwise a much longer predicate mask would have to be set
@@ -283,30 +298,31 @@ full context save/restore.  It contains (and permits setting of):
  
  * MVL
  * VL
-* the destination element offset of the current parallel instruction
-  being executed
-* and, for twin-predication, the source element offset as well.
+* destoffs - the destination element offset of the current parallel
+  instruction being executed
+* srcoffs - for twin-predication, the source element offset as well.
  * SUBVL
-* the subvector destination element offset of the current parallel instruction
-  being executed
-* and, for twin-predication, the subvector source element offset as well.
+* svdestoffs - the subvector destination element offset of the current
+  parallel instruction being executed
+* svsrcoffs - for twin-predication, the subvector source element offset
+  as well.
  
  Interestingly STATE may hypothetically also be modified to make the
  immediately-following instruction to skip a certain number of elements,
-by playing with destoffs and srcoffs
-(and the subvector offsets as well)
+by playing with destoffs and srcoffs (and the subvector offsets as well)
  
  Setting destoffs and srcoffs is realistically intended for saving state
  so that exceptions (page faults in particular) may be serviced and the
  hardware-loop that was being executed at the time of the trap, from
-user-mode (or Supervisor-mode), may be returned to and continued from exactly
-where it left off.  The reason why this works is because setting
-User-Mode STATE will not change (not be used) in M-Mode or S-Mode
-(and is entirely why M-Mode and S-Mode have their own STATE CSRs, meSTATE and seSTATE).
+user-mode (or Supervisor-mode), may be returned to and continued from
+exactly where it left off.  The reason why this works is because setting
+User-Mode STATE will not change (not be used) in M-Mode or S-Mode (and
+is entirely why M-Mode and S-Mode have their own STATE CSRs, meSTATE
+and seSTATE).
  
  The format of the STATE CSR is as follows:
  
-| (30..29 | (28..27) | (26..24) | (23..18) | (17..12) | (11..6) | (5...0) |
+| (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
  | ------- | -------- | -------- | -------- | -------- | ------- | ------- |
  | dsvoffs | ssvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
  
@@ -314,25 +330,41 @@ When setting this CSR, the following characteristics will be enforced:
  
  * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
  * **VL** will be truncated (after offset) to be within the range 1 to MAXVL
-* **SUBVL** which sets a SIMD-like quantity, has only 4 values there are no changes needed
+* **SUBVL** which sets a SIMD-like quantity, has only 4 values there
+  are no changes needed
  * **srcoffs** will be truncated to be within the range 0 to VL-1
  * **destoffs** will be truncated to be within the range 0 to VL-1
  * **ssvoffs** will be truncated to be within the range 0 to SUBVL-1
  * **dsvoffs** will be truncated to be within the range 0 to SUBVL-1
  
-NOTE: if the following instruction is not a twin predicated instruction, and destoffs or dsvoffs has been set to non-zero, subsequent execution behaviour is undefined. **USE WITH CARE**.
+NOTE: if the following instruction is not a twin predicated instruction,
+and destoffs or dsvoffs has been set to non-zero, subsequent execution
+behaviour is undefined. **USE WITH CARE**.
  
  ### Hardware rules for when to increment STATE offsets
  
-The offsets inside STATE are like the indices in a loop, except in hardware. They are also partially (conceptually) similar to a "sub-execution Program Counter". As such, and to allow proper context switching and to define correct exception behaviour, the following rules must be observed:
+The offsets inside STATE are like the indices in a loop, except
+in hardware. They are also partially (conceptually) similar to a
+"sub-execution Program Counter". As such, and to allow proper context
+switching and to define correct exception behaviour, the following rules
+must be observed:
  
  * When the VL CSR is set, srcoffs and destoffs are reset to zero.
-* Each instruction that contains a "tagged" register shall start execution at the *current* value of srcoffs (and destoffs in the case of twin predication)
-* Unpredicated bits (in nonzeroing mode) shall cause the element operation to skip, incrementing the srcoffs (or destoffs)
-* On execution of an element operation, Exceptions shall **NOT** cause srcoffs or destoffs to increment.
-* On completion of the full Vector Loop (srcoffs = VL-1 or destoffs = VL-1 after the last element is executed), both srcoffs and destoffs shall be reset to zero.
-
-This latter is why srcoffs and destoffs may be stored as values from 0 to XLEN-1 in the STATE CSR, because as loop indices they refer to elements. srcoffs and destoffs never need to be set to VL: their maximum operating values are limited to 0 to VL-1.
+* Each instruction that contains a "tagged" register shall start
+  execution at the *current* value of srcoffs (and destoffs in the case
+  of twin predication)
+* Unpredicated bits (in nonzeroing mode) shall cause the element operation
+  to skip, incrementing the srcoffs (or destoffs)
+* On execution of an element operation, Exceptions shall **NOT** cause
+  srcoffs or destoffs to increment.
+* On completion of the full Vector Loop (srcoffs = VL-1 or destoffs =
+  VL-1 after the last element is executed), both srcoffs and destoffs
+  shall be reset to zero.
+
+This latter is why srcoffs and destoffs may be stored as values from
+0 to XLEN-1 in the STATE CSR, because as loop indices they refer to
+elements. srcoffs and destoffs never need to be set to VL: their maximum
+operating values are limited to 0 to VL-1.
  
  The same corresponding rules apply to SUBVL, svsrcoffs and svdestoffs.
  
@@ -433,7 +465,9 @@ if the STATE CSR is to be used for fast context-switching.
  
  ## VL, MVL and SUBVL instruction aliases
  
-This table contains pseudo-assembly instruction aliases. Note the subtraction of 1 from the CSRRWI pseudo variants, to compensate for the reduced range of the 5 bit immediate.
+This table contains pseudo-assembly instruction aliases. Note the
+subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
+reduced range of the 5 bit immediate.
  
  | alias           | CSR                  |
  | -               | -                    |
@@ -449,8 +483,9 @@ Note: CSRRC and other bitsetting may still be used, they are however not particu
  ## Register key-value (CAM) table <a name="regcsrtable" />
  
  *NOTE: in prior versions of SV, this table used to be writable and
-accessible via CSRs. It is now stored in the VLIW instruction format. Note that 
-this table does *not* get applied to the SVPrefix P48/64 format, only to scalar opcodes*
+accessible via CSRs. It is now stored in the VLIW instruction format. Note
+that this table does *not* get applied to the SVPrefix P48/64 format,
+only to scalar opcodes*
  
  The purpose of the Register table is three-fold:
  
@@ -464,9 +499,16 @@ The purpose of the Register table is three-fold:
  * To over-ride the implicit or explicit bitwidth that the operation would
    normally give the register.
  
-Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15) and the Register table contains entried that only refer to registerd x1-x14 or x16-x31, such operations will *never* activate the VL hardware loop!
+Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15)
+and the Register table contains entried that only refer to registerd
+x1-x14 or x16-x31, such operations will *never* activate the VL hardware
+loop!
  
-If however the (16 bit) Register table does contain such an entry (x8-x15 or x2 in the case of LWSP), that src or dest reg may be redirected anywhere to the *full* 128 register range. Thus, RVC becomes far more powerful and has many more opportunities to reduce code size that in Standard RV32/RV64 executables.
+If however the (16 bit) Register table does contain such an entry (x8-x15
+or x2 in the case of LWSP), that src or dest reg may be redirected
+anywhere to the *full* 128 register range. Thus, RVC becomes far more
+powerful and has many more opportunities to reduce code size that in
+Standard RV32/RV64 executables.
  
  16 bit format:
  
@@ -483,8 +525,9 @@ If however the (16 bit) Register table does contain such an entry (x8-x15 or x2
  | ------ | | -   | ------ | ------- |
  | 0      | | i/f | vew0   | regnum  |
  
-i/f is set to "1" to indicate that the redirection/tag entry is to be applied
-to integer registers; 0 indicates that it is relevant to floating-point
+i/f is set to "1" to indicate that the redirection/tag entry is to
+be applied to integer registers; 0 indicates that it is relevant to
+floating-point
  registers.
  
  The 8 bit format is used for a much more compact expression. "isvec"
@@ -536,12 +579,11 @@ that the *actual* register used can be *different* from the one that is
  in the instruction, due to the redirection through the lookup table.
  
  * regidx is the register that in combination with the
-  i/f flag, if that integer or floating-point register is referred to
- in a (standard RV) instruction
-  results in the lookup table being referenced to find the predication
-  mask to use for this operation.
-* predidx is the
-  *actual* (full, 7 bit) register to be used for the predication mask. 
+  i/f flag, if that integer or floating-point register is referred to in a
+  (standard RV) instruction results in the lookup table being referenced
+  to find the predication mask to use for this operation.
+* predidx is the *actual* (full, 7 bit) register to be used for the
+  predication mask.
  * inv indicates that the predication mask bits are to be inverted
    prior to use *without* actually modifying the contents of the
    registerfrom which those bits originated.
@@ -554,16 +596,22 @@ in the instruction, due to the redirection through the lookup table.
    interpret unpredicated elements as an internal "copy element"
    operation (which would be necessary in SIMD microarchitectures
    that perform register-renaming)
+* ffirst is a special mode that stops sequential element processing when
+  a data-dependent condition occurs, whether a trap or a conditional test.
+  The handling of each (trap or conditional test) is slightly different:
+  see Instruction sections for further details
  
  16 bit format:
  
  | PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
  | ----- | -        | -      | -     | -   | ------- | ------- |
-| 0     | predkey  | zero0  | inv0  | i/f | regidx  | rsrvd |
-| 1     | predkey  | zero1  | inv1  | i/f | regidx  | rsvd |
-| ...   | predkey  | .....  | ....  | i/f | ....... | ....... |
-| 15    | predkey  | zero15 | inv15 | i/f | regidx  | rsvd |
+| 0     | predidx  | zero0  | inv0  | i/f | regidx  | ffirst0 |
+| 1     | predidx  | zero1  | inv1  | i/f | regidx  | ffirst1 |
+| 2     | predidx  | zero2  | inv2  | i/f | regidx  | ffirst2 |
+| 3     | predidx  | zero3  | inv3  | i/f | regidx  | ffirst3 |
  
+Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding.  Its use must
+generate an illegal instruction trap.
  
  8 bit format:
  
@@ -577,25 +625,36 @@ register to use is implicit, and numbering begins inplicitly from x9. The
  regnum is still used to "activate" predication, in the same fashion as
  described above.
  
-The 16 bit Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
+Thus if we map from 8 to 16 bit format, the table becomes:
+
+| PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
+| ----- | -        | -      | -     | -   | ------- | ------- |
+| 0     | x9       | zero0  | inv0  | i/f | regnum  | ff=0    |
+| 1     | x10      | zero1  | inv1  | i/f | regnum  | ff=0    |
+| 2     | x11      | zero2  | inv2  | i/f | regnum  | ff=0    |
+| 3     | x12      | zero3  | inv3  | i/f | regnum  | ff=0    |
+
+The 16 bit Predication CSR Table is a key-value store, so
+implementation-wise it will be faster to turn the table around (maintain
+topologically equivalent state):
  
      struct pred {
-        bool zero;
-        bool inv;
-        bool enabled;
-        int predidx; // redirection: actual int register to use
+        bool zero;    // zeroing
+        bool inv;     // register at predidx is inverted
+        bool ffirst;  // fail-on-first
+        bool enabled; // use this to tell if the table-entry is active
+        int predidx;  // redirection: actual int register to use
      }
  
      struct pred fp_pred_reg[32];   // 64 in future (bank=1)
      struct pred int_pred_reg[32];  // 64 in future (bank=1)
  
-    for (i = 0; i < 16; i++)
-      tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
-      idx = CSRpred[i].regidx
-      tb[idx].zero = CSRpred[i].zero
-      tb[idx].inv  = CSRpred[i].inv
+    for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
+      tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
+      idx = PredicateTable[i].regidx
+      tb[idx].zero     = CSRpred[i].zero
+      tb[idx].inv      = CSRpred[i].inv
+      tb[idx].ffirst   = CSRpred[i].ffirst
        tb[idx].predidx  = CSRpred[i].predidx
        tb[idx].enabled  = true
  
@@ -622,9 +681,12 @@ as follows.
      for (int i=0; i<vl; ++i)
          predicate, zeroing = get_pred_val(type(iop) == INT, rd):
          if (predicate && (1<<i))
-           (d ? regfile[rd+i] : regfile[rd]) =
-            iop(s1 ? regfile[rs1+i] : regfile[rs1],
-                s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
+           result = iop(s1 ? regfile[rs1+i] : regfile[rs1],
+                        s2 ? regfile[rs2+i] : regfile[rs2]);
+           (d ? regfile[rd+i] : regfile[rd]) = result
+           if preg.ffirst and result == 0:
+              VL = i # result was zero, end loop early, return VL
+              return
          else if (zeroing)
             (d ? regfile[rd+i] : regfile[rd]) = 0
  
@@ -636,6 +698,9 @@ Note:
    above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
    register-level redirection (from the Register table) if they are
    vectors.
+* fail-on-first mode stops execution early whenever an operation
+  returns a zero value.  floating-point results count both
+  positive-zero as well as negative-zero as "fail".
  
  If written as a function, obtaining the predication mask (and whether
  zeroing takes place) may be done as follows:
@@ -663,6 +728,72 @@ for the storage of comparisions: in these specific circumstances
  the requirement for there to be an active *register* entry
  is removed.
  
+## Fail-on-First Mode <a name="ffirst-mode"></a>
+
+ffirst is a special data-dependent predicate mode.  There are two
+variants: one is for faults: typically for LOAD/STORE operations,
+which may encounter end of page faults during a series of operations.
+The other variant is comparisons such as FEQ (or the augmented behaviour
+of Branch), and any operation that returns a result of zero (whether
+integer or floating-point).  In the FP case, this includes negative-zero.
+
+Note that the execution order must "appear" to be sequential for ffirst
+mode to work correctly.  An in-order architecture must execute the element
+operations in sequence, whilst an out-of-order architecture must *commit*
+the element operations in sequence (giving the appearance of in-order
+execution).
+
+Note also, that if ffirst mode is needed without predication, a special
+"always-on" Predicate Table Entry may be constructed by setting
+inverse-on and using x0 as the predicate register.  This
+will have the effect of creating a mask of all ones, allowing ffirst
+to be set.
+
+### Fail-on-first traps
+
+Except for the first element, ffault stops sequential element processing
+when a trap occurs.  The first element is treated normally (as if ffirst
+is clear).  Should any subsequent element instruction require a trap,
+instead it and subsequent indexed elements are ignored (or cancelled in
+out-of-order designs), and VL is set to the *last* instruction that did
+not take the trap.
+
+Note that predicated-out elements (where the predicate mask bit is zero)
+are clearly excluded (i.e. the trap will not occur).  However, note that
+the loop still had to test the predicate bit: thus on return,
+VL is set to include elements that did not take the trap *and* includes
+the elements that were predicated (masked) out (not tested up to the
+point where the trap occurred).
+
+If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
+will cause a trap as normal (as if ffirst is not set); subsequently,
+the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
+be modified.
+
+Given that predication bits apply to SUBVL groups, the same rules apply
+to predicated-out (masked-out) sub-groups in calculating the value that VL
+is set to.
+
+### Fail-on-first conditional tests
+
+ffault stops sequential element conditional testing on the first element result
+being zero.  VL is set to the number of elements that were processed before
+the fail-condition was encountered.
+
+Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
+will cause the processing to end, and, even if there were elements within
+the *sub-group* that passed the test, that sub-group is still (entirely)
+excluded from the count (from setting VL).  i.e. VL is set to the total
+number of *sub-groups* that had no fail-condition up until execution was
+stopped.
+
+Note again that, just as with traps, predicated-out (masked-out) elements
+are included in the count leading up to the fail-condition, even though they
+were not tested.
+
+The pseudo-code for Predication makes this clearer and simpler than it is
+in words (the loop ends, VL is set to the current element index, "i").
+
  ## REMAP CSR <a name="remap" />
  
  (Note: both the REMAP and SHAPE sections are best read after the
@@ -890,8 +1021,8 @@ All other operations using registers are automatically parallelised.
  This includes AMOMAX, AMOSWAP and so on, where particular care and
  attention must be paid.
  
-Example pseudo-code for an integer ADD operation (including scalar operations).
-Floating-point uses fp csrs.
+Example pseudo-code for an integer ADD operation (including scalar
+operations).  Floating-point uses the FP Register Table.
  
      function op_add(rd, rs1, rs2) # add not VADD!
        int i, id=0, irs1=0, irs2=0;
@@ -914,11 +1045,14 @@ reshaping and offsets and so on.  However it demonstrates the basic
  principle.  Augmentations that produce the full pseudo-code are covered in
  other sections.
  
-## SUBVL Pseudocode
+## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
  
-Adding in support for SUBVL is a matter of adding in an extra inner for-loop, where register src and dest are still incremented inside the inner part. Not that the predication is still taken from the VL index.
+Adding in support for SUBVL is a matter of adding in an extra inner
+for-loop, where register src and dest are still incremented inside the
+inner part. Not that the predication is still taken from the VL index.
  
-So whilst elements are indexed by (i * SUBVL + s), predicate bits are indexed by i
+So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
+indexed by "(i)"
  
      function op_add(rd, rs1, rs2) # add not VADD!
        int i, id=0, irs1=0, irs2=0;
@@ -945,7 +1079,8 @@ So whilst elements are indexed by (i * SUBVL + s), predicate bits are indexed by
          }
  
  
-NOTE: pseudocode simplified greatly: zeroing, proper predicate handling, elwidth handling etc. all left out.
+NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
+elwidth handling etc. all left out.
  
  ## Instruction Format
  
@@ -968,6 +1103,12 @@ comprehensive in its effect on instructions.
  
  ## Branch Instructions
  
+Branch operations are augmented slightly to be a little more like FP
+Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
+of multiple comparisons into a register (taken indirectly from the predicate
+table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
+See ffirst mode in the Predication Table section.
+
  ### Standard Branch <a name="standard_branch"></a>
  
  Branch operations use standard RV opcodes that are reinterpreted to
@@ -1120,6 +1261,8 @@ integer register (rd) to determine the predicate.  FP Compare is **not**
  a twin-predication operation, as, again, just as with SV Branches,
  there are three registers involved: FP src1, FP src2 and INT rd.
  
+Also: note that ffirst (fail first mode) applies directly to this operation.
+
  ### Compressed Branch Instruction
  
  Compressed Branch instructions are, just like standard Branch instructions,
@@ -2131,21 +2274,25 @@ is modified to as follows:
               if (int_vec[rs1].isvector)  { irs1 += 1; }
               if (int_vec[rs2].isvector)  { irs2 += 1; }
             if i == VL:
-             break
+             return
          if (predval & 1<<i)
             src1 = ....
             src2 = ...
             else:
                 result = src1 + src2 # actual add (or other op) here
             set_polymorphed_reg(rd, destwid, ird, result)
-           if (!int_vec[rd].isvector) break
+           if int_vec[rd].ffirst and result == 0:
+              VL = i # result was zero, end loop early, return VL
+              return
+           if (!int_vec[rd].isvector) return
          else if zeroing:
             result = 0
             set_polymorphed_reg(rd, destwid, ird, result)
          if (int_vec[rd ].isvector)  { id += 1; }
-        else if (predval & 1<<i) break;
+        else if (predval & 1<<i) return
          if (int_vec[rs1].isvector)  { irs1 += 1; }
          if (int_vec[rs2].isvector)  { irs2 += 1; }
+        if (rd == VL or rs1 == VL or rs2 == VL): return
  
  The optimisation to skip elements entirely is only possible for certain
  micro-architectures when zeroing is not set.  However for lane-based
@@ -2247,32 +2394,40 @@ in as a *parameter* to the HINT operation.
  
  No specific hints are yet defined in Simple-V
  
-# VLIW Format <a name="vliw-format"></a>
-
-One issue with SV is the setup and teardown time of the CSRs.  The cost
-of the use of a full CSRRW (requiring LI) is quite high.  A VLIW format
-therefore makes sense.
+# Vector Block Format <a name="vliw-format"></a>
  
-A suitable prefix, which fits the Expanded Instruction-Length encoding
-for "(80 + 16 times instruction_length)", as defined in Section 1.5
-of the RISC-V ISA, is as follows:
+One issue with a former revision of SV was the setup and teardown
+time of the CSRs.  The cost of the use of a full CSRRW (requiring LI)
+to set up registers and predicates was quite high.  A VLIW-like format
+therefore makes sense, and is conceptually reminiscent of the ARM Thumb2
+"IT" instruction.
  
-| 15    | 14:12 | 11:10 | 9:8   | 7    | 6:0     |
-| -     | ----- | ----- | ----- | ---  | ------- |
-| vlset | 16xil | pplen | rplen | mode | 1111111 |
+The format is:
  
-An optional VL Block, optional predicate entries, optional register
-entries and finally some 16/32/48 bit standard RV or SVPrefix opcodes
-follow.
+* the standard RISC-V 80 to 192 bit encoding sequence, with bits
+  defining the options to follow within the block
+* An optional VL Block (16-bit)
+* Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
+* Optional register entries (8/16-bit blocks: see Register Table, above)
+* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
  
-The variable-length format from Section 1.5 of the RISC-V ISA:
+Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
+as follows:
  
  | base+4 ... base+2          | base             | number of bits             |
  | ------ -----------------   | ---------------- | -------------------------- |
  | ..xxxx  xxxxxxxxxxxxxxxx   | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 |
  | {ops}{Pred}{Reg}{VL Block} | SV Prefix        |                            |
  
-VL/MAXVL/SubVL Block:
+A suitable prefix, which fits the Expanded Instruction-Length encoding
+for "(80 + 16 times instruction-length)", as defined in Section 1.5
+of the RISC-V ISA, is as follows:
+
+| 15    | 14:12 | 11:10 | 9:8   | 7    | 6:0     |
+| -     | ----- | ----- | ----- | ---  | ------- |
+| vlset | 16xil | pplen | rplen | mode | 1111111 |
+
+The VL/MAXVL/SubVL Block format:
  
  | 31-30 | 29:28 | 27:22  | 21:17  - 16  |
  | -     | ----- | ------ | ------ - -   |
@@ -2331,10 +2486,12 @@ Notes:
    number of bits is 80 + 16 times IL.  Standard RV32, RVC and also
    SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
    (optional) VL / RegCam / PredCam entries
-* In any RVC or 32 Bit opcode, any registers within the VLIW-prefixed format *MUST* have the
-  RegCam and PredCam entries applied to the operation
- (and the Vectorisation loop activated)
-* P48 and P64 opcodes do **not** take their Register or predication context from the VLIW Block tables: they do however have VL or SUBVL applied (unless VLtyp or svlen are set).
+* In any RVC or 32 Bit opcode, any registers within the VLIW-prefixed
+  format *MUST* have the RegCam and PredCam entries applied to the
+  operation (and the Vectorisation loop activated)
+* P48 and P64 opcodes do **not** take their Register or predication
+  context from the VLIW Block tables: they do however have VL or SUBVL
+  applied (unless VLtyp or svlen are set).
  * At the end of the VLIW Group, the RegCam and PredCam entries
    *no longer apply*.  VL, MAXVL and SUBVL on the other hand remain at
    the values set by the last instruction (whether a CSRRW or the VL
@@ -2343,19 +2500,19 @@ Notes:
    VL and SUBVL CSRs with standard CSRRW instructions, within a VLIW block.
  
  All this would greatly reduce the amount of space utilised by Vectorised
-instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes: the
-CSR itself, a LI, and the setting up of the value into the RS register
-of the CSR, which, again, requires a LI / LUI to get the 32 bit
-data into the CSR.  To get 64-bit data into the register in order to put
-it into the CSR(s), LOAD operations from memory are needed!
+instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
+the CSR itself, a LI, and the setting up of the value into the RS
+register of the CSR, which, again, requires a LI / LUI to get the 32
+bit data into the CSR.  To get 64-bit data into the register in order
+to put it into the CSR(s), LOAD operations from memory are needed!
  
  Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
  entries), that's potentially 6 to eight 32-bit instructions, just to
  establish the Vector State!
  
-Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more bits if
-VL needs to be set to greater than 32).  Bear in mind that in SV, both MAXVL
-and VL need to be set.
+Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
+bits if VL needs to be set to greater than 32).  Bear in mind that in SV,
+both MAXVL and VL need to be set.
  
  By contrast, the VLIW prefix is only 16 bits, the VL/MAX/SubVL block is
  only 16 bits, and as long as not too many predicates and register vector
@@ -2363,10 +2520,14 @@ qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
  the format. If the full flexibility of the 16 bit block formats are not
  needed, more space is saved by using the 8 bit formats.
  
-In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries into
-a VLIW format makes a lot of sense.
+In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
+into a VLIW format makes a lot of sense.
  
-Bear in mind the warning in an earlier section that use of VLtyp or svlen in a P48 or P64 opcode within a VLIW Group will result in corruption (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To avoid this situation, the STATE CSR may be copied into a temp register and restored afterwards.
+Bear in mind the warning in an earlier section that use of VLtyp or svlen
+in a P48 or P64 opcode within a VLIW Group will result in corruption
+(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
+avoid this situation, the STATE CSR may be copied into a temp register
+and restored afterwards.
  
  Open Questions:
  
@@ -2395,20 +2556,22 @@ This has implications, namely that a new set of CSRs identical to xepc
  as being a sub extension of the xepc set of CSRs.  Thus, xepcvliw CSRs
  must be context switched and saved / restored in traps.
  
-The srcoffs and destoffs indices in the STATE CSR may be similarly regarded as another
-sub-execution context, giving in effect two sets of nested sub-levels
-of the RISCV Program Counter (actually, three including SUBVL and ssvoffs).
+The srcoffs and destoffs indices in the STATE CSR may be similarly
+regarded as another sub-execution context, giving in effect two sets of
+nested sub-levels of the RISCV Program Counter (actually, three including
+SUBVL and ssvoffs).
  
  In addition, as xepcvliw CSRs are relative to the beginning of the VLIW
-block, branches MUST be restricted to within (relative to) the block, i.e. addressing
-is now restricted to the start (and very short) length of the block.
+block, branches MUST be restricted to within (relative to) the block,
+i.e. addressing is now restricted to the start (and very short) length
+of the block.
  
  Also: calling subroutines is inadviseable, unless they can be entirely
  accomplished within a block.
  
-A normal jump, normal branch and a normal function call may only be taken by letting
-the VLIW group end, returning to "normal" standard RV mode, and then using standard RVC, 32 bit
-or P48/64-\*-type opcodes.
+A normal jump, normal branch and a normal function call may only be taken
+by letting the VLIW group end, returning to "normal" standard RV mode,
+and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
  
  ## Links
  
@@ -2421,10 +2584,15 @@ different subsets of RV.
  
  ## Common options
  
-It is permitted to only implement SVprefix and not the VLIW instruction format option.
-UNIX Platforms **MUST** raise illegal instruction on seeing a VLIW opcode so that traps may emulate the format.
+It is permitted to only implement SVprefix and not the VLIW instruction
+format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
+instruction on seeing an unsupported VLIW or SVprefix opcode, so that
+traps may emulate the format.
  
-It is permitted in SVprefix to either not implement VL or not implement SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms *MUST* raise illegal instruction on implementations that do not support VL or SUBVL.
+It is permitted in SVprefix to either not implement VL or not implement
+SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
+*MUST* raise illegal instruction on implementations that do not support
+VL or SUBVL.
  
  It is permitted to limit the size of either (or both) the register files
  down to the original size of the standard RV architecture.  However, below
@@ -2530,8 +2698,15 @@ Could the 8 bit Register VLIW format use regnum<<1 instead, only accessing regs
  
  --
  
+Expand the range of SUBVL and its associated svsrcoffs and svdestoffs by
+adding a 2nd STATE CSR (or extending STATE to 64 bits).  Future version?
+
+--
+
  TODO evaluate strncpy and strlen
-https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ
+<https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
+
+RVV version: <a name="strncpy"></>
  
      strncpy: 
          mv a3, a0               # Copy dst 
@@ -2552,7 +2727,50 @@ https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ
      exit: 
          ret 
  
+SV version (WIP):
+
+    strncpy:
+        mv a3, a0
+        SETMVLI 8 # set max vector to 8
+        RegCSR[a3] = 8bit, a3, vector
+        RegCSR[a1] = 8bit, a3, vector
+        PredTb[t0] = ffirst, x0, inv
+        add t2, x0, x0 #t2 = 0
+    loop:
+        SETVLI a2, t4 # t4 and VL now 1..8
+        ldb t0, (a1) # t0 fail first mode
+        bne t0, x0, allnonzero # still ff
+        # VL points to last nonzero
+        GETVL t4       # from bne tests
+        addi t4, t4, 1 # include zero
+        SETVL t4       # set exactly to t4
+        stb t0, (a3)   # store incl zero
+        ret            # end subroutine
+    allnonzero:
+        stb t0, (a3)    # VL legal range
+        GETVL t4        # from bne tests
+        add a1, a1, t4  # Bump src pointer 
+        sub a2, a2, t4  # Decrement count. 
+        add a3, a3, t4  # Bump dst pointer 
+        bnez a2, loop   # Anymore? 
+    exit:
+        ret
+
+Notes:
  
+* ldb and bne are both using t0, both in ffirst mode
+* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
+* bne behaviour modified to do multiple tests (more like FNE).
+* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar  x0
+* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
+* the branch only goes to allnonzero if all tests succeed
+* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
+* SETVL sets *exactly* the requested amount into VL.
+* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
+* this would cause the stb to copy up to the end of the legal memory
+* of course, on the next loop the ldb would throw a trap, as a1 points to the first illegal mem location.
+
+RVV version:
  
          mv a3, a0             # Save start 
      loop: