(no commit message)

[libreriscv.git] / openpower / sv / svp64.mdwn
diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn

index d704266f60f98df579472623d09df435dc1cb535..55e537018c6b71eed4347e60514cd025d5a5c530 100644 (file)
--- a/openpower/sv/svp64.mdwn
+++ b/openpower/sv/svp64.mdwn
@@ -44,12 +44,12 @@ Table of contents
  Simple-V is a type of Vectorisation best described as a "Prefix Loop
  Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and
  to the 8086 `REP` Prefix instruction.  More advanced features are similar
-to the Z80 `CPIR` instruction. If naively viewed one-dimensionally as an actual
-Vector ISA it introduces over 1.5 million 64-bit True-Scalable Vector instructions
-on the SFFS Subset and closer to 10 million 64-bit True-Scalable Vector
-instructions if introduced on VSX.
-SVP64, the instruction format used by Simple-V, is therefore best viewed
-as an orthogonal RISC-paradigm "Prefixing" subsystem instead.
+to the Z80 `CPIR` instruction. If naively viewed one-dimensionally as an
+actual Vector ISA it introduces over 1.5 million 64-bit True-Scalable
+Vector instructions on the SFFS Subset and closer to 10 million 64-bit
+True-Scalable Vector instructions if introduced on VSX.  SVP64, the
+instruction format used by Simple-V, is therefore best viewed as an
+orthogonal RISC-paradigm "Prefixing" subsystem instead.
  
  Except where explicitly stated all bit numbers remain as in the rest of
  the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
@@ -64,15 +64,15 @@ the following instruction, but does **not** change the actual Decoding
  of that following instruction.  **All prefixed 32-bit instructions
  (Defined Words) retain their non-prefixed encoding and definition**.
  
-Two apparent exceptions to the above hard rule exist: SV Branch-Conditional
-operations and LD/ST-update "Post-Increment" Mode.  Post-Increment
-was considered sufficiently high priority (significantly reducing hot-loop
-instruction count) that one bit in the Prefix is reserved for it
-(Note the intention to release that bit and move Post-Increment instructions
-to EXT2xx).
+Two apparent exceptions to the above hard rule exist: SV
+Branch-Conditional operations and LD/ST-update "Post-Increment"
+Mode.  Post-Increment was considered sufficiently high priority
+(significantly reducing hot-loop instruction count) that one bit in
+the Prefix is reserved for it (*Note the intention to release that bit
+and move Post-Increment instructions to EXT2xx, as part of [[ls011]]*).
  Vectorised Branch-Conditional operations "embed" the original Scalar
-Branch-Conditional behaviour into a much more advanced variant that
-is highly suited to High-Performance Computation (HPC), Supercomputing,
+Branch-Conditional behaviour into a much more advanced variant that is
+highly suited to High-Performance Computation (HPC), Supercomputing,
  and parallel GPU Workloads.
  
  *Architectural Resource Allocation note: it is prohibited to accept RFCs
@@ -100,7 +100,7 @@ only 24 bits:
  * Predication on both source and destination
  * Two different sources of predication: INT and CR Fields
  * SV Modes including saturation (for Audio, Video and DSP), mapreduce,
-  fail-first and predicate-result mode.
+  and fail-first mode.
  
  Different classes of operations require different formats. The earlier
  sections cover the common formats and the four separate modes follow:
@@ -137,8 +137,101 @@ required (identifying whether the Suffix - Defined Word - is
  Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
  adding "UnVectorised" to this phase is not unreasonable.*
  
+## Definition of Strict Program Order
+
+Strict Program Order is defined as giving the appearance, as far
+as programs are concerned, that instructions were executed
+strictly in the sequence that they occurred.  A "Precise"
+out-of-order
+Micro-architecture goes to considerable lengths to ensure that
+this is the case.
+
+Many Vector ISAs allow interrupts to occur in the middle of
+processing of large Vector operations, only under the condition
+that partial results are cleanly discarded, and continuation on return
+from the Trap Handler will restart the entire operation.
+The reason is that saving of full Architectural State is
+not practical.
+
+Simple-V operates on an entirely different paradigm from traditional
+Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
+with Scalar instructions. With this in mind it is critical for
+implementations to observe Strict Element-Level Program Order
+at all times
+(often simply referred to as just "Strict Program Order"
+throughout
+this Chapter).
+*Any* element is Interruptible and Simple-V has
+been carefully designed to guarantee that Architectural State may
+be fully preserved and restored regardless of that same State, but
+it is not necessarily guaranteed that the amount of time needed to recover
+will be low latency (particularly if REMAP
+is active).
+
+Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
+but the full SVP64 Architectural State may be saved and
+restored through manual copying of `SVSTATE` (and the four
+REMAP SPRs if in use at the time)
+Whilst this initially sounds unsafe in reality
+all that Trap Handlers (and function call stack save/restore)
+need do is avoid
+use of SVP64 Prefixed instructions to perform the necessary
+save/restore of Simple-V Architectural State.
+This capability also allows nested function calls to be made from
+inside Vertical-First Vector loops, which is very rare for Vector ISAs.
+
+Strict Program Order is also preserved by the Parallel Reduction
+REMAP Schedule, but only at the cost of requiring the destination
+Vector to be used (Deterministically) to store partial progress of the
+Parallel Reduction.
+
+The only major caveat for REMAP is that
+after an explicit change to
+Architectural State caused by writing to the
+Simple-V SPRs, some implementations may find
+it easier to take longer to calculate where in a given Schedule
+the re-mapping Indices were.  Obvious examples include Interrupts occuring
+in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
+for example), which
+will force implementations to perform divide and modulo
+calculations.
+
+An additional caveat involves Condition Register Fields
+when also used as Predicate Masks. An operation that
+overwrites the same CR Fields that are simultaneously
+being used as a Predicate Mask is `UNDEFINED` behaviour
+if the overwritten CR field element was needed by a
+subsequent Element for its Predicate Mask bit.
+This allows implementations to relax some of the
+otherwise-draconian Register Hazards that would otherwise
+occur, and to consider internal cacheing of the CR-based
+Predicate
+bits, but some implementations *may not necessarily
+perform pre-reading* and consequently the risk of
+overwrite is the responsibility of the Programmer.
+Special care is particularly needed here when using REMAP.
+
  ## Register files, elements, and Element-width Overrides
  
+The relationship between register files, elements, and element-width
+overrides is expressed as follows:
+
+* register files are considered to be *byte-level* contiguous SRAMs,
+  accessed exclusively in Little-Endian  Byte-Order at all times
+* elements are sequential contiguous unbounded arrays starting at the "address"
+  of any given 64-bit GPR or FPR, numbered from 0 as the first,
+  "spilling" into numerically-sequentially-increasing GPRs
+* element-width overrides set the width of the *elements* in the
+  sequentially-numbered contiguous array.
+
+The relationship is best defined in Canonical form, below, in ANSI c as a
+union data structure.  A key difference is that VSR elements are bounded
+fixed at 128-bit, where SVP64 elements are conceptually unbounded and
+only limited by the Maximum Vector Length.
+
+*Future specification note: SVP64 may be defined on top of VSRs in future.
+At which point VSX also gains conceptually unbounded VSR register elements*
+
  In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR
  Register files are expanded from 32 to 128 entries, and the number of
  CR Fields expanded from CR0-CR7 to CR0-CR127. (Note: A future version
@@ -150,6 +243,16 @@ on the Load and Store memory-register operation byte-order, and having
  nothing to do with the ordering of the contents of register files or
  register-register operations.
  
+The only major impact on Arithmetic and Logical operations is that all
+Scalar operations are defined, where practical and workable, to have
+three new widths: elwidth=32, elwidth=16, elwidth=8.  The default of
+elwidth=64 is the pre-existing (Scalar) behaviour which remains 100%
+unchanged. Thus, `addi` is now joined by a 32-bit, 16-bit, and 8-bit
+variant of `addi`, but the sole exclusive difference is the width.
+*In no way* is the actual `addi` instruction fundamentally altered.
+FP Operations elwidth overrides are also defined, as explained in
+the [[svp64/appendix]].
+
  To be absolutely clear:
  
  ```
@@ -167,14 +270,13 @@ sequentially from the LSB end
  incrementally to the MSB end (confusingly numbered the lowest in
  MSB0 ordering).
  
-When exclusively using MSB0-numbering, SVP64
-becomes unnecessarily complex to both express and subsequently understand:
-the required conditional subtractions from 63,
-31, 15 and 7 needed to express the fact that elements are LSB0-sequential
-unfortunately become a hostile minefield, obscuring both
-intent and meaning. Therefore for the
-purposes of this section the more natural **LSB0 numbering is assumed**
-and it is left to the reader to translate to MSB0 numbering.
+When exclusively using MSB0-numbering, SVP64 becomes unnecessarily complex
+to both express and subsequently understand: the required conditional
+subtractions from 63, 31, 15 and 7 needed to express the fact that
+elements are LSB0-sequential unfortunately become a hostile minefield,
+obscuring both intent and meaning. Therefore for the purposes of this
+section the more natural **LSB0 numbering is assumed** and it is left
+to the reader to translate to MSB0 numbering.
  
  The Canonical specification for how element-sequential numbering and
  element-width overrides is defined is expressed in the following c
@@ -186,15 +288,20 @@ from Figure 97, Book I, Section 6.3, Page 258:
  ```
      #pragma pack
      typedef union {
+        uint8_t actual_bytes[8];
+        // all of these are very deliberately unbounded arrays
+        // that intentionally "wrap" into subsequent actual_bytes...
          uint8_t  bytes[]; // elwidth 8
          uint16_t hwords[]; // elwidth 16
          uint32_t words[]; // elwidth 32
          uint64_t dwords[]; // elwidth 64
-        uint8_t actual_bytes[8];
+
      } el_reg_t;
  
+    // ... here, as packed statically-defined GPRs.
      elreg_t int_regfile[128];
  
+    // use element 0 as the destination
      void get_register_element(el_reg_t* el, int gpr, int element, int width) {
          switch (width) {
              case 64: el->dwords[0] = int_regfile[gpr].dwords[element];
@@ -203,6 +310,8 @@ from Figure 97, Book I, Section 6.3, Page 258:
              case 8 : el->bytes[0] = int_regfile[gpr].bytes[element];
          }
      }
+
+    // use element 0 as the source
      void set_register_element(el_reg_t* el, int gpr, int element, int width) {
          switch (width) {
              case 64: int_regfile[gpr].dwords[element] = el->dwords[0];
@@ -229,14 +338,15 @@ However if elwidth overrides are set to 16 for both source and destination:
          int_regfile[RT].hwords[i] = int_regfile[RA].hwords[i] + int_regfile[RB].hwords[i]
  ```
  
-The most fundamental aspect here to understand is that the wrapping into
-subsequent Scalar GPRs that occurs on larger-numbered elements
-including and especially on smaller element widths is **deliberate and intentional**.
-From this Canonical definition it should be clear that sequential elements begin
-at the LSB end of any given underlying Scalar GPR, progress to the MSB end, and
-then to the LSB end of the *next numerically-larger Scalar GPR*.  In the
-example above if VL=5 and RT=1 then the contents of GPR(1) and GPR(2) will
-be as follows.  For clarity in the table below:
+The most fundamental aspect here to understand is that the wrapping
+into subsequent Scalar GPRs that occurs on larger-numbered elements
+including and especially on smaller element widths is **deliberate
+and intentional**.  From this Canonical definition it should be clear
+that sequential elements begin at the LSB end of any given underlying
+Scalar GPR, progress to the MSB end, and then to the LSB end of the
+*next numerically-larger Scalar GPR*.  In the example above if VL=5
+and RT=1 then the contents of GPR(1) and GPR(2) will be as follows.
+For clarity in the table below:
  
  * Both MSB0-ordered bitnumbering *and* LSB-ordered bitnumbering are shown
  * The GPR-numbering is considered LSB0-ordered
@@ -257,9 +367,9 @@ be as follows.  For clarity in the table below:
  ```
  
  Note that the upper 48 bits of GPR(2) would **not** be modified due to
-the example having VL=5.  Thus on "wrapping" - sequential progression from
-GPR(1) into GPR(2) - the 5th result modifies
-**only** the bottom 16 LSBs of GPR(1).
+the example having VL=5.  Thus on "wrapping" - sequential progression
+from GPR(1) into GPR(2) - the 5th result modifies **only** the bottom
+16 LSBs of GPR(1).
  
  Hardware Architectural note: to avoid a Read-Modify-Write at the register
  file it is strongly recommended to implement byte-level write-enable lines
@@ -292,27 +402,27 @@ Operation, the exact same contents would be viewed as follows:
  ```
  
  In other words, this perspective really is no different from the situation
-where the actual Register File is treated as an Industry-standard byte-level-addressable
-Little-Endian-addressed SRAM.  Note that this perspective does **not**
-involve `MSR.LE` in any way shape or form because `MSR.LE` is directly
-in control of the Memory-to-Register byte-ordering. This section is
-exclusively about how to correctly perceive Simple-V-Augmented **Register**
-Files.
+where the actual Register File is treated as an Industry-standard
+byte-level-addressable Little-Endian-addressed SRAM.  Note that
+this perspective does **not** involve `MSR.LE` in any way shape or
+form because `MSR.LE` is directly in control of the Memory-to-Register
+byte-ordering. This section is exclusively about how to correctly perceive
+Simple-V-Augmented **Register** Files.
  
  **Comparative equivalent using VSR registers**
  
  For a comparative data point the VSR Registers may be expressed in the
  same fashion. The c code below is directly an expression of Figure 97 in
-Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating for
-MSB0 numbering in both bits and elements, adapting in full to LSB0 numbering,
-and obeying LE ordering*.
+Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating
+for MSB0 numbering in both bits and elements, adapting in full to LSB0
+numbering, and obeying LE ordering*.
  
-**Crucial to understanding why the subtraction from 1,3,7,15 is present
-is because the Power ISA numbers VSX Registers elements also in MSB0 order**.
+**Crucial to understanding why the subtraction from 1,3,7,15 is present is
+because the Power ISA numbers VSX Registers elements also in MSB0 order**.
  SVP64 very specifically numbers elements in **LSB0** order with the first
-element (numbered zero) being at the bitwise-numbered **LSB** end of the register, where VSX
-does the reverse: places the numerically-*highest* (last-numbered) element at
-the LSB end of the register.
+element (numbered zero) being at the bitwise-numbered **LSB** end of the
+register, where VSX does the reverse: places the numerically-*highest*
+(last-numbered) element at the LSB end of the register.
  
  
  ```
@@ -358,21 +468,21 @@ the LSB end of the register.
      }
  ```
  
-For VSR Registers one key difference is that the overlay of different element
-widths is clearly a *bounded static quantity*, whereas for Simple-V the
-elements are
-unrestrained and permitted to flow into *successive underlying Scalar registers*.
-This difference is absolutely critical to a full understanding of the entire
-Simple-V paradigm and why element-ordering, bit-numbering *and register numbering*
-are all so strictly defined.
+For VSR Registers one key difference is that the overlay of different
+element widths is clearly a *bounded static quantity*, whereas for
+Simple-V the elements are unrestrained and permitted to flow into
+*successive underlying Scalar registers*.  This difference is absolutely
+critical to a full understanding of the entire Simple-V paradigm and
+why element-ordering, bit-numbering *and register numbering* are all so
+strictly defined.
  
-Implementations are not permitted to violate the Canonical definition. Software
-will be critically relying on the wrapped (overflow) behaviour inherently
-implied by the unbounded variable-length c arrays.
+Implementations are not permitted to violate the Canonical
+definition. Software will be critically relying on the wrapped (overflow)
+behaviour inherently implied by the unbounded variable-length c arrays.
  
-Illustrating the exact same loop with the exact same effect as achieved by Simple-V
-we are first forced to create wrapper functions, to cater for the fact
-that VSR register elements are static bounded:
+Illustrating the exact same loop with the exact same effect as achieved
+by Simple-V we are first forced to create wrapper functions, to cater
+for the fact that VSR register elements are static bounded:
  
  ```
      int calc_VSR_reg_offs(int elt, int width) {
@@ -426,19 +536,19 @@ whereas when VL=1 and the SV prefix is all zeros, the operation simply
  acts as if SV had not been applied at all to the instruction  (an
  "identity transformation").
  
-The fact that `VL` is dynamic and can be set to any value at runtime based
-on program conditions and behaviour means very specifically that
-`scalar identity behaviour` is **not** a redundant encoding. If the
-only means by which VL could be set was by way of static-compiled
-immediates then this assertion would be false.  VL should not
-be confused with MAXVL when understanding this key aspect of SimpleV.
+The fact that `VL` is dynamic and can be set to any value at runtime
+based on program conditions and behaviour means very specifically that
+`scalar identity behaviour` is **not** a redundant encoding. If the only
+means by which VL could be set was by way of static-compiled immediates
+then this assertion would be false.  VL should not be confused with
+MAXVL when understanding this key aspect of SimpleV.
  
  ## Register Naming and size
  
-As indicated above SV Registers are simply the GPR, FPR and CR
-register files extended linearly to larger sizes; SV Vectorisation
-iterates sequentially through these registers (LSB0 sequential ordering
-from 0 to VL-1).
+As indicated above SV Registers are simply the GPR, FPR and CR register
+files extended linearly to larger sizes; SV Vectorisation iterates
+sequentially through these registers (LSB0 sequential ordering from 0
+to VL-1).
  
  Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
  r0 to r31, SV extends this as r0 to r127.  Likewise FP registers are
@@ -458,13 +568,13 @@ This is part of `scalar identity behaviour` described above.
  
  **Condition Register(s)**
  
-The Scalar Power ISA Condition Register is a 64 bit register where the top
-32 MSBs (numbered 0:31 in MSB0 numbering) are not used.  This convention is
-*preserved*
-in SVP64 and an additional 15 Condition Registers provided in
-order to store the new CR Fields, CR8-CR15, CR16-CR23 etc. sequentially.
-The top 32 MSBs in each new SVP64 Condition Register are *also* not used:
-only the bottom 32 bits (numbered 32:63 in MSB0 numbering).
+The Scalar Power ISA Condition Register is a 64 bit register where
+the top 32 MSBs (numbered 0:31 in MSB0 numbering) are not used.
+This convention is *preserved* in SVP64 and an additional 15 Condition
+Registers provided in order to store the new CR Fields, CR8-CR15,
+CR16-CR23 etc. sequentially.  The top 32 MSBs in each new SVP64 Condition
+Register are *also* not used: only the bottom 32 bits (numbered 32:63
+in MSB0 numbering).
  
  *Programmer's note: using `sv.mfcr` without element-width overrides
  to take into account the fact that the top 32 MSBs are zero and thus
@@ -484,6 +594,55 @@ Condition-Register instructions because all other CR instructions,
  on closer investigation, will be observed to all be CR-bit or CR-Field
  related. Thus a `VL` of 16 must be used*
  
+**Condition Register Fields as Predicate Masks**
+
+Condition Register Fields perform an additional duty in Simple-V: they are
+used for Predicate Masks.  ARM's Scalar Instruction Set calls single-bit
+predication "Conditional Execution", and utilises Condition Codes for
+exactly this purpose to solve the problem caused by Branch Speculation.
+In a Vector ISA context the concept of Predication is naturally extended
+from single-bit to multi-bit, and the (well-known) benefits become all the
+more critical given that parallel branches in Vector ISAs are impossible
+(even a Vector ISA can only have Scalar branches).
+
+However the Scalar Power ISA does not have Conditional Execution (for
+which, if it had ever been considered, Condition Register bits would be
+a perfect natural fit).  Thus, when adding Predication using CR Fields
+via Simple-V it becomes a somewhat disruptive addition to the Power ISA.
+
+To ameliorate this situation, particularly for pre-existing Hardware
+designs implementing up to Scalar Power ISA v3.1, some rules are set that
+allow those pre-existing designs not to require heavy modification to
+their existing Scalar pipelines.  These rules effectively allow Hardware
+Architects to add the additional CR Fields CR8 to CR127 as if they were
+an **entirely separate register file**.
+
+* any instruction involving more than 1 source 1 destination
+  where one of the operands is a Condition Register is prohibited from
+  using registers from both the CR0-7 group and the CR8-127 group at
+  the same time.
+* any instruction involving 1 source 1 destination where either the
+  source or the destination is a Condition Register is prohibited
+  from setting CR0-7 as a Vector.
+* prohibitions are required to be enforced by raising Illegal Instruction
+  Traps
+
+Examples of permitted instructions:
+
+```
+    sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
+    sv.mfcr cr5, *cr40                   # only one source (CR40) copied to CR5
+    sv.mfcr *cr16, cr40                  # Vector-Splat CR40 onto CR16,17,18...
+    sv.mfcr *cr16, cr3                  # Vector-Splat CR3 onto CR16,17,18...
+```
+
+Examples of prohibited instructions:
+
+```
+    sv.mfcr *cr0, cr40        # Vector-Splat onto CR0,1,2
+    sv.crand cr7, cr9, cr10   # crosses over between CR0-7 and CR8-127
+```
+
  ## Future expansion.
  
  With the way that EXTRA fields are defined and applied to register
@@ -655,6 +814,14 @@ The SUBVL encoding value may be thought of as an inclusive range of a
  sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
  this may be considered to be elements 0b00 to 0b01 inclusive.
  
+Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+element operation issued, SUBVL element operations are issued (as an inner loop).
+The key difference between VL looping and SUBVL looping
+is that predication bits are applied per
+**group**, rather than by individual element.  
+
+Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
+
  ## MASK/MASK_SRC & MASKMODE Encoding
  
  One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
@@ -679,11 +846,12 @@ used for both src and dest, or different regs (one for src, one for dest).
  Likewise CR based twin predication has a second set of 3 bits, allowing
  a different test to be applied.
  
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed.  In practice (for CR Fields)
-this creates an unnecessary block on parallelism.  Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop.  Doing so results
+Note that it cannot necessarily be assumed that Predicate Masks
+(whether INT or CR) are read in full *before* the operations proceed.  In practice (for CR Fields)
+this creates an unnecessary block on parallelism, prohibiting
+"Vector Chaining".  Therefore, it is up
+to the programmer to ensure that the CR field Elements used as Predicate Masks
+are not overwritten by any parallel Vector Loop.  Doing so results
  in **UNDEFINED** behaviour, according to the definition outlined in the
  Power ISA v3.0B Specification.
  
@@ -692,6 +860,14 @@ of individual CR fields until the actual predicated element operation
  needs to take place, safe in the knowledge that no programmer will have
  issued a Vector Instruction where previous elements could have overwritten
  (destroyed) not-yet-executed CR-Predicated element operations.
+This particularly is an issue when using REMAP, as the order in
+which CR-Field-based Predicate Mask bits could be read on a per-element
+execution basis could well conflict with the order in which prior
+elements wrote to the very same CR Field.
+
+Additionally Programmers should avoid using r3 r10 or r30
+as destination registers when these are also used as a Predicate
+Mask. Doing so is again UNDEFINED behaviour.
  
  ### Integer Predication (MASKMODE=0)
  
@@ -715,6 +891,7 @@ following meaning:
  r10 and r30 are at the high end of temporary and unused registers,
  so as not to interfere with register allocation from ABIs.
  
+
  ### CR-based Predication (MASKMODE=1)
  
  When the predicate mode bit is one the 3 bits are interpreted as below.