(no commit message)

[libreriscv.git] / openpower / sv / svp64.mdwn
diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn

index f662ba2a4d374ae8d342f48b4a4f6bef3bbe3731..55e537018c6b71eed4347e60514cd025d5a5c530 100644 (file)
--- a/openpower/sv/svp64.mdwn
+++ b/openpower/sv/svp64.mdwn
@@ -65,14 +65,15 @@ of that following instruction.  **All prefixed 32-bit instructions
  (Defined Words) retain their non-prefixed encoding and definition**.
  
  Two apparent exceptions to the above hard rule exist: SV
-Branch-Conditional operations and LD/ST-update "Post-Increment" Mode.
-Post-Increment was considered sufficiently high priority (significantly
-reducing hot-loop instruction count) that one bit in the Prefix
-is reserved for it (Note the intention to release that bit and move
-Post-Increment instructions to EXT2xx).  Vectorised Branch-Conditional
-operations "embed" the original Scalar Branch-Conditional behaviour into
-a much more advanced variant that is highly suited to High-Performance
-Computation (HPC), Supercomputing, and parallel GPU Workloads.
+Branch-Conditional operations and LD/ST-update "Post-Increment"
+Mode.  Post-Increment was considered sufficiently high priority
+(significantly reducing hot-loop instruction count) that one bit in
+the Prefix is reserved for it (*Note the intention to release that bit
+and move Post-Increment instructions to EXT2xx, as part of [[ls011]]*).
+Vectorised Branch-Conditional operations "embed" the original Scalar
+Branch-Conditional behaviour into a much more advanced variant that is
+highly suited to High-Performance Computation (HPC), Supercomputing,
+and parallel GPU Workloads.
  
  *Architectural Resource Allocation note: it is prohibited to accept RFCs
  which fundamentally violate this hard requirement.  Under no circumstances
@@ -99,7 +100,7 @@ only 24 bits:
  * Predication on both source and destination
  * Two different sources of predication: INT and CR Fields
  * SV Modes including saturation (for Audio, Video and DSP), mapreduce,
-  fail-first and predicate-result mode.
+  and fail-first mode.
  
  Different classes of operations require different formats. The earlier
  sections cover the common formats and the four separate modes follow:
@@ -136,6 +137,80 @@ required (identifying whether the Suffix - Defined Word - is
  Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
  adding "UnVectorised" to this phase is not unreasonable.*
  
+## Definition of Strict Program Order
+
+Strict Program Order is defined as giving the appearance, as far
+as programs are concerned, that instructions were executed
+strictly in the sequence that they occurred.  A "Precise"
+out-of-order
+Micro-architecture goes to considerable lengths to ensure that
+this is the case.
+
+Many Vector ISAs allow interrupts to occur in the middle of
+processing of large Vector operations, only under the condition
+that partial results are cleanly discarded, and continuation on return
+from the Trap Handler will restart the entire operation.
+The reason is that saving of full Architectural State is
+not practical.
+
+Simple-V operates on an entirely different paradigm from traditional
+Vector ISAs: as a Sub-Program Counter where "Elements" are synonymous
+with Scalar instructions. With this in mind it is critical for
+implementations to observe Strict Element-Level Program Order
+at all times
+(often simply referred to as just "Strict Program Order"
+throughout
+this Chapter).
+*Any* element is Interruptible and Simple-V has
+been carefully designed to guarantee that Architectural State may
+be fully preserved and restored regardless of that same State, but
+it is not necessarily guaranteed that the amount of time needed to recover
+will be low latency (particularly if REMAP
+is active).
+
+Interrupts still only save `MSR` and `PC` in `SRR0` and `SRR1`
+but the full SVP64 Architectural State may be saved and
+restored through manual copying of `SVSTATE` (and the four
+REMAP SPRs if in use at the time)
+Whilst this initially sounds unsafe in reality
+all that Trap Handlers (and function call stack save/restore)
+need do is avoid
+use of SVP64 Prefixed instructions to perform the necessary
+save/restore of Simple-V Architectural State.
+This capability also allows nested function calls to be made from
+inside Vertical-First Vector loops, which is very rare for Vector ISAs.
+
+Strict Program Order is also preserved by the Parallel Reduction
+REMAP Schedule, but only at the cost of requiring the destination
+Vector to be used (Deterministically) to store partial progress of the
+Parallel Reduction.
+
+The only major caveat for REMAP is that
+after an explicit change to
+Architectural State caused by writing to the
+Simple-V SPRs, some implementations may find
+it easier to take longer to calculate where in a given Schedule
+the re-mapping Indices were.  Obvious examples include Interrupts occuring
+in the middle of a non-RADIX2 Matrix Multiply Schedule (5x3 by 3x3
+for example), which
+will force implementations to perform divide and modulo
+calculations.
+
+An additional caveat involves Condition Register Fields
+when also used as Predicate Masks. An operation that
+overwrites the same CR Fields that are simultaneously
+being used as a Predicate Mask is `UNDEFINED` behaviour
+if the overwritten CR field element was needed by a
+subsequent Element for its Predicate Mask bit.
+This allows implementations to relax some of the
+otherwise-draconian Register Hazards that would otherwise
+occur, and to consider internal cacheing of the CR-based
+Predicate
+bits, but some implementations *may not necessarily
+perform pre-reading* and consequently the risk of
+overwrite is the responsibility of the Programmer.
+Special care is particularly needed here when using REMAP.
+
  ## Register files, elements, and Element-width Overrides
  
  The relationship between register files, elements, and element-width
@@ -493,13 +568,13 @@ This is part of `scalar identity behaviour` described above.
  
  **Condition Register(s)**
  
-The Scalar Power ISA Condition Register is a 64 bit register where the top
-32 MSBs (numbered 0:31 in MSB0 numbering) are not used.  This convention is
-*preserved*
-in SVP64 and an additional 15 Condition Registers provided in
-order to store the new CR Fields, CR8-CR15, CR16-CR23 etc. sequentially.
-The top 32 MSBs in each new SVP64 Condition Register are *also* not used:
-only the bottom 32 bits (numbered 32:63 in MSB0 numbering).
+The Scalar Power ISA Condition Register is a 64 bit register where
+the top 32 MSBs (numbered 0:31 in MSB0 numbering) are not used.
+This convention is *preserved* in SVP64 and an additional 15 Condition
+Registers provided in order to store the new CR Fields, CR8-CR15,
+CR16-CR23 etc. sequentially.  The top 32 MSBs in each new SVP64 Condition
+Register are *also* not used: only the bottom 32 bits (numbered 32:63
+in MSB0 numbering).
  
  *Programmer's note: using `sv.mfcr` without element-width overrides
  to take into account the fact that the top 32 MSBs are zero and thus
@@ -519,6 +594,55 @@ Condition-Register instructions because all other CR instructions,
  on closer investigation, will be observed to all be CR-bit or CR-Field
  related. Thus a `VL` of 16 must be used*
  
+**Condition Register Fields as Predicate Masks**
+
+Condition Register Fields perform an additional duty in Simple-V: they are
+used for Predicate Masks.  ARM's Scalar Instruction Set calls single-bit
+predication "Conditional Execution", and utilises Condition Codes for
+exactly this purpose to solve the problem caused by Branch Speculation.
+In a Vector ISA context the concept of Predication is naturally extended
+from single-bit to multi-bit, and the (well-known) benefits become all the
+more critical given that parallel branches in Vector ISAs are impossible
+(even a Vector ISA can only have Scalar branches).
+
+However the Scalar Power ISA does not have Conditional Execution (for
+which, if it had ever been considered, Condition Register bits would be
+a perfect natural fit).  Thus, when adding Predication using CR Fields
+via Simple-V it becomes a somewhat disruptive addition to the Power ISA.
+
+To ameliorate this situation, particularly for pre-existing Hardware
+designs implementing up to Scalar Power ISA v3.1, some rules are set that
+allow those pre-existing designs not to require heavy modification to
+their existing Scalar pipelines.  These rules effectively allow Hardware
+Architects to add the additional CR Fields CR8 to CR127 as if they were
+an **entirely separate register file**.
+
+* any instruction involving more than 1 source 1 destination
+  where one of the operands is a Condition Register is prohibited from
+  using registers from both the CR0-7 group and the CR8-127 group at
+  the same time.
+* any instruction involving 1 source 1 destination where either the
+  source or the destination is a Condition Register is prohibited
+  from setting CR0-7 as a Vector.
+* prohibitions are required to be enforced by raising Illegal Instruction
+  Traps
+
+Examples of permitted instructions:
+
+```
+    sv.crand *cr8.eq, *cr16.le, *cr40.so # all CR8-CR127
+    sv.mfcr cr5, *cr40                   # only one source (CR40) copied to CR5
+    sv.mfcr *cr16, cr40                  # Vector-Splat CR40 onto CR16,17,18...
+    sv.mfcr *cr16, cr3                  # Vector-Splat CR3 onto CR16,17,18...
+```
+
+Examples of prohibited instructions:
+
+```
+    sv.mfcr *cr0, cr40        # Vector-Splat onto CR0,1,2
+    sv.crand cr7, cr9, cr10   # crosses over between CR0-7 and CR8-127
+```
+
  ## Future expansion.
  
  With the way that EXTRA fields are defined and applied to register
@@ -690,6 +814,14 @@ The SUBVL encoding value may be thought of as an inclusive range of a
  sub-vector.  SUBVL=2 represents a vec2, its encoding is 0b01, therefore
  this may be considered to be elements 0b00 to 0b01 inclusive.
  
+Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+element operation issued, SUBVL element operations are issued (as an inner loop).
+The key difference between VL looping and SUBVL looping
+is that predication bits are applied per
+**group**, rather than by individual element.  
+
+Directly related to `subvl` is the `pack` and `unpack` Mode bits of `SVSTATE`.
+
  ## MASK/MASK_SRC & MASKMODE Encoding
  
  One bit (`MASKMODE`) indicates the mode: CR or Int predication.   The two
@@ -714,11 +846,12 @@ used for both src and dest, or different regs (one for src, one for dest).
  Likewise CR based twin predication has a second set of 3 bits, allowing
  a different test to be applied.
  
-Note that it is assumed that Predicate Masks (whether INT or CR) are
-read *before* the operations proceed.  In practice (for CR Fields)
-this creates an unnecessary block on parallelism.  Therefore, it is up
-to the programmer to ensure that the CR fields used as Predicate Masks
-are not being written to by any parallel Vector Loop.  Doing so results
+Note that it cannot necessarily be assumed that Predicate Masks
+(whether INT or CR) are read in full *before* the operations proceed.  In practice (for CR Fields)
+this creates an unnecessary block on parallelism, prohibiting
+"Vector Chaining".  Therefore, it is up
+to the programmer to ensure that the CR field Elements used as Predicate Masks
+are not overwritten by any parallel Vector Loop.  Doing so results
  in **UNDEFINED** behaviour, according to the definition outlined in the
  Power ISA v3.0B Specification.
  
@@ -727,6 +860,14 @@ of individual CR fields until the actual predicated element operation
  needs to take place, safe in the knowledge that no programmer will have
  issued a Vector Instruction where previous elements could have overwritten
  (destroyed) not-yet-executed CR-Predicated element operations.
+This particularly is an issue when using REMAP, as the order in
+which CR-Field-based Predicate Mask bits could be read on a per-element
+execution basis could well conflict with the order in which prior
+elements wrote to the very same CR Field.
+
+Additionally Programmers should avoid using r3 r10 or r30
+as destination registers when these are also used as a Predicate
+Mask. Doing so is again UNDEFINED behaviour.
  
  ### Integer Predication (MASKMODE=0)
  
@@ -750,6 +891,7 @@ following meaning:
  r10 and r30 are at the high end of temporary and unused registers,
  so as not to interfere with register allocation from ABIs.
  
+
  ### CR-based Predication (MASKMODE=1)
  
  When the predicate mode bit is one the 3 bits are interpreted as below.