-[[!tag standards]]
-
-# SPRs <a name="sprs"></a>
-
-Note OpenPOWER v3.1 p12:
-
- The designated SPR sandbox consists of non-privileged SPRs
- 704-719 and privileged SPRs 720-735.
-
-There are eight SPRs, available in any privilege level:
-
-* SVSTATE (containing copies of MVL, VL and SUBVL as well as context information)
-* SVSRR0 which is used for exceptions and traps to store SVSTATE.
-* SVLR, a mirror of LR, used by Vectorised Branch
-* SVSHAPE0-3 for REMAP purposes, re-shaping Vector loops
-* SVREMAP for applying specific shapes to specific registers
-
-If SVSTATE is all zeros then SV is disabled and the contents of the
-other SPRs SVSHAPE/SVREMAP are ignored.
-
-For Privilege Levels (trap handling) there are the following SPRs,
-where x may be u, s or h for User, Supervisor or Hypervisor
-Modes respectively:
-
-* (x)eSTATE (useful for saving and restoring during context switch,
- and for providing fast transitions)
-
-The u/s SPRs are treated and handled exactly like their (x)epc
-equivalents. On entry to or exit from a privilege level, the contents
-of its (x)eSTATE are swapped with SVSTATE.
-
-# SVSTATE
-
-This is a standard SPR that (REMAP aside) contains sufficient information for a
-full context save/restore (see SVSRR0). It contains (and permits setting of):
-
-* MVL (the Maximum Vector Length) - declares (statically) how
- much of a regfile is to be reserved for Vector elements
-* VL - Vector Length
-* dststep - the destination element offset of the current parallel
- instruction being executed
-* srcstep - for twin-predication, the source element offset as well.
-* SUBVL
-* substep - the subvector element offset of the current
- parallel instruction being executed
-* vfirst - Vertical First mode. srcstep, dststep and substep
- **do not advance** unless explicitly requested to do so with
- pseudo-op svstep (a mode of setvl)
-* RMpst - REMAP persistence. REMAP will apply only to the following
- instruction unless this bit is set, in which case REMAP "persists".
- Reset (cleared) on use of the `setvl` instruction if used to
- alter VL or MVL.
-* hphint - Horizontal Parallelism Hint. In Vertical First Mode
- hardware **MAY** perform up to this many elements in parallel
- per instruction. Set to zero to indicate "no hint".
-* SVme - REMAP enable bits, indicating which register is to be
- REMAPed. RA, RB, RC, RT or EA.
-* mi0-mi4 - when the corresponding SVme bit is enabled, mi0-mi4
- indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
- should use.
-
-**MAXVECTORLENGTH (MVL)** <a name="mvl" />
-
-MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
-is variable length and may be dynamically set. MVL is
-however limited to the regfile bitwidth, 64.
+# SPRs <a name="sprs"></a>
-**Vector Length (VL)** <a name="vl" />
+The full list of SPRs for Simple-V is:
-VSETVL is slightly different from RVV. Similar to RVV, VL is set to be within
-the range 0 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
+| SPR | Width | Description |
+|---------------|---------|---------------------------------|
+| **SVSTATE** | 64-bit | Zero-Overhead Loop Architectural State |
+| **SVLR** | 64-bit | SVSTATE equivalent of LR-to-PC |
+| **SVSHAPE0** | 32-bit | REMAP Shape 0 |
+| **SVSHAPE1** | 32-bit | REMAP Shape 1 |
+| **SVSHAPE2** | 32-bit | REMAP Shape 2 |
+| **SVSHAPE3** | 32-bit | REMAP Shape 3 |
- VL = rd = MIN(vlen, MVL)
+Future versions of Simple-V will have at least 7 more SVSTATE SPRs, in a small
+"stack", as part of a full Zero-Overhead Loop Control subsystem.
-where 1 <= MVL <= XLEN
-
-**SUBVL - Sub Vector Length**
-
-This is a "group by quantity" that effectively asks each iteration
-of the hardware loop to load SUBVL elements of width elwidth at a
-time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
-operation issued, SUBVL operations are issued.
-
-The main effect of SUBVL is that predication bits are applied per
-**group**, rather than by individual element. Legal values are 1 to 4.
-Illegal values raise an exception.
-
-For hphint, the number chosen must be consistently
-executed **every time**. Hardware is not permitted to execute five
-computations for one instruction then three on the next.
-hphint is a hint from the compiler to hardware that up to this
-many elements may be safely executed in parallel.
-Interestingly, when hphint is set equal to VL, it is in effect
-as if Vertical First mode were not set, because the hardware is
-given the option to run through all elements in an instruction.
-This is exactly what Horizontal-First is: a for-loop from 0 to VL-1
-except that the hardware may *choose* the number of elements.
-
-*Note to programmers: changing VL during the middle of such modes
-should be done only with due care and respect for the fact that SVSTATE
-has exactly the same peer-level status as a Program Counter.*
+## SVSTATE SPR
The format of the SVSTATE SPR is as follows:
| 7:13 | vl | Vector Length |
| 14:20 | srcstep | for srcstep = 0..VL-1 |
| 21:27 | dststep | for dststep = 0..VL-1 |
-| 28:29 | subvl | Sub-vector length |
-| 30:31 | substep | for substep = 0..SUBVL-1 |
-| 32:33 | mi0 | REMAP RA SVSHAPE0-3 |
-| 34:35 | mi1 | REMAP RB SVSHAPE0-3 |
-| 36:37 | mi2 | REMAP RC SVSHAPE0-3 |
-| 38:39 | mo0 | REMAP RT SVSHAPE0-3 |
-| 40:41 | mo1 | REMAP EA SVSHAPE0-3 |
+| 28:29 | dsubstep | for substep = 0..SUBVL-1 |
+| 30:31 | ssubstep | for substep = 0..SUBVL-1 |
+| 32:33 | mi0 | REMAP RA/FRA/BFA SVSHAPE0-3 |
+| 34:35 | mi1 | REMAP RB/FRB/BFB SVSHAPE0-3 |
+| 36:37 | mi2 | REMAP RC/FRT SVSHAPE0-3 |
+| 38:39 | mo0 | REMAP RT/FRT/BF SVSHAPE0-3 |
+| 40:41 | mo1 | REMAP EA/RS/FRS SVSHAPE0-3 |
| 42:46 | SVme | REMAP enable (RA-RT) |
-| 47:61 | rsvd | reserved |
+| 47:52 | rsvd | reserved |
+| 53 | pack | PACK (srcstep reorder) |
+| 54 | unpack | UNPACK (dststep order) |
+| 55:61 | hphint | Horizontal Hint |
| 62 | RMpst | REMAP persistence |
| 63 | vfirst | Vertical First mode |
-The relationship between SUBVL and the subvl field is:
-
-| SUBVL | (29..28) |
-| ----- | -------- |
-| 1 | 0b00 |
-| 2 | 0b01 |
-| 3 | 0b10 |
-| 4 | 0b11 |
-
Notes:
* The entries are truncated to be within range. Attempts to set VL to
* Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
than 64 is reserved and will cause an illegal instruction trap.
-# SVSRR0
-
-In scalar v3.0B traps, exceptions and interrupts, two SRRs are saved/restored:
+**SVSTATE Fields**
-* SRR0 to store the PC (CIA/NIA)
-* SRR1 to store a copy of the MSR
+SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
+self-contaned information for a full context save/restore.
+SVSTATE contains (and permits setting of):
-Given that SVSTATE is effectively a Sub-PC it is critically important to add saving/restoring of SVSTATE as a full peer equal in status to PC, in every way. At any time PC is saved or restored, so is SVSTATE in **exactly** the same way for **exactly** the same reasons. Thus, at an exception point,
-hardware **must** save/restore SVSTATE in SVSRR0 at exactly the same
-time that SRR0 is saved/restored in PC and SRR1 in MSR.
-
-The SPR name given for the purposes of saving/restoring
-SVSTATE is SVSRR0.
-
-# SVLR
+* MVL (the Maximum Vector Length) - declares (statically) how
+ much of a regfile is to be reserved for Vector elements
+* VL - Vector Length
+* dststep - the destination element offset of the current parallel
+ instruction being executed
+* srcstep - for twin-predication, the source element offset as well.
+* ssubstep - the source subvector element offset of the current
+ parallel instruction being executed
+* dsubstep - the destination subvector element offset of the current
+ parallel instruction being executed
+* vfirst - Vertical First mode. srcstep, dststep and substep
+ **do not advance** unless explicitly requested to do so with svstep
+* RMpst - REMAP persistence. REMAP will apply only to the following
+ instruction unless this bit is set, in which case REMAP "persists".
+ Reset (cleared) on use of the `setvl` instruction if used to
+ alter VL or MVL.
+* Pack - if set then srcstep/ssubstep VL/SUBVL loop-ordering is inverted.
+* UnPack - if set then dststep/dsubstep VL/SUBVL loop-ordering is inverted.
+* hphint - Horizontal Parallelism Hint. Indicates that
+ no Hazards exist between groups of elements in sequential multiples of this number
+ (before REMAP). By definition: elements for which `FLOOR(step/hphint)` is
+ equal *before REMAP* are in the same parallelism "group", for both
+ `srcstep` and `dststep`. In Vertical First Mode
+ hardware **MUST** respect Strict Program Order but is permitted to
+ merge multiple scalar loops into parallel batches, if Reservation Station resources
+ are sufficient. Set to zero to indicate "no hint".
+* SVme - REMAP enable bits, indicating which register is to be
+ REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
+ associated with each bit, with RA being the LSB and EA being the MSB.
+ See table below for ordering. When `SVme` is zero (0b00000) REMAP
+ is **fully disabled and inactive** regardless of the contents of
+ `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
+* mi0-mi2/mo0-mo1 - these
+ indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
+ should use, as long as the register's corresponding SVme bit is set
+
+Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
+allows establishment of REMAP context well in advance, followed by utilising `svremap`
+at a precise (or the very last) moment. Some implementations may exploit this
+to cache (or take some time to prepare caches) in the background whilst other
+(unrelated) instructions are being executed. This is particularly important to
+bear in mind when using `svindex` which will require hardware to perform (and
+cache) additional GPR reads.
+
+Programmer's Note: when REMAP is activated it becomes necessary on any
+context-switch (Interrupt or Function call) to detect (or know in advance)
+that REMAP is enabled and to additionally explicitly save/restore the four SVSHAPE
+SPRs, SVHAPE0-3. Given that this is expected to be a rare occurrence it was
+deemed unreasonable to burden every context-switch or function call with
+mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
+(and Trap Handler) responsibility. Callees (and Trap Handlers) **MUST**
+avoid using all and any SVP64 instructions during the period where state
+could be adversely affected. SVP64 purely relies on Scalar instructions,
+so Scalar instructions (except the SVP64 Management ones and mtspr and
+mfspr) are 100% guaranteed to have zero impact on SVP64 state.
+
+**SVme REMAP area**
+
+Each bit of `SVSTATE.SVme` indicates whether the SVSHAPE (0-3) is active and to which register
+the REMAP applies. The application goes by *assembler operand names* on a per-mnemonic
+basis. Some instructions may have `RT` as a source and as a destination: REMAP applies
+**separately** to each use in this case. Also for Load/Store with Update the Effective
+Address (stored in EA) also may be separately REMAPed from RA as a source operand.
+
+| bit|applies|register applied|
+|----|-------|----------------|
+| 46 | mi0 | source RA / FRA / BA / BFA / RT / FRT |
+| 45 | mi1 | source RB / FRB / BB|
+| 44 | mi2 | source RC / FRC / BC|
+| 43 | mo0 | result RT / FRT / BT / BF|
+| 42 | mo1 | result Effective Address (RA) / FRS / RS|
+
+**Max Vector Length (maxvl)** <a name="mvl" />
+
+MAXVECTORLENGTH is a static (immediate-operand only) compile-time declaration
+of the maximum number of elements in a Vector. MVL is limited to 7 bits
+(in the first version of SVP64) and consequently the maximum number of
+elements is limited to between 0 and 127.
+
+MAXVL is normally (in other True-Scalable Vector ISAs) an Architecturally-defined
+quantity related indirectly to the total available number of bits in the Vector
+Register File. Cray Vectors had a Hardware-Architectural set limit of MAXVL=64.
+RISC-V RVV has MAXVL defined in terms of a Silicon-Partner-selectable fixed number
+of bits. MAXVL in Simple-V is set in terms of the number of *elements* and
+may change at runtime.
+
+Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
+result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
+field may only be set **statically** as an immediate, by the `setvl` instruction.
+It may **NOT** be set dynamically from a register. Compiler writers and assembly
+programmers are expected to perform static register file analysis, subdivision,
+and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
+"bypass" this Note could, in less-advanced implementations, potentially cause stalling,
+particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
+
+**Vector Length (vl)** <a name="vl" />
+
+The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
+entirely dynamically at runtime from a number of sources. `setvl` is the primary
+instruction for setting Vector Length.
+`setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
+equivalent. Similar to RVV, VL is set to be within
+the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
+
+```
+ VL = (RT|0) = MIN(vlen, MVL)
+```
+
+where `0 <= MVL <= 127`, and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
+depending on options selected with the `setvl` instruction.
+
+Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
+of the Power ISA Technical Reference. Guidance on the 50-year-old Cray Vector paradigm is
+best sought elsewhere: good studies include Academic Courses given on the 1970s
+Cray Supercomputers over at least the past three decades.
+
+**Horizontal Parallelism**
+
+A problem exists for hardware where it may not be able to detect
+that a programmer (or compiler) knows of opportunities for parallelism
+and lack of overlap between loops, despite these being easy for a compiler
+to statically detect and potentially express.
+`hphint` is such an expression, declaring that elements within a batch are
+independent of each other (no Register *or Memory* Hazards).
+
+Elements are considered to be in the same source batch if they have
+the same value of `FLOOR(srcstep/hphint)`. Likewise in the same destination batch
+for the same value `FLOOR(dststep/hphint)`.
+Four key observations here:
+
+1. predication is **not** involved here. the number of actual elements
+involved is considered *before* predicate masks are applied.
+2. twin predication can result in srcstep and dststep being in different
+batches
+3. batch evaluation is done *before* REMAP, making Hazard elimination easier
+ for Multi-Issue systems.
+4. `hphint` is *not* limited to power-of-two. Hardware implementors may choose
+ a lower parallelism hint up to `hphint` and may find power-of-two more
+ convenient.
+
+Regarding (4): if a smaller hint is chosen by hardware, actual parallelism
+(Dependency Hazard relaxation) must **never**
+exceed `hphint` and must still respect the batch boundaries, even if this results
+in just one element being considered Hazard-independent. Even under these
+circumstances Multi-Issue Register-renaming is possible, to introduce parallelism
+by a different route.
+
+*Hardware Architect note: each element within the same group may be treated as
+100% independent from any other element within that group, and therefore
+neither Register Hazards nor Memory Hazards inter-element exist,
+but crucially inter-group definitely remains. This makes
+implementation far easier on resources because the Hazard Dependencies are
+effectively at a much coarser granularity than a single register.
+With element-width overrides extending down to the byte level reducing Dependency
+Hazard hardware complexity becomes even more important.*
+
+`hphint` may legitimately be set greater than `MAXVL`. This indicates to Multi-Issue
+hardware that even though MAXVL is relatively small the batches are *still independent*
+and therefore if Multi-Issue hardware chooses to allocate several batches up to
+`MAXVL` in size they are still independent, even if Register-renaming is deployed.
+This helps greatly simplify Multi-Issue systems by significantly reducing Hazards.
+
+**Considerable care** must be taken when setting `hphint`. Matrix Outer Product
+could produce corrupted results if `hphint` is set to greater than the innermost
+loop depth. Parallel Reduction, DCT and FFT REMAP all are similarly critically affected
+by `hphint` in ways that if used correctly greatly increases ease of parallelism but
+if done incorrectly will also result in data corruption. Reduction/Iteration
+also requires care to correctly declare in `hphint` how many elements are
+independent. In the case of most Reduction use-cases the answer is almost certainly
+"none".
+
+`hphint` must never be set on Atomic Memory operations, Cache-Inhibited
+Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update
+Data-Dependent Fail-First is ever used for linked-list pointer-chasing, `hphint`
+should again definitely be disabled. Failure to do so results in `UNDEFINED`
+behaviour.
+
+`hphint` may only be ignored by Hardware Implementors as long as full element-level
+Register and Memory Hazards are implemented *in full* (including right down to individual
+bytes of each register for when elwidth=8/16/32). In other words if `hphint` is to
+be ignored then implementations must consider the situation as if `hphint=0`.
+
+**Horizontal Parallelism in Vertical-First Mode**
+
+Setting `hphint` with Vertical-First is perfectly legitimate. Under these circumstances
+single-element strict Program Execution Order must be preserved at all times, but
+should there be a small enough program loop, than Out-of-Order Hardware may
+take the opportunity to *merge*
+consecutive element-based instructions into the *same Reservation Stations*, for
+multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs.
+**Only** elements within the same `hphint` group (across multiple such looped instructions)
+may be treated as mergeable in this fashion.
+
+Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation
+Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at
+least there is no harm done: the loop is still correctly executed as Scalar instructions.
+Programmers do need to be aware though that short loops on some Hardware Implementations
+can be made considerably faster than on other Implementations.
+
+## SVLR
SV Link Register, exactly analogous to LR (Link Register) may
be used for temporary storage of SVSTATE, and, in particular,
SVLR and SVSTATE whenever LR and NIA are.
Note that there is no equivalent Link variant of SVREMAP or
-SVSHAPE0-3, so SVLR has limited applicability
+SVSHAPE0-3 (it would be too costly), so SVLR has limited applicability:
+REMAP SPRs must be saved and restored explicitly.
+
+-----------
+
+[[!tag standards]]
+