openpower/sv/sprs.mdwn

   1 # SPRs  <a name="sprs"></a>
   2
   3 The full list of SPRs for Simple-V is:
   4
   5 | SPR           | Width   | Description   |
   6 |---------------|---------|---------------------------------|
   7 | **SVSTATE**   | 64-bit  | Zero-Overhead Loop Architectural State   |
   8 | **SVLR**      | 64-bit  | SVSTATE equivalent of LR-to-PC  |
   9 | **SVSHAPE0**  | 32-bit  |  REMAP Shape 0   |
  10 | **SVSHAPE1**  | 32-bit  |  REMAP Shape 1   |
  11 | **SVSHAPE2**  | 32-bit  |  REMAP Shape 2   |
  12 | **SVSHAPE3**  | 32-bit  |  REMAP Shape 3   |
  13
  14 Future versions of Simple-V will have at least 7 more SVSTATE SPRs, in a small
  15 "stack", as part of a full Zero-Overhead Loop Control subsystem.
  16
  17 ## SVSTATE SPR
  18
  19 The format of the SVSTATE SPR is as follows:
  20
  21 | Field | Name     | Description           |
  22 | ----- | -------- | --------------------- |
  23 | 0:6   | maxvl    | Max Vector Length     |
  24 | 7:13  |    vl    | Vector Length         |
  25 | 14:20 | srcstep  | for srcstep = 0..VL-1 |
  26 | 21:27 | dststep  | for dststep = 0..VL-1 |
  27 | 28:29 | dsubstep | for substep = 0..SUBVL-1  |
  28 | 30:31 | ssubstep | for substep = 0..SUBVL-1  |
  29 | 32:33 | mi0      | REMAP RA/FRA/BFA SVSHAPE0-3    |
  30 | 34:35 | mi1      | REMAP RB/FRB/BFB SVSHAPE0-3    |
  31 | 36:37 | mi2      | REMAP RC/FRT SVSHAPE0-3    |
  32 | 38:39 | mo0      | REMAP RT/FRT/BF SVSHAPE0-3    |
  33 | 40:41 | mo1      | REMAP EA/RS/FRS SVSHAPE0-3    |
  34 | 42:46 | SVme     | REMAP enable (RA-RT)  |
  35 | 47:52 | rsvd     | reserved              |
  36 | 53    | pack     | PACK (srcstep reorder)  |
  37 | 54    | unpack   | UNPACK (dststep order)  |
  38 | 55:61 | hphint   | Horizontal Hint       |
  39 | 62    | RMpst    | REMAP persistence     |
  40 | 63    | vfirst   | Vertical First mode   |
  41
  42 Notes:
  43
  44 * The entries are truncated to be within range.  Attempts to set VL to
  45   greater than MAXVL will truncate VL.
  46 * Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
  47   than 64 is reserved and will cause an illegal instruction trap.
  48
  49 **SVSTATE Fields**
  50
  51 SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
  52 self-contaned information for a full context save/restore.
  53 SVSTATE contains (and permits setting of):
  54
  55 * MVL (the Maximum Vector Length) - declares (statically) how
  56   much of a regfile is to be reserved for Vector elements
  57 * VL - Vector Length
  58 * dststep - the destination element offset of the current parallel
  59   instruction being executed
  60 * srcstep - for twin-predication, the source element offset as well.
  61 * ssubstep - the source subvector element offset of the current
  62   parallel instruction being executed
  63 * dsubstep - the destination subvector element offset of the current
  64   parallel instruction being executed
  65 * vfirst - Vertical First mode.  srcstep, dststep and substep
  66     **do not advance** unless explicitly requested to do so with svstep
  67 * RMpst - REMAP persistence.  REMAP will apply only to the following
  68   instruction unless this bit is set, in which case REMAP "persists".
  69   Reset (cleared) on use of the `setvl` instruction if used to
  70   alter VL or MVL.
  71 * Pack - if set then srcstep/ssubstep VL/SUBVL loop-ordering is inverted.
  72 * UnPack - if set then dststep/dsubstep VL/SUBVL loop-ordering is inverted.
  73 * hphint - Horizontal Parallelism Hint. Indicates that
  74   no Hazards exist between groups of elements in sequential multiples of this number
  75    (before REMAP).  By definition: elements for which `FLOOR(step/hphint)` is
  76    equal *before REMAP* are in the same parallelism "group", for both
  77    `srcstep` and `dststep`. In Vertical First Mode
  78    hardware **MUST** respect Strict Program Order but is permitted to
  79    merge multiple scalar loops into parallel batches, if Reservation Station resources
  80    are sufficient.  Set to zero to indicate "no hint".
  81 * SVme - REMAP enable bits, indicating which register is to be
  82    REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
  83    associated with each bit, with RA being the LSB and EA being the MSB.
  84    See table below for ordering. When `SVme` is zero (0b00000) REMAP
  85    is **fully disabled and inactive** regardless of the contents of
  86   `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
  87 * mi0-mi2/mo0-mo1 - these
  88   indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
  89   should use, as long as the register's corresponding SVme bit is set
  90
  91 Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
  92 allows establishment of REMAP context well in advance, followed by utilising `svremap`
  93 at a precise (or the very last) moment.  Some implementations may exploit this
  94 to cache (or take some time to prepare caches) in the background whilst other
  95 (unrelated) instructions are being executed. This is particularly important to
  96 bear in mind when using `svindex` which will require hardware to perform (and
  97 cache) additional GPR reads.
  98
  99 Programmer's Note: when REMAP is activated it becomes necessary on any
 100 context-switch (Interrupt or Function call) to detect (or know in advance)
 101 that REMAP is enabled and to additionally explicitly save/restore the four SVSHAPE
 102 SPRs, SVHAPE0-3.  Given that this is expected to be a rare occurrence it was
 103 deemed unreasonable to burden every context-switch or function call with
 104 mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
 105 (and Trap Handler) responsibility.  Callees (and Trap Handlers) **MUST**
 106 avoid using all and any SVP64 instructions during the period where state
 107 could be adversely affected.  SVP64 purely relies on Scalar instructions,
 108 so Scalar instructions (except the SVP64 Management ones and mtspr and
 109 mfspr) are 100% guaranteed to have zero impact on SVP64 state.
 110
 111 **SVme REMAP area**
 112
 113 Each bit of `SVSTATE.SVme` indicates whether the SVSHAPE (0-3) is active and to which register
 114 the REMAP applies.  The application goes by *assembler operand names* on a per-mnemonic
 115 basis.  Some instructions may have `RT` as a source and as a destination: REMAP applies
 116 **separately** to each use in this case.  Also for Load/Store with Update the Effective
 117 Address (stored in EA) also may be separately REMAPed from RA as a source operand.
 118
 119 | bit|applies|register applied|
 120 |----|-------|----------------|
 121 | 46 | mi0 | source RA / FRA / BA / BFA / RT / FRT |
 122 | 45 | mi1 | source RB / FRB / BB|
 123 | 44 | mi2 | source RC / FRC / BC|
 124 | 43 | mo0 | result RT / FRT / BT / BF|
 125 | 42 | mo1 | result Effective Address (RA) / FRS / RS|
 126
 127 **Max Vector Length (maxvl)** <a name="mvl" />
 128
 129 MAXVECTORLENGTH is a static (immediate-operand only) compile-time declaration
 130 of the maximum number of elements in a Vector. MVL is limited to 7 bits
 131 (in the first version of SVP64) and consequently the maximum number of
 132 elements is limited to between 0 and 127.
 133
 134 MAXVL is normally (in other True-Scalable Vector ISAs) an Architecturally-defined
 135 quantity related indirectly to the total available number of bits in the Vector
 136 Register File.  Cray Vectors had a Hardware-Architectural set limit of MAXVL=64.
 137 RISC-V RVV has MAXVL defined in terms of a Silicon-Partner-selectable fixed number
 138 of bits.  MAXVL in Simple-V is set in terms of the number of *elements* and
 139 may change at runtime.
 140
 141 Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
 142 result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
 143 field may only be set **statically** as an immediate, by the `setvl` instruction.
 144 It may **NOT** be set dynamically from a register.  Compiler writers and assembly
 145 programmers are expected to perform static register file analysis, subdivision,
 146 and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
 147 "bypass" this Note could, in less-advanced implementations, potentially cause stalling,
 148 particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
 149
 150 **Vector Length (vl)** <a name="vl" />
 151
 152 The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
 153 entirely dynamically at runtime from a number of sources. `setvl` is the primary
 154 instruction for setting Vector Length.
 155 `setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
 156 equivalent. Similar to RVV, VL is set to be within
 157 the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
 158
 159 ```
 160     VL = (RT|0) = MIN(vlen, MVL)
 161 ```
 162
 163 where `0 <= MVL <= 127`, and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
 164 depending on options selected with the `setvl` instruction.
 165
 166 Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
 167 of the Power ISA Technical Reference.  Guidance on the 50-year-old Cray Vector paradigm is
 168 best sought elsewhere: good studies include Academic Courses given on the 1970s
 169 Cray Supercomputers over at least the past three decades.
 170
 171 **Horizontal Parallelism**
 172
 173 A problem exists for hardware where it may not be able to detect
 174 that a programmer (or compiler) knows of opportunities for parallelism
 175 and lack of overlap between loops, despite these being easy for a compiler
 176 to statically detect and potentially express.
 177 `hphint` is such an expression, declaring that elements within a batch are
 178 independent of each other (no Register *or Memory* Hazards).
 179
 180 Elements are considered to be in the same source batch if they have
 181 the same value of `FLOOR(srcstep/hphint)`. Likewise in the same destination batch
 182 for the same value `FLOOR(dststep/hphint)`.
 183 Four key observations here:
 184
 185 1. predication is **not** involved here.  the number of actual elements
 186 involved is considered *before* predicate masks are applied.
 187 2. twin predication can result in srcstep and dststep being in different
 188 batches
 189 3. batch evaluation is done *before* REMAP, making Hazard elimination easier
 190    for Multi-Issue systems.
 191 4. `hphint` is *not* limited to power-of-two. Hardware implementors may choose
 192    a lower parallelism hint up to `hphint` and may find power-of-two more
 193    convenient.
 194
 195 Regarding (4): if a smaller hint is chosen by hardware, actual parallelism
 196 (Dependency Hazard relaxation) must **never**
 197 exceed `hphint` and must still respect the batch boundaries, even if this results
 198 in just one element being considered Hazard-independent.  Even under these
 199 circumstances Multi-Issue Register-renaming is possible, to introduce parallelism
 200 by a different route.
 201
 202 *Hardware Architect note: each element within the same group may be treated as
 203 100% independent from any other element within that group, and therefore
 204 neither Register Hazards nor Memory Hazards inter-element exist,
 205 but crucially inter-group definitely remains.  This makes
 206 implementation far easier on resources because the Hazard Dependencies are
 207 effectively at a much coarser granularity than a single register.
 208 With element-width overrides extending down to the byte level reducing Dependency
 209 Hazard hardware complexity becomes even more important.*
 210
 211 `hphint` may legitimately be set greater than `MAXVL`. This indicates to Multi-Issue
 212 hardware that even though MAXVL is relatively small the batches are *still independent*
 213 and therefore if Multi-Issue hardware chooses to allocate several batches up to
 214 `MAXVL` in size they are still independent, even if Register-renaming is deployed.
 215 This helps greatly simplify Multi-Issue systems by significantly reducing Hazards.
 216
 217 **Considerable care** must be taken when setting `hphint`. Matrix Outer Product
 218 could produce corrupted results if `hphint` is set to greater than the innermost
 219 loop depth. Parallel Reduction, DCT and FFT REMAP all are similarly critically affected
 220 by `hphint` in ways that if used correctly greatly increases ease of parallelism but
 221 if done incorrectly will also result in data corruption.  Reduction/Iteration
 222 also requires care to correctly declare in `hphint` how many elements are
 223 independent. In the case of most Reduction use-cases the answer is almost certainly
 224 "none".
 225
 226 `hphint` must never be set on Atomic Memory operations, Cache-Inhibited
 227 Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update
 228 Data-Dependent Fail-First is ever used for linked-list pointer-chasing, `hphint`
 229 should again definitely be disabled. Failure to do so results in `UNDEFINED`
 230 behaviour.
 231
 232 `hphint` may only be ignored by Hardware Implementors as long as full element-level
 233 Register and Memory Hazards are implemented *in full* (including right down to individual
 234 bytes of each register for when elwidth=8/16/32). In other words if `hphint` is to
 235 be ignored then implementations must consider the situation as if `hphint=0`.
 236
 237 **Horizontal Parallelism in Vertical-First Mode**
 238
 239 Setting `hphint` with Vertical-First is perfectly legitimate.  Under these circumstances
 240 single-element strict Program Execution Order must be preserved at all times, but
 241 should there be a small enough program loop, than Out-of-Order Hardware may
 242 take the opportunity to *merge*
 243 consecutive element-based instructions into the *same Reservation Stations*, for
 244 multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs.
 245 **Only** elements within the same `hphint` group (across multiple such looped instructions)
 246 may be treated as mergeable in this fashion.
 247
 248 Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation
 249 Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at
 250 least there is no harm done: the loop is still correctly executed as Scalar instructions.
 251 Programmers do need to be aware though that short loops on some Hardware Implementations
 252 can be made considerably faster than on other Implementations.
 253
 254 ## SVLR
 255
 256 SV Link Register, exactly analogous to LR (Link Register) may
 257 be used for temporary storage of SVSTATE, and, in particular,
 258 Vectorized Branch-Conditional instructions may interchange
 259 SVLR and SVSTATE whenever LR and NIA are.
 260
 261 Note that there is no equivalent Link variant of SVREMAP or
 262 SVSHAPE0-3 (it would be too costly), so SVLR has limited applicability:
 263 REMAP SPRs must be saved and restored explicitly.
 264
 265 -----------
 266
 267 [[!tag standards]]
 268