(no commit message)
[libreriscv.git] / openpower / sv / sprs.mdwn
1 # SPRs <a name="sprs"></a>
2
3 The full list of SPRs for Simple-V is:
4
5 * **SVSTATE** 64-bit.
6 * **SVLR** 64-bit.
7 * **SVSHAPE0** 32-bit
8 * **SVSHAPE1** 32-bit
9 * **SVSHAPE2** 32-bit
10 * **SVSHAPE3** 32-bit
11
12 Future versions of Simple-V will have at least 7 more SVSTATE SPRs, in a small
13 "stack", as part of a full Zero-Overhead Loop Control subsystem.
14
15 ## SVSTATE SPR
16
17
18 The format of the SVSTATE SPR is as follows:
19
20 | Field | Name | Description |
21 | ----- | -------- | --------------------- |
22 | 0:6 | maxvl | Max Vector Length |
23 | 7:13 | vl | Vector Length |
24 | 14:20 | srcstep | for srcstep = 0..VL-1 |
25 | 21:27 | dststep | for dststep = 0..VL-1 |
26 | 28:29 | dsubstep | for substep = 0..SUBVL-1 |
27 | 30:31 | ssubstep | for substep = 0..SUBVL-1 |
28 | 32:33 | mi0 | REMAP RA/FRA/BFA SVSHAPE0-3 |
29 | 34:35 | mi1 | REMAP RB/FRB/BFB SVSHAPE0-3 |
30 | 36:37 | mi2 | REMAP RC/FRT SVSHAPE0-3 |
31 | 38:39 | mo0 | REMAP RT/FRT/BF SVSHAPE0-3 |
32 | 40:41 | mo1 | REMAP EA/RS/FRS SVSHAPE0-3 |
33 | 42:46 | SVme | REMAP enable (RA-RT) |
34 | 47:52 | rsvd | reserved |
35 | 53 | pack | PACK (srcstep reorder) |
36 | 54 | unpack | UNPACK (dststep order) |
37 | 55:61 | hphint | Horizontal Hint |
38 | 62 | RMpst | REMAP persistence |
39 | 63 | vfirst | Vertical First mode |
40
41 Notes:
42
43 * The entries are truncated to be within range. Attempts to set VL to
44 greater than MAXVL will truncate VL.
45 * Setting srcstep, dststep to 64 or greater, or VL or MVL to greater
46 than 64 is reserved and will cause an illegal instruction trap.
47
48 **SVSTATE Fields**
49
50 SVSTATE is a standard SPR that (if REMAP is not activated) contains sufficient
51 self-contaned information for a full context save/restore.
52 SVSTATE contains (and permits setting of):
53
54 * MVL (the Maximum Vector Length) - declares (statically) how
55 much of a regfile is to be reserved for Vector elements
56 * VL - Vector Length
57 * dststep - the destination element offset of the current parallel
58 instruction being executed
59 * srcstep - for twin-predication, the source element offset as well.
60 * ssubstep - the source subvector element offset of the current
61 parallel instruction being executed
62 * dsubstep - the destination subvector element offset of the current
63 parallel instruction being executed
64 * vfirst - Vertical First mode. srcstep, dststep and substep
65 **do not advance** unless explicitly requested to do so with svstep
66 * RMpst - REMAP persistence. REMAP will apply only to the following
67 instruction unless this bit is set, in which case REMAP "persists".
68 Reset (cleared) on use of the `setvl` instruction if used to
69 alter VL or MVL.
70 * Pack - if set then srcstep/ssubstep VL/SUBVL loop-ordering is inverted.
71 * UnPack - if set then dststep/dsubstep VL/SUBVL loop-ordering is inverted.
72 * hphint - Horizontal Parallelism Hint. Indicates that
73 no Hazards exist between groups of elements in sequential multiples of this number
74 (before REMAP). By definition: elements for which `FLOOR(step/hphint)` is
75 equal *before REMAP* are in the same parallelism "group", for both
76 `srcstep` and `dststep`. In Vertical First Mode
77 hardware **MUST** respect Strict Program Order but is permitted to
78 merge multiple scalar loops into parallel batches, if Reservation Station resources
79 are sufficient. Set to zero to indicate "no hint".
80 * SVme - REMAP enable bits, indicating which register is to be
81 REMAPed: RA, RB, RC, RT and EA are the canonical (typical) register names
82 associated with each bit, with RA being the LSB and EA being the MSB.
83 See table below for ordering. When `SVme` is zero (0b00000) REMAP
84 is **fully disabled and inactive** regardless of the contents of
85 `SVSTATE`, `mi0-mi2/mo0-mo1`, or the four `SVSHAPEn` SPRs
86 * mi0-mi2/mo0-mo1 - these
87 indicate the SVSHAPE (0-3) that the corresponding register (RA etc)
88 should use, as long as the register's corresponding SVme bit is set
89
90 Programmer's Note: the fact that REMAP is entirely dormant when `SVme` is zero
91 allows establishment of REMAP context well in advance, followed by utilising `svremap`
92 at a precise (or the very last) moment. Some implementations may exploit this
93 to cache (or take some time to prepare caches) in the background whilst other
94 (unrelated) instructions are being executed. This is particularly important to
95 bear in mind when using `svindex` which will require hardware to perform (and
96 cache) additional GPR reads.
97
98 Programmer's Note: when REMAP is activated it becomes necessary on any
99 context-switch (Interrupt or Function call) to detect (or know in advance)
100 that REMAP is enabled and to additionally explicitly save/restore the four SVSHAPE
101 SPRs, SVHAPE0-3. Given that this is expected to be a rare occurrence it was
102 deemed unreasonable to burden every context-switch or function call with
103 mandatory save/restore of SVSHAPEs, and consequently it is a *callee*
104 (and Trap Handler) responsibility. Callees (and Trap Handlers) **MUST**
105 avoid using all and any SVP64 instructions during the period where state
106 could be adversely affected. SVP64 purely relies on Scalar instructions,
107 so Scalar instructions (except the SVP64 Management ones and mtspr and
108 mfspr) are 100% guaranteed to have zero impact on SVP64 state.
109
110 **Max Vector Length (maxvl)** <a name="mvl" />
111
112 MAXVECTORLENGTH is a static (immediate-operand only) compile-time declaration
113 of the maximum number of elements in a Vector. MVL is limited to 7 bits
114 (in the first version of SVP64) and consequently the maximum number of
115 elements is limited to between 0 and 127.
116
117 MAXVL is normally (in other True-Scalable Vector ISAs) an Architecturally-defined
118 quantity related indirectly to the total available number of bits in the Vector
119 Register File. Cray Vectors had a Hardware-Architectural set limit of MAXVL=64.
120 RISC-V RVV has MAXVL defined in terms of a Silicon-Partner-selectable fixed number
121 of bits. MAXVL in Simple-V is set in terms of the number of *elements* and
122 may change at runtime.
123
124 Programmer's Note: Except by directly using `mtspr` on SVSTATE, which may
125 result in performance penalties on some hardware implementations, SVSTATE's `maxvl`
126 field may only be set **statically** as an immediate, by the `setvl` instruction.
127 It may **NOT** be set dynamically from a register. Compiler writers and assembly
128 programmers are expected to perform static register file analysis, subdivision,
129 and allocation and only utilise `setvl`. Direct writing to SVSTATE in order to
130 "bypass" this Note could, in less-advanced implementations, potentially cause stalling,
131 particularly if SVP64 instructions are issued directly after the `mtspr` to SVSTATE.
132
133 **Vector Length (vl)** <a name="vl" />
134
135 The actual Vector length, the number of elements in a "Vector", `SVSTATE.vl` may be set
136 entirely dynamically at runtime from a number of sources. `setvl` is the primary
137 instruction for setting Vector Length.
138 `setvl` is conceptually similar but different from the Cray, SX Aurora, and RISC-V RVV
139 equivalent. Similar to RVV, VL is set to be within
140 the range 0 <= VL <= MVL. Unlike RVV, VL is set **exactly** according to the following:
141
142 ```
143 VL = (RT|0) = MIN(vlen, MVL)
144 ```
145
146 where `0 <= MVL <= 127`, and vlen may come from an immediate, `RA`, or from the `CTR` SPR,
147 depending on options selected with the `setvl` instruction.
148
149 Programmer's Note: conceptual understanding of Cray-style Vectors is far beyond the scope
150 of the Power ISA Technical Reference. Guidance on the 50-year-old Cray Vector paradigm is
151 best sought elsewhere: good studies include Academic Courses given on the 1970s
152 Cray Supercomputers over at least the past three decades.
153
154 **Horizontal Parallelism**
155
156 A problem exists for hardware where it may not be able to detect
157 that a programmer (or compiler) knows of opportunities for parallelism
158 and lack of overlap between loops, despite these being easy for a compiler
159 to statically detect and potentially express.
160 `hphint` is such an expression, declaring that elements within a batch are
161 independent of each other (no Register *or Memory* Hazards).
162
163 Elements are considered to be in the same source batch if they have
164 the same value of `FLOOR(srcstep/hphint)`. Likewise in the same destination batch
165 for the same value `FLOOR(dststep/hphint)`.
166 Four key observations here:
167
168 1. predication is **not** involved here. the number of actual elements
169 involved is considered *before* predicate masks are applied.
170 2. twin predication can result in srcstep and dststep being in different
171 batches
172 3. batch evaluation is done *before* REMAP, making Hazard elimination easier
173 for Multi-Issue systems.
174 4. `hphint` is *not* limited to power-of-two. Hardware implementors may choose
175 a lower parallelism hint up to `hphint` and may find power-of-two more
176 convenient.
177
178 Regarding (4): if a smaller hint is chosen by hardware, actual parallelism
179 (Dependency Hazard relaxation) must **never**
180 exceed `hphint` and must still respect the batch boundaries, even if this results
181 in just one element being considered Hazard-independent. Even under these
182 circumstances Multi-Issue Register-renaming is possible, to introduce parallelism
183 by a different route.
184
185 *Hardware Architect note: each element within the same group may be treated as
186 100% independent from any other element within that group, and therefore
187 neither Register Hazards nor Memory Hazards inter-element exist,
188 but crucially inter-group definitely remains. This makes
189 implementation far easier on resources because the Hazard Dependencies are
190 effectively at a much coarser granularity than a single register.
191 With element-width overrides extending down to the byte level reducing Dependency
192 Hazard hardware complexity becomes even more important.*
193
194 `hphint` may legitimately be set greater than `MAXVL`. This indicates to Multi-Issue
195 hardware that even though MAXVL is relatively small the batches are *still independent*
196 and therefore if Multi-Issue hardware chooses to allocate several batches up to
197 `MAXVL` in size they are still independent, even if Register-renaming is deployed.
198 This helps greatly simplify Multi-Issue systems by significantly reducing Hazards.
199
200 **Considerable care** must be taken when setting `hphint`. Matrix Outer Product
201 could produce corrupted results if `hphint` is set to greater than the innermost
202 loop depth. Parallel Reduction, DCT and FFT REMAP all are similarly critically affected
203 by `hphint` in ways that if used correctly greatly increases ease of parallelism but
204 if done incorrectly will also result in data corruption. Reduction/Iteration
205 also requires care to correctly declare in `hphint` how many elements are
206 independent. In the case of most Reduction use-cases the answer is almost certainly
207 "none".
208
209 `hphint` must never be set on Atomic Memory operations, Cache-Inhibited
210 Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update
211 Data-Dependent Fail-First is ever used for linked-list pointer-chasing, `hphint`
212 should again definitely be disabled. Failure to do so results in `UNDEFINED`
213 behaviour.
214
215 `hphint` may only be ignored by Hardware Implementors as long as full element-level
216 Register and Memory Hazards are implemented *in full* (including right down to individual
217 bytes of each register for when elwidth=8/16/32). In other words if `hphint` is to
218 be ignored then implementations must consider the situation as if `hphint=0`.
219
220 **Horizontal Parallelism in Vertical-First Mode**
221
222 Setting `hphint` with Vertical-First is perfectly legitimate. Under these circumstances
223 single-element strict Program Execution Order must be preserved at all times, but
224 should there be a small enough program loop, than Out-of-Order Hardware may
225 take the opportunity to *merge*
226 consecutive element-based instructions into the *same Reservation Stations*, for
227 multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs.
228 **Only** elements within the same `hphint` group (across multiple such looped instructions)
229 may be treated as mergeable in this fashion.
230
231 Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation
232 Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at
233 least there is no harm done: the loop is still correctly executed as Scalar instructions.
234 Programmers do need to be aware though that short loops on some Hardware Implementations
235 can be made considerably faster than on other Implementations.
236
237 ## SVLR
238
239 SV Link Register, exactly analogous to LR (Link Register) may
240 be used for temporary storage of SVSTATE, and, in particular,
241 Vectorised Branch-Conditional instructions may interchange
242 SVLR and SVSTATE whenever LR and NIA are.
243
244 Note that there is no equivalent Link variant of SVREMAP or
245 SVSHAPE0-3 (it would be too costly), so SVLR has limited applicability:
246 REMAP SPRs must be saved and restored explicitly.
247
248 -----------
249
250 [[!tag standards]]
251