(no commit message)
[libreriscv.git] / openpower / sv / svp_rewrite / svp64.mdwn
1 # Rewrite of SVP64 for OpenPower ISA v3.1
2
3 * [[svp64/discussion]]
4
5 The plan is to create an encoding for SVP64, then to create an encoding for
6 SVP48, then to reorganize them both to improve field overlap, reducing the
7 amount of decoder hardware necessary.
8
9 All bit numbers are in MSB0 form (the bits are numbered from 0 at the MSB and
10 counting up as you move to the LSB end). All bit ranges are inclusive (so
11 `4:6` means bits 4, 5, and 6).
12
13 64-bit instructions are split into two 32-bit words, the prefix and the suffix. The prefix always comes before the suffix in PC order.
14
15 SVP64 is designed so that when the prefix is all zeros, no effect or influence occurs (no augmentation) such that all standard OpenPOWER v3.B instructions may be active at that time, in full (and SV is quiescent). The corollary is that when the SV prefix is nonzero, alternative meanings may be given to all and any instructions.
16
17 # Definition of Reserved in this spec.
18
19 For the new fields added in SVP64, instructions that have any of their fields set to a reserved value must cause an illegal instruction trap, to allow emulation of future instruction sets.
20
21 This is unlike OpenPower ISA v3.1, which doesn't require a CPU to trap.
22
23 # Remapped Encoding (`RM[0:23]`)
24
25 To allow relatively easy remapping of which portions of the Prefix Opcode Map
26 are used for SVP64 without needing to rewrite a large portion of the SVP64
27 spec, a mapping is defined from the OpenPower v3.1 prefix bits to a new 24-bit
28 Remapped Encoding denoted `RM[0]` at the MSB to `RM[23]` at the LSB.
29
30 The mapping from the OpenPower v3.1 prefix bits to the Remapped Encoding is
31 defined in the Prefix Fields section.
32 ## Prefix Opcode Map (64-bit instruction encoding) (prefix bits 6:11)
33
34 (shows both PowerISA v3.1 instructions as well as new SVP instructions; empty spaces are yet-to-be-allocated Illegal Instructions)
35
36 | bits 6:11 | ---000 | ---001 | ---010 | ---011 | ---100 | ---101 | ---110 | ---111 |
37 |-----------|----------|------------|----------|----------|----------|----------|----------|----------|
38 | 000--- | 8LS-form | 8LS-form | 8LS-form | 8LS-form | 8LS-form | 8LS-form | 8LS-form | 8LS-form |
39 | 001--- | | | | | | | | |
40 | 010--- | 8RR-form | | | | SVP64 | SVP64 | SVP64 | SVP64 |
41 | 011--- | | | | | SVP64 | SVP64 | SVP64 | SVP64 |
42 | 100--- | MLS-form | MLS-form | MLS-form | MLS-form | MLS-form | MLS-form | MLS-form | MLS-form |
43 | 101--- | | | | | | | | |
44 | 110--- | MRR-form | | | | SVP64 | SVP64 | SVP64 | SVP64 |
45 | 111--- | | MMIRR-form | | | SVP64 | SVP64 | SVP64 | SVP64 |
46
47 ## Prefix Fields
48
49 | Prefix Field Name | Field | Value | Description |
50 |---------------------|---------|-------|--------------------------------------------|
51 | PO (Primary Opcode) | `0:5` | `1` | Indicates this is Prefixed 64-bit |
52 | `RM[0]` | `6` | | Bit 0 of the Remapped Encoding |
53 | SVP64_7 | `7` | `1` | Indicates this is SVP64 |
54 | `RM[1]` | `8` | | Bit 1 of the Remapped Encoding |
55 | SVP64_9 | `9` | `1` | Indicates this is SVP64 |
56 | `RM[2:23]` | `10:31` | | Bits 2-23 of the Remapped Encoding |
57
58
59 # Remapped Encoding Fields
60
61 Shows all fields in the Remapped Encoding `RM[0:23]` for all instruction variants. There are two categories: Single and Twin Predication. Due to space considerations further subdivision of Single Predication is based on whether the number of src operands is 2 or 3.
62
63
64 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand instructions (fmadd, isel, madd).
65 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand instructions (src1 src2 dest)
66 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
67 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
68 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
69
70 ## RM-1P-3S1D
71
72 | Field Name | Field bits | Description |
73 |------------|------------|------------------------------------------------|
74 | MASK_KIND | `0` | Execution Mask Kind |
75 | MASK | `1:3` | Execution Mask |
76 | ELWIDTH | `4:5` | Element Width |
77 | SUBVL | `6:7` | Sub-vector length |
78 | Rdest_EXTRA2 | `8:9` | extra bits for Rdest (R\*_EXTRA2 Encoding) |
79 | Rsrc1_EXTRA2 | `10:11` | extra bits for Rsrc1 (R\*_EXTRA2 Encoding) |
80 | Rsrc2_EXTRA2 | `12:13` | extra bits for Rsrc2 (R\*_EXTRA2 Encoding) |
81 | Rsrc3_EXTRA2 | `14:15` | extra bits for Rsrc3 (R\*_EXTRA2 Encoding|
82 | reserved | `16` | reserved |
83 | MODE | `19:23` | see [[discussion]] |
84
85
86 ## RM-1P-2S1D
87
88 | Field Name | Field bits | Description |
89 |------------|------------|------------------------------------------------|
90 | MASK_KIND | `0` | Execution Mask Kind |
91 | MASK | `1:3` | Execution Mask |
92 | ELWIDTH | `4:5` | Element Width |
93 | SUBVL | `6:7` | Sub-vector length |
94 | Rdest_EXTRA3 | `8:10` | extra bits for Rdest (Uses R\*_EXTRA3 Encoding) |
95 | Rsrc1_EXTRA3 | `11:13` | extra bits for Rsrc1 (Uses R\*_EXTRA3 Encoding) |
96 | Rsrc2_EXTRA3 | `14:16` | extra bits for Rsrc3 (Uses R\*_EXTRA3 Encoding) |
97 | MODE | `19:23` | see [[discussion]] |
98
99 These are for 2 operand 1 dest instructions, such as `add RT, RA, RB`. However also included are unusual instructions with the same src and dest, such as `rlwinmi`.
100
101 Normally, the scalar v3.0B ISA would not have sufficient bits to allow an alternative destination. With SV however this becomes possible. Therefore, the fact that the dest is implicitly also a src should not mislead: rhey are different SV regs.
102
103 * `rlwimi RA, RS, ...`
104 * Rsrc1_EXTRA3 applies to RS as the first src
105 * Rsrc2_EXTRA3 applies to RA as the secomd src
106 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
107
108 Otherwise the normal SV hardware for-loop applies. The three registers each may be independently made vector or scalar, and may independently augmented to 7 bits in length.
109
110 ## RM-2P-1S1D
111
112 | Field Name | Field bits | Description |
113 |------------|------------|----------------------------|
114 | MASK_KIND | `0` | Execution Mask Kind |
115 | MASK | `1:3` | Execution Mask |
116 | ELWIDTH | `4:5` | Element Width |
117 | SUBVL | `6:7` | Sub-vector length |
118 | Rdest_EXTRA3 | `8:10` | extra bits for Rdest |
119 | Rsrc1_EXTRA3 | `11:13` | extra bits for Rsrc1 |
120 | MASK_SRC | `14:16` | Execution Mask for Source |
121 | ELWIDTH_SRC | `17:18` | Element Width for Source |
122 | MODE | `19:23` | see [[discussion]] |
123
124 note in [[discussion]]: TODO, evaluate if 2nd SUBVL should be added. conclusion: no. 2nd SUBVL makes no sense except for mv, and that is covered by [[mv.vec]]
125
126 ## RM-2P-2S1D/1S2D
127
128 The primary purpose for this encoding is for Twin Predication on LOAD and STORE operations. see [[sv/ldst]] for detailed anslysis.
129
130 RM-2P-2S1D:
131
132 | Field Name | Field bits | Description |
133 |------------|------------|----------------------------|
134 | MASK_KIND | `0` | Execution Mask Kind |
135 | MASK | `1:3` | Execution Mask |
136 | ELWIDTH | `4:5` | Element Width |
137 | SUBVL | `6:7` | Sub-vector length |
138 | Rdest_EXTRA2 | `8:9` | extra bits for Rdest (R\*_EXTRA2 Encoding) |
139 | Rsrc1_EXTRA2 | `10:11` | extra bits for Rsrc1 (R\*_EXTRA2 Encoding) |
140 | Rsrc2_EXTRA2 | `12:13` | extra bits for Rsrc2 (R\*_EXTRA2 Encoding) |
141 | MASK_SRC | `14:16` | Execution Mask for Source |
142 | ELWIDTH_SRC | `17:18` | Element Width for Source |
143 | MODE | `19:23` | see [[discussion]] |
144
145 Note that for 1S2P the EXTRA2 dest and src names are switched (Rsrc_EXTRA2 is in bits 8:9, Rdest1_EXTRA2 in 10:11)
146
147 Note also that LD with update indexed, which takes 2 src and 2 dest (e.g. `lhaux RT,RA,RB`), does not have room for 4 registers and also Twin Predication. therefore these are treated as RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest.
148
149 ## R\*_EXTRA2 and R\*_EXTRA3 Encoding
150
151 (**TODO: 2-bit version of the table, just like in the original SVPrefix. This is important, to save bits on 4-operand instructions such as fmadd**)
152
153 In the following table, `<N>` denotes the value of the corresponding register field in the SVP64 suffix word.
154
155 3 bit version
156
157
158 alternative which is understandable and, if EXTRA3 is zero, maps to "no effect" (scalar OpenPOWER ISA field naming). also, these are the encodings used in the original SV Prefix scheme. the reason why they were chosen is so that scalar registers in v3.0B and prefixed scalar registers have access to the same 32 registers.
159
160 | R\*_EXTRA3 | Mode | Range | Encoded as |
161 |-----------|-------|---------------|---------------------|
162 | 000 | Scalar | `r0-r31` | `0b00 RA` |
163 | 001 | Scalar | `r32-r63` | `0b01 RA` |
164 | 010 | Scalar | `r64-r95` | `0b10 RA` |
165 | 011 | Scalar | `r96-r127` | `0b11 RA` |
166 | 100 | Vector | `r0-r124` | `RA 0b00` |
167 | 101 | Vector | `r1-r125` | `RA 0b01` |
168 | 110 | Vector | `r2-r126` | `RA 0b10` |
169 | 111 | Vector | `r3-r127` | `RA 0b11` |
170
171 algorithm for original version:
172
173 spec = EXTRA3
174 if spec[2]: # vector
175 return RA << 2 + spec[0:1]
176 else: # scalar
177 return RA + spec[0:1] << 5
178
179 2 bit version
180
181 alternative which is understandable and, if EXTRA2 is zero will map to "no effect" i.e Scalar OpenPOWER register naming:
182
183 | R\*_EXTRA2 | Mode | Range | Encoded as |
184 |-----------|-------|---------------|---------------------|
185 | 00 | Scalar | `r0-r31` | `0b00 RA` |
186 | 01 | Scalar | `r32-r63` | `0b01 RA` |
187 | 10 | Vector | `r0-r124` | `RA 0b00` |
188 | 11 | Vector | `r2-r126` | `RA 0b10` |
189
190
191 algorithm for original version:
192
193 spec = EXTRA2 << 1
194 if spec[2]: # vector
195 return RA << 2 + spec[0:1]
196 else: # scalar
197 return RA + spec[0:1] << 5
198
199 ## ELWIDTH Encoding
200
201 Default behaviour is set to 0b00 so that zeros follow the convention of "npt doing anything". In this case it means that elwidth overrides are not applicable. Thus if a 32 bit instruction operates on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified. Likewise when a processor is switched from 64 bit to 32 bit mode, `elwidth=0b00` states that, again, the behaviour is not to be modified.
202
203 Only when elwidth is nonzero is the element width overridden to the explicitly required value.
204
205 ### Elwidth for Integers:
206
207 | Value | Mnemonic | Description |
208 |-------|----------------|------------------------------------|
209 | 00 | DEFAULT | default behaviour for operation |
210 | 01 | `ELWIDTH=b` | Byte: 8-bit integer |
211 | 10 | `ELWIDTH=h` | Halfword: 16-bit integer |
212 | 11 | `ELWIDTH=w` | Word: 32-bit integer |
213
214 ### Elwidth for FP Registers:
215
216 | Value | Mnemonic | Description |
217 |-------|----------------|------------------------------------|
218 | 00 | DEFAULT | default behaviour for FP operation |
219 | 01 | `ELWIDTH=bf16` | Reserved for `bf16` |
220 | 10 | `ELWIDTH=f16` | 16-bit IEEE 754 Half floating-point |
221 | 11 | `ELWIDTH=f32` | 32-bit IEEE 754 Single floating-point |
222
223 Note: [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
224 is reserved for a future implementation of SV
225
226 ### Elwidth for CRs:
227
228 TODO, important, particularly for crops, mfcr and mtcr, what elwidth even means. instead it may be possible to use the bits as extra indices (EXTRA6) to access the full 64 CRs. TBD, several ideas
229
230 The actual width of the CRs cannot be altered: they are 4 bit. Thus, for Rc=1 operations that produce a result and corresponding CR, it is the result to which the elwidth override applies, not the CR.
231
232 As mentioned TBD, this leaves crops etc. to have a meaming defined for elwidth, because these ops are pure explicit CR based.
233
234 ## SUBVL Encoding
235
236 the default for SUBVL is 1 and its encoding is 0b00 to indicate that SUBVL is effectively disabled (a SUBVL for-loop of only one element). this lines up in combination with all other "default is all zeros" behaviour.
237
238 | Value | Mnemonic | Description |
239 |-------|---------------------|------------------------|
240 | 00 | `SUBVL=1` (default) | Sub-vector length of 1 |
241 | 01 | `SUBVL=2` | Sub-vector length of 2 |
242 | 10 | `SUBVL=3` | Sub-vector length of 3 |
243 | 11 | `SUBVL=4` | Sub-vector length of 4 |
244
245 The SUBVL encoding value may be thought of as an inclusive range of a sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore this may be considered to be elements 0b00 to 0b01 inclusive.
246
247 ## MASK/MASK_SRC & MASK_KIND Encoding
248
249 One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two types may not be mixed.
250
251 Special note: to get default behaviour (SV disabled) this field must be set to zero in combination with Integer Predication also being set to 0b000. this has the effect of enabling "all 1s" in the predicate mask, which is equivalent to "not having any predication at all" and consequently, in combination with all other default zeros, fully disables SV.
252
253 | MASK_KIND Value | Description |
254 |-----------------|------------------------------------------------------|
255 | 0 | MASK/MASK_SRC are encoded using Integer Predication |
256 | 1 | MASK/MASK_SRC are encoded using CR-based Predication |
257
258 Integer Twin predication has a second set if 3 bits that uses the same encoding thus allowing either the same register (r3 or r10) to be used for both src and dest, or different regs (one for src, one for dest).
259
260 Likewise CR based twin predication has a second set of 3 bits, allowing a different test to be applied.
261
262 ### Integer Predication (MASK_KIND=0)
263
264 When the predicate mode bit is zero the 3 bits are interpreted as below.
265 Twin predication has an identical 3 bit field similarly encoded.
266
267 | Value | Mnemonic | Element `i` enabled if: |
268 |-------|----------|------------------------------|
269 | 000 | ALWAYS | (Operation is not masked) |
270 | 001 | 1 << R3 | `i == R3` |
271 | 010 | R3 | `R3 & (1 << i)` is non-zero |
272 | 011 | ~R3 | `R3 & (1 << i)` is zero |
273 | 100 | R10 | `R10 & (1 << i)` is non-zero |
274 | 101 | ~R10 | `R10 & (1 << i)` is zero |
275 | 110 | R30 | `R30 & (1 << i)` is non-zero |
276 | 111 | ~R30 | `R30 & (1 << i)` is zero |
277
278 ### CR-based Predication (MASK_KIND=1)
279
280 When the predicate mode bit is one the 3 bits are interpreted as below. Twin predication has an identical 3 bit field similarly encoded
281
282 | Value | Mnemonic | Description |
283 |-------|----------|-------------------------------------------------|
284 | 000 | lt | Element `i` is enabled if `CR[6+i].LT` is set |
285 | 001 | nl/ge | Element `i` is enabled if `CR[6+i].LT` is clear |
286 | 010 | gt | Element `i` is enabled if `CR[6+i].GT` is set |
287 | 011 | ng/le | Element `i` is enabled if `CR[6+i].GT` is clear |
288 | 100 | eq | Element `i` is enabled if `CR[6+i].EQ` is set |
289 | 101 | ne | Element `i` is enabled if `CR[6+i].EQ` is clear |
290 | 110 | so/un | Element `i` is enabled if `CR[6+i].FU` is set |
291 | 111 | ns/nu | Element `i` is enabled if `CR[6+i].FU` is clear |
292
293 CR based predication. TODO: select alternate CR for twin predication? see [[discussion]] Overlap of the two CR based predicates must be taken into account, so the starting point for one of them must be suitably high, or accept that for twin predication VL must not exceed the range where overlap will occur, *or* that they use the same starting point but select different *bits* of the same CRs
294
295
296 # Twin Predication
297
298 This is a novel concept that allows predication to be applied to a single source and a single dest register. The following types of traditional Vector operations may be encoded with it, *without requiring explicit opcodes to do so*
299
300 * VSPLAT (a single scalar distributed across a vector)
301 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
302 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
303 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
304 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
305
306 Those patterns (and more) may be applied to:
307
308 * mv (the usual way that V\* ISA operations are created)
309 * exts\* sign-extension
310 * rwlinm and other RS-RA shift operations (**note**: excluding
311 those that take RA as both a src and dest. These are not
312 1-src 1-dest, they are 2-src, 1-dest)
313 * LD and ST (treating AGEN as one source)
314 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
315 * Condition Register ops mfcr, mtcr and other similar
316
317 This is a huge list that creates extremely powerful combinations, particularly given that one of the predicate options is `(1<<r3)`
318
319 Additional unusual capabilities of Twin Predication include a back-to-back version of VCOMPRESS-VEXPAND which is effectively the ability to do an ordered multiple VINSERT.
320
321 # Register Naming
322
323 SV Registers are simply the INT, FP and CR register files extended linearly to larger sizes. Thus, the integer regfile in standard scalar OpenPOWER v3.0B and v3.1B is r0 to r31: SV extends this as r0 to r127. Likewise FP registers are extended to 128 (fp0 to fp127), and CRs are extended to 64 entries, CR0 thru CR63.
324
325 The names of the registers therefore reflects a simple linear extension of the OpenPOWER v3.0B / v3.1B register naming.
326
327 # Operation
328
329 ## CR fields as inputs/outputs of vector operations
330
331 When vectorized, the CR inputs/outputs are read/written to 4-bit CR fields
332 starting from SVCR6_000 and incrementing from there. If SVCR7_111 is reached, the next CR
333 field used wraps around to SVCR0_000, then incrementing from there.
334 (see [[discussion]]. some alternative schemes are described there)
335
336 SVCR6_000 was chosen to balance avoiding needing to save CR2-CR4 (which are
337 callee-saved) just to use SV vectors with VL <= 61 as well as having the first
338 vector CR field readily accessible to standard CR instructions and branches.
339 Additionally, SVCR6_000 is used as the implicit result of a OpenPower ISA v3.1
340 standard vector (SIMD) instruction with Rc=1.
341
342 ## Table of CR fields
343
344 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
345 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
346
347 There are 3 new SPRs for holding CRs: CR_EXT1, CR_EXT2, and CR_EXT3.
348
349 The 64 SV CRs are arranged similarly to the way the 128 integer registers are arranged. TODO a python program that auto-generates a CSV file which can be included in a table, which is in a new page (so as not to overwhelm this one). [[svp64/cr_names]]
350
351
352
353 # Register Profiles
354
355 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
356
357 Instructions are broken down by Register Profiles as listed in the following auto-generated page:
358 [[opcode_regs_deduped]]. "Non-SV" indicates that the operations with this Register Profile cannot be Vectorised (mtspr, bc, dcbz, twi)
359
360 ## LDST-1R-1W-imm
361
362 `RM-2P-1S1D`
363
364 LD immediate
365
366 ## LDST-1R-2W-imm
367
368 LD immediate with update
369
370 ## LDST-2R-imm
371
372 ST immediate
373
374 ## LDST-2R-1W
375
376 `RM-2P-2S1D`
377
378 LD Indexed with update
379
380 ## LDST-2R-1W-imm
381
382 ST Indexed with update
383
384 ## LDST-2R-2W
385
386 LD Indexed with update
387
388 ## LDST-3R
389
390 ST Indexed
391
392 ## LDST-3R-CRo
393
394 ST Indexed cache
395
396 ## LDST-3R-1W
397
398 ST Indexed with update
399
400 ## CRio
401 TBD
402 ## CR=2R1W
403
404 Remapped Encoding Fields: `RM-1P-2S1D`
405
406
407 ## 1W-CRi
408
409 Remapped Encoding Fields: `RM-2P-1S1D`
410
411
412
413 ## 1R-CRo
414
415 Remapped Encoding Fields: `RM-2P-1S1D`
416
417
418 ## 1R-CRio
419
420 Remapped Encoding Fields: `RM-2P-1S1D`
421
422
423
424 ## 1R-1W
425
426 Remapped Encoding Fields: `RM-2P-1S1D`
427
428
429 ## 1R-1W-imm
430
431 Remapped Encoding Fields: `RM-2P-1S1D`
432
433
434
435 ## 1R-1W-CRo
436
437 Remapped Encoding Fields: `RM-2P-1S1D`
438
439
440
441 ## 1R-1W-CRio
442
443 Remapped Encoding Fields: `RM-2P-1S1D`
444
445
446
447 ## 2R-CRo
448
449 Remapped Encoding Fields: `RM-1P-2S1D`
450
451
452 ## 2R-CRio
453
454 Remapped Encoding Fields: `RM-1P-2S1D`
455
456
457
458 ## 2R-1W
459
460 Remapped Encoding Fields: `RM-1P-2S1D`
461
462
463
464 ## 2R-1W-CRo
465
466 Remapped Encoding Fields: `RM-1P-2S1D`
467
468 *Note that analysis of `rl(w|d)imi` shows that these are correctly identified as 2S1D. The pseudocode in [[isa/fixedshift]] although RA is used as both a src and a dest the EXTRA3 extension of each of these gives different meanings to the src RA and dest RA.*
469
470
471 ## 2R-1W-CRi
472 TBD
473
474 ## 2R-1W-CRio
475
476 Remapped Encoding Fields: `RM-1P-2S1D`
477
478
479
480 ## 3R-1W-CRio
481
482 Remapped Encoding Fields: `RM-1P-3S1D`
483