(no commit message)
[libreriscv.git] / openpower / sv / svp64.mdwn
1 # SVP64 Zero-Overhead Loop Prefix Subsystem
2
3 * **DRAFT STATUS v0.1 18sep2021** Release notes <https://bugs.libre-soc.org/show_bug.cgi?id=699>
4
5 This document describes [[SV|sv]] augmentation of the [[Power|openpower]] v3.0B [[ISA|openpower/isa/]]. It is in Draft Status and
6 will be submitted to the [[!wikipedia OpenPOWER_Foundation]] ISA WG
7 via the External RFC Process.
8
9 Credits and acknowledgements:
10
11 * Luke Leighton
12 * Jacob Lifshay
13 * Hendrik Boom
14 * Richard Wilbur
15 * Alexandre Oliva
16 * Cesar Strauss
17 * NLnet Foundation, for funding
18 * OpenPOWER Foundation
19 * Paul Mackerras
20 * Toshaan Bharvani
21 * IBM for the Power ISA itself
22
23 Links:
24
25 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
26 * [[svp64/discussion]]
27 * [[svp64/appendix]]
28 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001650.html>
29 * <https://bugs.libre-soc.org/show_bug.cgi?id=550>
30 * <https://bugs.libre-soc.org/show_bug.cgi?id=573> TODO elwidth "infinite" discussion
31 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturating description.
32 * <https://bugs.libre-soc.org/show_bug.cgi?id=905> TODO [[sv/svp64-single]]
33 * <https://bugs.libre-soc.org/show_bug.cgi?id=1045> External RFC ls010
34 * [[sv/branches]] chapter
35 * [[sv/ldst]] chapter
36
37
38 Table of contents
39
40 [[!toc]]
41
42 ## Introduction
43
44 Simple-V is a type of Vectorisation best described as a "Prefix Loop
45 Subsystem" similar to the 5 decades-old Zilog Z80 `LDIR` instruction and
46 to the 8086 `REP` Prefix instruction. More advanced features are similar
47 to the Z80 `CPIR` instruction. If naively viewed one-dimensionally as an actual
48 Vector ISA it introduces over 1.5 million 64-bit True-Scalable Vector instructions
49 on the SFFS Subset and closer to 10 million 64-bit True-Scalable Vector
50 instructions if introduced on VSX.
51 SVP64, the instruction format used by Simple-V, is therefore best viewed
52 as an orthogonal RISC-paradigm "Prefixing" subsystem instead.
53
54 Except where explicitly stated all bit numbers remain as in the rest of
55 the Power ISA: in MSB0 form (the bits are numbered from 0 at the MSB on
56 the left and counting up as you move rightwards to the LSB end). All bit
57 ranges are inclusive (so `4:6` means bits 4, 5, and 6, in MSB0 order).
58 **All register numbering and element numbering however is LSB0 ordering**
59 which is a different convention from that used elsewhere in the Power ISA.
60
61 The SVP64 prefix always comes before the suffix in PC order and must be
62 considered an independent "Defined word" that augments the behaviour of
63 the following instruction, but does **not** change the actual Decoding
64 of that following instruction. **All prefixed 32-bit instructions
65 (Defined Words) retain their non-prefixed encoding and definition**.
66
67 Two apparent exceptions to the above hard rule exist: SV Branch-Conditional
68 operations and LD/ST-update "Post-Increment" Mode. Post-Increment
69 was considered sufficiently high priority (significantly reducing hot-loop
70 instruction count) that one bit in the Prefix is reserved for it
71 (Note the intention to release that bit and move Post-Increment instructions
72 to EXT2xx).
73 Vectorised Branch-Conditional operations "embed" the original Scalar
74 Branch-Conditional behaviour into a much more advanced variant that
75 is highly suited to High-Performance Computation (HPC), Supercomputing,
76 and parallel GPU Workloads.
77
78 *Architectural Resource Allocation note: it is prohibited to accept RFCs
79 which fundamentally violate this hard requirement. Under no circumstances
80 must the Suffix space have an alternate instruction encoding allocated
81 within SVP64 that is entirely different from the non-prefixed Defined
82 Word. Hardware Implementors critically rely on this inviolate guarantee
83 to implement High-Performance Multi-Issue micro-architectures that can
84 sustain 100% throughput*
85
86 Subset implementations in hardware are permitted, as long as certain
87 rules are followed, allowing for full soft-emulation including future
88 revisions. Compliancy Subsets exist to ensure minimum levels of binary
89 interoperability expectations within certain environments. Details in
90 the [[svp64/appendix]].
91
92 ## SVP64 encoding features
93
94 A number of features need to be compacted into a very small space of
95 only 24 bits:
96
97 * Independent per-register Scalar/Vector tagging and range extension on
98 every register
99 * Element width overrides on both source and destination
100 * Predication on both source and destination
101 * Two different sources of predication: INT and CR Fields
102 * SV Modes including saturation (for Audio, Video and DSP), mapreduce,
103 fail-first and predicate-result mode.
104
105 Different classes of operations require different formats. The earlier
106 sections cover the common formats and the four separate modes follow:
107 CR operations (crops), Arithmetic/Logical (termed "normal"), Load/Store
108 and Branch-Conditional.
109
110 ## Definition of Reserved in this spec.
111
112 For the new fields added in SVP64, instructions that have any of their
113 fields set to a reserved value must cause an illegal instruction trap,
114 to allow emulation of future instruction sets, or for subsets of SVP64 to
115 be implemented in hardware and the rest emulated. This includes SVP64
116 SPRs: reading or writing values which are not supported in hardware
117 must also raise illegal instruction traps in order to allow emulation.
118 Unless otherwise stated, reserved values are always all zeros.
119
120 This is unlike OpenPower ISA v3.1, which in many instances does not
121 require a trap if reserved fields are nonzero. Where the standard Power
122 ISA definition is intended the red keyword `RESERVED` is used.
123
124 ## Definition of "UnVectoriseable"
125
126 Any operation that inherently makes no sense if repeated is termed
127 "UnVectoriseable" or "UnVectorised". Examples include `sc` or `sync`
128 which have no registers. `mtmsr` is also classed as UnVectoriseable
129 because there is only one `MSR`.
130
131 UnVectorised instructions are required to be detected as such if
132 Prefixed (either SVP64 or SVP64Single) and an Illegal Instruction
133 Trap raised.
134
135 *Architectural Note: Given that a "pre-classification" Decode Phase is
136 required (identifying whether the Suffix - Defined Word - is
137 Arithmetic/Logical, CR-op, Load/Store or Branch-Conditional),
138 adding "UnVectorised" to this phase is not unreasonable.*
139
140 ## Register files, elements, and Element-width Overrides
141
142 In the Upper Compliancy Levels of SVP64 the size of the GPR and FPR
143 Register files are expanded from 32 to 128 entries, and the number of
144 CR Fields expanded from CR0-CR7 to CR0-CR127. (Note: A future version
145 of SVP64 is anticipated to extend the VSR register file).
146
147 Memory access remains exactly the same: the effects of `MSR.LE` remain
148 exactly the same, affecting as they already do and remain **only**
149 on the Load and Store memory-register operation byte-order, and having
150 nothing to do with the ordering of the contents of register files or
151 register-register operations.
152
153 To be absolutely clear:
154
155 ```
156 There are no conceptual arithmetic ordering or other changes over the
157 Scalar Power ISA definitions to registers or register files or to
158 arithmetic or Logical Operations beyond element-width subdivision
159 ```
160
161 Element offset
162 numbering is naturally **LSB0-sequentially-incrementing from zero, not
163 MSB0-incrementing** including when element-width overrides are used,
164 at which point the elements progress through each register
165 sequentially from the LSB end
166 (confusingly numbered the highest in MSB0 ordering) and progress
167 incrementally to the MSB end (confusingly numbered the lowest in
168 MSB0 ordering).
169
170 When exclusively using MSB0-numbering, SVP64
171 becomes unnecessarily complex to both express and subsequently understand:
172 the required conditional subtractions from 63,
173 31, 15 and 7 needed to express the fact that elements are LSB0-sequential
174 unfortunately become a hostile minefield, obscuring both
175 intent and meaning. Therefore for the
176 purposes of this section the more natural **LSB0 numbering is assumed**
177 and it is left to the reader to translate to MSB0 numbering.
178
179 The Canonical specification for how element-sequential numbering and
180 element-width overrides is defined is expressed in the following c
181 structure, assuming a Little-Endian system, and naturally using LSB0
182 numbering everywhere because the ANSI c specification is inherently LSB0.
183 Note the deliberate similarity to how VSX register elements are defined:
184
185 ```
186 #pragma pack
187 typedef union {
188 uint8_t bytes[]; // elwidth 8
189 uint16_t hwords[]; // elwidth 16
190 uint32_t words[]; // elwidth 32
191 uint64_t dwords[]; // elwidth 64
192 uint8_t actual_bytes[8];
193 } el_reg_t;
194
195 elreg_t int_regfile[128];
196
197 void get_register_element(el_reg_t* el, int gpr, int element, int width) {
198 switch (width) {
199 case 64: el->dwords[0] = int_regfile[gpr].dwords[element];
200 case 32: el->words[0] = int_regfile[gpr].words[element];
201 case 16: el->hwords[0] = int_regfile[gpr].hwords[element];
202 case 8 : el->bytes[0] = int_regfile[gpr].bytes[element];
203 }
204 }
205 void set_register_element(el_reg_t* el, int gpr, int element, int width) {
206 switch (width) {
207 case 64: int_regfile[gpr].dwords[element] = el->dwords[0];
208 case 32: int_regfile[gpr].words[element] = el->words[0];
209 case 16: int_regfile[gpr].hwords[element] = el->hwords[0];
210 case 8 : int_regfile[gpr].bytes[element] = el->bytes[0];
211 }
212 }
213 ```
214
215 Example Vector-looped add operation implementation when elwidths are 64-bit:
216
217 ```
218 # vector-add RT, RA,RB using the "uint64_t" union member, "dwords"
219 for i in range(VL):
220 int_regfile[RT].dword[i] = int_regfile[RA].dword[i] + int_regfile[RB].dword[i]
221 ```
222
223 However if elwidth overrides are set to 16 for both source and destination:
224
225 ```
226 # vector-add RT, RA, RB using the "uint64_t" union member "halfs"
227 for i in range(VL):
228 int_regfile[RT].halfs[i] = int_regfile[RA].halfs[i] + int_regfile[RB].halfs[i]
229 ```
230
231 The most fundamental aspect here to understand is that the wrapping into
232 subsequent Scalar GPRs that occurs on larger-numbered elements
233 including and especially on smaller element widths is **deliberate and intentional**.
234 From this Canonical definition it should be clear that sequential elements begin
235 at the LSB end of any given underlying Scalar GPR, progress to the MSB end, and
236 then to the LSB end of the *next numerically-larger Scalar GPR*. In the
237 example above if VL=5 and RT=1 then the contents of GPR(1) and GPR(2) will
238 be as follows. For clarity in the table below:
239
240 * Both MSB0-ordered bitnumbering *and* LSB-ordered bitnumbering are shown
241 * The GPR-numbering is considered LSB0-ordered
242 * The Element-numbering (result0-result4) is LSB0-ordered
243 * Each of the results (result0-result4) are 16-bit
244 * "same" indicates "no change as a result of the Vectorised add"
245
246 ```
247 | MSB0: | 0:15 | 16:31 | 32:47 | 48:63 |
248 | LSB0: | 63:48 | 47:32 | 31:16 | 15:0 |
249 |--------|---------|---------|---------|---------|
250 | GPR(0) | same | same | same | same |
251 | GPR(1) | result3 | result2 | result1 | result0 |
252 | GPR(2) | same | same | same | result4 |
253 | GPR(3) | same | same | same | same |
254 | ... | ... | ... | ... | ... |
255 | ... | ... | ... | ... | ... |
256 ```
257
258 Note that the upper 48 bits of GPR(2) would **not** be modified due to
259 the example having VL=5. Thus on "wrapping" - sequential progression from
260 GPR(1) into GPR(2) - the 5th result modifies
261 **only** the bottom 16 LSBs of GPR(1).
262
263 Hardware Architectural note: to avoid a Read-Modify-Write at the register
264 file it is strongly recommended to implement byte-level write-enable lines
265 exactly as has been implemented in DRAM ICs for many decades. Additionally
266 the predicate mask bit is advised to be associated with the element
267 operation and alongside the result ultimately passed to the register file.
268 When element-width is set to 64-bit the relevant predicate mask bit
269 may be repeated eight times and pull all eight write-port byte-level
270 lines HIGH. Clearly when element-width is set to 8-bit the relevant
271 predicate mask bit corresponds directly with one single byte-level
272 write-enable line. It is up to the Hardware Architect to then amortise
273 (merge) elements together into both PredicatedSIMD Pipelines as well
274 as simultaneous non-overlapping Register File writes, to achieve High
275 Performance designs. Overall it helps to think of the register files
276 as being much more akin to a byte-level-addressable SRAM.
277
278 If the 16-bit operation were to be followed up with a 32-bit Vectorised
279 Operation, the exact same contents would be viewed as follows:
280
281 ```
282 | MSB0: | 0:31 | 32:63 |
283 | LSB0: | 63:32 | 31:0 |
284 |--------|----------------------|----------------------|
285 | GPR(0) | same | same |
286 | GPR(1) | (result3 || result2) | (result1 || result0) |
287 | GPR(2) | same | (same || result4) |
288 | GPR(3) | same | same |
289 | ... | ... | ... |
290 | ... | ... | ... |
291 ```
292
293 In other words, this perspective really is no different from the situation
294 where the actual Register File is treated as an Industry-standard byte-level-addressable
295 Little-Endian-addressed SRAM. Note that this perspective does **not**
296 involve `MSR.LE` in any way shape or form because `MSR.LE` is directly
297 in control of the Memory-to-Register byte-ordering. This section is
298 exclusively about how to correctly perceive Simple-V-Augmented **Register**
299 Files.
300
301 **Comparative equivalent using VSR registers**
302
303 For a comparative data point the VSR Registers may be expressed in the
304 same fashion. The c code below is directly an expression of Figure 97 in
305 Power ISA Public v3.1 Book I Section 6.3 page 258, *after compensating for
306 MSB0 numbering in both bits and elements, adapting in full to LSB0 numbering,
307 and obeying LE ordering*.
308
309 **Crucial to understanding why the subtraction from 1,3,7,15 is present
310 is because the Power ISA numbers VSX Registers elements also in MSB0 order**.
311 SVP64 very specifically numbers elements in **LSB0** order with the first
312 element (numbered zero) being at the bitwise-numbered **LSB** end of the register, where VSX
313 does the reverse: places the numerically-*highest* (last-numbered) element at
314 the LSB end of the register.
315
316
317 ```
318 #pragma pack
319 typedef union {
320 // these do NOT match their Power ISA VSX numbering directly, they are all reversed
321 // bytes[15] is actually VSR.byte[0] for example. if this convention is not
322 // followed then everything ends up in the wrong place
323 uint8_t bytes[16]; // elwidth 8, QTY 16 FIXED total
324 uint16_t hwords[8]; // elwidth 16, QTY 8 FIXED total
325 uint32_t words[4]; // elwidth 32, QTY 8 FIXED total
326 uint64_t dwords[2]; // elwidth 64, QTY 2 FIXED total
327 uint8_t actual_bytes[16]; // totals 128-bit
328 } el_reg_t;
329
330 elreg_t VSR_regfile[64];
331
332 static void check_num_elements(int elt, int width) {
333 switch (width) {
334 case 64: assert elt < 2;
335 case 32: assert elt < 4;
336 case 16: assert elt < 8;
337 case 8 : assert elt < 16;
338 }
339 }
340 void get_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
341 check_num_elements(elt, width);
342 switch (width) {
343 case 64: el->dwords[0] = VSR_regfile[gpr].dwords[1-elt];
344 case 32: el->words[0] = VSR_regfile[gpr].words[3-elt];
345 case 16: el->hwords[0] = VSR_regfile[gpr].hwords[7-elt];
346 case 8 : el->bytes[0] = VSR_regfile[gpr].bytes[15-elt];
347 }
348 }
349 void set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
350 check_num_elements(elt, width);
351 switch (width) {
352 case 64: VSR_regfile[gpr].dwords[1-elt] = el->dwords[0];
353 case 32: VSR_regfile[gpr].words[3-elt] = el->words[0];
354 case 16: VSR_regfile[gpr].hwords[7-elt] = el->hwords[0];
355 case 8 : VSR_regfile[gpr].bytes[15-elt] = el->bytes[0];
356 }
357 }
358 ```
359
360 For VSR Registers one key difference is that the overlay of different element
361 widths is clearly a *bounded static quantity*, whereas for Simple-V the
362 elements are
363 unrestrained and permitted to flow into *successive underlying Scalar registers*.
364 This difference is absolutely critical to a full understanding of the entire
365 Simple-V paradigm and why element-ordering, bit-numbering *and register numbering*
366 are all so strictly defined.
367
368 Implementations are not permitted to violate the Canonical definition. Software
369 will be critically relying on the wrapped (overflow) behaviour inherently
370 implied by the unbounded variable-length c arrays.
371
372 Illustrating the exact same loop with the exact same effect as achieved by Simple-V
373 we are first forced to create wrapper functions, to cater for the fact
374 that VSR register elements are static bounded:
375
376 ```
377 int calc_VSR_reg_offs(int elt, int width) {
378 switch (width) {
379 case 64: return floor(elt / 2);
380 case 32: return floor(elt / 4);
381 case 16: return floor(elt / 8);
382 case 8 : return floor(elt / 16);
383 }
384 }
385 int calc_VSR_elt_offs(int elt, int width) {
386 switch (width) {
387 case 64: return (elt % 2);
388 case 32: return (elt % 4);
389 case 16: return (elt % 8);
390 case 8 : return (elt % 16);
391 }
392 }
393 void _set_VSR_element(el_reg_t* el, int gpr, int elt, int width) {
394 int new_elt = calc_VSR_elt_offs(elt, width);
395 int new_reg = calc_VSR_reg_offs(elt, width);
396 set_VSR_element(el, gpr+new_reg, new_elt, width);
397 }
398 ```
399
400 And finally use these functions:
401
402 ```
403 # VSX-add RT, RA, RB using the "uint64_t" union member "halfs"
404 for i in range(VL):
405 el_reg_t result, ra, rb;
406 _get_VSR_element(&ra, RA, i, 16);
407 _get_VSR_element(&rb, RB, i, 16);
408 result.halfs[0] = ra.halfs[0] + rb.halfs[0]; // use array 0 elements
409 _set_VSR_element(&result, RT, i, 16);
410
411 ```
412
413 ## Scalar Identity Behaviour
414
415 SVP64 is designed so that when the prefix is all zeros, and VL=1, no
416 effect or influence occurs (no augmentation) such that all standard Power
417 ISA v3.0/v3.1 instructions covered by the prefix are "unaltered". This
418 is termed `scalar identity behaviour` (based on the mathematical
419 definition for "identity", as in, "identity matrix" or better "identity
420 transformation").
421
422 Note that this is completely different from when VL=0. VL=0 turns all
423 operations under its influence into `nops` (regardless of the prefix)
424 whereas when VL=1 and the SV prefix is all zeros, the operation simply
425 acts as if SV had not been applied at all to the instruction (an
426 "identity transformation").
427
428 The fact that `VL` is dynamic and can be set to any value at runtime based
429 on program conditions and behaviour means very specifically that
430 `scalar identity behaviour` is **not** a redundant encoding. If the
431 only means by which VL could be set was by way of static-compiled
432 immediates then this assertion would be false. VL should not
433 be confused with MAXVL when understanding this key aspect of SimpleV.
434
435 ## Register Naming and size
436
437 As indicated above SV Registers are simply the GPR, FPR and CR
438 register files extended linearly to larger sizes; SV Vectorisation
439 iterates sequentially through these registers (LSB0 sequential ordering
440 from 0 to VL-1).
441
442 Where the integer regfile in standard scalar Power ISA v3.0B/v3.1B is
443 r0 to r31, SV extends this as r0 to r127. Likewise FP registers are
444 extended to 128 (fp0 to fp127), and CR Fields are extended to 128 entries,
445 CR0 thru CR127.
446
447 The names of the registers therefore reflects a simple linear extension
448 of the Power ISA v3.0B / v3.1B register naming, and in hardware this
449 would be reflected by a linear increase in the size of the underlying
450 SRAM used for the regfiles.
451
452 Note: when an EXTRA field (defined below) is zero, SV is deliberately
453 designed so that the register fields are identical to as if SV was not in
454 effect i.e. under these circumstances (EXTRA=0) the register field names
455 RA, RB etc. are interpreted and treated as v3.0B / v3.1B scalar registers.
456 This is part of `scalar identity behaviour` described above.
457
458 **Condition Register(s)**
459
460 The Scalar Power ISA Condition Register is a 64 bit register where the top
461 32 MSBs (numbered 0:31 in MSB0 numbering) are not used. This convention is
462 *preserved*
463 in SVP64 and an additional 15 Condition Registers provided in
464 order to store the new CR Fields, CR8-CR15, CR16-CR23 etc. sequentially.
465 The top 32 MSBs in each new SVP64 Condition Register are *also* not used:
466 only the bottom 32 bits (numbered 32:63 in MSB0 numbering).
467
468 *Programmer's note: using `sv.mfcr` without element-width overrides
469 to take into account the fact that the top 32 MSBs are zero and thus
470 effectively doubling the number of GPR registers required to hold all 128
471 CR Fields would seem the only option because normally elwidth overrides
472 would halve the capacity of the instruction. However in this case it
473 is possible to use destination element-width overrides (for `sv.mfcr`.
474 source overrides would be used on the GPR of `sv.mtocrf`), whereupon
475 truncation of the 64-bit Condition Register(s) occurs, throwing away
476 the zeros and storing the remaining (valid, desired) 32-bit values
477 sequentially into (LSB0-convention) lower-numbered and upper-numbered
478 halves of GPRs respectively. The programmer is expected to be aware
479 however that the full width of the entire 64-bit Condition Register
480 is considered to be "an element". This is **not** like any other
481 Condition-Register instructions because all other CR instructions,
482 on closer investigation, will be observed to all be CR-bit or CR-Field
483 related. Thus a `VL` of 16 must be used*
484
485 ## Future expansion.
486
487 With the way that EXTRA fields are defined and applied to register
488 fields, future versions of SV may involve 256 or greater registers
489 in some way as long as the reputation of Power ISA for full backwards
490 binary interoperability is preserved. Backwards binary compatibility
491 may be achieved with a PCR bit (Program Compatibility Register) or an
492 MSR bit analogous to SF. Further discussion is out of scope for this
493 version of SVP64.
494
495 Additionally, a future variant of SVP64 will be applied to the Scalar
496 (Quad-precision and 128-bit) VSX instructions. Element-width overrides are
497 an opportunity to expand a future version of the Power ISA to 256-bit,
498 512-bit and 1024-bit operations, as well as doubling or quadrupling the
499 number of VSX registers to 128 or 256. Again further discussion is out
500 of scope for this version of SVP64.
501
502 --------
503
504 \newpage{}
505
506 ## SVP64 Remapped Encoding (`RM[0:23]`)
507
508 In the SVP64 Vector Prefix spaces, the 24 bits 8-31 are termed `RM`. Bits
509 32-37 are the Primary Opcode of the Suffix "Defined Word". 38-63 are the
510 remainder of the Defined Word. Note that the new EXT232-263 SVP64 area
511 it is obviously mandatory that bit 32 is required to be set to 1.
512
513 | 0-5 | 6 | 7 | 8-31 | 32-37 | 38-64 |Description |
514 |-----|---|---|----------|--------|----------|-----------------------|
515 | PO | 0 | 1 | RM[0:23] | 1nnnnn | xxxxxxxx | SVP64:EXT232-263 |
516 | PO | 1 | 1 | RM[0:23] | nnnnnn | xxxxxxxx | SVP64:EXT000-063 |
517
518 It is important to note that unlike EXT1xx 64-bit prefixed instructions
519 there is insufficient space in `RM` to provide identification of
520 any SVP64 Fields without first partially decoding the 32-bit suffix.
521 Similar to the "Forms" (X-Form, D-Form) the `RM` format is individually
522 associated with every instruction. However this still does not adversely
523 affect Multi-Issue Decoding because the identification of the *length*
524 of anything in the 64-bit space has been kept brutally simple (EXT009),
525 and further decoding of any number of 64-bit Encodings in parallel at
526 that point is fully independent.
527
528 Extreme caution and care must be taken when extending SVP64
529 in future, to not create unnecessary relationships between prefix and
530 suffix that could complicate decoding, adding latency.
531
532 ## Common RM fields
533
534 The following fields are common to all Remapped Encodings:
535
536 | Field Name | Field bits | Description |
537 |------------|------------|----------------------------------------|
538 | MASKMODE | `0` | Execution (predication) Mask Kind |
539 | MASK | `1:3` | Execution Mask |
540 | SUBVL | `8:9` | Sub-vector length |
541
542 The following fields are optional or encoded differently depending
543 on context after decoding of the Scalar suffix:
544
545 | Field Name | Field bits | Description |
546 |------------|------------|----------------------------------------|
547 | ELWIDTH | `4:5` | Element Width |
548 | ELWIDTH_SRC | `6:7` | Element Width for Source |
549 | EXTRA | `10:18` | Register Extra encoding |
550 | MODE | `19:23` | changes Vector behaviour |
551
552 * MODE changes the behaviour of the SV operation (result saturation,
553 mapreduce)
554 * SUBVL groups elements together into vec2, vec3, vec4 for use in 3D
555 and Audio/Video DSP work
556 * ELWIDTH and ELWIDTH_SRC overrides the instruction's destination and
557 source operand width
558 * MASK (and MASK_SRC) and MASKMODE provide predication (two types of
559 sources: scalar INT and Vector CR).
560 * Bits 10 to 18 (EXTRA) are further decoded depending on the RM category
561 for the instruction, which is determined only by decoding the Scalar 32
562 bit suffix.
563
564 Similar to Power ISA `X-Form` etc. EXTRA bits are given designations,
565 such as `RM-1P-3S1D` which indicates for this example that the operation
566 is to be single-predicated and that there are 3 source operand EXTRA
567 tags and one destination operand tag.
568
569 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
570 or increased latency in some implementations due to lane-crossing.
571
572 ## Mode
573
574 Mode is an augmentation of SV behaviour. Different types of instructions
575 have different needs, similar to Power ISA v3.1 64 bit prefix 8LS and MTRR
576 formats apply to different instruction types. Modes include Reduction,
577 Iteration, arithmetic saturation, and Fail-First. More specific details
578 in each section and in the [[svp64/appendix]]
579
580 * For condition register operations see [[sv/cr_ops]]
581 * For LD/ST Modes, see [[sv/ldst]].
582 * For Branch modes, see [[sv/branches]]
583 * For arithmetic and logical, see [[sv/normal]]
584
585 ## ELWIDTH Encoding
586
587 Default behaviour is set to 0b00 so that zeros follow the convention
588 of `scalar identity behaviour`. In this case it means that elwidth
589 overrides are not applicable. Thus if a 32 bit instruction operates
590 on 32 bit, `elwidth=0b00` specifies that this behaviour is unmodified.
591 Likewise when a processor is switched from 64 bit to 32 bit mode,
592 `elwidth=0b00` states that, again, the behaviour is not to be modified.
593
594 Only when elwidth is nonzero is the element width overridden to the
595 explicitly required value.
596
597 ### Elwidth for Integers:
598
599 | Value | Mnemonic | Description |
600 |-------|----------------|------------------------------------|
601 | 00 | DEFAULT | default behaviour for operation |
602 | 01 | `ELWIDTH=w` | Word: 32-bit integer |
603 | 10 | `ELWIDTH=h` | Halfword: 16-bit integer |
604 | 11 | `ELWIDTH=b` | Byte: 8-bit integer |
605
606 This encoding is chosen such that the byte width may be computed as
607 `8<<(3-ew)`
608
609 ### Elwidth for FP Registers:
610
611 | Value | Mnemonic | Description |
612 |-------|----------------|------------------------------------|
613 | 00 | DEFAULT | default behaviour for FP operation |
614 | 01 | `ELWIDTH=f32` | 32-bit IEEE 754 Single floating-point |
615 | 10 | `ELWIDTH=f16` | 16-bit IEEE 754 Half floating-point |
616 | 11 | `ELWIDTH=bf16` | Reserved for `bf16` |
617
618 Note:
619 [`bf16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
620 is reserved for a future implementation of SV
621
622 Note that any IEEE754 FP operation in Power ISA ending in "s" (`fadds`)
623 shall perform its operation at **half** the ELWIDTH then padded back out
624 to ELWIDTH. `sv.fadds/ew=f32` shall perform an IEEE754 FP16 operation
625 that is then "padded" to fill out to an IEEE754 FP32. When ELWIDTH=DEFAULT
626 clearly the behaviour of `sv.fadds` is performed at 32-bit accuracy
627 then padded back out to fit in IEEE754 FP64, exactly as for Scalar
628 v3.0B "single" FP. Any FP operation ending in "s" where ELWIDTH=f16 or
629 ELWIDTH=bf16 is reserved and must raise an illegal instruction (IEEE754
630 FP8 or BF8 are not defined).
631
632 ### Elwidth for CRs (no meaning)
633
634 Element-width overrides for CR Fields has no meaning. The bits
635 are therefore used for other purposes, or when Rc=1, the Elwidth
636 applies to the result being tested (a GPR or FPR), but not to the
637 Vector of CR Fields.
638
639 ## SUBVL Encoding
640
641 The default for SUBVL is 1 and its encoding is 0b00 to indicate that
642 SUBVL is effectively disabled (a SUBVL for-loop of only one element). this
643 lines up in combination with all other "default is all zeros" behaviour.
644
645 | Value | Mnemonic | Subvec | Description |
646 |-------|-----------|---------|------------------------|
647 | 00 | `SUBVL=1` | single | Sub-vector length of 1 |
648 | 01 | `SUBVL=2` | vec2 | Sub-vector length of 2 |
649 | 10 | `SUBVL=3` | vec3 | Sub-vector length of 3 |
650 | 11 | `SUBVL=4` | vec4 | Sub-vector length of 4 |
651
652 The SUBVL encoding value may be thought of as an inclusive range of a
653 sub-vector. SUBVL=2 represents a vec2, its encoding is 0b01, therefore
654 this may be considered to be elements 0b00 to 0b01 inclusive.
655
656 ## MASK/MASK_SRC & MASKMODE Encoding
657
658 One bit (`MASKMODE`) indicates the mode: CR or Int predication. The two
659 types may not be mixed.
660
661 Special note: to disable predication this field must be set to zero in
662 combination with Integer Predication also being set to 0b000. this has the
663 effect of enabling "all 1s" in the predicate mask, which is equivalent to
664 "not having any predication at all".
665
666 `MASKMODE` may be set to one of 2 values:
667
668 | Value | Description |
669 |-----------|------------------------------------------------------|
670 | 0 | MASK/MASK_SRC are encoded using Integer Predication |
671 | 1 | MASK/MASK_SRC are encoded using CR-based Predication |
672
673 Integer Twin predication has a second set of 3 bits that uses the same
674 encoding thus allowing either the same register (r3, r10 or r31) to be
675 used for both src and dest, or different regs (one for src, one for dest).
676
677 Likewise CR based twin predication has a second set of 3 bits, allowing
678 a different test to be applied.
679
680 Note that it is assumed that Predicate Masks (whether INT or CR) are
681 read *before* the operations proceed. In practice (for CR Fields)
682 this creates an unnecessary block on parallelism. Therefore, it is up
683 to the programmer to ensure that the CR fields used as Predicate Masks
684 are not being written to by any parallel Vector Loop. Doing so results
685 in **UNDEFINED** behaviour, according to the definition outlined in the
686 Power ISA v3.0B Specification.
687
688 Hardware Implementations are therefore free and clear to delay reading
689 of individual CR fields until the actual predicated element operation
690 needs to take place, safe in the knowledge that no programmer will have
691 issued a Vector Instruction where previous elements could have overwritten
692 (destroyed) not-yet-executed CR-Predicated element operations.
693
694 ### Integer Predication (MASKMODE=0)
695
696 When the predicate mode bit is zero the 3 bits are interpreted as below.
697 Twin predication has an identical 3 bit field similarly encoded.
698
699 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
700 following meaning:
701
702 | Value | Mnemonic | Element `i` enabled if: |
703 |-------|----------|------------------------------|
704 | 000 | ALWAYS | predicate effectively all 1s |
705 | 001 | 1 << R3 | `i == R3` |
706 | 010 | R3 | `R3 & (1 << i)` is non-zero |
707 | 011 | ~R3 | `R3 & (1 << i)` is zero |
708 | 100 | R10 | `R10 & (1 << i)` is non-zero |
709 | 101 | ~R10 | `R10 & (1 << i)` is zero |
710 | 110 | R30 | `R30 & (1 << i)` is non-zero |
711 | 111 | ~R30 | `R30 & (1 << i)` is zero |
712
713 r10 and r30 are at the high end of temporary and unused registers,
714 so as not to interfere with register allocation from ABIs.
715
716 ### CR-based Predication (MASKMODE=1)
717
718 When the predicate mode bit is one the 3 bits are interpreted as below.
719 Twin predication has an identical 3 bit field similarly encoded.
720
721 `MASK` and `MASK_SRC` may be set to one of 8 values, to provide the
722 following meaning:
723
724 | Value | Mnemonic | Element `i` is enabled if |
725 |-------|----------|--------------------------|
726 | 000 | lt | `CR[offs+i].LT` is set |
727 | 001 | nl/ge | `CR[offs+i].LT` is clear |
728 | 010 | gt | `CR[offs+i].GT` is set |
729 | 011 | ng/le | `CR[offs+i].GT` is clear |
730 | 100 | eq | `CR[offs+i].EQ` is set |
731 | 101 | ne | `CR[offs+i].EQ` is clear |
732 | 110 | so/un | `CR[offs+i].FU` is set |
733 | 111 | ns/nu | `CR[offs+i].FU` is clear |
734
735 `offs` is defined as CR32 (4x8) so as to mesh cleanly with Vectorised
736 Rc=1 operations (see below). Rc=1 operations start from CR8 (TBD).
737
738 The CR Predicates chosen must start on a boundary that Vectorised CR
739 operations can access cleanly, in full. With EXTRA2 restricting starting
740 points to multiples of 8 (CR0, CR8, CR16...) both Vectorised Rc=1 and
741 CR Predicate Masks have to be adapted to fit on these boundaries as well.
742
743 ## Extra Remapped Encoding <a name="extra_remap"> </a>
744
745 Shows all instruction-specific fields in the Remapped Encoding
746 `RM[10:18]` for all instruction variants. Note that due to the very
747 tight space, the encoding mode is *not* included in the prefix itself.
748 The mode is "applied", similar to Power ISA "Forms" (X-Form, D-Form)
749 on a per-instruction basis, and, like "Forms" are given a designation
750 (below) of the form `RM-nP-nSnD`. The full list of which instructions
751 use which remaps is here [[opcode_regs_deduped]].
752
753 **Please note the following**:
754
755 ```
756 Machine-readable CSV files have been autogenerated which will make the
757 task of creating SV-aware ISA decoders, documentation, assembler tools
758 compiler tools Simulators documentation all aspects of SVP64 easier
759 and less prone to mistakes. Please avoid manual re-creation of
760 information from the written specification wording in this chapter,
761 and use the CSV files or use the Canonical tool which creates the CSV
762 files, named sv_analysis.py. The information contained within
763 sv_analysis.py is considered to be part of this Specification, even
764 encoded as it is in python3.
765 ```
766
767
768 The mappings are part of the SVP64 Specification in exactly the same
769 way as X-Form, D-Form. New Scalar instructions added to the Power ISA
770 will need a corresponding SVP64 Mapping, which can be derived by-rote
771 from examining the Register "Profile" of the instruction.
772
773 There are two categories: Single and Twin Predication. Due to space
774 considerations further subdivision of Single Predication is based on
775 whether the number of src operands is 2 or 3. With only 9 bits available
776 some compromises have to be made.
777
778 * `RM-1P-3S1D` Single Predication dest/src1/2/3, applies to 4-operand
779 instructions (fmadd, isel, madd).
780 * `RM-1P-2S1D` Single Predication dest/src1/2 applies to 3-operand
781 instructions (src1 src2 dest)
782 * `RM-2P-1S1D` Twin Predication (src=1, dest=1)
783 * `RM-2P-2S1D` Twin Predication (src=2, dest=1) primarily for LDST (Indexed)
784 * `RM-2P-1S2D` Twin Predication (src=1, dest=2) primarily for LDST Update
785
786 ### RM-1P-3S1D
787
788 | Field Name | Field bits | Description |
789 |------------|------------|----------------------------------------|
790 | Rdest\_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
791 | Rsrc1\_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
792 | Rsrc2\_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
793 | Rsrc3\_EXTRA2 | `16:17` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
794 | EXTRA2_MODE | `18` | used by `divmod2du` and `maddedu` for RS |
795
796 These are for 3 operand in and either 1 or 2 out instructions.
797 3-in 1-out includes `madd RT,RA,RB,RC`. (DRAFT) instructions
798 such as `maddedu` have an implicit second destination, RS, the
799 selection of which is determined by bit 18.
800
801 ### RM-1P-2S1D
802
803 | Field Name | Field bits | Description |
804 |------------|------------|-------------------------------------------|
805 | Rdest\_EXTRA3 | `10:12` | extends Rdest |
806 | Rsrc1\_EXTRA3 | `13:15` | extends Rsrc1 |
807 | Rsrc2\_EXTRA3 | `16:18` | extends Rsrc3 |
808
809 These are for 2 operand 1 dest instructions, such as `add RT, RA,
810 RB`. However also included are unusual instructions with an implicit
811 dest that is identical to its src reg, such as `rlwinmi`.
812
813 Normally, with instructions such as `rlwinmi`, the scalar v3.0B ISA would
814 not have sufficient bit fields to allow an alternative destination.
815 With SV however this becomes possible. Therefore, the fact that the
816 dest is implicitly also a src should not mislead: due to the *prefix*
817 they are different SV regs.
818
819 * `rlwimi RA, RS, ...`
820 * Rsrc1_EXTRA3 applies to RS as the first src
821 * Rsrc2_EXTRA3 applies to RA as the secomd src
822 * Rdest_EXTRA3 applies to RA to create an **independent** dest.
823
824 With the addition of the EXTRA bits, the three registers
825 each may be *independently* made vector or scalar, and be independently
826 augmented to 7 bits in length.
827
828 ### RM-2P-1S1D/2S
829
830 | Field Name | Field bits | Description |
831 |------------|------------|----------------------------|
832 | Rdest_EXTRA3 | `10:12` | extends Rdest |
833 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
834 | MASK_SRC | `16:18` | Execution Mask for Source |
835
836 `RM-2P-2S` is for `stw` etc. and is Rsrc1 Rsrc2.
837
838 ### RM-1P-2S1D
839
840 single-predicate, three registers (2 read, 1 write)
841
842 | Field Name | Field bits | Description |
843 |------------|------------|----------------------------|
844 | Rdest_EXTRA3 | `10:12` | extends Rdest |
845 | Rsrc1_EXTRA3 | `13:15` | extends Rsrc1 |
846 | Rsrc2_EXTRA3 | `16:18` | extends Rsrc2 |
847
848 ### RM-2P-2S1D/1S2D/3S
849
850 The primary purpose for this encoding is for Twin Predication on LOAD
851 and STORE operations. see [[sv/ldst]] for detailed anslysis.
852
853 **RM-2P-2S1D:**
854
855 | Field Name | Field bits | Description |
856 |------------|------------|----------------------------|
857 | Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
858 | Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
859 | Rsrc2_EXTRA2 | `14:15` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
860 | MASK_SRC | `16:18` | Execution Mask for Source |
861
862 **RM-2P-1S2D:**
863
864 For RM-2P-1S2D the EXTRA2 dest and src names are switched (Rsrc_EXTRA2
865 is in bits 10:11, Rdest1_EXTRA2 in 12:13)
866
867 | Field Name | Field bits | Description |
868 |------------|------------|----------------------------|
869 | Rsrc2_EXTRA2 | `10:11` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
870 | Rsrc1_EXTRA2 | `12:13` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
871 | Rdest_EXTRA2 | `14:15` | extends Rdest (R\*\_EXTRA2 Encoding) |
872 | MASK_SRC | `16:18` | Execution Mask for Source |
873
874 **RM-2P-3S:**
875
876 Also that for RM-2P-3S (to cover `stdx` etc.) the names are switched to 3 src:
877 Rsrc1_EXTRA2, Rsrc2_EXTRA2, Rsrc3_EXTRA2.
878
879 | Field Name | Field bits | Description |
880 |------------|------------|----------------------------|
881 | Rsrc1_EXTRA2 | `10:11` | extends Rsrc1 (R\*\_EXTRA2 Encoding) |
882 | Rsrc2_EXTRA2 | `12:13` | extends Rsrc2 (R\*\_EXTRA2 Encoding) |
883 | Rsrc3_EXTRA2 | `14:15` | extends Rsrc3 (R\*\_EXTRA2 Encoding) |
884 | MASK_SRC | `16:18` | Execution Mask for Source |
885
886 Note also that LD with update indexed, which takes 2 src and
887 creates 2 dest registers (e.g. `lhaux RT,RA,RB`), does not have room
888 for 4 registers and also Twin Predication. Therefore these are treated as
889 RM-2P-2S1D and the src spec for RA is also used for the same RA as a dest.
890
891 Note that if ELWIDTH != ELWIDTH_SRC this may result in reduced performance
892 or increased latency in some implementations due to lane-crossing.
893
894 ## R\*\_EXTRA2/3
895
896 EXTRA is the means by which two things are achieved:
897
898 1. Registers are marked as either Vector *or Scalar*
899 2. Register field numbers (limited typically to 5 bit)
900 are extended in range, both for Scalar and Vector.
901
902 The register files are therefore extended:
903
904 * INT (GPR) is extended from r0-31 to r0-127
905 * FP (FPR) is extended from fp0-32 to fp0-fp127
906 * CR Fields are extended from CR0-7 to CR0-127
907
908 However due to pressure in `RM.EXTRA` not all these registers
909 are accessible by all instructions, particularly those with
910 a large number of operands (`madd`, `isel`).
911
912 In the following tables register numbers are constructed from the
913 standard v3.0B / v3.1B 32 bit register field (RA, FRA) and the EXTRA2 or
914 EXTRA3 field from the SV Prefix, determined by the specific RM-xx-yyyy
915 designation for a given instruction. The prefixing is arranged so that
916 interoperability between prefixing and nonprefixing of scalar registers
917 is direct and convenient (when the EXTRA field is all zeros).
918
919 A pseudocode algorithm explains the relationship, for INT/FP (see
920 [[svp64/appendix]] for CRs)
921
922 ```
923 if extra3_mode:
924 spec = EXTRA3
925 else:
926 spec = EXTRA2 << 1 # same as EXTRA3, shifted
927 if spec[0]: # vector
928 return (RA << 2) | spec[1:2]
929 else: # scalar
930 return (spec[1:2] << 5) | RA
931 ```
932
933 Future versions may extend to 256 by shifting Vector numbering up.
934 Scalar will not be altered.
935
936 Note that in some cases the range of starting points for Vectors
937 is limited.
938
939 ### INT/FP EXTRA3
940
941 If EXTRA3 is zero, maps to "scalar identity" (scalar Power ISA field
942 naming).
943
944 Fields are as follows:
945
946 * Value: R_EXTRA3
947 * Mode: register is tagged as scalar or vector
948 * Range/Inc: the range of registers accessible from this EXTRA
949 encoding, and the "increment" (accessibility). "/4" means
950 that this EXTRA encoding may only give access (starting point)
951 every 4th register.
952 * MSB..LSB: the bit field showing how the register opcode field
953 combines with EXTRA to give (extend) the register number (GPR)
954
955 Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB)
956
957 | Value | Mode | Range/Inc | 6..0 |
958 |-----------|-------|---------------|---------------------|
959 | 000 | Scalar | `r0-r31`/1 | `0b00 RA` |
960 | 001 | Scalar | `r32-r63`/1 | `0b01 RA` |
961 | 010 | Scalar | `r64-r95`/1 | `0b10 RA` |
962 | 011 | Scalar | `r96-r127`/1 | `0b11 RA` |
963 | 100 | Vector | `r0-r124`/4 | `RA 0b00` |
964 | 101 | Vector | `r1-r125`/4 | `RA 0b01` |
965 | 110 | Vector | `r2-r126`/4 | `RA 0b10` |
966 | 111 | Vector | `r3-r127`/4 | `RA 0b11` |
967
968 ### INT/FP EXTRA2
969
970 If EXTRA2 is zero will map to "scalar identity behaviour" i.e Scalar
971 Power ISA register naming:
972
973 Encoding shown in LSB0: MSB down to LSB (MSB 6..0 LSB)
974
975 | Value | Mode | Range/inc | 6..0 |
976 |----------|-------|---------------|-----------|
977 | 00 | Scalar | `r0-r31`/1 | `0b00 RA` |
978 | 01 | Scalar | `r32-r63`/1 | `0b01 RA` |
979 | 10 | Vector | `r0-r124`/4 | `RA 0b00` |
980 | 11 | Vector | `r2-r126`/4 | `RA 0b10` |
981
982 **Note that unlike in EXTRA3, in EXTRA2**:
983
984 * the GPR Vectors may only start from
985 `r0, r2, r4, r6, r8` and likewise FPR Vectors.
986 * the GPR Scalars may only go from `r0, r1, r2.. r63` and likewise FPR Scalars.
987
988 as there is insufficient bits to cover the full range.
989
990 ### CR Field EXTRA3
991
992 CR Field encoding is essentially the same but made more complex due to CRs
993 being bit-based, because the application of SVP64 element-numbering applies
994 to the CR *Field* numbering not the CR register *bit* numbering.
995 Note that Vectors may only start from `CR0, CR4, CR8, CR12, CR16, CR20`...
996 and Scalars may only go from `CR0, CR1, ... CR31`
997
998 Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB),
999 BA ranges are in MSB0.
1000
1001 For a 5-bit operand (BA, BB, BT):
1002
1003 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
1004 |-------|------|---------------|-----------| --------|---------|
1005 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] |
1006 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] |
1007 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BA[0:2] | BA[3:4] |
1008 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BA[0:2] | BA[3:4] |
1009 | 100 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] |
1010 | 101 | Vector | `CR4-CR116`/16 | BA[0:2] 0 | 0b100 | BA[3:4] |
1011 | 110 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] |
1012 | 111 | Vector | `CR12-CR124`/16 | BA[0:2] 1 | 0b100 | BA[3:4] |
1013
1014 For a 3-bit operand (e.g. BFA):
1015
1016 | Value | Mode | Range/Inc | 6..3 | 2..0 |
1017 |-------|------|---------------|-----------| --------|
1018 | 000 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
1019 | 001 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
1020 | 010 | Scalar | `CR16-CR23`/1 | 0b0010 | BFA |
1021 | 011 | Scalar | `CR24-CR31`/1 | 0b0011 | BFA |
1022 | 100 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
1023 | 101 | Vector | `CR4-CR116`/16 | BFA 0 | 0b100 |
1024 | 110 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
1025 | 111 | Vector | `CR12-CR124`/16 | BFA 1 | 0b100 |
1026
1027 ### CR EXTRA2
1028
1029 CR encoding is essentially the same but made more complex due to CRs
1030 being bit-based, because the application of SVP64 element-numbering applies
1031 to the CR *Field* numbering not the CR register *bit* numbering.
1032 Note that Vectors may only start from CR0, CR8, CR16, CR24, CR32...
1033
1034 Encoding shown in LSB0: MSB down to LSB (MSB 8..5 4..2 1..0 LSB),
1035 BA ranges are in MSB0.
1036
1037 For a 5-bit operand (BA, BB, BC):
1038
1039 | Value | Mode | Range/Inc | 8..5 | 4..2 | 1..0 |
1040 |-------|--------|----------------|---------|---------|---------|
1041 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BA[0:2] | BA[3:4] |
1042 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BA[0:2] | BA[3:4] |
1043 | 10 | Vector | `CR0-CR112`/16 | BA[0:2] 0 | 0b000 | BA[3:4] |
1044 | 11 | Vector | `CR8-CR120`/16 | BA[0:2] 1 | 0b000 | BA[3:4] |
1045
1046 For a 3-bit operand (e.g. BFA):
1047
1048 | Value | Mode | Range/Inc | 6..3 | 2..0 |
1049 |-------|------|---------------|-----------| --------|
1050 | 00 | Scalar | `CR0-CR7`/1 | 0b0000 | BFA |
1051 | 01 | Scalar | `CR8-CR15`/1 | 0b0001 | BFA |
1052 | 10 | Vector | `CR0-CR112`/16 | BFA 0 | 0b000 |
1053 | 11 | Vector | `CR8-CR120`/16 | BFA 1 | 0b000 |
1054
1055 ## Appendix
1056
1057 Now at its own page: [[svp64/appendix]]
1058
1059 --------
1060
1061 [[!tag standards]]
1062
1063 \newpage{}