(no commit message)
[libreriscv.git] / openpower / sv / overview.mdwn
1 # SV Overview
2
3 This document provides a crash-course overview as to why SV exists, and how it works.
4
5 [SIMD is known to be harmful](https://www.sigarch.org/simd-instructions-considered-harmful/):
6 a seductive simplicity that is easy to implement in hardware. Even with predication added, SIMD only becomes more and more problematic with each power of two SIMD width increase introduced through an ISA revision.
7
8 Cray-style variable-length Vectors on the other hand result in stunningly elegant and small loops, with no alarmingly high setup and cleanup code, where at the hardware level the microarchitecture may execute from one element right the way through to tens of thousands at a time, yet the executable remains exactly the same. Unlike in SIMD, powers of two limitations are not involved in either the hardware nor in the assembly code.
9
10 SimpleV takes the Cray style Vector principle and applies it to a Scalar ISA, in the process allowing register file size increases using "tagging" (similar to how x86 originally extended registers from 32 to 64 bit).
11
12 The fundamentals are:
13
14 * The Program Counter gains a "Sub Counter" context.
15 * Vectorisation pauses the PC and runs a loop from 0 to VL-1
16 (where VL is Vector Length)
17 * During the loop the instruction at the PC is executed *multiple*
18 times.
19 * Some registers may be "tagged" as Vectors
20 * During the loop, "Vector"-tagged register are incrememted by
21 one with each iteration.
22 * Once the loop is completed *only then* is the Program Counter
23 allowed to move to the next instruction.
24
25 In OpenPOWER ISA v3.0B pseudo-code form, an ADD operation, assuming both source and destination have been "tagged" as Vectors, is simply:
26
27 for i = 0 to VL-1:
28 GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
29
30 At its heart, SimpleV really is this simple. On top of this fundamental basis further refinements can be added which build up towards an extremely powerful Vector augmentation system, with very little in the way of additional opcodes required: simply external "context".
31
32 RISC-V RVV as of version 0.9 is over 180 instructions (more than the rest of RV64G combined). Over 95% of that functionality is added to OpenPOWER v3 0B, by SimpleV augmentation, with around 5 to 8 instructions.
33
34 Even in OpenPOWER v3.0B, the Scalar Integer ISA is around 150 instructions, with IEEE754 FP adding approximately 80 more. VSX, being based on SIMD design principles, adds somewhere in the region of 600 more. SimpleV again provides over 95% of VSX functionality, simply by augmenting the *Scalar* OpenPOWER ISA, and in the process providing features such as predication, which VSX is entirely missing.
35
36 The rest of this document builds on the above simple loop to add:
37
38 * Vector-Scalar, Scalar-Vector and Scalar-Scalar operation
39 * Traditional Vector operations (VSPLAT, VINSERT, VCOMPRESS etc)
40 * Predication masks (essential for parallel if/else constructs)
41 * 8, 16 and 32 bit integer operations, and both FP16 and BF16.
42 * Fail-on-first (introduced in ARM SVE2)
43 * A new concept: Data-dependent fail-first
44 * Condition-Register based *post-result* predication (also new)
45 * A completely new concept: "Twin Predication"
46
47 All of this is *without modifying the OpenPOWER v3.0B ISA*, except to add "wrapping context", similar to how v3.1B 64 Prefixes work.
48
49 In fairness to both VSX and RVV, there are things that are not provided by SimpleV:
50
51 * 128 bit or above arithmetic and other operations
52 (VSX Rijndael and SHA primitives; VSX shuffle operations)
53 * register files above 128
54 * Vector lengths over 64
55 * Unit-strided LD/ST and other comprehensive memory operations
56 (struct-based LD/ST from RVV for example)
57 * 32-bit instruction lengths. [[svp64]] had to be added as 64 bit.
58
59 These are not insurmountable limitations, that, over time, may well be added in future revisions of SV.
60
61 # Adding Scalar / Vector
62
63 The first augmentation to the simple loop is to add the option for all source and destinations to all be either scalar or vector. As a FSM this is where our "simple" loop gets its first complexity.
64
65 function op_add(rd, rs1, rs2) # add not VADD!
66 int id=0, irs1=0, irs2=0;
67 for i = 0 to VL-1:
68 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
69 if (!rd.isvec) break;
70 if (rd.isvec) { id += 1; }
71 if (rs1.isvec) { irs1 += 1; }
72 if (rs2.isvec) { irs2 += 1; }
73
74 With some walkthroughs it is clear that the loop exits immediately after the first scalar destination result is written, and that when the destination is a Vector the loop proceeds to fill up the register file, sequentially, starting at `rd` and ending at `rd+VL-1`. The two source registers will, independently, either remain pointing at `rs1` or `rs2` respectively, or, if marked as Vectors, will march incrementally in lockstep, producing element results along the way, as the destination also progresses through elements.
75
76 In this way all the eight permutations of Scalar and Vector behaviour are covered, although without predication the scalar-destination ones are reduced in usefulness. It does however clearly illustrate the principle.
77
78 Note in particular: there is no separate Scalar add instruction and separate Vector instruction and separate Scalar-Vector instruction, *and there is no separate Vector register file*: it's all the same instruction, on the standard register file, just with a loop. Scalar happens to set that loop size to one.
79
80 # Adding single predication
81
82 The next step is to add a single predicate mask. This is where it gets interesting. Predicate masks are a bitvector, each bit specifying, in order, whether the element operation is to be skipped ("masked out") or allowed. If there is no predicate, it is set to all 1s, which is effectively the same as "no predicate".
83
84 function op_add(rd, rs1, rs2) # add not VADD!
85 int id=0, irs1=0, irs2=0;
86 predval = get_pred_val(FALSE, rd);
87 for i = 0 to VL-1:
88 if (predval & 1<<i) # predication bit test
89 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
90 if (!rd.isvec) break;
91 if (rd.isvec) { id += 1; }
92 if (rs1.isvec) { irs1 += 1; }
93 if (rs2.isvec) { irs2 += 1; }
94
95 The key modification is to skip the creation and storage of the result if the relevant predicate mask bit is clear, but *not the progression through the registers*.
96
97 A particularly interesting case is if the destination is scalar, and the first few bits of the predicate are zero. The loop proceeds to increment the Svalar *source* registers until the first nonzero predicate bit is found, whereupon a single result is computed, and *then* the loop exits. This therefore uses the predicate to perform Vector source indexing. This case was not possible without the predicate mask.
98
99 If all three registers are marked as Vector then the "traditional" predicated Vector behaviour is provided. Yet, just as before, all other options are still provided, right the way back to the pure-scalar case, as if this were a straight OpenPOWER v3.0B non-augmented instruction.
100
101 # Predicate "zeroing" mode
102
103 Sometimes with predication it is ok to leave the masked-out element alone (not modify the result) however sometimes it is better to zero the masked-out elrments. This can be combined with bit-wise ORing to build up vectors from multiple predicate patterns. Our pseudocode therefore ends up as follows, to take that into account:
104
105 function op_add(rd, rs1, rs2) # add not VADD!
106 int id=0, irs1=0, irs2=0;
107 predval = get_pred_val(FALSE, rd);
108 for i = 0 to VL-1:
109 if (predval & 1<<i) # predication bit test
110 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
111 if (!rd.isvec) break;
112 else if zeroing:
113 ireg[rd+id] = 0
114 if (rd.isvec) { id += 1; }
115 if (rs1.isvec) { irs1 += 1; }
116 if (rs2.isvec) { irs2 += 1; }
117
118 Many Vector systems either have zeroing or they have nonzeroing, they do not have both. This is because they usually have separate Vector register files. However SV sits on top of standard register files and consequently there are advantages to both, so both are provided.
119
120 # Element Width overrides
121
122 All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64 bit integer operations, and IEEE754 FP32 and 64. Often also included is FP16 and more recently BF16. The *really* good Vector ISAs have variable-width vectors right down to bitlevel, and as high as 1024 bit arithmetic, as well as IEEE754 FP128.
123
124 SV has an "override" system that *changes* the bitwidth of operations that were intended by the original scalar ISA designers to have (for example) 64 bit operations. The override widths are 8, 16 and 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in the future).
125
126 This presents a particularly intriguing conundrum given that the OpenPOWER Scalar ISA was never designed with for example 8 bit operations in mind, let alone Vectors of 8 bit.
127
128 The solution comes in terms of rethinking the definition of a Register File. Rhe typical regfile may be considered to be a multi-ported SRAM block, 64 bits wide and usually 32 entries deep, to give 32 64 bit registers. Conceptually, to get our variable element width vectors, we may think of the regfile as being the following c-based data structure:
129
130 typedef union {
131 uint8_t actual_bytes[8];
132 uint8_t b[0]; // array of type uint8_t
133 uint16_t s[0];
134 uint32_t i[0];
135 uint64_t l[0]; // default OpenPOWER ISA uses this
136 } reg_t;
137
138 reg_t int_regfile[128]; // SV extends to 128 regs
139
140 Then, our simple loop, instead of accessing the array of 64 bits with a computed index, would access the appropriate element of the appropriate type. Thus we have a series of overlapping conceptual arrays that each start at what is traditionally thought of as "a register". It then helps if we have a couple of routines:
141
142 get_polymorphed_reg(reg, bitwidth, offset):
143 reg_t res = 0;
144 if bitwidth == 8:
145 reg.b = int_regfile[reg].b[offset]
146 elif bitwidth == 16:
147 reg.s = int_regfile[reg].s[offset]
148 elif bitwidth == 32:
149 reg.i = int_regfile[reg].i[offset]
150 elif bitwidth == default: # 64
151 reg.l = int_regfile[reg].l[offset]
152 return res
153
154 set_polymorphed_reg(reg, bitwidth, offset, val):
155 if (!reg.isvec): # scalar
156 int_regfile[reg].l[0] = val
157 elif bitwidth == 8:
158 int_regfile[reg].b[offset] = val
159 elif bitwidth == 16:
160 int_regfile[reg].s[offset] = val
161 elif bitwidth == 32:
162 int_regfile[reg].i[offset] = val
163 elif bitwidth == default: # 64
164 int_regfile[reg].l[offset] = val
165
166 These basically provide a convenient parameterised way to access the register file, at an arbitrary vector element offset and an arbitrary element width. Our first simple loop thus becomes:
167
168 for i = 0 to VL-1:
169 src1 = get_polymorphed_reg(rs1, srcwid, i)
170 src2 = get_polymorphed_reg(rs2, srcwid, i)
171 result = src1 + src2 # actual add here
172 set_polymorphed_reg(rd, destwid, i, result)
173
174 Note that things such as zero/sign-extension (and predication) have been left out to illustrate the elwidth concept. Also note that it turns out to be important to perform the operation at the maximum bitwidth - `max(srcwid, destwid)` - such that any truncation, rounding errors or other artefacts may all be ironed out. This turns out to be important when applying Saturation for Audio DSP workloads.
175
176 Other than that, element width overrides, which can be applied to *either* source or destination or both, are pretty straightforward, conceptually. The details, for hardware engineers, involve byte-level write-enable lines, which is exactly what is used on SRAMs anyway. Compiler writers have to alter Register Allocation Tables to byte-level granularity.
177
178 One critical thing to note: upper parts of the underlying 64 bit register are *not zero'd out* by a write involving a non-aligned Vector Length. An 8 bit operation with VL=7 will *not* overwrite the 8th byte of the destination. This is extremely important to consider the register file as a byte-level store, not a 64-bit-level store.