(no commit message)
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * First revision 05may2021
6
7 **Table of Contents**
8
9 [[!toc]]
10
11 # Why in the 2020s would you invent a new Vector ISA
12
13 Inventing a new Scalar ISA from scratch is over a decade-long task
14 including simulators and compilers: OpenRISC 1200 took 12 years to
15 mature. A Vector or Packed SIMD ISA to reach stable general-purpose
16 auto-vectorisation compiler support has never been achieved in the
17 history of computing, not with the combined resources of ARM, Intel,
18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. Rather: GPUs
19 have ultra-specialist compilers that are designed from the ground up
20 to support Vector/SIMD parallelism, and associated standards managed by
21 the Khronos Group, with multi-man-century development committment from
22 multiple billion-dollar-revenue companies, to sustain them.
23
24 Therefore it begs the question, why on earth would anyone consider
25 this task, and what, in Computer Science, actually needs solving?
26
27 First hints are that whilst memory bitcells have not increased in speed
28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
31 all make an effort (all simply increasing the parallel deployment of
32 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
33 two nearly three orders of magnitude increase in CPU horsepower. Seymour
34 Cray, from his amazing in-depth knowledge, predicted that the mismatch
35 would become a serious limitation, over two decades ago. Some systems
36 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
37 by way of compensation, and as we know from experience even that will
38 be considered inadequate in future.
39
40 Efforts to solve this problem by moving the processing closer to or
41 directly integrated into the memory have traditionally not gone well:
42 Aspex Microelectronics, Elixent, these are parallel processing companies
43 that very few have heard of, because their software stack was so
44 specialist that it required heavy investment by customers to utilise.
45 D-Matrix and Graphcore are a modern incarnation of the exact same
46 "specialist parallel processing" mistake, betting heavily on AI with
47 Matrix and Convolution Engines that can do no other task. Aspex only
48 survived by being bought by Ericsson, where its specialised suitability
49 for massive wide Baseband FFTs saved it from going under. Any "better
50 AI mousetrap" that comes along will quickly render both D-Matrix and
51 Graphcore obsolete.
52
53 NVIDIA and other GPUs have taken a different approach again: massive
54 parallelism with more Turing-complete ISAs in each, and dedicated
55 slower parallel memory paths (GDDR5) suited to the specific tasks of
56 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
57 by the amount of money poured into the software ecosystem in order
58 to make it accessible, and even then, GPU Programmers are a specialist
59 and rare (expensive) breed.
60
61 Second hints as to the answer emerge from an article
62 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
63 which illustrates a catastrophic rabbit-hole taken by Industry Giants
64 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
65 Order(N^6) opcode proliferation nightmare, with its mantra "make it
66 easy for hardware engineers, let software sort out the mess" literally
67 overwhelming programmers. Worse than that, specialists in charging
68 clients Optimisation Services are finding that AVX-512, to take an
69 example, is anything but optimal: overall performance of AVX-512 actually
70 *decreases* even as power consumption goes up.
71
72 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
73 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
74 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
75 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
76 instruction that makes a truly ubiquitous Vector ISA) in ways that
77 will become apparent over time as adoption increases. In the meantime
78 programmers are, in direct violation of ARM's advice on how to use SVE2,
79 trying desperately to use it as if it was Packed SIMD NEON. The advice
80 not to create SVE2 assembler that is hardcoded to fixed widths is being
81 disregarded, in favour of writing *multiple identical implementations*
82 of a function, each with a different hardware width, and compelling
83 software to choose one at runtime after probing the hardware.
84
85 Even RISC-V, for all that we can be grateful to the RISC-V Founders
86 for reviving Cray Vectors, has severe performance and implementation
87 limitations that are only really apparent to exceptionally experienced
88 assembly-level developers with a wide, diverse depth in multiple ISAs:
89 one of the best and clearest is a
90 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
91 by adrian_b.
92
93 Adrian logically and concisely points out that the fundamental design
94 assumptions and simplifications that went into the RISC-V ISA have an
95 irrevocably damaging effect on its viability for high performance use.
96 That is not to say that its use in low-performance embedded scenarios is
97 not ideal: in private custom secretive commercial usage it is perfect.
98 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
99 ARM with RISC-V and saving themselves USD 1 in licensing royalties
100 per product are a classic case study. Ubiquitous and common everyday
101 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
102 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
103 unfortunately, fundamentally flawed as far as power efficient high
104 performance is concerned.
105
106 Slowly, at this point, a realisation should be sinking in that, actually,
107 there aren't as many really truly viable Vector ISAs out there, as the
108 ones that are evolving in the general direction of Vectorisation are,
109 in various completely different ways, flawed.
110
111 **Successfully identifying a limitation marks the beginning of an
112 opportunity**
113
114 We are nowhere near done, however, because a Vector ISA is a superset of a
115 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
116 support, and even longer to get the software ecosystem up and running.
117
118 Which ISAs, therefore, have or have had, at one point in time, a decent
119 Software Ecosystem? Debian supports most of these including s390:
120
121 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
122 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
123 reputation nobody wants to go near SPARC.
124 * MIPS, created by SGI and only really commonly used in Network switches.
125 Exceptions: Ingenic with embedded CPUs,
126 and China ICT with the Loongson supercomputers.
127 * x86, the most well-known ISA and also one of the most heavily
128 litigously-protected.
129 * ARM, well known in embedded and smartphone scenarios, very slowly
130 making its way into data centres.
131 * OpenRISC, an entirely Open ISA suitable for embedded systems.
132 * s390, a Mainframe ISA very similar to Power.
133 * Power ISA, a Supercomputing-class ISA, as demonstrated by
134 two out of three of the top500.org supercomputers using
135 160,000 IBM POWER9 Cores.
136 * ARC, a competitor at the time to ARM, best known for use in
137 Broadcom VideoCore IV.
138 * RISC-V, with a software ecosystem heavily in development
139 and with rapid adoption
140 in an uncontrolled fashion, is set on an unstoppable
141 and inevitable trainwreck path to replicate the
142 opcode conflict nightmare that plagued the Power ISA,
143 two decades ago.
144 * Tensilica, Andes STAR and Western Digital for successful
145 commercial proprietary ISAs: Tensilica in Baseband Modems,
146 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
147 astoundingly commercially successful
148 multi-billion-unit mass volume markets that almost nobody
149 knows anything about. Included for completeness.
150
151 In order of least controlled to most controlled, the viable
152 candidates for further advancement are:
153
154 * OpenRISC 1200, not controlled or restricted by anyone. no patent
155 protection.
156 * RISC-V, touted as "Open" but actually strictly controlled under
157 Trademark License: too new to have adequate patent pool protection,
158 as evidenced by multiple adopters having been hit by patent lawsuits.
159 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
160 * Power ISA: protected by IBM's extensive patent portfolio for Members
161 of the OpenPOWER Foundation, covered by Trademarks, permitting
162 and encouraging contributions, and having software support for over
163 20 years.
164 * ARM, not permitting Open Licensing, they survived in the early 90s
165 only by doing a deal with Samsung for an in-perpetuity
166 Royalty-free License, in exchange
167 for GBP 3 million and legal protection through Samsung Research.
168 Several large Corporations (Apple most notably) have licensed the ISA
169 but not ARM designs: the barrier to entry is high and the ISA itself
170 protected from interference as a result.
171 * x86, famous for an unprecedented
172 Court Ruling in 2004 where a Judge "banged heads
173 together" and ordered AMD and Intel to stop wasting his time,
174 make peace, and cross-license each other's patents, anyone wishing
175 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
176 and VIA EDEN processors, and see how they fared.
177 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
178 but the 800lb Gorilla Syndrome seems not to have deterred one
179 particularly disingenuous group from performing illegal
180 Reverse-Engineering.
181
182 By asking the question, "which ISA would be the best and most stable to
183 base a Vector Supercomputing-class Extension on?" where patent protection,
184 software ecosystem, open-ness and pedigree all combine to reduce risk
185 and increase the chances of success, there is really only one candidate.
186
187 **Of all of these, the only one with the most going for it is the Power ISA.**
188
189 The summary of advantages, then, of the Power ISA is that:
190
191 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
192 and more.
193 * IBM's extensive 20+ years of patents is available, royalty-free,
194 to protect implementors as long as they are also members of the
195 OpenPOWER Foundation
196 * IBM designed and maintained the Power ISA as a Supercomputing
197 class ISA from its inception.
198 * Coherent distributed memory access is possible through OpenCAPI
199 * Extensions to the Power ISA may be submitted through an External
200 RFC Process that does not require membership of OPF.
201
202 From this strong base, the next step is: how to leverage this
203 foundation to take a leap forward in performance and performance/watt,
204 *without* losing all the advantages of an ubiquitous software ecosystem?
205
206 # How do you turn a Scalar ISA into a Vector one?
207
208 The most obvious question before that is: why would you want to?
209 As explained in the "SIMD Considered Harmful" article, Cray-style
210 Vector ISAs break the link between data element batches and the
211 underlying architectural back-end parallel processing capability.
212 Packed SIMD explicitly smashes that width right in the face of the
213 programmer and expects them to like it. As the article immediately
214 demonstrates, an arbitrary-sized data set has to contend with
215 an insane power-of-two Packed SIMD cascade at both setup and teardown
216 that can add literally an order
217 of magnitude increase in the number of hand-written lines of assembler
218 compared to a well-designed Cray-style Vector ISA with a `setvl`
219 instruction.
220
221 Assuming then that variable-length Vectors are obviously desirable,
222 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
223 went the way of adding explicit Vector opcodes, a style which RVV
224 copied and modernised. In the case of RVV this introduced 192 new
225 instructions on top of an existing 95+ for base RV64GC. Adding
226 200% more instructions than the base ISA seems unwise: at least,
227 it feels like there should be a better way, particularly on
228 close inspection of RVV as an example, the basic arithmetic
229 operations are massively duplicated: scalar-scalar from the base
230 is joined by both scalar-vector and vector-vector *and* predicate
231 mask management, and transfer instructions between all the sane,
232 which goes a long way towards explaining why there are twice as many
233 Vector instructions in RISC-V as there are in the RV64GC base.
234
235 The question then becomes: with all the duplication of arithmetic
236 operations just to make the registers scalar or vector, why not
237 leverage the *existing* Scalar ISA with some sort of "context"
238 or prefix that augments its behaviour?
239
240 Remarkably this is not a new idea. Intel's x86 `REP` instruction
241 gives the base concept, but in 1994 it was Peter Hsu, the designer
242 of the MIPS R8000, who first came up with the idea of Vector
243 prefixing. Relying on a multi-issue Out-of-Order Execution Engine,
244 the prefix would mark which of the registers were to be treated as
245 Scalar and which as Vector, then perform a `REP`-like loop that
246 jammed multiple scalar operations into the Multi-Issue Execution
247 Engine. The only reason that the team did not take this forward
248 into a commercial product
249 was because they could not work out how to cleanly do OoO
250 multi-issue at the time.
251
252 In its simplest form, then, this "prefixing" idea is a matter
253 of:
254
255 * Defining the format of the prefix
256 * Adding a `setvl` instruction
257 * Adding Vector-context SPRs and working out how to do
258 context-switches with them
259 * Writing an awful lot of Specification Documentation
260 (4 years and counting)
261
262 Once the basics of this concept have sunk in, early
263 advancements quickly follow naturally from analysis
264 of the problem-space:
265
266 * Predication (an absolutely critical component for a Vector ISA),
267 then the next logical advancement is to allow separate predication masks
268 to be applied to *both* the source *and* the destination, independently.
269 * Element-width overrides: most Scalar ISAs today are 64-bit only,
270 with primarily Load and Store being able to handle 8/16/32/64
271 and sometimes 128-bit (quad-word), where Vector ISAs need to
272 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
273 high-performance AI.
274 * "Reordering" of the assumption of linear sequential element
275 access, for Matrices, rotations, transposition, Convolutions,
276 DCT, FFT, Parallel Prefix-Sum and other common transformations
277 that require significant programming effort in other ISAs.
278
279 **What is missing from Power Scalar ISA that a Vector ISA needs?**
280
281 Remarkably, very little.
282
283 * The traditional `iota` instruction may be
284 synthesised with an overlapping add, that stacks up incrementally
285 and sequentially. Although it requires two instructions (one to
286 start the sum-chain) the technique has the advantage of allowing
287 increments by arbitrary amounts, and is not limited to addition,
288 either.
289 * Big-integer addition (arbitrary-precision arithmetic) is an
290 emergent characteristic from the carry-in, carry-out capability of
291 Power ISA `adde` instruction. `sv.adde` as a BigNum add
292 naturally emerges from the
293 sequential chaining of these scalar instructions.
294 * The Condition Register Fields of the Power ISA make a great candidate
295 for use as Predicate Masks, particularly when combined with
296 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.