(no commit message)
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * First revision 05may2021
6
7 **Table of Contents**
8
9 [[!toc]]
10
11 # Why in the 2020s would you invent a new Vector ISA
12
13 Inventing a new Scalar ISA from scratch is over a decade-long task
14 including simulators and compilers: OpenRISC 1200 took 12 years to
15 mature. A Vector or Packed SIMD ISA to reach stable general-purpose
16 auto-vectorisation compiler support has never been achieved in the
17 history of computing, not with the combined resources of ARM, Intel,
18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. Rather: GPUs
19 have ultra-specialist compilers that are designed from the ground up
20 to support Vector/SIMD parallelism, and associated standards managed by
21 the Khronos Group, with multi-man-century development committment from
22 multiple billion-dollar-revenue companies, to sustain them.
23
24 Therefore it begs the question, why on earth would anyone consider
25 this task, and what, in Computer Science, actually needs solving?
26
27 First hints are that whilst memory bitcells have not increased in speed
28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
31 all make an effort (all simply increasing the parallel deployment of
32 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
33 two nearly three orders of magnitude increase in CPU horsepower
34 over the same timeframe. Seymour
35 Cray, from his amazing in-depth knowledge, predicted that the mismatch
36 would become a serious limitation, over two decades ago. Some systems
37 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
38 by way of compensation, and as we know from experience even that will
39 be considered inadequate in future.
40
41 Efforts to solve this problem by moving the processing closer to or
42 directly integrated into the memory have traditionally not gone well:
43 Aspex Microelectronics, Elixent, these are parallel processing companies
44 that very few have heard of, because their software stack was so
45 specialist that it required heavy investment by customers to utilise.
46 D-Matrix and Graphcore are a modern incarnation of the exact same
47 "specialist parallel processing" mistake, betting heavily on AI with
48 Matrix and Convolution Engines that can do no other task. Aspex only
49 survived by being bought by Ericsson, where its specialised suitability
50 for massive wide Baseband FFTs saved it from going under. Any "better
51 AI mousetrap" that comes along will quickly render both D-Matrix and
52 Graphcore obsolete.
53
54 NVIDIA and other GPUs have taken a different approach again: massive
55 parallelism with more Turing-complete ISAs in each, and dedicated
56 slower parallel memory paths (GDDR5) suited to the specific tasks of
57 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
58 by the amount of money poured into the software ecosystem in order
59 to make it accessible, and even then, GPU Programmers are a specialist
60 and rare (expensive) breed.
61
62 Second hints as to the answer emerge from an article
63 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
64 which illustrates a catastrophic rabbit-hole taken by Industry Giants
65 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
66 Order(N^6) opcode proliferation nightmare, with its mantra "make it
67 easy for hardware engineers, let software sort out the mess" literally
68 overwhelming programmers. Worse than that, specialists in charging
69 clients Optimisation Services are finding that AVX-512, to take an
70 example, is anything but optimal: overall performance of AVX-512 actually
71 *decreases* even as power consumption goes up.
72
73 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
74 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
75 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
76 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
77 instruction that makes a truly ubiquitous Vector ISA) in ways that
78 will become apparent over time as adoption increases. In the meantime
79 programmers are, in direct violation of ARM's advice on how to use SVE2,
80 trying desperately to use it as if it was Packed SIMD NEON. The advice
81 not to create SVE2 assembler that is hardcoded to fixed widths is being
82 disregarded, in favour of writing *multiple identical implementations*
83 of a function, each with a different hardware width, and compelling
84 software to choose one at runtime after probing the hardware.
85
86 Even RISC-V, for all that we can be grateful to the RISC-V Founders
87 for reviving Cray Vectors, has severe performance and implementation
88 limitations that are only really apparent to exceptionally experienced
89 assembly-level developers with a wide, diverse depth in multiple ISAs:
90 one of the best and clearest is a
91 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
92 by adrian_b.
93
94 Adrian logically and concisely points out that the fundamental design
95 assumptions and simplifications that went into the RISC-V ISA have an
96 irrevocably damaging effect on its viability for high performance use.
97 That is not to say that its use in low-performance embedded scenarios is
98 not ideal: in private custom secretive commercial usage it is perfect.
99 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
100 ARM with RISC-V and saving themselves USD 1 in licensing royalties
101 per product are a classic case study. Ubiquitous and common everyday
102 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
103 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
104 unfortunately, fundamentally flawed as far as power efficient high
105 performance is concerned.
106
107 Slowly, at this point, a realisation should be sinking in that, actually,
108 there aren't as many really truly viable Vector ISAs out there, as the
109 ones that are evolving in the general direction of Vectorisation are,
110 in various completely different ways, flawed.
111
112 **Successfully identifying a limitation marks the beginning of an
113 opportunity**
114
115 We are nowhere near done, however, because a Vector ISA is a superset of a
116 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
117 support, and even longer to get the software ecosystem up and running.
118
119 Which ISAs, therefore, have or have had, at one point in time, a decent
120 Software Ecosystem? Debian supports most of these including s390:
121
122 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
123 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
124 reputation nobody wants to go near SPARC.
125 * MIPS, created by SGI and only really commonly used in Network switches.
126 Exceptions: Ingenic with embedded CPUs,
127 and China ICT with the Loongson supercomputers.
128 * x86, the most well-known ISA and also one of the most heavily
129 litigously-protected.
130 * ARM, well known in embedded and smartphone scenarios, very slowly
131 making its way into data centres.
132 * OpenRISC, an entirely Open ISA suitable for embedded systems.
133 * s390, a Mainframe ISA very similar to Power.
134 * Power ISA, a Supercomputing-class ISA, as demonstrated by
135 two out of three of the top500.org supercomputers using
136 160,000 IBM POWER9 Cores.
137 * ARC, a competitor at the time to ARM, best known for use in
138 Broadcom VideoCore IV.
139 * RISC-V, with a software ecosystem heavily in development
140 and with rapid adoption
141 in an uncontrolled fashion, is set on an unstoppable
142 and inevitable trainwreck path to replicate the
143 opcode conflict nightmare that plagued the Power ISA,
144 two decades ago.
145 * Tensilica, Andes STAR and Western Digital for successful
146 commercial proprietary ISAs: Tensilica in Baseband Modems,
147 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
148 astoundingly commercially successful
149 multi-billion-unit mass volume markets that almost nobody
150 knows anything about. Included for completeness.
151
152 In order of least controlled to most controlled, the viable
153 candidates for further advancement are:
154
155 * OpenRISC 1200, not controlled or restricted by anyone. no patent
156 protection.
157 * RISC-V, touted as "Open" but actually strictly controlled under
158 Trademark License: too new to have adequate patent pool protection,
159 as evidenced by multiple adopters having been hit by patent lawsuits.
160 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
161 * Power ISA: protected by IBM's extensive patent portfolio for Members
162 of the OpenPOWER Foundation, covered by Trademarks, permitting
163 and encouraging contributions, and having software support for over
164 20 years.
165 * ARM, not permitting Open Licensing, they survived in the early 90s
166 only by doing a deal with Samsung for an in-perpetuity
167 Royalty-free License, in exchange
168 for GBP 3 million and legal protection through Samsung Research.
169 Several large Corporations (Apple most notably) have licensed the ISA
170 but not ARM designs: the barrier to entry is high and the ISA itself
171 protected from interference as a result.
172 * x86, famous for an unprecedented
173 Court Ruling in 2004 where a Judge "banged heads
174 together" and ordered AMD and Intel to stop wasting his time,
175 make peace, and cross-license each other's patents, anyone wishing
176 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
177 and VIA EDEN processors, and see how they fared.
178 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
179 but the 800lb Gorilla Syndrome seems not to have deterred one
180 particularly disingenuous group from performing illegal
181 Reverse-Engineering.
182
183 By asking the question, "which ISA would be the best and most stable to
184 base a Vector Supercomputing-class Extension on?" where patent protection,
185 software ecosystem, open-ness and pedigree all combine to reduce risk
186 and increase the chances of success, there is really only one candidate.
187
188 **Of all of these, the only one with the most going for it is the Power ISA.**
189
190 The summary of advantages, then, of the Power ISA is that:
191
192 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
193 and more.
194 * IBM's extensive 20+ years of patents is available, royalty-free,
195 to protect implementors as long as they are also members of the
196 OpenPOWER Foundation
197 * IBM designed and maintained the Power ISA as a Supercomputing
198 class ISA from its inception.
199 * Coherent distributed memory access is possible through OpenCAPI
200 * Extensions to the Power ISA may be submitted through an External
201 RFC Process that does not require membership of OPF.
202
203 From this strong base, the next step is: how to leverage this
204 foundation to take a leap forward in performance and performance/watt,
205 *without* losing all the advantages of an ubiquitous software ecosystem,
206 the lack of which has historically plagued other systems and relegated
207 them to a risky niche market?
208
209 # How do you turn a Scalar ISA into a Vector one?
210
211 The most obvious question before that is: why would you want to?
212 As explained in the "SIMD Considered Harmful" article, Cray-style
213 Vector ISAs break the link between data element batches and the
214 underlying architectural back-end parallel processing capability.
215 Packed SIMD explicitly smashes that width right in the face of the
216 programmer and expects them to like it. As the article immediately
217 demonstrates, an arbitrary-sized data set has to contend with
218 an insane power-of-two Packed SIMD cascade at both setup and teardown
219 that can add literally an order
220 of magnitude increase in the number of hand-written lines of assembler
221 compared to a well-designed Cray-style Vector ISA with a `setvl`
222 instruction.
223
224 Assuming then that variable-length Vectors are obviously desirable,
225 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
226 went the way of adding explicit Vector opcodes, a style which RVV
227 copied and modernised. In the case of RVV this introduced 192 new
228 instructions on top of an existing 95+ for base RV64GC. Adding
229 200% more instructions than the base ISA seems unwise: at least,
230 it feels like there should be a better way, particularly on
231 close inspection of RVV as an example, the basic arithmetic
232 operations are massively duplicated: scalar-scalar from the base
233 is joined by both scalar-vector and vector-vector *and* predicate
234 mask management, and transfer instructions between all the same,
235 which goes a long way towards explaining why there are twice as many
236 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
237
238 The question then becomes: with all the duplication of arithmetic
239 operations just to make the registers scalar or vector, why not
240 leverage the *existing* Scalar ISA with some sort of "context"
241 or prefix that augments its behaviour?
242
243 Remarkably this is not a new idea. Intel's x86 `REP` instruction
244 gives the base concept, but in 1994 it was Peter Hsu, the designer
245 of the MIPS R8000, who first came up with the idea of Vector-augmented
246 prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
247 the prefix would mark which of the registers were to be treated as
248 Scalar and which as Vector, then perform a `REP`-like loop that
249 jammed multiple scalar operations into the Multi-Issue Execution
250 Engine. The only reason that the team did not take this forward
251 into a commercial product
252 was because they could not work out how to cleanly do OoO
253 multi-issue at the time.
254
255 In its simplest form, then, this "prefixing" idea is a matter
256 of:
257
258 * Defining the format of the prefix
259 * Adding a `setvl` instruction
260 * Adding Vector-context SPRs and working out how to do
261 context-switches with them
262 * Writing an awful lot of Specification Documentation
263 (4 years and counting)
264
265 Once the basics of this concept have sunk in, early
266 advancements quickly follow naturally from analysis
267 of the problem-space:
268
269 * Expanding the size of GPR, FPR and CR register files to
270 provide 128 entries in each. This is a bare minimum for GPUs
271 in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
272 batching as possible.
273 * Predication (an absolutely critical component for a Vector ISA),
274 then the next logical advancement is to allow separate predication masks
275 to be applied to *both* the source *and* the destination, independently.
276 * Element-width overrides: most Scalar ISAs today are 64-bit only,
277 with primarily Load and Store being able to handle 8/16/32/64
278 and sometimes 128-bit (quad-word), where Vector ISAs need to
279 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
280 high-performance AI. Rather than waste opcode space adding all
281 such operations at different bitwidths, let the prefix
282 *redefine* the element width.
283 * "Reordering" of the assumption of linear sequential element
284 access, for Matrices, rotations, transposition, Convolutions,
285 DCT, FFT, Parallel Prefix-Sum and other common transformations
286 that require significant programming effort in other ISAs.
287
288 All of these things come entirely from "Augmentation" of the Scalar operation
289 being prefixed: at no time is the Scalar operation significantly
290 altered.
291 From there, several more "Modes" can be added, including saturation,
292 which is needed for Audio and Video applications, "Reverse Gear"
293 which runs the Element Loop in reverse order (needed for Prefix
294 Sum), and more.
295
296 **What is missing from Power Scalar ISA that a Vector ISA needs?**
297
298 Remarkably, very little: the devil is in the details though.
299
300 * The traditional `iota` instruction may be
301 synthesised with an overlapping add, that stacks up incrementally
302 and sequentially. Although it requires two instructions (one to
303 start the sum-chain) the technique has the advantage of allowing
304 increments by arbitrary amounts, and is not limited to addition,
305 either.
306 * Big-integer addition (arbitrary-precision arithmetic) is an
307 emergent characteristic from the carry-in, carry-out capability of
308 Power ISA `adde` instruction. `sv.adde` as a BigNum add
309 naturally emerges from the
310 sequential chaining of these scalar instructions.
311 * The Condition Register Fields of the Power ISA make a great candidate
312 for use as Predicate Masks, particularly when combined with
313 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
314
315 It is only when looking slightly deeper into the Power ISA that
316 certain things turn out to be missing, and this is down in part to IBM's
317 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
318 so Scalar ones. Examples include that transfer operations between the
319 Integer and Floating-point Scalar register files were dropped approximately
320 a decade ago after the Packed SIMD variants were considered to be
321 duplicates. With it being completely inappropriate to attempt to Vectorise
322 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
323 the Scalar ISA, a much better all-round candidate for Vectorisation is
324 left anaemic. Fortunately, with the ISA Working Group being willing
325 to consider RFCs (Requests For Change) these omissions have the potential
326 to be corrected.
327
328 One deliberate decision in SVP64 involves Predication. Typical Vector
329 ISAs have quite comprehensive arithmetic and logical operations on
330 Predicate Masks, and if CR Fields were the only predicates in SVP64
331 it would put pressure on to start adding the exact same arithmetic and logical
332 operations that already exist in the Integer opcodes.
333 Instead of taking that route the decision was made to allow *both*
334 Integer *and* CR Fields to be Predicate Masks, and to create Draft
335 instructions that provide better transfer capability between CR Fields
336 and Integer Register files.
337
338 Beyond that, further extensions to the Power ISA become much more
339 domain-specific, such as adding bitmanipulation for Audio, Video
340 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
341 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
342 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
343 *automatically* is inherently added to the Vector one as well, and
344 because these GPU and Video opcodes have been added to the CPU ISA,
345 Software Driver development and debugging is dramatically simplified.
346
347 Which brings us to the next important question: how is any of these
348 CPU-centric Vector-centric improvements relevant to power efficiency
349 and making more effective use of resources?