(no commit message)
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * v0.00 05may2021 first created
6
7 **Table of Contents**
8
9 [[!toc]]
10
11 # Why in the 2020s would you invent a new Vector ISA
12
13 Inventing a new Scalar ISA from scratch is over a decade-long task
14 including simulators and compilers: OpenRISC 1200 took 12 years to
15 mature. A Vector or Packed SIMD ISA to reach stable *general-purpose*
16 auto-vectorisation compiler support has never been achieved in the
17 history of computing, not with the combined resources of ARM, Intel,
18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
19 assembler and direct use of intrinsics is the Industry-standard norm
20 to achieve high-performance optimisation where it matters*).
21 Rather: GPUs
22 have ultra-specialist compilers (CUDA) that are designed from the ground up
23 to support Vector/SIMD parallelism, and associated standards
24 (SPIR-V, Vulkan, OpenCL) managed by
25 the Khronos Group, with multi-man-century development committment from
26 multiple billion-dollar-revenue companies, to sustain them.
27
28 Therefore it begs the question, why on earth would anyone consider
29 this task, and what, in Computer Science, actually needs solving?
30
31 First hints are that whilst memory bitcells have not increased in speed
32 since the 90s (around 150 mhz), increasing the datapath widths has allowed
33 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
34 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
35 all make an effort (all simply increasing the parallel deployment of
36 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
37 two nearly three orders of magnitude increase in CPU horsepower
38 over the same timeframe. Seymour
39 Cray, from his amazing in-depth knowledge, predicted that the mismatch
40 would become a serious limitation, over two decades ago. Some systems
41 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
42 by way of compensation, and as we know from experience even that will
43 be considered inadequate in future.
44
45 Efforts to solve this problem by moving the processing closer to or
46 directly integrated into the memory have traditionally not gone well:
47 Aspex Microelectronics, Elixent, these are parallel processing companies
48 that very few have heard of, because their software stack was so
49 specialist that it required heavy investment by customers to utilise.
50 D-Matrix and Graphcore are a modern incarnation of the exact same
51 "specialist parallel processing" mistake, betting heavily on AI with
52 Matrix and Convolution Engines that can do no other task. Aspex only
53 survived by being bought by Ericsson, where its specialised suitability
54 for massive wide Baseband FFTs saved it from going under.
55 The huge risk is that any "better
56 AI mousetrap" created by an innovative competitor
57 that comes along will quickly render both D-Matrix and
58 Graphcore's approach obsolete.
59
60 NVIDIA and other GPUs have taken a different approach again: massive
61 parallelism with more Turing-complete ISAs in each, and dedicated
62 slower parallel memory paths (GDDR5) suited to the specific tasks of
63 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
64 by the amount of money poured into the software ecosystem in order
65 to make it accessible, and even then, GPU Programmers are a specialist
66 and rare (expensive) breed.
67
68 Second hints as to the answer emerge from an article
69 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
70 which illustrates a catastrophic rabbit-hole taken by Industry Giants
71 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
72 Order(N^6) opcode proliferation nightmare, with its mantra "make it
73 easy for hardware engineers, let software sort out the mess" literally
74 overwhelming programmers. Specialists charging
75 clients for assembly-code Optimisation Services are finding that AVX-512,
76 to take an
77 example, is anything but optimal: overall performance of AVX-512 actually
78 *decreases* even as power consumption goes up.
79
80 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
81 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
82 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
83 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
84 instruction that makes a truly ubiquitous Vector ISA) in ways that
85 will become apparent over time as adoption increases. In the meantime
86 programmers are, in direct violation of ARM's advice on how to use SVE2,
87 trying desperately to use it as if it was Packed SIMD NEON. The advice
88 not to create SVE2 assembler that is hardcoded to fixed widths is being
89 disregarded, in favour of writing *multiple identical implementations*
90 of a function, each with a different hardware width, and compelling
91 software to choose one at runtime after probing the hardware.
92
93 Even RISC-V, for all that we can be grateful to the RISC-V Founders
94 for reviving Cray Vectors, has severe performance and implementation
95 limitations that are only really apparent to exceptionally experienced
96 assembly-level developers with a wide, diverse depth in multiple ISAs:
97 one of the best and clearest is a
98 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
99 by adrian_b.
100
101 Adrian logically and concisely points out that the fundamental design
102 assumptions and simplifications that went into the RISC-V ISA have an
103 irrevocably damaging effect on its viability for high performance use.
104 That is not to say that its use in low-performance embedded scenarios is
105 not ideal: in private custom secretive commercial usage it is perfect.
106 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
107 ARM with RISC-V and saving themselves USD 1 in licensing royalties
108 per product are a classic case study. Ubiquitous and common everyday
109 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
110 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
111 unfortunately, fundamentally flawed as far as power efficient high
112 performance is concerned.
113
114 Slowly, at this point, a realisation should be sinking in that, actually,
115 there aren't as many really truly viable Vector ISAs out there, as the
116 ones that are evolving in the general direction of Vectorisation are,
117 in various completely different ways, flawed.
118
119 **Successfully identifying a limitation marks the beginning of an
120 opportunity**
121
122 We are nowhere near done, however, because a Vector ISA is a superset of a
123 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
124 support, and even longer to get the software ecosystem up and running.
125
126 Which ISAs, therefore, have or have had, at one point in time, a decent
127 Software Ecosystem? Debian supports most of these including s390:
128
129 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
130 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
131 reputation nobody wants to go near SPARC.
132 * MIPS, created by SGI and only really commonly used in Network switches.
133 Exceptions: Ingenic with embedded CPUs,
134 and China ICT with the Loongson supercomputers.
135 * x86, the most well-known ISA and also one of the most heavily
136 litigously-protected.
137 * ARM, well known in embedded and smartphone scenarios, very slowly
138 making its way into data centres.
139 * OpenRISC, an entirely Open ISA suitable for embedded systems.
140 * s390, a Mainframe ISA very similar to Power.
141 * Power ISA, a Supercomputing-class ISA, as demonstrated by
142 two out of three of the top500.org supercomputers using
143 around 2 million IBM POWER9 Cores each.
144 * ARC, a competitor at the time to ARM, best known for use in
145 Broadcom VideoCore IV.
146 * RISC-V, with a software ecosystem heavily in development
147 and with rapid expansion
148 in an uncontrolled fashion, is set on an unstoppable
149 and inevitable trainwreck path to replicate the
150 opcode conflict nightmare that plagued the Power ISA,
151 two decades ago.
152 * Tensilica, Andes STAR and Western Digital for successful
153 commercial proprietary ISAs: Tensilica in Baseband Modems,
154 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
155 astoundingly commercially successful
156 multi-billion-unit mass volume markets that almost nobody
157 knows anything about. Included for completeness.
158
159 In order of least controlled to most controlled, the viable
160 candidates for further advancement are:
161
162 * OpenRISC 1200, not controlled or restricted by anyone. no patent
163 protection.
164 * RISC-V, touted as "Open" but actually strictly controlled under
165 Trademark License: too new to have adequate patent pool protection,
166 as evidenced by multiple adopters having been hit by patent lawsuits.
167 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
168 * Power ISA: protected by IBM's extensive patent portfolio for Members
169 of the OpenPOWER Foundation, covered by Trademarks, permitting
170 and encouraging contributions, and having software support for over
171 20 years.
172 * ARM, not permitting Open Licensing, they survived in the early 90s
173 only by doing a deal with Samsung for an in-perpetuity
174 Royalty-free License, in exchange
175 for GBP 3 million and legal protection through Samsung Research.
176 Several large Corporations (Apple most notably) have licensed the ISA
177 but not ARM designs: the barrier to entry is high and the ISA itself
178 protected from interference as a result.
179 * x86, famous for an unprecedented
180 Court Ruling in 2004 where a Judge "banged heads
181 together" and ordered AMD and Intel to stop wasting his time,
182 make peace, and cross-license each other's patents, anyone wishing
183 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
184 and VIA EDEN processors, and see how they fared.
185 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
186 but the 800lb Gorilla Syndrome seems not to have deterred one
187 particularly disingenuous group from performing illegal
188 Reverse-Engineering.
189
190 By asking the question, "which ISA would be the best and most stable to
191 base a Vector Supercomputing-class Extension on?" where patent protection,
192 software ecosystem, open-ness and pedigree all combine to reduce risk
193 and increase the chances of success, there is really only one candidate.
194
195 **Of all of these, the only one with the most going for it is the Power ISA.**
196
197 The summary of advantages, then, of the Power ISA is that:
198
199 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
200 and more.
201 * IBM's extensive 20+ years of patents is available, royalty-free,
202 to protect implementors as long as they are also members of the
203 OpenPOWER Foundation
204 * IBM designed and maintained the Power ISA as a Supercomputing
205 class ISA from its inception.
206 * Coherent distributed memory access is possible through OpenCAPI
207 * Extensions to the Power ISA may be submitted through an External
208 RFC Process that does not require membership of OPF.
209
210 From this strong base, the next step is: how to leverage this
211 foundation to take a leap forward in performance and performance/watt,
212 *without* losing all the advantages of an ubiquitous software ecosystem,
213 the lack of which has historically plagued other systems and relegated
214 them to a risky niche market?
215
216 # How do you turn a Scalar ISA into a Vector one?
217
218 The most obvious question before that is: why on earth would you want to?
219 As explained in the "SIMD Considered Harmful" article, Cray-style
220 Vector ISAs break the link between data element batches and the
221 underlying architectural back-end parallel processing capability.
222 Packed SIMD explicitly smashes that width right in the face of the
223 programmer and expects them to like it. As the article immediately
224 demonstrates, an arbitrary-sized data set has to contend with
225 an insane power-of-two Packed SIMD cascade at both setup and teardown
226 that routinely adds literally an order
227 of magnitude increase in the number of hand-written lines of assembler
228 compared to a well-designed Cray-style Vector ISA with a `setvl`
229 instruction.
230
231 Assuming then that variable-length Vectors are obviously desirable,
232 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
233 went the way of adding explicit Vector opcodes, a style which RVV
234 copied and modernised. In the case of RVV this introduced 192 new
235 instructions on top of an existing 95+ for base RV64GC. Adding
236 200% more instructions than the base ISA seems unwise: at least,
237 it feels like there should be a better way, particularly on
238 close inspection of RVV as an example, the basic arithmetic
239 operations are massively duplicated: scalar-scalar from the base
240 is joined by both scalar-vector and vector-vector *and* predicate
241 mask management, and transfer instructions between all the same,
242 which goes a long way towards explaining why there are twice as many
243 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
244
245 The question then becomes: with all the duplication of arithmetic
246 operations just to make the registers scalar or vector, why not
247 leverage the *existing* Scalar ISA with some sort of "context"
248 or prefix that augments its behaviour? Then, the Instruction Decode
249 phase is greatly simplified, reducing design complexity and leaving
250 plenty of headroom for further expansion.
251
252 Remarkably this is not a new idea. Intel's x86 `REP` instruction
253 gives the base concept, but in 1994 it was Peter Hsu, the designer
254 of the MIPS R8000, who first came up with the idea of Vector-augmented
255 prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
256 the prefix would mark which of the registers were to be treated as
257 Scalar and which as Vector, then perform a `REP`-like loop that
258 jammed multiple scalar operations into the Multi-Issue Execution
259 Engine. The only reason that the team did not take this forward
260 into a commercial product
261 was because they could not work out how to cleanly do OoO
262 multi-issue at the time.
263
264 In its simplest form, then, this "prefixing" idea is a matter
265 of:
266
267 * Defining the format of the prefix
268 * Adding a `setvl` instruction
269 * Adding Vector-context SPRs and working out how to do
270 context-switches with them
271 * Writing an awful lot of Specification Documentation
272 (4 years and counting)
273
274 Once the basics of this concept have sunk in, early
275 advancements quickly follow naturally from analysis
276 of the problem-space:
277
278 * Expanding the size of GPR, FPR and CR register files to
279 provide 128 entries in each. This is a bare minimum for GPUs
280 in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
281 batching as possible.
282 * Predication (an absolutely critical component for a Vector ISA),
283 then the next logical advancement is to allow separate predication masks
284 to be applied to *both* the source *and* the destination, independently.
285 * Element-width overrides: most Scalar ISAs today are 64-bit only,
286 with primarily Load and Store being able to handle 8/16/32/64
287 and sometimes 128-bit (quad-word), where Vector ISAs need to
288 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
289 high-performance AI. Rather than waste opcode space adding all
290 such operations at different bitwidths, let the prefix
291 *redefine* the element width.
292 * "Reordering" of the assumption of linear sequential element
293 access, for Matrices, rotations, transposition, Convolutions,
294 DCT, FFT, Parallel Prefix-Sum and other common transformations
295 that require significant programming effort in other ISAs.
296
297 All of these things come entirely from "Augmentation" of the Scalar operation
298 being prefixed: at no time is the Scalar operation significantly
299 altered.
300 From there, several more "Modes" can be added, including saturation,
301 which is needed for Audio and Video applications, "Reverse Gear"
302 which runs the Element Loop in reverse order (needed for Prefix
303 Sum), and more.
304
305 **What is missing from Power Scalar ISA that a Vector ISA needs?**
306
307 Remarkably, very little: the devil is in the details though.
308
309 * The traditional `iota` instruction may be
310 synthesised with an overlapping add, that stacks up incrementally
311 and sequentially. Although it requires two instructions (one to
312 start the sum-chain) the technique has the advantage of allowing
313 increments by arbitrary amounts, and is not limited to addition,
314 either.
315 * Big-integer addition (arbitrary-precision arithmetic) is an
316 emergent characteristic from the carry-in, carry-out capability of
317 Power ISA `adde` instruction. `sv.adde` as a BigNum add
318 naturally emerges from the
319 sequential chaining of these scalar instructions.
320 * The Condition Register Fields of the Power ISA make a great candidate
321 for use as Predicate Masks, particularly when combined with
322 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
323
324 It is only when looking slightly deeper into the Power ISA that
325 certain things turn out to be missing, and this is down in part to IBM's
326 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
327 so Scalar ones. Examples include that transfer operations between the
328 Integer and Floating-point Scalar register files were dropped approximately
329 a decade ago after the Packed SIMD variants were considered to be
330 duplicates. With it being completely inappropriate to attempt to Vectorise
331 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
332 the Scalar ISA, a much better all-round candidate for Vectorisation is
333 left anaemic.
334
335 A particular key instruction that is missing is `MV.X` which is
336 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
337 expensive instruction causing a huge swathe of Register Hazards
338 in one single hit is almost never added to a Scalar ISA but
339 is almost always added to a Vector one. When `MV.X` is
340 Vectorised it allows for arbitrary
341 remapping of elements within a Vector to positions specified
342 by another Vector. A typical Scalar ISA will use Memory to
343 achieve this task, but with Vector ISAs the Vector Register Files are
344 usually so enormous, and so far away from Memory, that it is easier and
345 more efficient, architecturally, to provide these Indexing instructions.
346
347 Fortunately, with the ISA Working Group being willing
348 to consider RFCs (Requests For Change) these omissions have the potential
349 to be corrected.
350
351 One deliberate decision in SVP64 involves Predication. Typical Vector
352 ISAs have quite comprehensive arithmetic and logical operations on
353 Predicate Masks, and if CR Fields were the only predicates in SVP64
354 it would put pressure on to start adding the exact same arithmetic and logical
355 operations that already exist in the Integer opcodes.
356 Instead of taking that route the decision was made to allow *both*
357 Integer *and* CR Fields to be Predicate Masks, and to create Draft
358 instructions that provide better transfer capability between CR Fields
359 and Integer Register files.
360
361 Beyond that, further extensions to the Power ISA become much more
362 domain-specific, such as adding bitmanipulation for Audio, Video
363 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
364 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
365 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
366 *automatically* is inherently added to the Vector one as well, and
367 because these GPU and Video opcodes have been added to the CPU ISA,
368 Software Driver development and debugging is dramatically simplified.
369
370 Which brings us to the next important question: how is any of these
371 CPU-centric Vector-centric improvements relevant to power efficiency
372 and making more effective use of resources?
373
374 # Simpler more compact programs
375
376 The first and most obvious saving is that, just as with any Vector
377 ISA, the amount of data processing requested
378 and controlled by each instruction is enormous, and leaves the
379 Decode and Issue Engines idle, as well as the L1 I-Cache. With
380 programs being smaller, chances are higher that they fit into
381 L1 Cache, or that the L1 Cache may be made smaller.
382
383 Even a Packed SIMD ISA could take limited advantage of a higher
384 bang-per-buck for limited specific workloads, as long as the
385 stripmining setup and teardown is not required. However a
386 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
387 ratio as a 64-wide Vector Length.
388
389 Realistically, for general use cases however it is extremely common
390 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
391 astounding 240 hand-coded assembler instructions where it is around
392 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
393 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
394 the case of the IBM POWER9 a little-known design flaw this results in
395 contention between the L1 D and I Caches at the L2 Bus, slowing down
396 execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
397 requires loop-unrolling to contend with non-power-of-two Matrix
398 sizes: SVP64 does not, as hinted at below.
399
400 Additional savings come in the form of `SVREMAP`. This is a hardware
401 index transformation system where the normally sequentially-linear
402 element access may be "Re-Mapped" to limited but algorithmic-tailored
403 commonly-used deterministic schedules, for example Matrix Multiply,
404 DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
405 2x6 may be performed in as little as 4 instructions, one of which
406 is to zero-initialise the accumulator Vector used to store the result.
407 If addition to another Matrix is also required then it is only three
408 instructions. Not only that, but because the "Schedule" is an abstract
409 concept separated from the mathematical operation, there is no reason
410 why Matrix Multiplication Schedules may not be applied to Integer
411 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, or Logical
412 AND-and-OR. The flexibility is not only enormous, but the compactness
413 unprecedented. RADIX2 in-place DCT Triple-loop Schedules may be created in
414 around 11 instructions. The only other processors well-known to have
415 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
416 and Qualcom's Hexagon, and both are targetted at FFTs only.
417
418 There is no reason at all why future algorithmic schedules should not
419 be proposed as extensions to SVP64 (sorting algorithms,
420 compression algorithms, Sparse Data Sets, Graph Node walking
421 for example). Bear in mind that
422 the submission process will be
423 entirely at the discretion of the OpenPOWER Foundation ISA WG,
424 however this is encouraged and welcomed.