(no commit message)
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * First revision 05may2021
6
7 **Table of Contents**
8
9 [[!toc]]
10
11 # Why in the 2020s would you invent a new Vector ISA
12
13 Inventing a new Scalar ISA from scratch is over a decade-long task
14 including simulators and compilers: OpenRISC 1200 took 12 years to
15 mature. A Vector or Packed SIMD ISA to reach stable general-purpose
16 auto-vectorisation compiler support has never been achieved in the
17 history of computing, not with the combined resources of ARM, Intel,
18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. Rather: GPUs
19 have ultra-specialist compilers that are designed from the ground up
20 to support Vector/SIMD parallelism, and associated standards managed by
21 the Khronos Group, with multi-man-century development committment from
22 multiple billion-dollar-revenue companies, to sustain them.
23
24 Therefore it begs the question, why on earth would anyone consider
25 this task, and what, in Computer Science, actually needs solving?
26
27 First hints are that whilst memory bitcells have not increased in speed
28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
31 all make an effort (all simply increasing the parallel deployment of
32 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
33 two nearly three orders of magnitude increase in CPU horsepower. Seymour
34 Cray, from his amazing in-depth knowledge, predicted that the mismatch
35 would become a serious limitation, over two decades ago. Some systems
36 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
37 by way of compensation, and as we know from experience even that will
38 be considered inadequate in future.
39
40 Efforts to solve this problem by moving the processing closer to or
41 directly integrated into the memory have traditionally not gone well:
42 Aspex Microelectronics, Elixent, these are parallel processing companies
43 that very few have heard of, because their software stack was so
44 specialist that it required heavy investment by customers to utilise.
45 D-Matrix and Graphcore are a modern incarnation of the exact same
46 "specialist parallel processing" mistake, betting heavily on AI with
47 Matrix and Convolution Engines that can do no other task. Aspex only
48 survived by being bought by Ericsson, where its specialised suitability
49 for massive wide Baseband FFTs saved it from going under. Any "better
50 AI mousetrap" that comes along will quickly render both D-Matrix and
51 Graphcore obsolete.
52
53 NVIDIA and other GPUs have taken a different approach again: massive
54 parallelism with more Turing-complete ISAs in each, and dedicated
55 slower parallel memory paths (GDDR5) suited to the specific tasks of
56 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
57 by the amount of money poured into the software ecosystem in order
58 to make it accessible, and even then, GPU Programmers are a specialist
59 and rare (expensive) breed.
60
61 Second hints as to the answer emerge from an article
62 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
63 which illustrates a catastrophic rabbit-hole taken by Industry Giants
64 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
65 Order(N^6) opcode proliferation nightmare, with its mantra "make it
66 easy for hardware engineers, let software sort out the mess" literally
67 overwhelming programmers. Worse than that, specialists in charging
68 clients Optimisation Services are finding that AVX-512, to take an
69 example, is anything but optimal: overall performance of AVX-512 actually
70 *decreases* even as power consumption goes up.
71
72 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
73 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
74 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
75 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
76 instruction that makes a truly ubiquitous Vector ISA) in ways that
77 will become apparent over time as adoption increases. In the meantime
78 programmers are, in direct violation of ARM's advice on how to use SVE2,
79 trying desperately to use it as if it was Packed SIMD NEON. The advice
80 not to create SVE2 assembler that is hardcoded to fixed widths is being
81 disregarded, in favour of writing *multiple identical implementations*
82 of a function, each with a different hardware width, and compelling
83 software to choose one at runtime after probing the hardware.
84
85 Even RISC-V, for all that we can be grateful to the RISC-V Founders
86 for reviving Cray Vectors, has severe performance and implementation
87 limitations that are only really apparent to exceptionally experienced
88 assembly-level developers with a wide, diverse depth in multiple ISAs:
89 one of the best and clearest is a
90 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
91 by adrian_b.
92
93 Adrian logically and concisely points out that the fundamental design
94 assumptions and simplifications that went into the RISC-V ISA have an
95 irrevocably damaging effect on its viability for high performance use.
96 That is not to say that its use in low-performance embedded scenarios is
97 not ideal: in private custom secretive commercial usage it is perfect.
98 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
99 ARM with RISC-V and saving themselves USD 1 in licensing royalties
100 per product are a classic case study. Ubiquitous and common everyday
101 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
102 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
103 unfortunately, fundamentally flawed as far as power efficient high
104 performance is concerned.
105
106 Slowly, at this point, a realisation should be sinking in that, actually,
107 there aren't as many really truly viable Vector ISAs out there, as the
108 ones that are evolving in the general direction of Vectorisation are,
109 in various completely different ways, flawed.
110
111 **Successfully identifying a limitation marks the beginning of an
112 opportunity**
113
114 We are nowhere near done, however, because a Vector ISA is a superset of a
115 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
116 support, and even longer to get the software ecosystem up and running.
117
118 Which ISAs, therefore, have or have had, at one point in time, a decent
119 Software Ecosystem? Debian supports most of these including s390:
120
121 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
122 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
123 reputation nobody wants to go near SPARC.
124 * MIPS, created by SGI and only really commonly used in Network switches.
125 Exceptions: Ingenic with embedded CPUs,
126 and China ICT with the Loongson supercomputers.
127 * x86, the most well-known ISA and also one of the most heavily
128 litigously-protected.
129 * ARM, well known in embedded and smartphone scenarios, very slowly
130 making its way into data centres.
131 * OpenRISC, an entirely Open ISA suitable for embedded systems.
132 * s390, a Mainframe ISA very similar to Power.
133 * Power ISA, a Supercomputing-class ISA, as demonstrated by
134 two out of three of the top500.org supercomputers using
135 160,000 IBM POWER9 Cores.
136 * ARC, a competitor at the time to ARM, best known for use in
137 Broadcom VideoCore IV.
138 * RISC-V, with a software ecosystem heavily in development
139 and with rapid adoption
140 in an uncontrolled fashion, is set on an unstoppable
141 and inevitable trainwreck path to replicate the
142 opcode conflict nightmare that plagued the Power ISA,
143 two decades ago.
144 * Tensilica, Andes STAR and Western Digital for successful
145 commercial proprietary ISAs: Tensilica in Baseband Modems,
146 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
147 multi-billion-unit mass volume markets that almost nobody
148 knows anything about. Included for completeness.
149
150 In order of least controlled to most controlled, the viable
151 candidates for further advancement are:
152
153 * OpenRISC 1200, not controlled or restricted by anyone. no patent
154 protection.
155 * RISC-V, touted as "Open" but actually strictly controlled under
156 Trademark License: too new to have adequate patent pool protection,
157 as evidenced by multiple adopters having been hit by patent lawsuits.
158 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
159 * Power ISA: protected by IBM's extensive patent portfolio for Members
160 of the OpenPOWER Foundation, covered by Trademarks, permitting
161 and encouraging contributions, and having software support for over
162 20 years.
163 * ARM, not permitting Open Licensing, they survived in the early 90s
164 only by doing a deal with Samsung for a Royalty-free License in exchange
165 for GBP 3 million and legal protection under Samsung Research Division.
166 Several large Corporations (Apple most notably) have licensed the ISA
167 but not ARM designs, the barrier to entry is high and the ISA itself
168 protected from interference as a result.
169 * x86, famous for a Court Ruling in 2004 where a Judge "banged heads
170 together" and ordered AMD and Intel to stop wasting his time,
171 make peace, and cross-license each other's patents, anyone wishing
172 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
173 and VIA EDEN processors, and see how they fared.
174 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
175 but the 800lb Gorilla Syndrome seems not to have deterred one
176 particularly disingenuous group from performing illegal
177 Reverse-Engineering.
178
179 By asking the question, "which ISA would be the best and most stable to
180 base a Vector Supercomputing-class Extension on?" where patent protection,
181 software ecosystem, open-ness and pedigree all combine to reduce risk
182 and increase the chances of success, there is really only one candidate.
183
184 **Of all of these, the only one with the most going for it is the Power ISA.**