add rev history and toc
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * First revision 05may2021
6
7 **Table of Contents**
8
9 [[!toc]]
10
11 # Why in the 2020s would you invent a new Vector ISA
12
13 Inventing a new Scalar ISA from scratch is over a decade-long task
14 including simulators and compilers: OpenRISC 1200 took 12 years to
15 mature. A Vector or Packed SIMD ISA to reach stable general-purpose
16 auto-vectorisation compiler support has never been achieved in the
17 history of computing, not with the combined resources of ARM, Intel,
18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. Rather: GPUs
19 have ultra-specialist compilers that are designed from the ground up
20 to support Vector/SIMD parallelism, and associated standards managed by
21 the Khronos Group, with multi-man-century development committment from
22 multiple billion-dollar-revenue companies, to sustain them.
23
24 Therefore it begs the question, why on earth would anyone consider
25 this task?
26
27 First hints are that whilst memory bitcells have not increased in speed
28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
31 all make an effort, but these efforts are dwarfed by the two nearly
32 three orders of magnitude increase in CPU horsepower. Seymour Cray,
33 from his amazing in-depth knowledge, predicted that the mismatch would
34 become a serious limitation. Some systems at the time of writing are
35 approaching a *Gigabyte* of L4 Cache, by way of compensation, and as we
36 know from experience even that will be considered inadequate in future.
37
38 Efforts to solve this problem by moving the processing closer to or
39 directly integrated into the memory have traditionally not gone well:
40 Aspex Microelectronics, Elixent, these are parallel processing companies
41 that very few have heard of, because their software stack was so
42 specialist that it required heavy investment by customers to utilise.
43 D-Matrix and Graphcore are a modern incarnation of the exact same
44 "specialist parallel processing" mistake, betting heavily on AI with
45 Matrix and Convolution Engines that can do no other task. Aspex only
46 survived by being bought by Ericsson, where its specialised suitability
47 for massive wide Baseband FFTs saved it from going under. Any "better
48 AI mousetrap" that comes along will quickly render both D-Matrix and
49 Graphcore obsolete.
50
51 Second hints as to the answer emerge from an article
52 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
53 which illustrates a catastrophic rabbit-hole taken by Industry Giants
54 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
55 Order(N^6) opcode proliferation nightmare, with its mantra "make it
56 easy for hardware engineers, let software sort out the mess" literally
57 overwhelming programmers. Worse than that, specialists in charging
58 clients Optimisation Services are finding that AVX-512, to take an
59 example, is anything but optimal: overall performance of AVX-512 actually
60 *decreases* even as power consumption goes up.
61
62 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
63 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
64 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
65 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
66 instruction that makes a truly ubiquitous Vector ISA) in ways that
67 will become apparent over time as adoption increases. In the meantime
68 programmers are, in direct violation of ARM's advice on how to use SVE2,
69 trying desperately to use it as if it was Packed SIMD NEON. The advice
70 not to create SVE2 assembler that is hardcoded to fixed widths is being
71 disregarded, in favour of writing *multiple identical implementations*
72 of a function, each with a different hardware width, and compelling
73 software to choose one at runtime after probing the hardware.
74
75 Even RISC-V, for all that we can be grateful to the RISC-V Founders
76 for reviving Cray Vectors, has severe performance and implementation
77 limitations that are only really apparent to exceptionally experienced
78 assembly-level developers with a wide, diverse depth in multiple ISAs:
79 one of the best and clearest is a
80 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
81 by adrian_b.
82
83 Adrian logically and concisely points out that the fundamental design
84 assumptions and simplifications that went into the RISC-V ISA have an
85 irrevocably damaging effect on its viability for high performance use.
86 That is not to say that its use in low-performance embedded scenarios is
87 not ideal: in private custom secretive commercial usage it is perfect.
88 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
89 ARM with RISC-V and saving themselves USD 1 in licensing royalties
90 per product are a classic case study. Ubiquitous and common everyday
91 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
92 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
93 unfortunately, fundamentally flawed as far as power efficient high
94 performance is concerned.
95
96 Slowly, at this point, a realisation should be sinking in that, actually,
97 there aren't as many really truly viable Vector ISAs out there, as the
98 ones that are evolving in the general direction of Vectorisation are,
99 in various completely different ways, flawed.
100
101 **Successfully identifying a limitation marks the beginning of an
102 opportunity**
103
104 We are nowhere near done, however, because a Vector ISA is a superset of a
105 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
106 support, and even longer to get the software ecosystem up and running.
107
108 Which ISAs, therefore, have or have had, at one point in time, a decent
109 Software Ecosystem? Debian supports most of these including s390:
110
111 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
112 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
113 reputation nobody wants to go near SPARC.
114 * MIPS, created by SGI and only really commonly used in Network switches.
115 Exceptions: Ingenic with embedded CPUs,
116 and China ICT with the Loongson supercomputers.
117 * x86, the most well-known ISA and also one of the most heavily
118 litigously-protected.
119 * ARM, well known in embedded and smartphone scenarios, very slowly
120 making its way into data centres.
121 * OpenRISC, an entirely Open ISA suitable for embedded systems.
122 * s390, a Mainframe ISA very similar to Power.
123 * Power ISA, a Supercomputing-class ISA, as demonstrated by
124 two out of three of the top500.org supercomputers using
125 160,000 IBM POWER9 Cores.
126 * ARC, a competitor at the time to ARM, best known for use in
127 Broadcom VideoCore IV.
128 * RISC-V, with a software ecosystem heavily in development
129 and with rapid adoption
130 in an uncontrolled fashion, is set on an unstoppable
131 and inevitable trainwreck path to replicate the
132 opcode conflict nightmare that plagued the Power ISA,
133 two decades ago.
134 * Tensilica, Andes STAR and Western Digital for successful
135 commercial proprietary ISAs: Tensilica in Baseband Modems,
136 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
137 multi-billion-unit mass volume markets that almost nobody
138 knows anything about. Included for completeness.
139
140 In order of least controlled to most controlled, the viable
141 candidates for further advancement are:
142
143 * OpenRISC 1200, not controlled or restricted by anyone. no patent
144 protection.
145 * RISC-V, touted as "Open" but actually strictly controlled under
146 Trademark License: too new to have adequate patent pool protection,
147 as evidenced by multiple adopters having been hit by patent lawsuits.
148 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
149 * Power ISA: protected by IBM's extensive patent portfolio for Members
150 of the OpenPOWER Foundation, covered by Trademarks, permitting
151 and encouraging contributions, and having software support for over
152 20 years.
153 * ARM, not permitting Open Licensing, they survived in the early 90s
154 only by doing a deal with Samsung for a Royalty-free License in exchange
155 for GBP 3 million and legal protection under Samsung Research Division.
156 Several large Corporations (Apple most notably) have licensed the ISA
157 but not ARM designs, the barrier to entry is high and the ISA itself
158 protected from interference as a result.
159 * x86, famous for a Court Ruling in 2004 where a Judge "banged heads
160 together" and ordered AMD and Intel to stop wasting his time,
161 make peace, and cross-license each other's patents, anyone wishing
162 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
163 and VIA EDEN processors, and see how they fared.
164 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
165 but the 800lb Gorilla Syndrome seems not to have deterred one
166 particularly disingenuous group from performing illegal
167 Reverse-Engineering.
168
169 By asking the question, "which ISA would be the best and most stable to
170 base a Vector Supercomputing-class Extension on?" where patent protection,
171 software ecosystem, open-ness and pedigree all combine to reduce risk
172 and increase the chances of success, there is really only one candidate.
173
174 **Of all of these, the only one with the most going for it is the Power ISA.**