openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * First revision 05may2021
   6
   7 **Table of Contents**
   8
   9 [[!toc]]
  10
  11 # Why in the 2020s would you invent a new Vector ISA
  12
  13 Inventing a new Scalar ISA from scratch is over a decade-long task
  14 including simulators and compilers: OpenRISC 1200 took 12 years to
  15 mature.  A Vector or Packed SIMD ISA to reach stable general-purpose
  16 auto-vectorisation compiler support has never been achieved in the
  17 history of computing, not with the combined resources of ARM, Intel,
  18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more.  Rather: GPUs
  19 have ultra-specialist compilers that are designed from the ground up
  20 to support Vector/SIMD parallelism, and associated standards managed by
  21 the Khronos Group, with multi-man-century development committment from
  22 multiple billion-dollar-revenue companies, to sustain them.
  23
  24 Therefore it begs the question, why on earth would anyone consider
  25 this task, and what, in Computer Science, actually needs solving?
  26
  27 First hints are that whilst memory bitcells have not increased in speed
  28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
  29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
  31 all make an effort (all simply increasing the parallel deployment of
  32 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  33 two nearly three orders of magnitude increase in CPU horsepower. Seymour
  34 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  35 would become a serious limitation, over two decades ago.  Some systems
  36 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  37 by way of compensation, and as we know from experience even that will
  38 be considered inadequate in future.
  39
  40 Efforts to solve this problem by moving the processing closer to or
  41 directly integrated into the memory have traditionally not gone well:
  42 Aspex Microelectronics, Elixent, these are parallel processing companies
  43 that very few have heard of, because their software stack was so
  44 specialist that it required heavy investment by customers to utilise.
  45 D-Matrix and Graphcore are a modern incarnation of the exact same
  46 "specialist parallel processing" mistake, betting heavily on AI with
  47 Matrix and Convolution Engines that can do no other task.  Aspex only
  48 survived by being bought by Ericsson, where its specialised suitability
  49 for massive wide Baseband FFTs saved it from going under.  Any "better
  50 AI mousetrap" that comes along will quickly render both D-Matrix and
  51 Graphcore obsolete.
  52
  53 NVIDIA and other GPUs have taken a different approach again: massive
  54 parallelism with more Turing-complete ISAs in each, and dedicated
  55 slower parallel memory paths (GDDR5) suited to the specific tasks of
  56 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  57 by the amount of money poured into the software ecosystem in order
  58 to make it accessible, and even then, GPU Programmers are a specialist
  59 and rare (expensive) breed.
  60
  61 Second hints as to the answer emerge from an article
  62 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  63 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  64 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  65 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  66 easy for hardware engineers, let software sort out the mess" literally
  67 overwhelming programmers.  Worse than that, specialists in charging
  68 clients Optimisation Services are finding that AVX-512, to take an
  69 example, is anything but optimal: overall performance of AVX-512 actually
  70 *decreases* even as power consumption goes up.
  71
  72 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  73 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  74 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  75 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  76 instruction that makes a truly ubiquitous Vector ISA) in ways that
  77 will become apparent over time as adoption increases. In the meantime
  78 programmers are, in direct violation of ARM's advice on how to use SVE2,
  79 trying desperately to use it as if it was Packed SIMD NEON.  The advice
  80 not to create SVE2 assembler that is hardcoded to fixed widths is being
  81 disregarded, in favour of writing *multiple identical implementations*
  82 of a function, each with a different hardware width, and compelling
  83 software to choose one at runtime after probing the hardware.
  84
  85 Even RISC-V, for all that we can be grateful to the RISC-V Founders
  86 for reviving Cray Vectors, has severe performance and implementation
  87 limitations that are only really apparent to exceptionally experienced
  88 assembly-level developers with a wide, diverse depth in multiple ISAs:
  89 one of the best and clearest is a
  90 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
  91 by adrian_b.
  92
  93 Adrian logically and concisely points out that the fundamental design
  94 assumptions and simplifications that went into the RISC-V ISA have an
  95 irrevocably damaging effect on its viability for high performance use.
  96 That is not to say that its use in low-performance embedded scenarios is
  97 not ideal: in private custom secretive commercial usage it is perfect.
  98 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
  99 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 100 per product are a classic case study.  Ubiquitous and common everyday
 101 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 102 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 103 unfortunately, fundamentally flawed as far as power efficient high
 104 performance is concerned.
 105
 106 Slowly, at this point, a realisation should be sinking in that, actually,
 107 there aren't as many really truly viable Vector ISAs out there, as the
 108 ones that are evolving in the general direction of Vectorisation are,
 109 in various completely different ways, flawed.
 110
 111 **Successfully identifying a limitation marks the beginning of an
 112 opportunity**
 113
 114 We are nowhere near done, however, because a Vector ISA is a superset of a
 115 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 116 support, and even longer to get the software ecosystem up and running.
 117
 118 Which ISAs, therefore, have or have had, at one point in time, a decent
 119 Software Ecosystem? Debian supports most of these including s390:
 120
 121 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 122   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 123   reputation nobody wants to go near SPARC.
 124 * MIPS, created by SGI and only really commonly used in Network switches.
 125   Exceptions: Ingenic with embedded CPUs,
 126   and China ICT with the Loongson supercomputers.
 127 * x86, the most well-known ISA and also one of the most heavily
 128   litigously-protected.
 129 * ARM, well known in embedded and smartphone scenarios, very slowly
 130   making its way into data centres.
 131 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 132 * s390, a Mainframe ISA very similar to Power.
 133 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 134   two out of three of the top500.org supercomputers using
 135   160,000 IBM POWER9 Cores.
 136 * ARC, a competitor at the time to ARM, best known for use in
 137   Broadcom VideoCore IV.
 138 * RISC-V, with a software ecosystem heavily in development
 139   and with rapid adoption
 140   in an uncontrolled fashion, is set on an unstoppable
 141   and inevitable trainwreck path to replicate the
 142   opcode conflict nightmare that plagued the Power ISA,
 143   two decades ago.
 144 * Tensilica, Andes STAR and Western Digital for successful
 145   commercial proprietary ISAs: Tensilica in Baseband Modems,
 146   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 147   astoundingly commercially successful
 148   multi-billion-unit mass volume markets that almost nobody
 149   knows anything about. Included for completeness.
 150
 151 In order of least controlled to most controlled, the viable
 152 candidates for further advancement are:
 153
 154 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 155   protection.
 156 * RISC-V, touted as "Open" but actually strictly controlled under
 157   Trademark License: too new to have adequate patent pool protection,
 158   as evidenced by multiple adopters having been hit by patent lawsuits.
 159 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
 160 * Power ISA: protected by IBM's extensive patent portfolio for Members
 161   of the OpenPOWER Foundation, covered by Trademarks, permitting
 162   and encouraging contributions, and having software support for over
 163   20 years.
 164 * ARM, not permitting Open Licensing, they survived in the early 90s
 165   only by doing a deal with Samsung for an in-perpetuity
 166   Royalty-free License, in exchange
 167   for GBP 3 million and legal protection through Samsung Research.
 168   Several large Corporations (Apple most notably) have licensed the ISA
 169   but not ARM designs: the barrier to entry is high and the ISA itself
 170   protected from interference as a result.
 171 * x86, famous for an unprecedented
 172   Court Ruling in 2004 where a Judge "banged heads
 173   together" and ordered AMD and Intel to stop wasting his time,
 174   make peace, and cross-license each other's patents, anyone wishing
 175   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 176   and VIA EDEN processors, and see how they fared.
 177 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 178   but the 800lb Gorilla Syndrome seems not to have deterred one
 179   particularly disingenuous group from performing illegal
 180   Reverse-Engineering.
 181
 182 By asking the question, "which ISA would be the best and most stable to
 183 base a Vector Supercomputing-class Extension on?" where patent protection,
 184 software ecosystem, open-ness and pedigree all combine to reduce risk
 185 and increase the chances of success, there is really only one candidate.
 186
 187 **Of all of these, the only one with the most going for it is the Power ISA.**
 188
 189 The summary of advantages, then, of the Power ISA is that:
 190
 191 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 192   and more.
 193 * IBM's extensive 20+ years of patents is available, royalty-free,
 194   to protect implementors as long as they are also members of the
 195   OpenPOWER Foundation
 196 * IBM designed and maintained the Power ISA as a Supercomputing
 197   class ISA from its inception.
 198 * Coherent distributed memory access is possible through OpenCAPI
 199 * Extensions to the Power ISA may be submitted through an External
 200   RFC Process that does not require membership of OPF.
 201
 202 From this strong base, the next step is: how to leverage this
 203 foundation to take a leap forward in performance and performance/watt,
 204 *without* losing all the advantages of an ubiquitous software ecosystem?
 205
 206 # How do you turn a Scalar ISA into a Vector one?
 207
 208 The most obvious question before that is: why would you want to?
 209 As explained in the "SIMD Considered Harmful" article, Cray-style
 210 Vector ISAs break the link between data element batches and the
 211 underlying architectural back-end parallel processing capability.
 212 Packed SIMD explicitly smashes that width right in the face of the
 213 programmer and expects them to like it.  As the article immediately
 214 demonstrates, an arbitrary-sized data set has to contend with
 215 an insane power-of-two Packed SIMD cascade at both setup and teardown
 216 that can add literally an order
 217 of magnitude increase in the number of hand-written lines of assembler
 218 compared to a well-designed Cray-style Vector ISA with a `setvl`
 219 instruction.
 220
 221 Assuming then that variable-length Vectors are obviously desirable,
 222 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 223 went the way of adding explicit Vector opcodes, a style which RVV
 224 copied and modernised. In the case of RVV this introduced 192 new
 225 instructions on top of an existing 95+ for base RV64GC.  Adding
 226 200% more instructions than the base ISA seems unwise: at least,
 227 it feels like there should be a better way, particularly on
 228 close inspection of RVV as an example, the basic arithmetic
 229 operations are massively duplicated: scalar-scalar from the base
 230 is joined by both scalar-vector and vector-vector *and* predicate
 231 mask management, and transfer instructions between all the sane,
 232 which goes a long way towards explaining why there are twice as many
 233 Vector instructions in RISC-V as there are in the RV64GC base.
 234
 235 The question then becomes: with all the duplication of arithmetic
 236 operations just to make the registers scalar or vector, why not
 237 leverage the *existing* Scalar ISA with some sort of "context"
 238 or prefix that augments its behaviour?
 239
 240 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 241 gives the base concept, but in 1994 it was Peter Hsu, the designer
 242 of the MIPS R8000, who first came up with the idea of Vector
 243 prefixing.  Relying on a multi-issue Out-of-Order Execution Engine,
 244 the prefix would mark which of the registers were to be treated as
 245 Scalar and which as Vector, then perform a `REP`-like loop that
 246 jammed multiple scalar operations into the Multi-Issue Execution
 247 Engine.  The only reason that the team did not take this forward
 248 into a commercial product
 249 was because they could not work out how to cleanly do OoO
 250 multi-issue at the time.
 251
 252 In its simplest form, then, this "prefixing" idea is a matter
 253 of:
 254
 255 * Defining the format of the prefix
 256 * Adding a `setvl` instruction
 257 * Adding Vector-context SPRs and working out how to do
 258   context-switches with them
 259 * Writing an awful lot of Specification Documentation
 260   (4 years and counting)
 261
 262 Once the basics of this concept have sunk in, early
 263 advancements quickly follow naturally from analysis
 264 of the problem-space:
 265
 266 * Predication (an absolutely critical component for a Vector ISA),
 267   then the next logical advancement is to allow separate predication masks
 268   to be applied to *both* the source *and* the destination, independently.
 269 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 270   with primarily Load and Store being able to handle 8/16/32/64
 271   and sometimes 128-bit (quad-word), where Vector ISAs need to
 272   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 273   high-performance AI.
 274 * "Reordering" of the assumption of linear sequential element
 275   access, for Matrices, rotations, transposition, Convolutions,
 276   DCT, FFT, Parallel Prefix-Sum and other common transformations
 277   that require significant programming effort in other ISAs.
 278
 279 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 280
 281 Remarkably, very little.
 282
 283 * The traditional `iota` instruction may be
 284   synthesised with an overlapping add, that stacks up incrementally
 285   and sequentially.  Although it requires two instructions (one to
 286   start the sum-chain) the technique has the advantage of allowing
 287   increments by arbitrary amounts, and is not limited to addition,
 288   either.
 289 * Big-integer addition (arbitrary-precision arithmetic) is an
 290   emergent characteristic from the carry-in, carry-out capability of
 291   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 292   naturally emerges from the
 293   sequential chaining of these scalar instructions.
 294 * The Condition Register Fields of the Power ISA make a great candidate
 295   for use as Predicate Masks, particularly when combined with
 296   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.