openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * First revision 05may2021
   6
   7 **Table of Contents**
   8
   9 [[!toc]]
  10
  11 # Why in the 2020s would you invent a new Vector ISA
  12
  13 Inventing a new Scalar ISA from scratch is over a decade-long task
  14 including simulators and compilers: OpenRISC 1200 took 12 years to
  15 mature.  A Vector or Packed SIMD ISA to reach stable general-purpose
  16 auto-vectorisation compiler support has never been achieved in the
  17 history of computing, not with the combined resources of ARM, Intel,
  18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more.  Rather: GPUs
  19 have ultra-specialist compilers that are designed from the ground up
  20 to support Vector/SIMD parallelism, and associated standards managed by
  21 the Khronos Group, with multi-man-century development committment from
  22 multiple billion-dollar-revenue companies, to sustain them.
  23
  24 Therefore it begs the question, why on earth would anyone consider
  25 this task?
  26
  27 First hints are that whilst memory bitcells have not increased in speed
  28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
  29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
  31 all make an effort, but these efforts are dwarfed by the two nearly
  32 three orders of magnitude increase in CPU horsepower. Seymour Cray,
  33 from his amazing in-depth knowledge, predicted that the mismatch would
  34 become a serious limitation.  Some systems at the time of writing are
  35 approaching a *Gigabyte* of L4 Cache, by way of compensation, and as we
  36 know from experience even that will be considered inadequate in future.
  37
  38 Efforts to solve this problem by moving the processing closer to or
  39 directly integrated into the memory have traditionally not gone well:
  40 Aspex Microelectronics, Elixent, these are parallel processing companies
  41 that very few have heard of, because their software stack was so
  42 specialist that it required heavy investment by customers to utilise.
  43 D-Matrix and Graphcore are a modern incarnation of the exact same
  44 "specialist parallel processing" mistake, betting heavily on AI with
  45 Matrix and Convolution Engines that can do no other task.  Aspex only
  46 survived by being bought by Ericsson, where its specialised suitability
  47 for massive wide Baseband FFTs saved it from going under.  Any "better
  48 AI mousetrap" that comes along will quickly render both D-Matrix and
  49 Graphcore obsolete.
  50
  51 Second hints as to the answer emerge from an article
  52 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  53 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  54 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  55 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  56 easy for hardware engineers, let software sort out the mess" literally
  57 overwhelming programmers.  Worse than that, specialists in charging
  58 clients Optimisation Services are finding that AVX-512, to take an
  59 example, is anything but optimal: overall performance of AVX-512 actually
  60 *decreases* even as power consumption goes up.
  61
  62 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  63 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  64 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  65 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  66 instruction that makes a truly ubiquitous Vector ISA) in ways that
  67 will become apparent over time as adoption increases. In the meantime
  68 programmers are, in direct violation of ARM's advice on how to use SVE2,
  69 trying desperately to use it as if it was Packed SIMD NEON.  The advice
  70 not to create SVE2 assembler that is hardcoded to fixed widths is being
  71 disregarded, in favour of writing *multiple identical implementations*
  72 of a function, each with a different hardware width, and compelling
  73 software to choose one at runtime after probing the hardware.
  74
  75 Even RISC-V, for all that we can be grateful to the RISC-V Founders
  76 for reviving Cray Vectors, has severe performance and implementation
  77 limitations that are only really apparent to exceptionally experienced
  78 assembly-level developers with a wide, diverse depth in multiple ISAs:
  79 one of the best and clearest is a
  80 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
  81 by adrian_b.
  82
  83 Adrian logically and concisely points out that the fundamental design
  84 assumptions and simplifications that went into the RISC-V ISA have an
  85 irrevocably damaging effect on its viability for high performance use.
  86 That is not to say that its use in low-performance embedded scenarios is
  87 not ideal: in private custom secretive commercial usage it is perfect.
  88 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
  89 ARM with RISC-V and saving themselves USD 1 in licensing royalties
  90 per product are a classic case study.  Ubiquitous and common everyday
  91 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
  92 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
  93 unfortunately, fundamentally flawed as far as power efficient high
  94 performance is concerned.
  95
  96 Slowly, at this point, a realisation should be sinking in that, actually,
  97 there aren't as many really truly viable Vector ISAs out there, as the
  98 ones that are evolving in the general direction of Vectorisation are,
  99 in various completely different ways, flawed.
 100
 101 **Successfully identifying a limitation marks the beginning of an
 102 opportunity**
 103
 104 We are nowhere near done, however, because a Vector ISA is a superset of a
 105 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 106 support, and even longer to get the software ecosystem up and running.
 107
 108 Which ISAs, therefore, have or have had, at one point in time, a decent
 109 Software Ecosystem? Debian supports most of these including s390:
 110
 111 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 112   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 113   reputation nobody wants to go near SPARC.
 114 * MIPS, created by SGI and only really commonly used in Network switches.
 115   Exceptions: Ingenic with embedded CPUs,
 116   and China ICT with the Loongson supercomputers.
 117 * x86, the most well-known ISA and also one of the most heavily
 118   litigously-protected.
 119 * ARM, well known in embedded and smartphone scenarios, very slowly
 120   making its way into data centres.
 121 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 122 * s390, a Mainframe ISA very similar to Power.
 123 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 124   two out of three of the top500.org supercomputers using
 125   160,000 IBM POWER9 Cores.
 126 * ARC, a competitor at the time to ARM, best known for use in
 127   Broadcom VideoCore IV.
 128 * RISC-V, with a software ecosystem heavily in development
 129   and with rapid adoption
 130   in an uncontrolled fashion, is set on an unstoppable
 131   and inevitable trainwreck path to replicate the
 132   opcode conflict nightmare that plagued the Power ISA,
 133   two decades ago.
 134 * Tensilica, Andes STAR and Western Digital for successful
 135   commercial proprietary ISAs: Tensilica in Baseband Modems,
 136   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 137   multi-billion-unit mass volume markets that almost nobody
 138   knows anything about. Included for completeness.
 139
 140 In order of least controlled to most controlled, the viable
 141 candidates for further advancement are:
 142
 143 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 144   protection.
 145 * RISC-V, touted as "Open" but actually strictly controlled under
 146   Trademark License: too new to have adequate patent pool protection,
 147   as evidenced by multiple adopters having been hit by patent lawsuits.
 148 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
 149 * Power ISA: protected by IBM's extensive patent portfolio for Members
 150   of the OpenPOWER Foundation, covered by Trademarks, permitting
 151   and encouraging contributions, and having software support for over
 152   20 years.
 153 * ARM, not permitting Open Licensing, they survived in the early 90s
 154   only by doing a deal with Samsung for a Royalty-free License in exchange
 155   for GBP 3 million and legal protection under Samsung Research Division.
 156   Several large Corporations (Apple most notably) have licensed the ISA
 157   but not ARM designs, the barrier to entry is high and the ISA itself
 158   protected from interference as a result.
 159 * x86, famous for a Court Ruling in 2004 where a Judge "banged heads
 160   together" and ordered AMD and Intel to stop wasting his time,
 161   make peace, and cross-license each other's patents, anyone wishing
 162   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 163   and VIA EDEN processors, and see how they fared.
 164 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 165   but the 800lb Gorilla Syndrome seems not to have deterred one
 166   particularly disingenuous group from performing illegal
 167   Reverse-Engineering.
 168
 169 By asking the question, "which ISA would be the best and most stable to
 170 base a Vector Supercomputing-class Extension on?" where patent protection,
 171 software ecosystem, open-ness and pedigree all combine to reduce risk
 172 and increase the chances of success, there is really only one candidate.
 173
 174 **Of all of these, the only one with the most going for it is the Power ISA.**