openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * First revision 05may2021
   6
   7 **Table of Contents**
   8
   9 [[!toc]]
  10
  11 # Why in the 2020s would you invent a new Vector ISA
  12
  13 Inventing a new Scalar ISA from scratch is over a decade-long task
  14 including simulators and compilers: OpenRISC 1200 took 12 years to
  15 mature.  A Vector or Packed SIMD ISA to reach stable general-purpose
  16 auto-vectorisation compiler support has never been achieved in the
  17 history of computing, not with the combined resources of ARM, Intel,
  18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more.  Rather: GPUs
  19 have ultra-specialist compilers that are designed from the ground up
  20 to support Vector/SIMD parallelism, and associated standards managed by
  21 the Khronos Group, with multi-man-century development committment from
  22 multiple billion-dollar-revenue companies, to sustain them.
  23
  24 Therefore it begs the question, why on earth would anyone consider
  25 this task, and what, in Computer Science, actually needs solving?
  26
  27 First hints are that whilst memory bitcells have not increased in speed
  28 since the 90s (around 150 mhz), increasing the datapath widths has allowed
  29 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  30 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
  31 all make an effort (all simply increasing the parallel deployment of
  32 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  33 two nearly three orders of magnitude increase in CPU horsepower
  34 over the same timeframe. Seymour
  35 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  36 would become a serious limitation, over two decades ago.  Some systems
  37 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  38 by way of compensation, and as we know from experience even that will
  39 be considered inadequate in future.
  40
  41 Efforts to solve this problem by moving the processing closer to or
  42 directly integrated into the memory have traditionally not gone well:
  43 Aspex Microelectronics, Elixent, these are parallel processing companies
  44 that very few have heard of, because their software stack was so
  45 specialist that it required heavy investment by customers to utilise.
  46 D-Matrix and Graphcore are a modern incarnation of the exact same
  47 "specialist parallel processing" mistake, betting heavily on AI with
  48 Matrix and Convolution Engines that can do no other task.  Aspex only
  49 survived by being bought by Ericsson, where its specialised suitability
  50 for massive wide Baseband FFTs saved it from going under.  Any "better
  51 AI mousetrap" that comes along will quickly render both D-Matrix and
  52 Graphcore obsolete.
  53
  54 NVIDIA and other GPUs have taken a different approach again: massive
  55 parallelism with more Turing-complete ISAs in each, and dedicated
  56 slower parallel memory paths (GDDR5) suited to the specific tasks of
  57 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  58 by the amount of money poured into the software ecosystem in order
  59 to make it accessible, and even then, GPU Programmers are a specialist
  60 and rare (expensive) breed.
  61
  62 Second hints as to the answer emerge from an article
  63 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  64 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  65 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  66 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  67 easy for hardware engineers, let software sort out the mess" literally
  68 overwhelming programmers.  Worse than that, specialists in charging
  69 clients Optimisation Services are finding that AVX-512, to take an
  70 example, is anything but optimal: overall performance of AVX-512 actually
  71 *decreases* even as power consumption goes up.
  72
  73 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  74 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  75 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  76 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  77 instruction that makes a truly ubiquitous Vector ISA) in ways that
  78 will become apparent over time as adoption increases. In the meantime
  79 programmers are, in direct violation of ARM's advice on how to use SVE2,
  80 trying desperately to use it as if it was Packed SIMD NEON.  The advice
  81 not to create SVE2 assembler that is hardcoded to fixed widths is being
  82 disregarded, in favour of writing *multiple identical implementations*
  83 of a function, each with a different hardware width, and compelling
  84 software to choose one at runtime after probing the hardware.
  85
  86 Even RISC-V, for all that we can be grateful to the RISC-V Founders
  87 for reviving Cray Vectors, has severe performance and implementation
  88 limitations that are only really apparent to exceptionally experienced
  89 assembly-level developers with a wide, diverse depth in multiple ISAs:
  90 one of the best and clearest is a
  91 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
  92 by adrian_b.
  93
  94 Adrian logically and concisely points out that the fundamental design
  95 assumptions and simplifications that went into the RISC-V ISA have an
  96 irrevocably damaging effect on its viability for high performance use.
  97 That is not to say that its use in low-performance embedded scenarios is
  98 not ideal: in private custom secretive commercial usage it is perfect.
  99 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
 100 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 101 per product are a classic case study.  Ubiquitous and common everyday
 102 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 103 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 104 unfortunately, fundamentally flawed as far as power efficient high
 105 performance is concerned.
 106
 107 Slowly, at this point, a realisation should be sinking in that, actually,
 108 there aren't as many really truly viable Vector ISAs out there, as the
 109 ones that are evolving in the general direction of Vectorisation are,
 110 in various completely different ways, flawed.
 111
 112 **Successfully identifying a limitation marks the beginning of an
 113 opportunity**
 114
 115 We are nowhere near done, however, because a Vector ISA is a superset of a
 116 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 117 support, and even longer to get the software ecosystem up and running.
 118
 119 Which ISAs, therefore, have or have had, at one point in time, a decent
 120 Software Ecosystem? Debian supports most of these including s390:
 121
 122 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 123   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 124   reputation nobody wants to go near SPARC.
 125 * MIPS, created by SGI and only really commonly used in Network switches.
 126   Exceptions: Ingenic with embedded CPUs,
 127   and China ICT with the Loongson supercomputers.
 128 * x86, the most well-known ISA and also one of the most heavily
 129   litigously-protected.
 130 * ARM, well known in embedded and smartphone scenarios, very slowly
 131   making its way into data centres.
 132 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 133 * s390, a Mainframe ISA very similar to Power.
 134 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 135   two out of three of the top500.org supercomputers using
 136   160,000 IBM POWER9 Cores.
 137 * ARC, a competitor at the time to ARM, best known for use in
 138   Broadcom VideoCore IV.
 139 * RISC-V, with a software ecosystem heavily in development
 140   and with rapid adoption
 141   in an uncontrolled fashion, is set on an unstoppable
 142   and inevitable trainwreck path to replicate the
 143   opcode conflict nightmare that plagued the Power ISA,
 144   two decades ago.
 145 * Tensilica, Andes STAR and Western Digital for successful
 146   commercial proprietary ISAs: Tensilica in Baseband Modems,
 147   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 148   astoundingly commercially successful
 149   multi-billion-unit mass volume markets that almost nobody
 150   knows anything about. Included for completeness.
 151
 152 In order of least controlled to most controlled, the viable
 153 candidates for further advancement are:
 154
 155 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 156   protection.
 157 * RISC-V, touted as "Open" but actually strictly controlled under
 158   Trademark License: too new to have adequate patent pool protection,
 159   as evidenced by multiple adopters having been hit by patent lawsuits.
 160 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
 161 * Power ISA: protected by IBM's extensive patent portfolio for Members
 162   of the OpenPOWER Foundation, covered by Trademarks, permitting
 163   and encouraging contributions, and having software support for over
 164   20 years.
 165 * ARM, not permitting Open Licensing, they survived in the early 90s
 166   only by doing a deal with Samsung for an in-perpetuity
 167   Royalty-free License, in exchange
 168   for GBP 3 million and legal protection through Samsung Research.
 169   Several large Corporations (Apple most notably) have licensed the ISA
 170   but not ARM designs: the barrier to entry is high and the ISA itself
 171   protected from interference as a result.
 172 * x86, famous for an unprecedented
 173   Court Ruling in 2004 where a Judge "banged heads
 174   together" and ordered AMD and Intel to stop wasting his time,
 175   make peace, and cross-license each other's patents, anyone wishing
 176   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 177   and VIA EDEN processors, and see how they fared.
 178 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 179   but the 800lb Gorilla Syndrome seems not to have deterred one
 180   particularly disingenuous group from performing illegal
 181   Reverse-Engineering.
 182
 183 By asking the question, "which ISA would be the best and most stable to
 184 base a Vector Supercomputing-class Extension on?" where patent protection,
 185 software ecosystem, open-ness and pedigree all combine to reduce risk
 186 and increase the chances of success, there is really only one candidate.
 187
 188 **Of all of these, the only one with the most going for it is the Power ISA.**
 189
 190 The summary of advantages, then, of the Power ISA is that:
 191
 192 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 193   and more.
 194 * IBM's extensive 20+ years of patents is available, royalty-free,
 195   to protect implementors as long as they are also members of the
 196   OpenPOWER Foundation
 197 * IBM designed and maintained the Power ISA as a Supercomputing
 198   class ISA from its inception.
 199 * Coherent distributed memory access is possible through OpenCAPI
 200 * Extensions to the Power ISA may be submitted through an External
 201   RFC Process that does not require membership of OPF.
 202
 203 From this strong base, the next step is: how to leverage this
 204 foundation to take a leap forward in performance and performance/watt,
 205 *without* losing all the advantages of an ubiquitous software ecosystem,
 206 the lack of which has historically plagued other systems and relegated
 207 them to a risky niche market?
 208
 209 # How do you turn a Scalar ISA into a Vector one?
 210
 211 The most obvious question before that is: why would you want to?
 212 As explained in the "SIMD Considered Harmful" article, Cray-style
 213 Vector ISAs break the link between data element batches and the
 214 underlying architectural back-end parallel processing capability.
 215 Packed SIMD explicitly smashes that width right in the face of the
 216 programmer and expects them to like it.  As the article immediately
 217 demonstrates, an arbitrary-sized data set has to contend with
 218 an insane power-of-two Packed SIMD cascade at both setup and teardown
 219 that can add literally an order
 220 of magnitude increase in the number of hand-written lines of assembler
 221 compared to a well-designed Cray-style Vector ISA with a `setvl`
 222 instruction.
 223
 224 Assuming then that variable-length Vectors are obviously desirable,
 225 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 226 went the way of adding explicit Vector opcodes, a style which RVV
 227 copied and modernised. In the case of RVV this introduced 192 new
 228 instructions on top of an existing 95+ for base RV64GC.  Adding
 229 200% more instructions than the base ISA seems unwise: at least,
 230 it feels like there should be a better way, particularly on
 231 close inspection of RVV as an example, the basic arithmetic
 232 operations are massively duplicated: scalar-scalar from the base
 233 is joined by both scalar-vector and vector-vector *and* predicate
 234 mask management, and transfer instructions between all the same,
 235 which goes a long way towards explaining why there are twice as many
 236 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
 237
 238 The question then becomes: with all the duplication of arithmetic
 239 operations just to make the registers scalar or vector, why not
 240 leverage the *existing* Scalar ISA with some sort of "context"
 241 or prefix that augments its behaviour?
 242
 243 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 244 gives the base concept, but in 1994 it was Peter Hsu, the designer
 245 of the MIPS R8000, who first came up with the idea of Vector-augmented
 246 prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
 247 the prefix would mark which of the registers were to be treated as
 248 Scalar and which as Vector, then perform a `REP`-like loop that
 249 jammed multiple scalar operations into the Multi-Issue Execution
 250 Engine.  The only reason that the team did not take this forward
 251 into a commercial product
 252 was because they could not work out how to cleanly do OoO
 253 multi-issue at the time.
 254
 255 In its simplest form, then, this "prefixing" idea is a matter
 256 of:
 257
 258 * Defining the format of the prefix
 259 * Adding a `setvl` instruction
 260 * Adding Vector-context SPRs and working out how to do
 261   context-switches with them
 262 * Writing an awful lot of Specification Documentation
 263   (4 years and counting)
 264
 265 Once the basics of this concept have sunk in, early
 266 advancements quickly follow naturally from analysis
 267 of the problem-space:
 268
 269 * Expanding the size of GPR, FPR and CR register files to
 270   provide 128 entries in each. This is a bare minimum for GPUs
 271   in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
 272   batching as possible.
 273 * Predication (an absolutely critical component for a Vector ISA),
 274   then the next logical advancement is to allow separate predication masks
 275   to be applied to *both* the source *and* the destination, independently.
 276 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 277   with primarily Load and Store being able to handle 8/16/32/64
 278   and sometimes 128-bit (quad-word), where Vector ISAs need to
 279   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 280   high-performance AI. Rather than waste opcode space adding all
 281   such operations at different bitwidths, let the prefix
 282   *redefine* the element width.
 283 * "Reordering" of the assumption of linear sequential element
 284   access, for Matrices, rotations, transposition, Convolutions,
 285   DCT, FFT, Parallel Prefix-Sum and other common transformations
 286   that require significant programming effort in other ISAs.
 287
 288 All of these things come entirely from "Augmentation" of the Scalar operation
 289 being prefixed: at no time is the Scalar operation significantly
 290 altered.
 291 From there, several more "Modes" can be added, including saturation,
 292 which is needed for Audio and Video applications, "Reverse Gear"
 293 which runs the Element Loop in reverse order (needed for Prefix
 294 Sum), and more.
 295
 296 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 297
 298 Remarkably, very little: the devil is in the details though.
 299
 300 * The traditional `iota` instruction may be
 301   synthesised with an overlapping add, that stacks up incrementally
 302   and sequentially.  Although it requires two instructions (one to
 303   start the sum-chain) the technique has the advantage of allowing
 304   increments by arbitrary amounts, and is not limited to addition,
 305   either.
 306 * Big-integer addition (arbitrary-precision arithmetic) is an
 307   emergent characteristic from the carry-in, carry-out capability of
 308   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 309   naturally emerges from the
 310   sequential chaining of these scalar instructions.
 311 * The Condition Register Fields of the Power ISA make a great candidate
 312   for use as Predicate Masks, particularly when combined with
 313   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
 314
 315 It is only when looking slightly deeper into the Power ISA that
 316 certain things turn out to be missing, and this is down in part to IBM's
 317 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
 318 so Scalar ones.  Examples include that transfer operations between the
 319 Integer and Floating-point Scalar register files were dropped approximately
 320 a decade ago after the Packed SIMD variants were considered to be
 321 duplicates.  With it being completely inappropriate to attempt to Vectorise
 322 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
 323 the Scalar ISA, a much better all-round candidate for Vectorisation is
 324 left anaemic.  Fortunately, with the ISA Working Group being willing
 325 to consider RFCs (Requests For Change) these omissions have the potential
 326 to be corrected.
 327
 328 One deliberate decision in SVP64 involves Predication. Typical Vector
 329 ISAs have quite comprehensive arithmetic and logical operations on
 330 Predicate Masks, and if CR Fields were the only predicates in SVP64
 331 it would put pressure on to start adding the exact same arithmetic and logical
 332 operations that already exist in the Integer opcodes.
 333 Instead of taking that route the decision was made to allow *both*
 334 Integer *and* CR Fields to be Predicate Masks, and to create Draft
 335 instructions that provide better transfer capability between CR Fields
 336 and Integer Register files.
 337
 338 Beyond that, further extensions to the Power ISA become much more
 339 domain-specific, such as adding bitmanipulation for Audio, Video
 340 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
 341 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
 342 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
 343 *automatically* is inherently added to the Vector one as well, and
 344 because these GPU and Video opcodes have been added to the CPU ISA,
 345 Software Driver development and debugging is dramatically simplified.
 346
 347 Which brings us to the next important question: how is any of these
 348 CPU-centric Vector-centric improvements relevant to power efficiency
 349 and making more effective use of resources?