openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * v0.00 05may2021 first created
   6
   7 **Table of Contents**
   8
   9 [[!toc]]
  10
  11 # Why in the 2020s would you invent a new Vector ISA
  12
  13 Inventing a new Scalar ISA from scratch is over a decade-long task
  14 including simulators and compilers: OpenRISC 1200 took 12 years to
  15 mature.  A Vector or Packed SIMD ISA to reach stable *general-purpose*
  16 auto-vectorisation compiler support has never been achieved in the
  17 history of computing, not with the combined resources of ARM, Intel,
  18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  19 assembler and direct use of intrinsics is the Industry-standard norm
  20 to achieve high-performance optimisation where it matters*).
  21 Rather: GPUs
  22 have ultra-specialist compilers (CUDA) that are designed from the ground up
  23 to support Vector/SIMD parallelism, and associated standards
  24 (SPIR-V, Vulkan, OpenCL) managed by
  25 the Khronos Group, with multi-man-century development committment from
  26 multiple billion-dollar-revenue companies, to sustain them.
  27
  28 Therefore it begs the question, why on earth would anyone consider
  29 this task, and what, in Computer Science, actually needs solving?
  30
  31 First hints are that whilst memory bitcells have not increased in speed
  32 since the 90s (around 150 mhz), increasing the datapath widths has allowed
  33 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  34 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
  35 all make an effort (all simply increasing the parallel deployment of
  36 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  37 two nearly three orders of magnitude increase in CPU horsepower
  38 over the same timeframe. Seymour
  39 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  40 would become a serious limitation, over two decades ago.  Some systems
  41 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  42 by way of compensation, and as we know from experience even that will
  43 be considered inadequate in future.
  44
  45 Efforts to solve this problem by moving the processing closer to or
  46 directly integrated into the memory have traditionally not gone well:
  47 Aspex Microelectronics, Elixent, these are parallel processing companies
  48 that very few have heard of, because their software stack was so
  49 specialist that it required heavy investment by customers to utilise.
  50 D-Matrix and Graphcore are a modern incarnation of the exact same
  51 "specialist parallel processing" mistake, betting heavily on AI with
  52 Matrix and Convolution Engines that can do no other task.  Aspex only
  53 survived by being bought by Ericsson, where its specialised suitability
  54 for massive wide Baseband FFTs saved it from going under.
  55 The huge risk is that any "better
  56 AI mousetrap" created by an innovative competitor
  57 that comes along will quickly render both D-Matrix and
  58 Graphcore's approach obsolete.
  59
  60 NVIDIA and other GPUs have taken a different approach again: massive
  61 parallelism with more Turing-complete ISAs in each, and dedicated
  62 slower parallel memory paths (GDDR5) suited to the specific tasks of
  63 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  64 by the amount of money poured into the software ecosystem in order
  65 to make it accessible, and even then, GPU Programmers are a specialist
  66 and rare (expensive) breed.
  67
  68 Second hints as to the answer emerge from an article
  69 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  70 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  71 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  72 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  73 easy for hardware engineers, let software sort out the mess" literally
  74 overwhelming programmers.  Specialists charging
  75 clients for assembly-code Optimisation Services are finding that AVX-512,
  76 to take an
  77 example, is anything but optimal: overall performance of AVX-512 actually
  78 *decreases* even as power consumption goes up.
  79
  80 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  81 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  82 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  83 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  84 instruction that makes a truly ubiquitous Vector ISA) in ways that
  85 will become apparent over time as adoption increases. In the meantime
  86 programmers are, in direct violation of ARM's advice on how to use SVE2,
  87 trying desperately to use it as if it was Packed SIMD NEON.  The advice
  88 not to create SVE2 assembler that is hardcoded to fixed widths is being
  89 disregarded, in favour of writing *multiple identical implementations*
  90 of a function, each with a different hardware width, and compelling
  91 software to choose one at runtime after probing the hardware.
  92
  93 Even RISC-V, for all that we can be grateful to the RISC-V Founders
  94 for reviving Cray Vectors, has severe performance and implementation
  95 limitations that are only really apparent to exceptionally experienced
  96 assembly-level developers with a wide, diverse depth in multiple ISAs:
  97 one of the best and clearest is a
  98 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
  99 by adrian_b.
 100
 101 Adrian logically and concisely points out that the fundamental design
 102 assumptions and simplifications that went into the RISC-V ISA have an
 103 irrevocably damaging effect on its viability for high performance use.
 104 That is not to say that its use in low-performance embedded scenarios is
 105 not ideal: in private custom secretive commercial usage it is perfect.
 106 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
 107 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 108 per product are a classic case study.  Ubiquitous and common everyday
 109 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 110 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 111 unfortunately, fundamentally flawed as far as power efficient high
 112 performance is concerned.
 113
 114 Slowly, at this point, a realisation should be sinking in that, actually,
 115 there aren't as many really truly viable Vector ISAs out there, as the
 116 ones that are evolving in the general direction of Vectorisation are,
 117 in various completely different ways, flawed.
 118
 119 **Successfully identifying a limitation marks the beginning of an
 120 opportunity**
 121
 122 We are nowhere near done, however, because a Vector ISA is a superset of a
 123 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 124 support, and even longer to get the software ecosystem up and running.
 125
 126 Which ISAs, therefore, have or have had, at one point in time, a decent
 127 Software Ecosystem? Debian supports most of these including s390:
 128
 129 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 130   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 131   reputation nobody wants to go near SPARC.
 132 * MIPS, created by SGI and only really commonly used in Network switches.
 133   Exceptions: Ingenic with embedded CPUs,
 134   and China ICT with the Loongson supercomputers.
 135 * x86, the most well-known ISA and also one of the most heavily
 136   litigously-protected.
 137 * ARM, well known in embedded and smartphone scenarios, very slowly
 138   making its way into data centres.
 139 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 140 * s390, a Mainframe ISA very similar to Power.
 141 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 142   two out of three of the top500.org supercomputers using
 143   around 2 million IBM POWER9 Cores each.
 144 * ARC, a competitor at the time to ARM, best known for use in
 145   Broadcom VideoCore IV.
 146 * RISC-V, with a software ecosystem heavily in development
 147   and with rapid expansion
 148   in an uncontrolled fashion, is set on an unstoppable
 149   and inevitable trainwreck path to replicate the
 150   opcode conflict nightmare that plagued the Power ISA,
 151   two decades ago.
 152 * Tensilica, Andes STAR and Western Digital for successful
 153   commercial proprietary ISAs: Tensilica in Baseband Modems,
 154   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 155   astoundingly commercially successful
 156   multi-billion-unit mass volume markets that almost nobody
 157   knows anything about. Included for completeness.
 158
 159 In order of least controlled to most controlled, the viable
 160 candidates for further advancement are:
 161
 162 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 163   protection.
 164 * RISC-V, touted as "Open" but actually strictly controlled under
 165   Trademark License: too new to have adequate patent pool protection,
 166   as evidenced by multiple adopters having been hit by patent lawsuits.
 167 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
 168 * Power ISA: protected by IBM's extensive patent portfolio for Members
 169   of the OpenPOWER Foundation, covered by Trademarks, permitting
 170   and encouraging contributions, and having software support for over
 171   20 years.
 172 * ARM, not permitting Open Licensing, they survived in the early 90s
 173   only by doing a deal with Samsung for an in-perpetuity
 174   Royalty-free License, in exchange
 175   for GBP 3 million and legal protection through Samsung Research.
 176   Several large Corporations (Apple most notably) have licensed the ISA
 177   but not ARM designs: the barrier to entry is high and the ISA itself
 178   protected from interference as a result.
 179 * x86, famous for an unprecedented
 180   Court Ruling in 2004 where a Judge "banged heads
 181   together" and ordered AMD and Intel to stop wasting his time,
 182   make peace, and cross-license each other's patents, anyone wishing
 183   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 184   and VIA EDEN processors, and see how they fared.
 185 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 186   but the 800lb Gorilla Syndrome seems not to have deterred one
 187   particularly disingenuous group from performing illegal
 188   Reverse-Engineering.
 189
 190 By asking the question, "which ISA would be the best and most stable to
 191 base a Vector Supercomputing-class Extension on?" where patent protection,
 192 software ecosystem, open-ness and pedigree all combine to reduce risk
 193 and increase the chances of success, there is really only one candidate.
 194
 195 **Of all of these, the only one with the most going for it is the Power ISA.**
 196
 197 The summary of advantages, then, of the Power ISA is that:
 198
 199 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 200   and more.
 201 * IBM's extensive 20+ years of patents is available, royalty-free,
 202   to protect implementors as long as they are also members of the
 203   OpenPOWER Foundation
 204 * IBM designed and maintained the Power ISA as a Supercomputing
 205   class ISA from its inception.
 206 * Coherent distributed memory access is possible through OpenCAPI
 207 * Extensions to the Power ISA may be submitted through an External
 208   RFC Process that does not require membership of OPF.
 209
 210 From this strong base, the next step is: how to leverage this
 211 foundation to take a leap forward in performance and performance/watt,
 212 *without* losing all the advantages of an ubiquitous software ecosystem,
 213 the lack of which has historically plagued other systems and relegated
 214 them to a risky niche market?
 215
 216 # How do you turn a Scalar ISA into a Vector one?
 217
 218 The most obvious question before that is: why on earth would you want to?
 219 As explained in the "SIMD Considered Harmful" article, Cray-style
 220 Vector ISAs break the link between data element batches and the
 221 underlying architectural back-end parallel processing capability.
 222 Packed SIMD explicitly smashes that width right in the face of the
 223 programmer and expects them to like it.  As the article immediately
 224 demonstrates, an arbitrary-sized data set has to contend with
 225 an insane power-of-two Packed SIMD cascade at both setup and teardown
 226 that routinely adds literally an order
 227 of magnitude increase in the number of hand-written lines of assembler
 228 compared to a well-designed Cray-style Vector ISA with a `setvl`
 229 instruction.
 230
 231 Assuming then that variable-length Vectors are obviously desirable,
 232 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 233 went the way of adding explicit Vector opcodes, a style which RVV
 234 copied and modernised. In the case of RVV this introduced 192 new
 235 instructions on top of an existing 95+ for base RV64GC.  Adding
 236 200% more instructions than the base ISA seems unwise: at least,
 237 it feels like there should be a better way, particularly on
 238 close inspection of RVV as an example, the basic arithmetic
 239 operations are massively duplicated: scalar-scalar from the base
 240 is joined by both scalar-vector and vector-vector *and* predicate
 241 mask management, and transfer instructions between all the same,
 242 which goes a long way towards explaining why there are twice as many
 243 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
 244
 245 The question then becomes: with all the duplication of arithmetic
 246 operations just to make the registers scalar or vector, why not
 247 leverage the *existing* Scalar ISA with some sort of "context"
 248 or prefix that augments its behaviour?  Then, the Instruction Decode
 249 phase is greatly simplified, reducing design complexity and leaving
 250 plenty of headroom for further expansion.
 251
 252 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 253 gives the base concept, but in 1994 it was Peter Hsu, the designer
 254 of the MIPS R8000, who first came up with the idea of Vector-augmented
 255 prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
 256 the prefix would mark which of the registers were to be treated as
 257 Scalar and which as Vector, then perform a `REP`-like loop that
 258 jammed multiple scalar operations into the Multi-Issue Execution
 259 Engine.  The only reason that the team did not take this forward
 260 into a commercial product
 261 was because they could not work out how to cleanly do OoO
 262 multi-issue at the time.
 263
 264 In its simplest form, then, this "prefixing" idea is a matter
 265 of:
 266
 267 * Defining the format of the prefix
 268 * Adding a `setvl` instruction
 269 * Adding Vector-context SPRs and working out how to do
 270   context-switches with them
 271 * Writing an awful lot of Specification Documentation
 272   (4 years and counting)
 273
 274 Once the basics of this concept have sunk in, early
 275 advancements quickly follow naturally from analysis
 276 of the problem-space:
 277
 278 * Expanding the size of GPR, FPR and CR register files to
 279   provide 128 entries in each. This is a bare minimum for GPUs
 280   in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
 281   batching as possible.
 282 * Predication (an absolutely critical component for a Vector ISA),
 283   then the next logical advancement is to allow separate predication masks
 284   to be applied to *both* the source *and* the destination, independently.
 285 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 286   with primarily Load and Store being able to handle 8/16/32/64
 287   and sometimes 128-bit (quad-word), where Vector ISAs need to
 288   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 289   high-performance AI. Rather than waste opcode space adding all
 290   such operations at different bitwidths, let the prefix
 291   *redefine* the element width.
 292 * "Reordering" of the assumption of linear sequential element
 293   access, for Matrices, rotations, transposition, Convolutions,
 294   DCT, FFT, Parallel Prefix-Sum and other common transformations
 295   that require significant programming effort in other ISAs.
 296
 297 All of these things come entirely from "Augmentation" of the Scalar operation
 298 being prefixed: at no time is the Scalar operation significantly
 299 altered.
 300 From there, several more "Modes" can be added, including saturation,
 301 which is needed for Audio and Video applications, "Reverse Gear"
 302 which runs the Element Loop in reverse order (needed for Prefix
 303 Sum), and more.
 304
 305 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 306
 307 Remarkably, very little: the devil is in the details though.
 308
 309 * The traditional `iota` instruction may be
 310   synthesised with an overlapping add, that stacks up incrementally
 311   and sequentially.  Although it requires two instructions (one to
 312   start the sum-chain) the technique has the advantage of allowing
 313   increments by arbitrary amounts, and is not limited to addition,
 314   either.
 315 * Big-integer addition (arbitrary-precision arithmetic) is an
 316   emergent characteristic from the carry-in, carry-out capability of
 317   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 318   naturally emerges from the
 319   sequential chaining of these scalar instructions.
 320 * The Condition Register Fields of the Power ISA make a great candidate
 321   for use as Predicate Masks, particularly when combined with
 322   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
 323
 324 It is only when looking slightly deeper into the Power ISA that
 325 certain things turn out to be missing, and this is down in part to IBM's
 326 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
 327 so Scalar ones.  Examples include that transfer operations between the
 328 Integer and Floating-point Scalar register files were dropped approximately
 329 a decade ago after the Packed SIMD variants were considered to be
 330 duplicates.  With it being completely inappropriate to attempt to Vectorise
 331 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
 332 the Scalar ISA, a much better all-round candidate for Vectorisation is
 333 left anaemic.
 334
 335 A particular key instruction that is missing is `MV.X` which is
 336 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
 337 expensive instruction causing a huge swathe of Register Hazards
 338 in one single hit is almost never added to a Scalar ISA but
 339 is almost always added to a Vector one. When `MV.X` is
 340 Vectorised it allows for arbitrary
 341 remapping of elements within a Vector to positions specified
 342 by another Vector. A typical Scalar ISA will use Memory to
 343 achieve this task, but with Vector ISAs the Vector Register Files are
 344 usually so enormous, and so far away from Memory, that it is easier and
 345 more efficient, architecturally, to provide these Indexing instructions.
 346
 347 Fortunately, with the ISA Working Group being willing
 348 to consider RFCs (Requests For Change) these omissions have the potential
 349 to be corrected.
 350
 351 One deliberate decision in SVP64 involves Predication. Typical Vector
 352 ISAs have quite comprehensive arithmetic and logical operations on
 353 Predicate Masks, and if CR Fields were the only predicates in SVP64
 354 it would put pressure on to start adding the exact same arithmetic and logical
 355 operations that already exist in the Integer opcodes.
 356 Instead of taking that route the decision was made to allow *both*
 357 Integer *and* CR Fields to be Predicate Masks, and to create Draft
 358 instructions that provide better transfer capability between CR Fields
 359 and Integer Register files.
 360
 361 Beyond that, further extensions to the Power ISA become much more
 362 domain-specific, such as adding bitmanipulation for Audio, Video
 363 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
 364 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
 365 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
 366 *automatically* is inherently added to the Vector one as well, and
 367 because these GPU and Video opcodes have been added to the CPU ISA,
 368 Software Driver development and debugging is dramatically simplified.
 369
 370 Which brings us to the next important question: how is any of these
 371 CPU-centric Vector-centric improvements relevant to power efficiency
 372 and making more effective use of resources?
 373
 374 # Simpler more compact programs
 375
 376 The first and most obvious saving is that, just as with any Vector
 377 ISA, the amount of data processing requested
 378 and controlled by each instruction is enormous, and leaves the
 379 Decode and Issue Engines idle, as well as the L1 I-Cache. With
 380 programs being smaller, chances are higher that they fit into
 381 L1 Cache, or that the L1 Cache may be made smaller.
 382
 383 Even a Packed SIMD ISA could take limited advantage of a higher
 384 bang-per-buck for limited specific workloads, as long as the
 385 stripmining setup and teardown is not required.  However a
 386 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
 387 ratio as a 64-wide Vector Length.
 388
 389 Realistically, for general use cases however it is extremely common
 390 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
 391 astounding 240 hand-coded assembler instructions where it is around
 392 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
 393 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
 394 the case of the IBM POWER9 a little-known design flaw this results in
 395 contention between the L1 D and I Caches at the L2 Bus, slowing down
 396 execution even further.  Power ISA 3.1 MMA (Matrix-Multiply-Assist)
 397 requires loop-unrolling to contend with non-power-of-two Matrix
 398 sizes: SVP64 does not, as hinted at below.
 399
 400 Additional savings come in the form of `SVREMAP`. This is a hardware
 401 index transformation system where the normally sequentially-linear
 402 element access may be "Re-Mapped" to limited but algorithmic-tailored
 403 commonly-used deterministic schedules, for example Matrix Multiply,
 404 DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
 405 2x6 may be performed in as little as 4 instructions, one of which
 406 is to zero-initialise the accumulator Vector used to store the result.
 407 If addition to another Matrix is also required then it is only three
 408 instructions. Not only that, but because the "Schedule" is an abstract
 409 concept separated from the mathematical operation, there is no reason
 410 why Matrix Multiplication Schedules may not be applied to Integer
 411 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, or Logical
 412 AND-and-OR.  The flexibility is not only enormous, but the compactness
 413 unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
 414 around 11 instructions. The only other processors well-known to have
 415 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
 416 and Qualcom's Hexagon, and both are targetted at FFTs only.
 417
 418 There is no reason at all why future algorithmic schedules should not
 419 be proposed as extensions to SVP64 (sorting algorithms,
 420 compression algorithms, Sparse Data Sets, Graph Node walking
 421 for example). Bear in mind that
 422 the submission process will be
 423 entirely at the discretion of the OpenPOWER Foundation ISA WG,
 424 however this is encouraged and welcomed.