openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * v0.00 05may2021 first created
   6
   7 **Table of Contents**
   8
   9 [[!toc]]
  10
  11 # Why in the 2020s would you invent a new Vector ISA
  12
  13 Inventing a new Scalar ISA from scratch is over a decade-long task
  14 including simulators and compilers: OpenRISC 1200 took 12 years to
  15 mature.  A Vector or Packed SIMD ISA to reach stable *general-purpose*
  16 auto-vectorisation compiler support has never been achieved in the
  17 history of computing, not with the combined resources of ARM, Intel,
  18 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  19 assembler and direct use of intrinsics is the Industry-standard norm
  20 to achieve high-performance optimisation where it matters*).
  21 Rather: GPUs
  22 have ultra-specialist compilers (CUDA) that are designed from the ground up
  23 to support Vector/SIMD parallelism, and associated standards
  24 (SPIR-V, Vulkan, OpenCL) managed by
  25 the Khronos Group, with multi-man-century development committment from
  26 multiple billion-dollar-revenue companies, to sustain them.
  27
  28 Therefore it begs the question, why on earth would anyone consider
  29 this task, and what, in Computer Science, actually needs solving?
  30
  31 First hints are that whilst memory bitcells have not increased in speed
  32 since the 90s (around 150 mhz), increasing the datapath widths has allowed
  33 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  34 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
  35 all make an effort (all simply increasing the parallel deployment of
  36 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  37 two nearly three orders of magnitude increase in CPU horsepower
  38 over the same timeframe. Seymour
  39 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  40 would become a serious limitation, over two decades ago.  Some systems
  41 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  42 by way of compensation, and as we know from experience even that will
  43 be considered inadequate in future.
  44
  45 Efforts to solve this problem by moving the processing closer to or
  46 directly integrated into the memory have traditionally not gone well:
  47 Aspex Microelectronics, Elixent, these are parallel processing companies
  48 that very few have heard of, because their software stack was so
  49 specialist that it required heavy investment by customers to utilise.
  50 D-Matrix and Graphcore are a modern incarnation of the exact same
  51 "specialist parallel processing" mistake, betting heavily on AI with
  52 Matrix and Convolution Engines that can do no other task.  Aspex only
  53 survived by being bought by Ericsson, where its specialised suitability
  54 for massive wide Baseband FFTs saved it from going under.
  55 The huge risk is that any "better
  56 AI mousetrap" created by an innovative competitor
  57 that comes along will quickly render both D-Matrix and
  58 Graphcore's approach obsolete.
  59
  60 NVIDIA and other GPUs have taken a different approach again: massive
  61 parallelism with more Turing-complete ISAs in each, and dedicated
  62 slower parallel memory paths (GDDR5) suited to the specific tasks of
  63 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  64 by the amount of money poured into the software ecosystem in order
  65 to make it accessible, and even then, GPU Programmers are a specialist
  66 and rare (expensive) breed.
  67
  68 Second hints as to the answer emerge from an article
  69 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  70 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  71 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  72 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  73 easy for hardware engineers, let software sort out the mess" literally
  74 overwhelming programmers.  Specialists charging
  75 clients for assembly-code Optimisation Services are finding that AVX-512,
  76 to take an
  77 example, is anything but optimal: overall performance of AVX-512 actually
  78 *decreases* even as power consumption goes up.
  79
  80 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  81 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  82 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  83 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  84 instruction that makes a truly ubiquitous Vector ISA) in ways that
  85 will become apparent over time as adoption increases. In the meantime
  86 programmers are, in direct violation of ARM's advice on how to use SVE2,
  87 trying desperately to use it as if it was Packed SIMD NEON.  The advice
  88 not to create SVE2 assembler that is hardcoded to fixed widths is being
  89 disregarded, in favour of writing *multiple identical implementations*
  90 of a function, each with a different hardware width, and compelling
  91 software to choose one at runtime after probing the hardware.
  92
  93 Even RISC-V, for all that we can be grateful to the RISC-V Founders
  94 for reviving Cray Vectors, has severe performance and implementation
  95 limitations that are only really apparent to exceptionally experienced
  96 assembly-level developers with a wide, diverse depth in multiple ISAs:
  97 one of the best and clearest is a
  98 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
  99 by adrian_b.
 100
 101 Adrian logically and concisely points out that the fundamental design
 102 assumptions and simplifications that went into the RISC-V ISA have an
 103 irrevocably damaging effect on its viability for high performance use.
 104 That is not to say that its use in low-performance embedded scenarios is
 105 not ideal: in private custom secretive commercial usage it is perfect.
 106 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
 107 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 108 per product are a classic case study.  Ubiquitous and common everyday
 109 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 110 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 111 unfortunately, fundamentally flawed as far as power efficient high
 112 performance is concerned.
 113
 114 Slowly, at this point, a realisation should be sinking in that, actually,
 115 there aren't as many really truly viable Vector ISAs out there, as the
 116 ones that are evolving in the general direction of Vectorisation are,
 117 in various completely different ways, flawed.
 118
 119 **Successfully identifying a limitation marks the beginning of an
 120 opportunity**
 121
 122 We are nowhere near done, however, because a Vector ISA is a superset of a
 123 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 124 support, and even longer to get the software ecosystem up and running.
 125
 126 Which ISAs, therefore, have or have had, at one point in time, a decent
 127 Software Ecosystem? Debian supports most of these including s390:
 128
 129 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 130   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 131   reputation nobody wants to go near SPARC.
 132 * MIPS, created by SGI and only really commonly used in Network switches.
 133   Exceptions: Ingenic with embedded CPUs,
 134   and China ICT with the Loongson supercomputers.
 135 * x86, the most well-known ISA and also one of the most heavily
 136   litigously-protected.
 137 * ARM, well known in embedded and smartphone scenarios, very slowly
 138   making its way into data centres.
 139 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 140 * s390, a Mainframe ISA very similar to Power.
 141 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 142   two out of three of the top500.org supercomputers using
 143   around 2 million IBM POWER9 Cores each.
 144 * ARC, a competitor at the time to ARM, best known for use in
 145   Broadcom VideoCore IV.
 146 * RISC-V, with a software ecosystem heavily in development
 147   and with rapid expansion
 148   in an uncontrolled fashion, is set on an unstoppable
 149   and inevitable trainwreck path to replicate the
 150   opcode conflict nightmare that plagued the Power ISA,
 151   two decades ago.
 152 * Tensilica, Andes STAR and Western Digital for successful
 153   commercial proprietary ISAs: Tensilica in Baseband Modems,
 154   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 155   astoundingly commercially successful
 156   multi-billion-unit mass volume markets that almost nobody
 157   knows anything about. Included for completeness.
 158
 159 In order of least controlled to most controlled, the viable
 160 candidates for further advancement are:
 161
 162 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 163   protection.
 164 * RISC-V, touted as "Open" but actually strictly controlled under
 165   Trademark License: too new to have adequate patent pool protection,
 166   as evidenced by multiple adopters having been hit by patent lawsuits.
 167 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
 168 * Power ISA: protected by IBM's extensive patent portfolio for Members
 169   of the OpenPOWER Foundation, covered by Trademarks, permitting
 170   and encouraging contributions, and having software support for over
 171   20 years.
 172 * ARM, not permitting Open Licensing, they survived in the early 90s
 173   only by doing a deal with Samsung for an in-perpetuity
 174   Royalty-free License, in exchange
 175   for GBP 3 million and legal protection through Samsung Research.
 176   Several large Corporations (Apple most notably) have licensed the ISA
 177   but not ARM designs: the barrier to entry is high and the ISA itself
 178   protected from interference as a result.
 179 * x86, famous for an unprecedented
 180   Court Ruling in 2004 where a Judge "banged heads
 181   together" and ordered AMD and Intel to stop wasting his time,
 182   make peace, and cross-license each other's patents, anyone wishing
 183   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 184   and VIA EDEN processors, and see how they fared.
 185 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 186   but the 800lb Gorilla Syndrome seems not to have deterred one
 187   particularly disingenuous group from performing illegal
 188   Reverse-Engineering.
 189
 190 By asking the question, "which ISA would be the best and most stable to
 191 base a Vector Supercomputing-class Extension on?" where patent protection,
 192 software ecosystem, open-ness and pedigree all combine to reduce risk
 193 and increase the chances of success, there is really only one candidate.
 194
 195 **Of all of these, the only one with the most going for it is the Power ISA.**
 196
 197 The summary of advantages, then, of the Power ISA is that:
 198
 199 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 200   and more.
 201 * IBM's extensive 20+ years of patents is available, royalty-free,
 202   to protect implementors as long as they are also members of the
 203   OpenPOWER Foundation
 204 * IBM designed and maintained the Power ISA as a Supercomputing
 205   class ISA from its inception.
 206 * Coherent distributed memory access is possible through OpenCAPI
 207 * Extensions to the Power ISA may be submitted through an External
 208   RFC Process that does not require membership of OPF.
 209
 210 From this strong base, the next step is: how to leverage this
 211 foundation to take a leap forward in performance and performance/watt,
 212 *without* losing all the advantages of an ubiquitous software ecosystem,
 213 the lack of which has historically plagued other systems and relegated
 214 them to a risky niche market?
 215
 216 # How do you turn a Scalar ISA into a Vector one?
 217
 218 The most obvious question before that is: why on earth would you want to?
 219 As explained in the "SIMD Considered Harmful" article, Cray-style
 220 Vector ISAs break the link between data element batches and the
 221 underlying architectural back-end parallel processing capability.
 222 Packed SIMD explicitly smashes that width right in the face of the
 223 programmer and expects them to like it.  As the article immediately
 224 demonstrates, an arbitrary-sized data set has to contend with
 225 an insane power-of-two Packed SIMD cascade at both setup and teardown
 226 that routinely adds literally an order
 227 of magnitude increase in the number of hand-written lines of assembler
 228 compared to a well-designed Cray-style Vector ISA with a `setvl`
 229 instruction.
 230
 231 Assuming then that variable-length Vectors are obviously desirable,
 232 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 233 went the way of adding explicit Vector opcodes, a style which RVV
 234 copied and modernised. In the case of RVV this introduced 192 new
 235 instructions on top of an existing 95+ for base RV64GC.  Adding
 236 200% more instructions than the base ISA seems unwise: at least,
 237 it feels like there should be a better way, particularly on
 238 close inspection of RVV as an example, the basic arithmetic
 239 operations are massively duplicated: scalar-scalar from the base
 240 is joined by both scalar-vector and vector-vector *and* predicate
 241 mask management, and transfer instructions between all the same,
 242 which goes a long way towards explaining why there are twice as many
 243 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
 244
 245 The question then becomes: with all the duplication of arithmetic
 246 operations just to make the registers scalar or vector, why not
 247 leverage the *existing* Scalar ISA with some sort of "context"
 248 or prefix that augments its behaviour?  Then, the Instruction Decode
 249 phase is greatly simplified, reducing design complexity and leaving
 250 plenty of headroom for further expansion.
 251
 252 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 253 gives the base concept, but in 1994 it was Peter Hsu, the designer
 254 of the MIPS R8000, who first came up with the idea of Vector-augmented
 255 prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
 256 the prefix would mark which of the registers were to be treated as
 257 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
 258 as a guide and making "scalar instruction" synonymous with "Vector element",
 259 perform a `REP`-like loop that
 260 jammed multiple scalar operations into the Multi-Issue Execution
 261 Engine.  The only reason that the team did not take this forward
 262 into a commercial product
 263 was because they could not work out how to cleanly do OoO
 264 multi-issue at the time.
 265
 266 In its simplest form, then, this "prefixing" idea is a matter
 267 of:
 268
 269 * Defining the format of the prefix
 270 * Adding a `setvl` instruction
 271 * Adding Vector-context SPRs and working out how to do
 272   context-switches with them
 273 * Writing an awful lot of Specification Documentation
 274   (4 years and counting)
 275
 276 Once the basics of this concept have sunk in, early
 277 advancements quickly follow naturally from analysis
 278 of the problem-space:
 279
 280 * Expanding the size of GPR, FPR and CR register files to
 281   provide 128 entries in each. This is a bare minimum for GPUs
 282   in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
 283   batching as possible.
 284 * Predication (an absolutely critical component for a Vector ISA),
 285   then the next logical advancement is to allow separate predication masks
 286   to be applied to *both* the source *and* the destination, independently.
 287 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 288   with primarily Load and Store being able to handle 8/16/32/64
 289   and sometimes 128-bit (quad-word), where Vector ISAs need to
 290   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 291   high-performance AI. Rather than waste opcode space adding all
 292   such operations at different bitwidths, let the prefix
 293   *redefine* the element width.
 294 * "Reordering" of the assumption of linear sequential element
 295   access, for Matrices, rotations, transposition, Convolutions,
 296   DCT, FFT, Parallel Prefix-Sum and other common transformations
 297   that require significant programming effort in other ISAs.
 298
 299 All of these things come entirely from "Augmentation" of the Scalar operation
 300 being prefixed: at no time is the Scalar operation significantly
 301 altered.
 302 From there, several more "Modes" can be added, including saturation,
 303 which is needed for Audio and Video applications, "Reverse Gear"
 304 which runs the Element Loop in reverse order (needed for Prefix
 305 Sum), and more.
 306
 307 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 308
 309 Remarkably, very little: the devil is in the details though.
 310
 311 * The traditional `iota` instruction may be
 312   synthesised with an overlapping add, that stacks up incrementally
 313   and sequentially.  Although it requires two instructions (one to
 314   start the sum-chain) the technique has the advantage of allowing
 315   increments by arbitrary amounts, and is not limited to addition,
 316   either.
 317 * Big-integer addition (arbitrary-precision arithmetic) is an
 318   emergent characteristic from the carry-in, carry-out capability of
 319   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 320   naturally emerges from the
 321   sequential chaining of these scalar instructions.
 322 * The Condition Register Fields of the Power ISA make a great candidate
 323   for use as Predicate Masks, particularly when combined with
 324   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
 325
 326 It is only when looking slightly deeper into the Power ISA that
 327 certain things turn out to be missing, and this is down in part to IBM's
 328 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
 329 so Scalar ones.  Examples include that transfer operations between the
 330 Integer and Floating-point Scalar register files were dropped approximately
 331 a decade ago after the Packed SIMD variants were considered to be
 332 duplicates.  With it being completely inappropriate to attempt to Vectorise
 333 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
 334 the Scalar ISA, a much better all-round candidate for Vectorisation is
 335 left anaemic.
 336
 337 A particular key instruction that is missing is `MV.X` which is
 338 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
 339 expensive instruction causing a huge swathe of Register Hazards
 340 in one single hit is almost never added to a Scalar ISA but
 341 is almost always added to a Vector one. When `MV.X` is
 342 Vectorised it allows for arbitrary
 343 remapping of elements within a Vector to positions specified
 344 by another Vector. A typical Scalar ISA will use Memory to
 345 achieve this task, but with Vector ISAs the Vector Register Files are
 346 usually so enormous, and so far away from Memory, that it is easier and
 347 more efficient, architecturally, to provide these Indexing instructions.
 348
 349 Fortunately, with the ISA Working Group being willing
 350 to consider RFCs (Requests For Change) these omissions have the potential
 351 to be corrected.
 352
 353 One deliberate decision in SVP64 involves Predication. Typical Vector
 354 ISAs have quite comprehensive arithmetic and logical operations on
 355 Predicate Masks, and if CR Fields were the only predicates in SVP64
 356 it would put pressure on to start adding the exact same arithmetic and logical
 357 operations that already exist in the Integer opcodes.
 358 Instead of taking that route the decision was made to allow *both*
 359 Integer *and* CR Fields to be Predicate Masks, and to create Draft
 360 instructions that provide better transfer capability between CR Fields
 361 and Integer Register files.
 362
 363 Beyond that, further extensions to the Power ISA become much more
 364 domain-specific, such as adding bitmanipulation for Audio, Video
 365 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
 366 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
 367 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
 368 *automatically* is inherently added to the Vector one as well, and
 369 because these GPU and Video opcodes have been added to the CPU ISA,
 370 Software Driver development and debugging is dramatically simplified.
 371
 372 Which brings us to the next important question: how is any of these
 373 CPU-centric Vector-centric improvements relevant to power efficiency
 374 and making more effective use of resources?
 375
 376 # Simpler more compact programs saves power
 377
 378 The first and most obvious saving is that, just as with any Vector
 379 ISA, the amount of data processing requested
 380 and controlled by each instruction is enormous, and leaves the
 381 Decode and Issue Engines idle, as well as the L1 I-Cache. With
 382 programs being smaller, chances are higher that they fit into
 383 L1 Cache, or that the L1 Cache may be made smaller.
 384
 385 Even a Packed SIMD ISA could take limited advantage of a higher
 386 bang-per-buck for limited specific workloads, as long as the
 387 stripmining setup and teardown is not required.  However a
 388 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
 389 ratio as a 64-wide Vector Length.
 390
 391 Realistically, for general use cases however it is extremely common
 392 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
 393 astounding 240 hand-coded assembler instructions where it is around
 394 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
 395 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
 396 the case of the IBM POWER9 with a little-known design flaw not
 397 normally otherwise encountered this results in
 398 contention between the L1 D and I Caches at the L2 Bus, slowing down
 399 execution even further.  Power ISA 3.1 MMA (Matrix-Multiply-Assist)
 400 requires loop-unrolling to contend with non-power-of-two Matrix
 401 sizes: SVP64 does not, as hinted at below.
 402
 403 Additional savings come in the form of `SVREMAP`. This is a hardware
 404 index transformation system where the normally sequentially-linear
 405 element access may be "Re-Mapped" to limited but algorithmic-tailored
 406 commonly-used deterministic schedules, for example Matrix Multiply,
 407 DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
 408 2x6 may be performed in as little as 4 instructions, one of which
 409 is to zero-initialise the accumulator Vector used to store the result.
 410 If addition to another Matrix is also required then it is only three
 411 instructions. Not only that, but because the "Schedule" is an abstract
 412 concept separated from the mathematical operation, there is no reason
 413 why Matrix Multiplication Schedules may not be applied to Integer
 414 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
 415 AND-and-OR, or any other future instruction such as Complex-Number
 416 Multiply-and-Accumulate that a future version of the Power ISA might
 417 support.  The flexibility is not only enormous, but the compactness
 418 unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
 419 around 11 instructions. The only other processors well-known to have
 420 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
 421 and Qualcom's Hexagon, and both are targetted at FFTs only.
 422
 423 There is no reason at all why future algorithmic schedules should not
 424 be proposed as extensions to SVP64 (sorting algorithms,
 425 compression algorithms, Sparse Data Sets, Graph Node walking
 426 for example). Bear in mind that
 427 the submission process will be
 428 entirely at the discretion of the OpenPOWER Foundation ISA WG,
 429 something that is both encouraged and welcomed by the OPF.
 430
 431 One of SVP64's current limitations is that it was initially designed
 432 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
 433 a heavy focus on adding hardware-for-loops onto the *Registers*.
 434 After more than three years of development the realisation hit that
 435 the SVP64 concept could be expanded to Coherent Distributed Memory,
 436 This astoundingly powerful concept is explored in the next section.
 437
 438 # Coherent Deterministic Hybrid Distributed Memory-Processing
 439
 440 It is not often that a heading in an article can legitimately
 441 contain quite so many comically-chained buzzwords, but in this section
 442 they are justified.  As hinted at in the first section, the last time
 443 that memory was the same speed as processors was the Pentium III
 444 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
 445 CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
 446 these rates*, yet the pressure from Software Engineers is to
 447 make *sequential* algorithm processing faster and faster because
 448 parallelising of algorithms is simply too difficult to master, and always
 449 has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
 450 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
 451 are at an astonishing four levels of cache (L1 to L4).
 452
 453 It should therefore come as no surprise that attempts are being made
 454 to move (distribute) processing closer to the DRAM Memory, firmly
 455 on the *opposite* side of the main CPU's L1/2/3/4 Caches.  However
 456 the alarm bells ring here at the keyword "distributed", because by
 457 moving the processing down next to the Memory, the speed of any
 458 of the parallel Processing Elements (PEs) has dropped
 459 by almost two orders of magnitude (5 ghz down to 100 mhz),
 460 the simplicity of each PE has, for pure pragmatic reasons,
 461 to drop by several
 462 orders of magnitude as well.
 463 Things that the average "sequential algorithm"
 464 programmer
 465 takes for granted such as SMP, Cache Coherency, Virtual Memory,
 466 spinlocks (atomic locking), all of these are either outright gone
 467 or expected that the programmer shall explicitly contend with
 468 (even if that programmer is the Compiler Developer).
 469
 470 To give an extreme example: Aspex's Array-String Processor, which
 471 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
 472 Memory, was capable of literally a hundred-fold improvement in
 473 performance over Scalar CPUs such as the Pentium III of its era,
 474 all on a 3 watt budget at only 250 mhz in 130 nm.  Yet to take
 475 proper advantage of its capability required an astounding 5-10
 476 *days* per line of assembly code because multiple versions of
 477 an algorithm had to be hand-crafted then compared, and
 478 the best one selected, all others discarded. 20 lines of optimised
 479 Assembler taking six months to write can in no way be termed
 480 "productive", yet this extreme level of unproductivity is an inherent
 481 side-effect of going down the parallel-processing rabbithole.
 482
 483 **In short, we are in "Programmer's nightmare" territory**
 484
 485 Having dug a proverbial hole that rivals the Grand Canyon, and
 486 jumped in it feet-first, the next
 487 task is to piece together a strategy to climb back out and show
 488 how falling back in can be avoided. This takes some explaining,
 489 and first requires some background on various research efforts and
 490 commercial designs.  Once the context is clear, their synthesis
 491 can be proposed.  These are:
 492
 493 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
 494 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
 495 * [Snitch](https://arxiv.org/abs/2002.10143)
 496
 497 **ZOLC: Zero-Overhead Loop Control**
 498
 499 The simplest longest commercially successful deployment of Zero-overhead looping
 500 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
 501 within the VLIW word may be repeatedly deployed on successive clock
 502 cycles until a countdown reaches zero. This extraordinarily simple
 503 concept needs no branches, and has no complex Register Hazard
 504 Management in the hardware
 505 because it is down to the programmer (or, the compiler),
 506 to ensure data overlaps do not occur.
 507
 508 The key aspect of these
 509 very simplistic countdown loops is: *they are deterministic*.
 510
 511 Zero-Overhead Loop Control takes this basic "single loop" concept
 512 way further: both nested loops and conditional exit are included,
 513 but also arbitrary control-jumping from the current inner loop
 514 out to an entirely different loop, all based on conditions determined
 515 dynamically at runtime.
 516
 517 Even when deployed on as basic a CPU as a single-issue in-order RISC
 518 core, the performance and power-savings were astonishing: between 20
 519 and **80%** reduction in algorithm completion times were achieved compared
 520 to a more traditional branch-speculative in-order RISC CPU.  MPEG
 521 Decode, the target algorithm specifically picked by the researcher
 522 due to its high complexity with 6-deep nested loops and conditional
 523 execution that frequently jumped in and out of at least 2 loops,
 524 came out with an astonishing 43% improvement in completion time. 43%
 525 less instructions executed is an almost unheard-of level of optimisation:
 526 most ISA designers are elated if they can achieve 5 to 10%. The reduction
 527 was so compelling that ST Microelectronics put it into commercial
 528 production in one of their embedded CPUs.
 529
 530 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
 531 design of its triple-nested for-loop system
 532 turned out to be remarkably similar to the
 533 core nested for-loop engine of ZOLC. In hindsight this should not
 534 have come as a surprise, because both are basically nested for-loops
 535 that do not need branches to issue instructions.
 536
 537 The important insight is, however, that if ZOLC can be general-purpose
 538 and apply deterministic nested looped instruction
 539 schedules to more than just registers
 540 (unlike SVP64 in its current incarnation) then so can SVP64.
 541
 542 **OpenCAPI and Extra-V**
 543
 544 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
 545 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
 546 has OpenCAPI Memory interfaces, and requires an OpenCAPI-to-DDR4/5 Bridge PHY
 547 to connect to standard DIMMs.
 548
 549 Extra-V appears to be a remarkable research project that, by leveraging
 550 OpenCAPI, assuming that the map of edges in any given arbitrary data graph
 551 could be kept by the main CPU in-memory, could distribute and delegate
 552 a limited-capability deterministic but most importantly *data-dependent*
 553 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor analysed
 554 the data it had read (at the Memory), and determine if it should
 555 notify the main processor that this "Node" is worth investigating,
 556 or if the Graph node-walk should split in a different direction.
 557 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
 558 abstraction, locking, and cache-coherency, many of the nightmare problems
 559 of other more explicit parallel processing paradigms disappear.
 560
 561 The similarity to ZOLC should not have gone unnoticed: where ZOLC
 562 has nested conditional for-loops Extra-V appears to have just the
 563 one conditional for-loop, but the key strategically-crucial
 564 part of this multi-faceted puzzle is that due to the deterministic and
 565 coherent nature of Extra-V, the processing of the loops, which
 566 requires a tiny non-Turing-Complete processor, is not
 567 done close to or by the main CPU at all: it is
 568 *embedded right next to the memory*.
 569
 570 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
 571 Array-String Processing, and Elixent 2D Array Processing, should
 572 also not have gone unnoticed.  All of these solutions utilised
 573 or utilise
 574 a more comprehensive Turing-complete von-Neumann "Management Core"
 575 to coordinate data passed in and out of PEs: none of them have or
 576 had something
 577 as powerful as OpenCAPI as part of that picture.
 578
 579 **Snitch**
 580
 581 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
 582 become "tagged" with a Memory-access Mode that went out of fashion
 583 over forty years ago: Load-then-Auto-Increment. Expressed in c as
 584 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
 585 thanks to the RISC paradigm having gone too far,
 586 the efficiency and effectiveness
 587 of these Load-Store-with-Increment instructions has been
 588 forgotten until Snitch.
 589
 590 What the designers did however was not to add new Load-Store
 591 or Arithmetic instructions to RISC-V, but instead to "mark"
 592 registers with a tag.  These tags tell the CPU: when you are asked to
 593 carry out
 594 an add instruction on r6 and r7, do not take r6 or r7 from the reguster
 595 file, instead please perform a Cache-coherent Load-with-Increment
 596 on each, using special Address Registers for each.  Each new use
 597 of r6 therefore brings in an entirely new value *directly from
 598 memory*. Likewise on the second operand, r7, and likewise on
 599 the destination result which can be an automatic Coherent
 600 Store-and-increment
 601 directly into Memory.
 602
 603 On top of a barrel-architecture the slowness of Memory access
 604 was not a problem because the Deterministic nature of classic
 605 Load-Store-Increment can be compensated for by having 8 Memory
 606 accesses scheduled underway and interleaved in a time-sliced
 607 fashion with an FPU that is correspondingly 8 times faster than
 608 the Coherent Memory accesses.
 609
 610 This design is almost identical to the early Vector Processors
 611 of the late 1950s and early 1960s, which also critically relied
 612 on implicit auto-increment addressing. The barrel-architecture neatly
 613 solves one of the inherent problems of those early designs ( a mismatch in memory
 614 speed) and the presence of a full register file caters for a
 615 second limitation of pure Memory-based Vector Processors: temporary
 616 variables needed in the computation of intermediate results, which
 617 also were put in memory, put
 618 an awfully high artificial load on Memory bandwidth.
 619
 620 The similarity to SVP64 should be clear: SVP64 Prefixing and the
 621 associated REMAP system is just another form of register "tagging"
 622 that augments what was formerly designated by its original authors
 623 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
 624 with advanced behaviour not previously envisaged.
 625
 626 What Snitch brings to the table therefore is a further illustration of
 627 the concept introduced by Extra-V: where Extra-V brought information
 628 about Sparse-Distributed Data to the attention of the main CPU in
 629 a coherent fashion *without the CPU having to ask for it*, Snitch
 630 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
 631 distributed coherent manner, and does so with dramatically-reduced
 632 power consumption.
 633
 634 **Bringing it all together**
 635
 636 At this point we are well into a future revision of SVP64, one that
 637 clearly has some startlingly powerful potential: Supercomputing-class
 638 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
 639 sustained fashion with reduced complexity, reduced power consumption
 640 and reduced completion time, thanks to Deterministic Coherent Scheduling
 641 of the data fed in and out, or even moved down next to Memory.
 642
 643 This last part is where it normally gets hair-raising, but as ZOLC shows
 644 there is no reason at all why even complex algorithms such as MPEG cannot
 645 be run in a partially-deterministic manner, and anything that is
 646 deterministic can be Scheduled, coherently.  Combine that with OpenCAPI
 647 which solves the many issues associated with SMP Virtual Memory and so on
 648 yet still allows Cache-Coherent Distributed Memory Access, and what was
 649 previously an intractable Computer Science problem for decades begins to
 650 look like there is a potential solution.
 651
 652 The Deterministic Schedules created by ZOLC should even be possible to identify their
 653 suitability for full off-CPU distributed processing, as long as OpenCAPI
 654 is integrated into the mix.  What a compiler - or even the hardware -
 655 will be looking out for is a Basic Block of instructions that:
 656
 657 * begins with a LOAD (to be handled by OpenCAPI)
 658 * contains some instructions that a given PE is capable of executing
 659 * ends with a STORE (again: OpenCAPI)
 660
 661 For best results that would be wrapped with a Zero-Overhead Loop, where
 662 the Compiler (or hardware at runtime) could easily identify, in advance,
 663 the full range of Memory Addresses that the Loop is to encounter.  Copies
 664 of loop-invariant data would need to be passed down to the remote PE:
 665 again, for simple-enough Basic Blocks, with assistance from the Compiler,
 666 loop-invariant inputs are easily identified.
 667
 668 The importance of OpenCAPI in this mix cannot be underestimated, because
 669 it will be the means by which the main CPU coordinates its activities
 670 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
 671 violated.
 672