openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * v0.00 05may2021 first created
   6 * v0.01 06may2021 initial first draft
   7 * v0.02 08may2021 add scenarios / use-cases
   8 * v0.03 09may2021 add draft image for scenario
   9 * v0.04 14may2021 add appendix with other research
  10
  11 **Table of Contents**
  12
  13 [[!toc]]
  14
  15 # Why in the 2020s would you invent a new Vector ISA
  16
  17 *(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
  18
  19 Inventing a new Scalar ISA from scratch is over a decade-long task
  20 including simulators and compilers: OpenRISC 1200 took 12 years to
  21 mature.  Stable Open ISAs require Standards and Compliance Suites that
  22 take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
  23 auto-vectorisation compiler support has never been achieved in the
  24 history of computing, not with the combined resources of ARM, Intel,
  25 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  26 assembler and direct use of intrinsics is the Industry-standard norm
  27 to achieve high-performance optimisation where it matters*).
  28 GPUs fill this void both in hardware and software terms by having
  29 ultra-specialist compilers (CUDA) that are designed from the ground up
  30 to support Vector/SIMD parallelism, and associated standards
  31 (SPIR-V, Vulkan, OpenCL) managed by
  32 the Khronos Group, with multi-man-century development committment from
  33 multiple billion-dollar-revenue companies, to sustain them.
  34
  35 Therefore it begs the question, why on earth would anyone consider
  36 this task, and what, in Computer Science, actually needs solving?
  37
  38 First hints are that whilst memory bitcells have not increased in speed
  39 since the 90s (around 150 mhz), increasing the bank width, striping, and
  40 datapath widths and speeds to the same has, with significant relative
  41 latency penalties, allowed
  42 apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  43 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI's OMI,
  44 all make an effort (all simply increasing the parallel deployment of
  45 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  46 two nearly three orders of magnitude increase in CPU horsepower
  47 over the same timeframe. Seymour
  48 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  49 would become a serious limitation, over two decades ago.
  50
  51 The latency gap between that bitcell speed and the CPU speed can do nothing to help Random Access (unpredictable reads/writes). Cacheing helps only so
  52 much, but not with some types of workloads (FFTs are one of the worst)
  53 even though
  54 they are fully deterministic.
  55 Some systems
  56 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  57 by way of compensation, and as we know from experience even that will
  58 be considered inadequate in future.
  59
  60 Efforts to solve this problem by moving the processing closer to or
  61 directly integrated into the memory have traditionally not gone well:
  62 Aspex Microelectronics, Elixent, these are parallel processing companies
  63 that very few have heard of, because their software stack was so
  64 specialist that it required heavy investment by customers to utilise.
  65 D-Matrix, a Systolic Array Processor, is a modern incarnation of the exact same
  66 "specialist parallel processing" mistake, betting heavily on AI with
  67 Matrix and Convolution Engines that can do no other task.  Aspex only
  68 survived by being bought by Ericsson, where its specialised suitability
  69 for massive wide Baseband FFTs saved it from going under.
  70 The huge risk is that any "better
  71 AI mousetrap" created by an innovative competitor
  72 that comes along quickly renders a too-specialist design obsolete.
  73
  74 NVIDIA and other GPUs have taken a different approach again: massive
  75 parallelism with more Turing-complete ISAs in each, and dedicated
  76 slower parallel memory paths (GDDR5) suited to the specific tasks of
  77 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  78 by the amount of money poured into the software ecosystem in order
  79 to make it accessible, and even then, GPU Programmers are a specialist
  80 and rare (expensive) breed.
  81
  82 Second hints as to the answer emerge from an article
  83 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  84 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  85 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  86 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  87 easy for hardware engineers, let software sort out the mess" literally
  88 overwhelming programmers with thousands of instructions. Specialists charging
  89 clients for assembly-code Optimisation Services are finding that AVX-512,
  90 to take an
  91 example, is anything but optimal: overall performance of AVX-512 actually
  92 *decreases* even as power consumption goes up.
  93
  94 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  95 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  96 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  97 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  98 instruction that makes a truly ubiquitous Vector ISA) in ways that
  99 will become apparent over time as adoption increases. In the meantime
 100 programmers are, in direct violation of ARM's advice on how to use SVE2,
 101 trying desperately to understand it by applying their experience
 102 of Packed SIMD NEON.  The advice from ARM
 103 not to create SVE2 assembler that is hardcoded to fixed widths is being
 104 disregarded, in favour of writing *multiple identical implementations*
 105 of a function, each with a different hardware width, and compelling
 106 software to choose one at runtime after probing the hardware.
 107
 108 Even RISC-V, for all that we can be grateful to the RISC-V Founders
 109 for reviving Cray Vectors, has severe performance and implementation
 110 limitations that are only really apparent to exceptionally experienced
 111 assembly-level developers with a wide, diverse depth in multiple ISAs:
 112 one of the best and clearest is a
 113 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
 114 by adrian_b.
 115
 116 Adrian logically and concisely points out that the fundamental design
 117 assumptions and simplifications that went into the RISC-V ISA have an
 118 irrevocably damaging effect on its viability for high performance use.
 119 That is not to say that its use in low-performance embedded scenarios is
 120 not ideal: in private custom secretive commercial usage it is perfect.
 121 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
 122 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 123 per product are a classic case study.  Ubiquitous and common everyday
 124 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 125 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 126 unfortunately, fundamentally flawed as far as power efficient high
 127 performance is concerned.
 128
 129 Slowly, at this point, a realisation should be sinking in that, actually,
 130 there aren't as many really truly viable Vector ISAs out there, as the
 131 ones that are evolving in the general direction of Vectorisation are,
 132 in various completely different ways, flawed.
 133
 134 **Successfully identifying a limitation marks the beginning of an
 135 opportunity**
 136
 137 We are nowhere near done, however, because a Vector ISA is a superset of a
 138 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 139 support, and even longer to get the software ecosystem up and running.
 140
 141 Which ISAs, therefore, have or have had, at one point in time, a decent
 142 Software Ecosystem? Debian supports most of these including s390:
 143
 144 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 145   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 146   reputation nobody wants to go near SPARC.
 147 * MIPS, created by SGI and only really commonly used in Network switches.
 148   Exceptions: Ingenic with embedded CPUs,
 149   and China ICT with the Loongson supercomputers.
 150 * x86, the most well-known ISA and also one of the most heavily
 151   litigously-protected.
 152 * ARM, well known in embedded and smartphone scenarios, very slowly
 153   making its way into data centres.
 154 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 155 * s390, a Mainframe ISA very similar to Power.
 156 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 157   two out of three of the top500.org supercomputers using
 158   around 2 million IBM POWER9 Cores each.
 159 * ARC, a competitor at the time to ARM, best known for use in
 160   Broadcom VideoCore IV.
 161 * RISC-V, with a software ecosystem heavily in development
 162   and with rapid expansion
 163   in an uncontrolled fashion, is set on an unstoppable
 164   and inevitable trainwreck path to replicate the
 165   opcode conflict nightmare that plagued the Power ISA,
 166   two decades ago.
 167 * Tensilica, Andes STAR and Western Digital for successful
 168   commercial proprietary ISAs: Tensilica in Baseband Modems,
 169   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 170   astoundingly commercially successful
 171   multi-billion-unit mass volume markets that almost nobody
 172   knows anything about, outside their specialised proprietary
 173   niche. Included for completeness.
 174
 175 In order of least controlled to most controlled, the viable
 176 candidates for further advancement are:
 177
 178 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 179   protection.
 180 * RISC-V, touted as "Open" but actually strictly controlled under
 181   Trademark License: too new to have adequate patent pool protection,
 182   as evidenced by multiple adopters having been hit by patent lawsuits.
 183   (Agreements between RISC-V *Members* to not engage in patent litigation
 184    does nothing to stop third party patents that *legitimately pre-date*
 185    the newly-created RISC-V ISA)
 186 * MIPS, SPARC, ARC, and others, simply have no viable publicly
 187   managed ecosystem. They work well within their niche markets.
 188 * Power ISA: protected by IBM's extensive patent portfolio for Members
 189   of the OpenPOWER Foundation, covered by Trademarks, permitting
 190   and encouraging contributions, and having software support for over
 191   20 years.
 192 * ARM, not permitting Open Licensing, they survived in the early 90s
 193   only by doing a deal with Samsung for an in-perpetuity
 194   Royalty-free License, in exchange
 195   for GBP 3 million and legal protection through Samsung Research.
 196   Several large Corporations (Apple most notably) have licensed the ISA
 197   but not ARM designs: the barrier to entry is high and the ISA itself
 198   protected from interference as a result.
 199 * x86, famous for an unprecedented
 200   Court Ruling in 2004 where a Judge "banged heads
 201   together" and ordered AMD and Intel to stop wasting his time,
 202   make peace, and cross-license each other's patents, anyone wishing
 203   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 204   and VIA EDEN processors, and see how they fared.
 205 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 206   but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
 207   particularly disingenuous group from performing illegal
 208   Reverse-Engineering.
 209
 210 By asking the question, "which ISA would be the best and most stable to
 211 base a Vector Supercomputing-class Extension on?" where patent protection,
 212 software ecosystem, open-ness and pedigree all combine to reduce risk
 213 and increase the chances of success, there is really only one candidate.
 214
 215 **Of all of these, the only one with the most going for it is the Power ISA.**
 216
 217 The summary of advantages, then, of the Power ISA is that:
 218
 219 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 220   and more.
 221 * Amongst many other features
 222   it has Condition Registers which can be used by Branches, greatly
 223   reducing pressure on the main register files.
 224 * IBM's extensive 20+ years of patents is available, royalty-free,
 225   to protect implementors as long as they are also members of the
 226   OpenPOWER Foundation
 227 * IBM designed and maintained the Power ISA as a Supercomputing
 228   class ISA from its inception over 25 years ago.
 229 * Coherent distributed memory access is possible through OpenCAPI
 230 * Extensions to the Power ISA may be submitted through an External
 231   RFC Process that does not require membership of OPF.
 232
 233 From this strong base, the next step is: how to leverage this
 234 foundation to take a leap forward in performance and performance/watt,
 235 *without* losing all the advantages of an ubiquitous software ecosystem,
 236 the lack of which has historically plagued other systems and relegated
 237 them to a risky niche market?
 238
 239 # How do you turn a Scalar ISA into a Vector one?
 240
 241 The most obvious question before that is: why on earth would you want to?
 242 As explained in the "SIMD Considered Harmful" article, Cray-style
 243 Vector ISAs break the link between data element batches and the
 244 underlying architectural back-end parallel processing capability.
 245 Packed SIMD explicitly smashes that width right in the face of the
 246 programmer and expects them to like it.  As the article immediately
 247 demonstrates, an arbitrary-sized data set has to contend with
 248 an insane power-of-two Packed SIMD cascade at both setup and teardown
 249 that routinely adds literally an order
 250 of magnitude increase in the number of hand-written lines of assembler
 251 compared to a well-designed Cray-style Vector ISA with a `setvl`
 252 instruction.
 253
 254 *<blockquote>
 255 Packed SIMD looped algorithms actually have to
 256 contain multiple implementations processing fragments of data at
 257 different SIMD widths: Cray-style Vectors have just the one, covering not
 258 just current architectural implementations but future ones with
 259 wider back-end ALUs as well.
 260 </blockquote>*
 261
 262 Assuming then that variable-length Vectors are obviously desirable,
 263 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 264 went the way of adding explicit Vector opcodes, a style which RVV
 265 copied and modernised. In the case of RVV this introduced 192 new
 266 instructions on top of an existing 95+ for base RV64GC.  Adding
 267 200% more instructions than the base ISA seems unwise: at least,
 268 it feels like there should be a better way, particularly on
 269 close inspection of RVV as an example, the basic arithmetic
 270 operations are massively duplicated: scalar-scalar from the base
 271 is joined by both scalar-vector and vector-vector *and* predicate
 272 mask management, and transfer instructions between all the same,
 273 which goes a long way towards explaining why there are twice as many
 274 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
 275
 276 The question then becomes: with all the duplication of arithmetic
 277 operations just to make the registers scalar or vector, why not
 278 leverage the *existing* Scalar ISA with some sort of "context"
 279 or prefix that augments its behaviour? Make "Scalar instruction"
 280 synonymous with "Vector Element instruction" and through nothing
 281 more than contextual
 282 augmentation the Scalar ISA *becomes* the Vector ISA.
 283 Then, by not having to have any Vector instructions at all,
 284 the Instruction Decode
 285 phase is greatly simplified, reducing design complexity and leaving
 286 plenty of headroom for further expansion.
 287
 288 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 289 gives the base concept, but in 1994 it was Peter Hsu, the designer
 290 of the MIPS R8000, who first came up with the idea of Vector-augmented
 291 prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
 292 the prefix would mark which of the registers were to be treated as
 293 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
 294 as a guide and making "scalar instruction" synonymous with "Vector element",
 295 perform a `REP`-like loop that
 296 jammed multiple scalar operations into the Multi-Issue Execution
 297 Engine.  The only reason that the team did not take this forward
 298 into a commercial product
 299 was because they could not work out how to cleanly do OoO
 300 multi-issue at the time.
 301
 302 In its simplest form, then, this "prefixing" idea is a matter
 303 of:
 304
 305 * Defining the format of the prefix
 306 * Adding a `setvl` instruction
 307 * Adding Vector-context SPRs and working out how to do
 308   context-switches with them
 309 * Writing an awful lot of Specification Documentation
 310   (4 years and counting)
 311
 312 Once the basics of this concept have sunk in, early
 313 advancements quickly follow naturally from analysis
 314 of the problem-space:
 315
 316 * Expanding the size of GPR, FPR and CR register files to
 317   provide 128 entries in each. This is a bare minimum for GPUs
 318   in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
 319   batching as possible.
 320 * Predication (an absolutely critical component for a Vector ISA),
 321   then the next logical advancement is to allow separate predication masks
 322   to be applied to *both* the source *and* the destination, independently.
 323   (*Readers familiar with Vector ISAs will recognise this as a back-to-back
 324   `VGATHER-VSCATTER`*)
 325 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 326   with primarily Load and Store being able to handle 8/16/32/64
 327   and sometimes 128-bit (quad-word), where Vector ISAs need to
 328   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 329   high-performance AI. Rather than waste opcode space adding all
 330   such operations at different bitwidths, let the prefix
 331   *redefine* (override) the element width, without actually altering
 332   the Scalar ISA at all.
 333 * "Reordering" of the assumption of linear sequential element
 334   access, for Matrices, rotations, transposition, Convolutions,
 335   DCT, FFT, Parallel Prefix-Sum and other common transformations
 336   that require significant programming effort in other ISAs.
 337
 338 All of these things come entirely from "Augmentation" of the Scalar operation
 339 being prefixed: at no time is the Scalar operation's binary pattern decoded
 340 differently compared to when it is used as a Scalar operation.
 341 From there, several more "Modes" can be added, including
 342
 343 * saturation,
 344 which is needed for Audio and Video applications
 345 *  "Reverse Gear"
 346 which runs the Element Loop in reverse order (needed for Prefix
 347 Sum)
 348 * Data-dependent Fail-First, which emerged from asking the simple
 349   question, "If modern Vector ISAs have Load/Store Fail-First,
 350   and the Power ISA has Condition Codes, why not make Conditional
 351   early-exit from Arithmetic operation looping?"
 352 * over 500 Branch-Conditional Modes emerge from application of
 353   Boolean Logic in a Vector context, on top of an already-powerful
 354   Scalar Branch-Conditional/Counter instruction
 355
 356 All of these festures are added as "Augmentations", to create of
 357 the order of 1.5 *million* instructions, none of which decode the
 358 32-bit scalar suffix any differently.
 359
 360 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 361
 362 Remarkably, very little: the devil is in the details though.
 363
 364 * The traditional `iota` instruction may be
 365   synthesised with an overlapping add, that stacks up incrementally
 366   and sequentially.  Although it requires two instructions (one to
 367   start the sum-chain) the technique has the advantage of allowing
 368   increments by arbitrary amounts, and is not limited to addition,
 369   either.
 370 * Big-integer addition (arbitrary-precision arithmetic) is an
 371   emergent characteristic from the carry-in, carry-out capability of
 372   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 373   naturally emerges from the
 374   sequential carry-flag chaining of these scalar instructions.
 375 * The Condition Register Fields of the Power ISA make a great candidate
 376   for use as Predicate Masks, particularly when combined with
 377   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
 378
 379 It is only when looking slightly deeper into the Power ISA that
 380 certain things turn out to be missing, and this is down in part to IBM's
 381 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
 382 so Scalar ones.  Examples include that transfer operations between the
 383 Integer and Floating-point Scalar register files were dropped approximately
 384 a decade ago after the Packed SIMD variants were considered to be
 385 duplicates.  With it being completely inappropriate to attempt to Vectorise
 386 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
 387 the Scalar ISA, a much better all-round candidate for Vectorisation
 388 (the Scalar parts of Power ISA) is left anaemic.
 389
 390 A particular key instruction that is missing is `MV.X` which is
 391 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
 392 expensive instruction causing a huge swathe of Register Hazards
 393 in one single hit is almost never added to a Scalar ISA but
 394 is almost always added to a Vector one. When `MV.X` is
 395 Vectorised it allows for arbitrary
 396 remapping of elements within a Vector to positions specified
 397 by another Vector. A typical Scalar ISA will use Memory to
 398 achieve this task, but with Vector ISAs the Vector Register Files are
 399 usually so enormous, and so far away from Memory, that it is easier and
 400 more efficient, architecturally, to provide these Indexing instructions.
 401
 402 Fortunately, with the ISA Working Group being willing
 403 to consider RFCs (Requests For Change) these omissions have the potential
 404 to be corrected.
 405
 406 One deliberate decision in SVP64 involves Predication. Typical Vector
 407 ISAs have quite comprehensive arithmetic and logical operations on
 408 Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
 409 side of Power ISA already has most of them.
 410 If CR Fields were the only predicates in SVP64
 411 it would put pressure on to start adding the exact same arithmetic and logical
 412 operations that already exist in the Integer opcodes, which is less
 413 than desirable.
 414 Instead of taking that route the decision was made to allow *both*
 415 Integer *and* CR Fields to be Predicate Masks, and to create Draft
 416 instructions that provide better transfer capability between CR Fields
 417 and Integer Register files.
 418
 419 Beyond that, further extensions to the Power ISA become much more
 420 domain-specific, such as adding bitmanipulation for Audio, Video
 421 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
 422 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
 423 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
 424 *automatically* is inherently added to the Vector one as well, and
 425 because these GPU and Video opcodes have been added to the CPU ISA,
 426 Software Driver development and debugging is dramatically simplified.
 427
 428 Which brings us to the next important question: how is any of these
 429 CPU-centric Vector-centric improvements relevant to power efficiency
 430 and making more effective use of resources?
 431
 432 # Simpler more compact programs saves power
 433
 434 The first and most obvious saving is that, just as with any Vector
 435 ISA, the amount of data processing requested
 436 and controlled by each instruction is enormous, and leaves the
 437 Decode and Issue Engines idle, as well as the L1 I-Cache. With
 438 programs being smaller, chances are higher that they fit into
 439 L1 Cache, or that the L1 Cache may be made smaller: either way
 440 is a considerable O(N^2) power-saving.
 441
 442 Even a Packed SIMD ISA could take limited advantage of a higher
 443 bang-per-buck for limited specific workloads, as long as the
 444 stripmining setup and teardown is not required.  However a
 445 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
 446 ratio as a 64-wide Vector Length.
 447
 448 Realistically, for general use cases however it is extremely common
 449 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
 450 astounding 240 hand-coded assembler instructions where it is around
 451 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
 452 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
 453 the case of the IBM POWER9 with a little-known design flaw not
 454 normally otherwise encountered this results in
 455 contention between the L1 D and I Caches at the L2 Bus, slowing down
 456 execution even further.  Power ISA 3.1 MMA (Matrix-Multiply-Assist)
 457 requires loop-unrolling to contend with non-power-of-two Matrix
 458 sizes: SVP64 does not (as hinted at below).
 459 [Figures 8 and 9](https://arxiv.org/abs/2104.03142)
 460 illustrate the process of concatenating copies of data in order
 461 to match RADIX2 limitations of MMA.
 462
 463 Additional savings come in the form of `SVREMAP`. Like the
 464 hardware-assist of Google's TPU mentioned on p9 of the above MMA paper,
 465 `SVREMAP` is a hardware
 466 index transformation system where the normally sequentially-linear
 467 Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
 468 commonly-used deterministic schedules, for example Matrix Multiply,
 469 DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
 470 2x6 with optional *in-place* transpose, mirroring or rotation
 471 on any source or destination Matrix
 472 may be performed in as little as 4 instructions, one of which
 473 is to zero-initialise the accumulator Vector used to store the result.
 474 If addition to another Matrix is also required then it is only three
 475 instructions.
 476
 477 Not only that, but because the "Schedule" is an abstract
 478 concept separated from the mathematical operation, there is no reason
 479 why Matrix Multiplication Schedules may not be applied to Integer
 480 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
 481 AND-and-OR, or any other future instruction such as Complex-Number
 482 Multiply-and-Accumulate or Abs-Diff-and-Accumulate
 483 that a future version of the Power ISA might
 484 support.  The flexibility is not only enormous, but the compactness
 485 unprecedented.  RADIX2 in-place DCT may be created in
 486 around 11 instructions using the Triple-loop DCT Schedule. The only other processors well-known to have
 487 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
 488 and Qualcom's Hexagon, and both are targetted at FFTs only.
 489
 490 There is no reason at all why future algorithmic schedules should not
 491 be proposed as extensions to SVP64 (sorting algorithms,
 492 compression algorithms, Sparse Data Sets, Graph Node walking
 493 for example). (*Bear in mind that
 494 the submission process will be
 495 entirely at the discretion of the OpenPOWER Foundation ISA WG,
 496 something that is both encouraged and welcomed by the OPF.*)
 497
 498 One of SVP64's current limitations is that it was initially designed
 499 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
 500 a heavy focus on adding hardware-for-loops onto the *Registers*.
 501 After more than three years of development the realisation hit that
 502 the SVP64 concept could be expanded to Coherent Distributed Memory.
 503 This astoundingly powerful concept is explored in the next section.
 504
 505 # Coherent Deterministic Hybrid Distributed In-Memory Processing
 506
 507 It is not often that a heading in an article can legitimately
 508 contain quite so many comically-chained buzzwords, but in this section
 509 they are justified.  As hinted at in the first section, the last time
 510 that memory was the same speed as processors was the Pentium III
 511 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
 512 CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
 513 these rates*, yet the pressure from Software Engineers is to
 514 make *sequential* algorithm processing faster and faster because
 515 parallelising of algorithms is simply too difficult to master, and always
 516 has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
 517 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
 518 are at an astonishing four levels of cache (L1 to L4).
 519
 520 It should therefore come as no surprise that attempts are being made
 521 to move (distribute) processing closer to the DRAM Memory, firmly
 522 on the *opposite* side of the main CPU's L1/2/3/4 Caches,
 523 where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
 524 why this approach is compelling.  However
 525 the alarm bells ring here at the keyword "distributed", because by
 526 moving the processing down next to the Memory, even onto
 527 the same die as the DRAM, the speed of any
 528 of the parallel Processing Elements (PEs) would likely drop
 529 by almost two orders of magnitude (5 ghz down to 150 mhz),
 530 the simplicity of each PE has, for pure pragmatic reasons,
 531 to drop by several
 532 orders of magnitude as well.
 533 Things that the average "sequential algorithm"
 534 programmer
 535 takes for granted such as SMP, Cache Coherency, Virtual Memory,
 536 spinlocks (atomic locking, mutexes), all of these are either outright gone
 537 or expected that the programmer shall explicitly contend with
 538 (even if that programmer is the Compiler Developer). There's definitely
 539 not going to be a standard OS: the PEs will be too basic, too
 540 resource-constrained, and definitely too busy.
 541
 542 To give an extreme example: Aspex's Array-String Processor, which
 543 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
 544 Memory, was capable of literally a hundred-fold improvement in
 545 performance over Scalar CPUs such as the Pentium III of its era,
 546 all on a 3 watt budget at only 250 mhz in 130 nm.  Yet to take
 547 proper advantage of its capability required an astounding 5-10
 548 *days* per line of assembly code because multiple versions of
 549 an algorithm had to be hand-crafted then compared, and only
 550 the best one selected: all others discarded. 20 lines of optimised
 551 Assembler taking three to six months to write can in no way be termed
 552 "productive", yet this extreme level of unproductivity is an inherent
 553 side-effect of going down the parallel-processing rabbithole where
 554 the cost of providing "Traditional" programmabilility (Virtual Memory,
 555 SMP) is worse than counter-productive, it's often outright impossible.
 556
 557 *<blockquote>
 558 Similar to how GPUs achieve astounding task-dedicated
 559 performance by giving
 560 ALUs 30% of total silicon area and sacrificing the ability to run
 561 General-Purpose programs, Aspex, Google's Tensor Processor and D-Matrix
 562 likewise took this route and made the same compromise.
 563 </blockquote>*
 564
 565 **In short, we are in "Programmer's nightmare" territory**
 566
 567 Having dug a proverbial hole that rivals the Grand Canyon, and
 568 jumped in it feet-first, the next
 569 task is to piece together a strategy to climb back out and show
 570 how falling back in can be avoided. This takes some explaining,
 571 and first requires some background on various research efforts and
 572 commercial designs.  Once the context is clear, their synthesis
 573 can be proposed.  These are:
 574
 575 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
 576  available [no paywall](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf)
 577 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
 578 * [Snitch](https://arxiv.org/abs/2002.10143)
 579
 580 **ZOLC: Zero-Overhead Loop Control**
 581
 582 Zero-Overhead Looping is the concept of automatically running a set sequence
 583 of instructions a predetermined number of times, without requiring
 584 a branch. This is conceptually similar but
 585 slightly different from using Power ISA `bc` in `CTR`
 586 (Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
 587
 588 The simplest longest commercially successful deployment of Zero-overhead looping
 589 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
 590 within the VLIW word may be repeatedly deployed on successive clock
 591 cycles until a countdown reaches zero. This extraordinarily simple
 592 concept needs no branches, and has no complex Register Hazard
 593 Management in the hardware
 594 because it is down to the programmer (or, the compiler),
 595 to ensure data overlaps do not occur.  Careful crafting of those
 596 14 instructions can keep the ALUs 100% occupied for sustained periods,
 597 and the iconic example for which the TI DSPs are renowned
 598 is that an entire inner loop for large FFTs
 599 can be done with that one VLIW word: no stalls, no stopping, no fuss,
 600 an entire 1024 or 4096 wide FFT Layer in one instruction.
 601
 602 <blockquote>
 603 The key aspect of these
 604 very simplistic countdown loops as far as we are concerned:
 605 is: *they are deterministic*.
 606 </blockquote>
 607
 608 Zero-Overhead Loop Control takes this basic "single loop" concept
 609 way further: both nested loops and conditional exit are included,
 610 but also arbitrary control-jumping from the current inner loop
 611 out to an entirely different loop, all based on conditions determined
 612 dynamically at runtime.
 613
 614 Even when deployed on as basic a CPU as a single-issue in-order RISC
 615 core, the performance and power-savings were astonishing: between 27
 616 and **75%** reduction in algorithm completion times were achieved compared
 617 to a more traditional branch-speculative in-order RISC CPU.  MPEG
 618 Encode's timing, the target algorithm specifically picked by the researcher
 619 due to its high complexity with 6-deep nested loops and conditional
 620 execution that frequently jumped in and out of at least 2 loops,
 621 came out with an astonishing 43% improvement in completion time. 43%
 622 less instructions executed is an almost unheard-of level of optimisation:
 623 most ISA designers are elated if they can achieve 5 to 10%. The reduction
 624 was so compelling that ST Microelectronics put it into commercial
 625 production in one of their embedded CPUs, the ST120 DSP-MCU.
 626
 627 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
 628 design of its triple-nested for-loop system
 629 turned out to be remarkably similar to the
 630 core nested for-loop engine of ZOLC. In hindsight this should not
 631 have come as a surprise, because both are basically nested for-loops
 632 that do not need branches to issue instructions.
 633
 634 The important insight is, however, that if ZOLC can be general-purpose
 635 and apply deterministic nested looped instruction
 636 schedules to more than just registers
 637 (unlike SVP64 in its current incarnation) then so can SVP64.
 638
 639 **OpenCAPI and Extra-V**
 640
 641 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
 642 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
 643
 644 <blockquote>(Side note:
 645 POWER10 *only*
 646 has OpenCAPI Memory interfaces: an astounding number of them,
 647 with overall bandwidth so high it's actually difficult to conceptualise.
 648 An OMI-to-DDR4/5 Bridge PHY is therefore required
 649 to connect to standard Memory DIMMs.)
 650 </blockquote>
 651
 652 Extra-V appears to be a remarkable research project based on OpenCAPI that,
 653 by assuming that the map of edges (excluding the actual data)
 654 in any given arbitrary data graph
 655 could be kept by the main CPU in-memory, could distribute and delegate
 656 a limited-capability deterministic but most importantly *data-dependent*
 657 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor
 658 (non-Turing-complete) analysed
 659 the data it had read (at the Memory), and determined if it should
 660 notify the main processor that this "Node" is worth investigating,
 661 or if the Graph node-walk should split in a different direction.
 662 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
 663 abstraction, locking, and cache-coherency, many of the nightmare problems
 664 of other more explicit parallel processing paradigms disappear.
 665
 666 The similarity to ZOLC should not have gone unnoticed: where ZOLC
 667 has nested conditional for-loops Extra-V appears to have just the
 668 one conditional for-loop, but the key strategically-crucial
 669 part of this multi-faceted puzzle is that due to the deterministic and
 670 coherent nature of Extra-V, the processing of the loops, which
 671 requires a tiny non-Turing-Complete processor, is not
 672 done close to or by the main CPU at all: it is
 673 *embedded right next to the memory*.
 674
 675 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
 676 Array-String Processing, and Elixent 2D Array Processing, should
 677 also not have gone unnoticed.  All of these solutions utilised
 678 or utilise
 679 a more comprehensive Turing-complete von-Neumann "Management Core"
 680 to coordinate data passed in and out of PEs: none of them have or
 681 had something
 682 as powerful as OpenCAPI as part of that picture.
 683
 684 The fact that Neural Networks may be expressed as arbitrary Graphs,
 685 and comprise Sparse Matrices, should also have been noted by the reader
 686 interested in AI.
 687
 688 **Snitch**
 689
 690 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
 691 become "tagged" with a Memory-access Mode that went out of fashion
 692 over forty years ago: Load-then-Auto-Increment. Expressed in c as
 693 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
 694 thanks to the RISC paradigm having gone too far,
 695 the efficiency and effectiveness
 696 of these Load-Store-with-Increment instructions has been
 697 forgotten until Snitch.
 698
 699 What the designers did however was not to add any new Load-Store
 700 or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
 701 registers with a tag which *augmented* (altered) the behaviour
 702 of *existing* instructions.  These tags tell the CPU: when you are asked to
 703 carry out
 704 an add instruction on r6 and r7, do not take r6 or r7 from the register
 705 file, instead please perform a Cache-coherent Load-with-Increment
 706 on each, using special (hidden, implicit)
 707 Address Registers for each.  Each new use
 708 of r6 therefore brings in an entirely new value *directly from
 709 memory*. Likewise on the second operand, r7, and likewise on
 710 the destination result which can be an automatic Coherent
 711 Store-and-increment
 712 directly into Memory.
 713
 714 <blockquote>
 715 *The act of "reading" or "writing" a register has been decoupled
 716 and intercepted, then connected transparently to a completely
 717 separate Coherent Memory Subsystem*
 718 </blockquote>
 719
 720 On top of a barrel-architecture the slowness of Memory access
 721 was not a problem because the Deterministic nature of classic
 722 Load-Store-Increment can be compensated for by having 8 Memory
 723 accesses scheduled underway and interleaved in a time-sliced
 724 fashion with an FPU that is correspondingly 8 times faster than
 725 the Coherent Memory accesses.
 726
 727 This design is reminiscent of the early Vector Processors
 728 of the late 1950s and early 1960s, which also critically relied
 729 on implicit auto-increment addressing.
 730 The [CDC STAR-100](https://en.m.wikipedia.org/wiki/CDC_STAR-100)
 731 for example was specifically designed as a Memory-to-Memory Vector
 732 Processor. The barrel-architecture of Snitch neatly
 733 solves one of the inherent problems of those early designs (a mismatch
 734 with memory
 735 speed) and the presence of a full register file (non-tagged,
 736 normal, standard scalar registers) caters for a
 737 second limitation of pure Memory-based Vector Processors: temporary
 738 variables needed in the computation of intermediate results, which
 739 also had to go through memory, put
 740 an awfully high artificial load on Memory bandwidth.
 741
 742 The similarity to SVP64 should be clear: SVP64 Prefixing and the
 743 associated REMAP system is just another form of register "tagging"
 744 that augments what was formerly designated by its original authors
 745 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
 746 with advanced behaviour not previously envisaged.
 747
 748 What Snitch brings to the table therefore is a further illustration of
 749 the concept introduced by Extra-V: where Extra-V brought information
 750 about Sparse-Distributed Data to the attention of the main CPU in
 751 a coherent fashion *without the CPU having to ask for it*, Snitch
 752 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
 753 distributed coherent manner, and does so with dramatically-reduced
 754 power consumption.
 755
 756 **Bringing it all together**
 757
 758 At this point we are well into a future revision of SVP64, one that
 759 clearly has some startlingly powerful potential: Supercomputing-class
 760 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
 761 sustained fashion with reduced complexity, reduced power consumption
 762 and reduced completion time, thanks to Deterministic Coherent Scheduling
 763 of the data fed in and out, or even moved down next to Memory.
 764
 765 This last part is where it normally gets hair-raising, but as ZOLC shows
 766 there is no reason at all why even complex algorithms such as MPEG cannot
 767 be run in a partially-deterministic manner, and anything that is
 768 deterministic can be Scheduled, coherently.  Combine that with OpenCAPI
 769 which solves the many issues associated with SMP Virtual Memory and so on
 770 yet still allows Cache-Coherent Distributed Memory Access, and what was
 771 previously an intractable Computer Science problem for decades begins to
 772 look like there is a potential solution.
 773
 774 The Deterministic Schedules created by ZOLC should even be possible to identify their
 775 suitability for full off-CPU distributed processing, as long as OpenCAPI
 776 is integrated into the mix.  What a compiler - or even the hardware -
 777 will be looking out for is a Basic Block of instructions that:
 778
 779 * begins with a LOAD (to be handled by OpenCAPI)
 780 * contains some instructions that a given PE is capable of executing
 781 * ends with a STORE (again: OpenCAPI)
 782
 783 For best results that would be wrapped with a Zero-Overhead Loop
 784 (which is offloaded - in full - down to the PE), where
 785 the Compiler (or hardware at runtime) could easily identify, in advance,
 786 the full range of Memory Addresses that the Loop is to encounter.  Copies
 787 of loop-invariant data would need to be passed down to the remote PE:
 788 again, for simple-enough Basic Blocks, with assistance from the Compiler,
 789 loop-invariant inputs are easily identified. Parallel Processing
 790 opportunities should also be easy enough to create, simply by farming out
 791 different parts of a given Deterministic Zero-Overhead Loop to
 792 different PEs based on their proximity, bandwidth or ease of access to
 793 given Memory.
 794
 795 The importance of OpenCAPI in this mix cannot be underestimated, because
 796 it will be the means by which the main CPU coordinates its activities
 797 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
 798 violated. It should also be straightforward to ensure that the offloading
 799 is entirely transparent to the developer, in fact this is a hard requirement
 800 because at any given moment there is the possibility that the PEs may be
 801 busy and it is the main CPU that has to complete the Processing Task itself.
 802
 803 It is also important to note that we are not necessarily talking about
 804 the Remote PEs executing the Power ISA, but if they do so it becomes
 805 much easier for the main CPU to take over in the event that PEs are
 806 currently occupied.  Plus, the twin lessons that inventing ISAs, even
 807 a small one, is hard (mostly in compiler writing) and how complex
 808 GPU Task Scheduling is, are being heard loud and clear.
 809
 810 Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
 811
 812 * In order that the main CPU can execute the same sequence if necessary,
 813   the CPU must support dual ISAs: Power and PE **OR**
 814 * There must be a JIT binary-translator which either turns PE code
 815  into Power ISA code or vice-versa **OR**
 816 * The compiler dual-compiles the original source code, and embeds
 817   both a Power binary and a PE binary into the ZOLC Basic Block **OR**
 818 * All binaries are stored in an Intermediate Representation
 819   (LLVM-IR, SPIR-V) and JIT-compiled on-demand.
 820
 821 All of these would work, but it is simpler and a lot less work
 822 just to have the PEs
 823 execute the exact same ISA (or a subset of it). If however the
 824 concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
 825 which would increase adoption and reduce cost, a bit more thought
 826 is required here because ARM or Intel or MIPS might not necessarily
 827 be happy that a Processing Element (PE) has to execute Power ISA binaries.
 828 At least the Power ISA is much richer, more powerful, still RISC,
 829 and is an Open Standard, as discussed in a earlier sections.
 830
 831 A reasonable compromise as a JEDEC Standard is illustrated with
 832 the following diagram: a 3-way Bridge PHY that allows for full
 833 direct interaction between DRAM ICs, PEs, and one or more main CPUs
 834 (* a variant of the Northbridge and/or IBM POWER10 OMI-to-DDR5 PHY concept*).
 835 It is also the ideal location for a "Management Core".
 836 If the 3-way Bridge (4-way if connectivity to other Bridge PHYs
 837 is also included) does not itself have PEs built-in then the ISA
 838 utilised on any PE or CPU is non-critical.  The only concern regarding
 839 mixed ISAs is that the PHY should be capable of transferring all and
 840 any types of "Management" packets, particularly PE Virtual Memory Management
 841 and Register File Control (Context-switch Management given that the PEs
 842 are expected to be ALU-heavy and not capable of running a full SMP Operating
 843 System).
 844
 845 There is also no reason why this type of arrangement should not be deployed
 846 in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
 847 the performance boost that goes with smaller line-drivers.
 848
 849
 850 Draft Image (placeholder):
 851
 852 <img src="/openpower/sv/bridge_phy.jpg" width=800 />
 853
 854 # Transparently-Distributed Vector Processing
 855
 856 It is very strange to the author to be describing what amounts to a
 857 "Holy Grail" solution to a decades-long intractable problem that
 858 mitigates the anticipated end of Moore's Law: how to make it easy for
 859 well-defined workloads, expressed as a perfectly normal
 860 sequential program, compiled to a standard well-known ISA, to have
 861 the potential of being offloaded transparently to Parallel Compute Engines,
 862 all without the Software Developer being excessively burdened with
 863 a Parallel-Processing Paradigm that is alien to all their experience
 864 and training, as well as Industry-wide common knowledge.
 865
 866 Will it be that easy? ZOLC is, honestly, in its current incarnation,
 867 not that straightforward: programs
 868 have to be "massaged" by tools that insert intrinsics into the
 869 source code, in order to identify the Basic Blocks that the Zero-Overhead
 870 Loops can run. Can this be merged into standard gcc and llvm
 871 compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
 872 if an infinite supply of money and engineering time is thrown at it.
 873 Is a half-way-house solution of compiler intrinsics good enough?
 874 Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
 875 for several decades, and advanced programmers are comfortable with the
 876 practice.
 877
 878 Additional questions remain as to whether OpenCAPI or its use for this
 879 particular scenario requires that the PEs, even quite basic ones,
 880 implement a full RADIX MMU, and associated TLB lookup? In order to ensure
 881 that programs may be cleanly and seamlessly transferred between PEs
 882 and CPU the answer is quite likely to be "yes", which is interesting
 883 in and of itself.  Fortunately, the associated L1 Cache with TLB
 884 Translation does not have to be large, and the actual RADIX Tree Walk
 885 need not explicitly be done by the PEs, it can be handled by the main
 886 CPU as a software-extension: PEs generate a TLB Miss notification
 887 to the main CPU over OpenCAPI, and the main CPU feeds back the new
 888 TLB entries to the PE in response.
 889
 890 Also in practical terms, with the PEs anticipated to be so small as to
 891 make running a full SMP-aware OS impractical it will not just be their TLB
 892 pages that need remote management but their entire register file including
 893 the Program Counter will need to be set up, and the ZOLC Context as
 894 well. With OpenCAPI packet formats being quite large a concern is that
 895 the context management increases latency to the point where the premise
 896 of this paper is invalidated. Research is needed here as to whether a
 897 bare-bones microkernel
 898 would be viable, or a Management Core closer to the PEs (on the same
 899 die or Multi-Chip-Module as the PEs) would allow better bandwidth and
 900 reduce Management Overhead on the main CPUs.  However if
 901 the same level of power saving as Snitch (1/6th) and
 902 the same sort of reduction in algorithm runtime as ZOLC (20 to 80%) is not
 903 unreasonable to expect, this is
 904 definitely compelling enough to warrant in-depth investigation.
 905
 906 **Use-case: Matrix and Convolutions**
 907
 908 First, some important definitions, because there are two different
 909 Vectorisation Modes in SVP64:
 910
 911 * **Horizontal-First**: (aka standard Cray Vectors) walk
 912   through **elements** first before moving to next **instruction**
 913 * **Vertical-First**: walk through **instructions** before
 914   moving to next **element**.  Currently managed by `svstep`,
 915   ZOLC may be deployed to manage the stepping, in a Deterministic manner.
 916
 917 Imagine a large Matrix scenario, with several values close to zero that
 918 could be skipped: no need to include zero-multiplications, but a
 919 traditional CPU in no way can help: only by loading the data through
 920 the L1-L4 Cache and Virtual Memory Barriers is it possible to
 921 ascertain, retrospectively, that time and power had just been wasted.
 922
 923 SVP64 is able to do what is termed "Vertical-First" Vectorisation,
 924 combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
 925 extended, Snitch-style, to perform a deterministic memory-array walk of
 926 a large Matrix.
 927
 928 Let us also imagine that the Matrices are stored in Memory with PEs
 929 attached, and that the PEs are fully functioning Power ISA with Draft
 930 SVP64, but their Multiply capability is not as good as the main CPU.
 931 Therefore:
 932 we want the PEs to conditionally
 933 feed sparse data to the main CPU, a la "Extra-V".
 934
 935 * The ZOLC SVREMAP System running on the main CPU generates a Matrix
 936   Memory-Load Schedule.
 937 * The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
 938 * The PEs are also sent the Basic Block to be executed on each
 939   Memory Load (each element of the Matrices to be multiplied)
 940 * The PEs execute the Basic Block and **exclude**, in a deterministic
 941   fashion, any elements containing Zero values
 942 * Non-zero elements are sent, via OpenCAPI, to the main CPU, which
 943   queues sequences of Multiply-and-Accumulate, and feeds the results
 944   back to Memory, again via OpenCAPI, to the PEs.
 945 * The PEs, which are tracking the Sparse Conditions, know where
 946   to store the results received
 947
 948 In essence this is near-identical to the original Snitch concept
 949 except that there are, like Extra-V, PEs able to perform
 950 conditional testing of the data as it goes both to and from the
 951 main CPU.  In this way a large Sparse Matrix Multiply or Convolution
 952 may be achieved without having to pass unnecessary data through
 953 L1/L2/L3 Caches only to find, at the CPU, that it is zero.
 954
 955 The reason in this case for the use of Vertical-First Mode is the
 956 conditional execution of the Multiply-and-Accumulate.
 957 Horizontal-First Mode is the standard Cray-Style Vectorisation:
 958 loop on all *elements* with the same instruction before moving
 959 on to the next instruction. Horizontal-First
 960 Predication needs to be pre-calculated
 961 for the entire Vector in order to exclude certain elements from
 962 the computation. In this case, that's an expensive inconvenience
 963 (remarkably similar to the problems associated with Memory-to-Memory
 964 Vector Machines such as the CDC Star-100).
 965
 966 Vertical-First allows *scalar* instructions and
 967 *scalar* temporary registers to be utilised
 968 in the assessment as to whether a particular Vector element should
 969 be skipped, utilising a straight Branch instruction *(or ZOLC
 970 Conditions)*.  The Vertical Vector technique
 971 is pioneered by Mitch Alsup and is a key feature of his VVM Extension
 972 to MyISA 66000.  Careful analysis of the registers within the
 973 Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
 974 *amortise in-flight scalar looped operations into SIMD batches*
 975 as long as the loop is kept small enough to entirely fit into
 976 in-flight Reservation Stations in the first place.
 977
 978 *<blockquote>
 979 (With thanks and gratitude to Mitch Alsup on comp.arch for
 980 spending considerable time explaining VVM, how its Loop
 981 Construct explicitly identifies loop-invariant registers,
 982 and how that helps Register Hazards and SIMD amortisation
 983 on a GB-OoO Micro-architecture)
 984 </blockquote>*
 985
 986 Draft Image (placeholder):
 987
 988 <img src="/openpower/sv/zolc_svp64_extrav.jpg" width=800 />
 989
 990 The program being executed is a simple loop with a conditional
 991 test that ignores the multiply if the input is zero.
 992
 993 * In the CPU-only case (top) the data goes through L1/L2
 994   Cache before reaching the CPU.
 995 * However the PE version does not send zero-data to the CPU,
 996   and even when it does it goes into a Coherent FIFO: no real
 997   compelling need to enter L1/L2 Cache or even the CPU Register
 998   File (one of the key reasons why Snitch saves so much power).
 999 * The PE-only version (see next use-case) the CPU is mostly
1000   idle, serving RADIX MMU TLB requests for PEs, and OpenCAPI
1001   requests.
1002
1003 **Use-case variant: More powerful in-memory PEs**
1004
1005 An obvious variant of the above is that, if there is inherently
1006 more parallelism in the data set, then the PEs get their own
1007 Multiply-and-Accumulate instruction, and rather than send the
1008 data to the CPU over OpenCAPI, perform the Matrix-Multiply
1009 directly themselves.
1010
1011 However the source code and binary would be near-identical if
1012 not identical in every respect, and the PEs implementing the full
1013 ZOLC capability in order to compact binary size to the bare minimum.
1014 The main CPU's role would be to coordinate and manage the PEs
1015 over OpenCAPI.
1016
1017 One key strategic question does remain: do the PEs need to have
1018 a RADIX MMU and associated TLB-aware minimal L1 Cache, in order
1019 to support OpenCAPI properly? The answer is very likely to be yes.
1020 The saving grace here is that with
1021 the expectation of running only hot-loops with ZOLC-driven
1022 binaries, the size of each PE's TLB-aware
1023 L1 Cache needed would be miniscule compared
1024 to the average high-end CPU.
1025
1026 **Comparison of PE-CPU to GPU-CPU interaction**
1027
1028 The informed reader will have noted the remarkable similarity between how
1029 a CPU communicates with a GPU to schedule tasks, and the proposed
1030 architecture.  CPUs schedule tasks with GPUs as follows:
1031
1032 * User-space program encounters an OpenGL function, in the
1033   CPU's ISA.
1034 * Proprietary GPU Driver, still in the CPU's ISA, prepares a
1035   Shader Binary written in the GPU's ISA.
1036 * GPU Driver wishes to transfer both the data and the Shader Binary
1037   to the GPU. Both may only do so via Shared Memory, usually
1038   DMA over PCIe (assuming a PCIe Graphics Card)
1039 * GPU Driver which has been running CPU userspace notifies CPU
1040   kernelspace of the desire to transfer data and GPU Shader Binary
1041   to the GPU. A context-switch occurs...
1042
1043 It is almost unfair to burden the reader with further details.
1044 The extraordinarily convoluted procedure is as bad as it sounds. Hundreds
1045 of thousands of tasks per second are scheduled this way, with hundreds
1046 or megabytes of data per second being exchanged as well.
1047
1048 Yet, the process is not that different from how things would work
1049 with the proposed microarchitecture: the differences however are key.
1050
1051 * Both PEs and CPU run the exact same ISA.  A major complexity of 3D GPU
1052   and CUDA workloads  (JIT compilation etc) is eliminated, and, crucially,
1053   the CPU may directly execute the PE's tasks, if needed. This simply
1054   is not even remotely possible on GPU Architectures.
1055 * Where GPU Drivers use PCIe Shared Memory, the proposed architecture
1056   deploys OpenCAPI.
1057 * Where GPUs are a foreign architecture and a foreign ISA, the proposed
1058   architecture only narrowly misses being defined as big/LITTLE Symmetric
1059   Multi-Processing (SMP) by virtue of the massively-parallel PEs
1060   being a bit light on L1 Cache, in favour of large ALUs and proximity
1061   to Memory, and require a modest amount of "helper" assistance with
1062   their Virtual Memory Management.
1063 * The proposed architecture has the markup points emdedded into the
1064   binary programs
1065   where PEs may take over from the CPU, and there is accompanying
1066   (planned) hardware-level assistance at the ISA level.  GPUs, which have to
1067   work with a wide range of commodity CPUs, cannot in any way expect
1068   ARM or Intel to add support for GPU Task Scheduling directly into
1069   the ARM or x86 ISAs!
1070
1071 On this last point it is crucial to note that SVP64 began its inspiration
1072 from a Hybrid CPU-GPU-VPU paradigm (like ICubeCorp's IC3128) and
1073 consequently has versatility that the separate specialisation of both
1074 GPU and CPU architectures lack.
1075
1076 **Roadmap summary of Advanced SVP64**
1077
1078 The future direction for SVP64, then, is:
1079
1080 * To overcome its current limitation of REMAP Schedules being
1081   restricted to Register Files, leveraging the Snitch-style
1082   register interception "tagging" technique.
1083 * To adopt ZOLC and merge REMAP Schedules into ZOLC
1084 * To bring OpenCAPI Memory Access into ZOLC as a first-level
1085   concept that mirrors Snitch's Coherent Memory interception
1086 * To add the Graph-Node Walking Capability of Extra-V
1087   to ZOLC / SVREMAP
1088 * To make it possible, in a combination of hardware and software,
1089   to easily identify ZOLC / SVREMAP Blocks
1090   that may be transparently pushed down closer to Memory, for
1091   localised distributed parallel execution, by OpenCAPI-aware PEs,
1092   exploiting both the Deterministic nature of ZOLC / SVREMAP
1093   combined with the Cache-Coherent nature of OpenCAPI,
1094   to the maximum extent possible.
1095 * To explore "Remote Management" of PE RADIX MMU, TLB, and
1096   Context-Switching (register file transferrance) by proxy,
1097   over OpenCAPI, to ensure that the distributed PEs are as
1098   close to a Standard SMP model as possible, for programmers.
1099 * To make the exploitation of this powerful solution as simple
1100   and straightforward as possible for Software Engineers to use,
1101   in standard common-usage compilers, gcc and llvm.
1102 * To propose extensions to Public Standards that allow all of
1103   the above to become part of everyday ubiquitous mass-volume
1104   computing.
1105
1106 Even the first of these - merging Snitch-style register tagging
1107 into SVP64 - would
1108 expand SVP64's capability for Matrices, currently limited to
1109 around 5x7 to 6x6 Matrices and constrained by the size of
1110 the register files (128 64-bit entries), to arbitrary (massive) sizes.
1111
1112 **Summary**
1113
1114 There are historical and current efforts that step away from both a
1115 general-purpose architecture and from the practice of using compiler
1116 intrinsics in general-purpose compute to make programmer's lives easier.
1117 A classic example being the Cell Processor (Sony PS3) which required
1118 programmers to use DMA to schedule processing tasks. These specialist
1119 high-performance architectures are only tolerated for
1120 as long as there is no equivalent performant alternative that is
1121 easier to program.
1122
1123 Combining SVP64 with ZOLC and OpenCAPI can produce an extremely powerful
1124 architectural base that fits well with intrinsics embedded into standard
1125 general-purpose compilers (gcc, llvm) as a pragmatic compromise which makes
1126 it useful right out the gate. Further R&D may target compiler technology
1127 that brings it on-par with NVIDIA, Graphcore, AMDGPU, but with intrinsics
1128 there is no critical product launch dependence on having such
1129 advanced compilers.
1130
1131 Bottom line is that there is a clear roadmap towards solving a long
1132 standing problem facing Computer Science and doing so in a way that
1133 reduces power consumption reduces algorithm completion time and reduces
1134 the need for complex hardware microarchitectures in favour of much
1135 smaller distributed coherent Processing Elements.
1136
1137 # Appendix
1138
1139 **Samsung PIM**
1140
1141 Samsung's
1142 [Processing-in-Memory](https://semiconductor.samsung.com/emea/newsroom/news/samsung-brings-in-memory-processing-power-to-wider-range-of-applications/)
1143 seems to be ready to launch as a
1144 [commercial product](https://semiconductor.samsung.com/insights/technology/pim/)
1145 that uses HBM as its Memory Standard,
1146 has "some logic suitable for AI", has parallel processing elements,
1147 and offers 70% reduction
1148 in power consumption and a 2x performance increase in speech
1149 recognition. Details beyond that as to its internal workings
1150 or programmability are minimal, however given the similarity
1151 to D-Matrix and Google TPU it is reasonable to place in the
1152 same category.
1153
1154 * [Samsung PIM IEEE Article](https://spectrum.ieee.org/samsung-ai-memory-chips)
1155   explains that there are 9 instructions, mostly FP16 arithmetic,
1156   and that it is designed to "complement" AI rather than compete.
1157   With only 9 instructions, 2 of which will be LOAD and STORE,
1158   conditional code execution seems unlikely.
1159   Silicon area in DRAM is increased by 5% for a much greater reduction
1160   in power. The article notes, pointedly, that programmability will
1161   be a key deciding factor.  The article also notes that Samsung has
1162   proposed its architecture as a JEDEC Standard.
1163
1164 **PIM-HBM Research**
1165
1166 [Presentation](https://ieeexplore.ieee.org/document/9073325/) by Seongguk Kim
1167 and associated [video](https://www.youtube.com/watch?v=e4zU6u0YIRU)
1168 showing 3D-stacked DRAM connected to GPUs, but notes that even HBM, due to
1169 large GPU size, is less advantageous than it should be.  Processing-in-Memory
1170 is therefore logically proposed. the PE (named a Streaming Multiprocessor)
1171 is much more sophisticated, comprising Register File, L1 Cache, FP32, FP64
1172 and a Tensor Unit.
1173
1174 <img src="/openpower/sv/2022-05-14_11-55.jpg" width=500 />
1175
1176 **etp4hpc.eu**
1177
1178 [ETP 4 HPC](https://etp4hpc.eu) is a European Joint Initiative for HPC,
1179 with an eye towards
1180 [Processing in Memory](https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf)
1181
1182 **Salient Labs**
1183
1184 [Research paper](https://arxiv.org/abs/2002.00281) explaining
1185 that they can exceed a 14 ghz clock rate Multiply-and-Accumulate
1186 using Photonics.