openpower/sv/SimpleV_rationale.mdwn

   1 [[!tag whitepapers]]
   2
   3 **Revision History**
   4
   5 * v0.00 05may2021 first created
   6 * v0.01 06may2021 initial first draft
   7 * v0.02 08may2021 add scenarios / use-cases
   8 * v0.03 09may2021 add draft image for scenario
   9
  10 **Table of Contents**
  11
  12 [[!toc]]
  13
  14 # Why in the 2020s would you invent a new Vector ISA
  15
  16 *(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
  17
  18 Inventing a new Scalar ISA from scratch is over a decade-long task
  19 including simulators and compilers: OpenRISC 1200 took 12 years to
  20 mature.  Stable ISAs require Standards and Compliance Suites that
  21 take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
  22 auto-vectorisation compiler support has never been achieved in the
  23 history of computing, not with the combined resources of ARM, Intel,
  24 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  25 assembler and direct use of intrinsics is the Industry-standard norm
  26 to achieve high-performance optimisation where it matters*).
  27 GPUs fill this void both in hardware and software terms by having
  28 ultra-specialist compilers (CUDA) that are designed from the ground up
  29 to support Vector/SIMD parallelism, and associated standards
  30 (SPIR-V, Vulkan, OpenCL) managed by
  31 the Khronos Group, with multi-man-century development committment from
  32 multiple billion-dollar-revenue companies, to sustain them.
  33
  34 Therefore it begs the question, why on earth would anyone consider
  35 this task, and what, in Computer Science, actually needs solving?
  36
  37 First hints are that whilst memory bitcells have not increased in speed
  38 since the 90s (around 150 mhz), increasing the bank width, striping, and
  39 datapath widths and speeds to the same has, with significant relative
  40 latency penalties, allowed
  41 apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
  42 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI's OMI,
  43 all make an effort (all simply increasing the parallel deployment of
  44 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  45 two nearly three orders of magnitude increase in CPU horsepower
  46 over the same timeframe. Seymour
  47 Cray, from his amazing in-depth knowledge, predicted that the mismatch
  48 would become a serious limitation, over two decades ago.
  49
  50 The latency gap between that bitcell speed and the CPU speed can do nothing to help Random Access (unpredictable reads/writes). Cacheing helps only so
  51 much, but not with some types of workloads (FFTs are one of the worst)
  52 even though
  53 they are fully deterministic.
  54 Some systems
  55 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  56 by way of compensation, and as we know from experience even that will
  57 be considered inadequate in future.
  58
  59 Efforts to solve this problem by moving the processing closer to or
  60 directly integrated into the memory have traditionally not gone well:
  61 Aspex Microelectronics, Elixent, these are parallel processing companies
  62 that very few have heard of, because their software stack was so
  63 specialist that it required heavy investment by customers to utilise.
  64 D-Matrix, a Systolic Array Processor, is a modern incarnation of the exact same
  65 "specialist parallel processing" mistake, betting heavily on AI with
  66 Matrix and Convolution Engines that can do no other task.  Aspex only
  67 survived by being bought by Ericsson, where its specialised suitability
  68 for massive wide Baseband FFTs saved it from going under.
  69 The huge risk is that any "better
  70 AI mousetrap" created by an innovative competitor
  71 that comes along quickly renders a too-specialist design obsolete.
  72
  73 NVIDIA and other GPUs have taken a different approach again: massive
  74 parallelism with more Turing-complete ISAs in each, and dedicated
  75 slower parallel memory paths (GDDR5) suited to the specific tasks of
  76 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
  77 by the amount of money poured into the software ecosystem in order
  78 to make it accessible, and even then, GPU Programmers are a specialist
  79 and rare (expensive) breed.
  80
  81 Second hints as to the answer emerge from an article
  82 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
  83 which illustrates a catastrophic rabbit-hole taken by Industry Giants
  84 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  85 Order(N^6) opcode proliferation nightmare, with its mantra "make it
  86 easy for hardware engineers, let software sort out the mess" literally
  87 overwhelming programmers with thousands of instructions. Specialists charging
  88 clients for assembly-code Optimisation Services are finding that AVX-512,
  89 to take an
  90 example, is anything but optimal: overall performance of AVX-512 actually
  91 *decreases* even as power consumption goes up.
  92
  93 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
  94 nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
  95 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
  96 it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  97 instruction that makes a truly ubiquitous Vector ISA) in ways that
  98 will become apparent over time as adoption increases. In the meantime
  99 programmers are, in direct violation of ARM's advice on how to use SVE2,
 100 trying desperately to understand it by applying their experience
 101 of Packed SIMD NEON.  The advice from ARM
 102 not to create SVE2 assembler that is hardcoded to fixed widths is being
 103 disregarded, in favour of writing *multiple identical implementations*
 104 of a function, each with a different hardware width, and compelling
 105 software to choose one at runtime after probing the hardware.
 106
 107 Even RISC-V, for all that we can be grateful to the RISC-V Founders
 108 for reviving Cray Vectors, has severe performance and implementation
 109 limitations that are only really apparent to exceptionally experienced
 110 assembly-level developers with a wide, diverse depth in multiple ISAs:
 111 one of the best and clearest is a
 112 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
 113 by adrian_b.
 114
 115 Adrian logically and concisely points out that the fundamental design
 116 assumptions and simplifications that went into the RISC-V ISA have an
 117 irrevocably damaging effect on its viability for high performance use.
 118 That is not to say that its use in low-performance embedded scenarios is
 119 not ideal: in private custom secretive commercial usage it is perfect.
 120 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
 121 ARM with RISC-V and saving themselves USD 1 in licensing royalties
 122 per product are a classic case study.  Ubiquitous and common everyday
 123 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
 124 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
 125 unfortunately, fundamentally flawed as far as power efficient high
 126 performance is concerned.
 127
 128 Slowly, at this point, a realisation should be sinking in that, actually,
 129 there aren't as many really truly viable Vector ISAs out there, as the
 130 ones that are evolving in the general direction of Vectorisation are,
 131 in various completely different ways, flawed.
 132
 133 **Successfully identifying a limitation marks the beginning of an
 134 opportunity**
 135
 136 We are nowhere near done, however, because a Vector ISA is a superset of a
 137 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
 138 support, and even longer to get the software ecosystem up and running.
 139
 140 Which ISAs, therefore, have or have had, at one point in time, a decent
 141 Software Ecosystem? Debian supports most of these including s390:
 142
 143 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
 144   Gaisler Research maintains the LEON Open Source Cores but with Oracle's
 145   reputation nobody wants to go near SPARC.
 146 * MIPS, created by SGI and only really commonly used in Network switches.
 147   Exceptions: Ingenic with embedded CPUs,
 148   and China ICT with the Loongson supercomputers.
 149 * x86, the most well-known ISA and also one of the most heavily
 150   litigously-protected.
 151 * ARM, well known in embedded and smartphone scenarios, very slowly
 152   making its way into data centres.
 153 * OpenRISC, an entirely Open ISA suitable for embedded systems.
 154 * s390, a Mainframe ISA very similar to Power.
 155 * Power ISA, a Supercomputing-class ISA, as demonstrated by
 156   two out of three of the top500.org supercomputers using
 157   around 2 million IBM POWER9 Cores each.
 158 * ARC, a competitor at the time to ARM, best known for use in
 159   Broadcom VideoCore IV.
 160 * RISC-V, with a software ecosystem heavily in development
 161   and with rapid expansion
 162   in an uncontrolled fashion, is set on an unstoppable
 163   and inevitable trainwreck path to replicate the
 164   opcode conflict nightmare that plagued the Power ISA,
 165   two decades ago.
 166 * Tensilica, Andes STAR and Western Digital for successful
 167   commercial proprietary ISAs: Tensilica in Baseband Modems,
 168   Andes in Audio DSPs, WD in HDDs and SSDs. These are all
 169   astoundingly commercially successful
 170   multi-billion-unit mass volume markets that almost nobody
 171   knows anything about. Included for completeness.
 172
 173 In order of least controlled to most controlled, the viable
 174 candidates for further advancement are:
 175
 176 * OpenRISC 1200, not controlled or restricted by anyone. no patent
 177   protection.
 178 * RISC-V, touted as "Open" but actually strictly controlled under
 179   Trademark License: too new to have adequate patent pool protection,
 180   as evidenced by multiple adopters having been hit by patent lawsuits.
 181   (Agreements between RISC-V *Members* to not engage in patent litigation
 182    does nothing to stop third party patents that *legitimately pre-date*
 183    the newly-created RISC-V ISA)
 184 * MIPS, SPARC, ARC, and others, simply have no viable publicly
 185   managed ecosystem. They work well within their niche markets.
 186 * Power ISA: protected by IBM's extensive patent portfolio for Members
 187   of the OpenPOWER Foundation, covered by Trademarks, permitting
 188   and encouraging contributions, and having software support for over
 189   20 years.
 190 * ARM, not permitting Open Licensing, they survived in the early 90s
 191   only by doing a deal with Samsung for an in-perpetuity
 192   Royalty-free License, in exchange
 193   for GBP 3 million and legal protection through Samsung Research.
 194   Several large Corporations (Apple most notably) have licensed the ISA
 195   but not ARM designs: the barrier to entry is high and the ISA itself
 196   protected from interference as a result.
 197 * x86, famous for an unprecedented
 198   Court Ruling in 2004 where a Judge "banged heads
 199   together" and ordered AMD and Intel to stop wasting his time,
 200   make peace, and cross-license each other's patents, anyone wishing
 201   to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
 202   and VIA EDEN processors, and see how they fared.
 203 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
 204   but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
 205   particularly disingenuous group from performing illegal
 206   Reverse-Engineering.
 207
 208 By asking the question, "which ISA would be the best and most stable to
 209 base a Vector Supercomputing-class Extension on?" where patent protection,
 210 software ecosystem, open-ness and pedigree all combine to reduce risk
 211 and increase the chances of success, there is really only one candidate.
 212
 213 **Of all of these, the only one with the most going for it is the Power ISA.**
 214
 215 The summary of advantages, then, of the Power ISA is that:
 216
 217 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
 218   and more.
 219 * Amongst many other features
 220   it has Condition Registers which can be used by Branches, greatly
 221   reducing pressure on the main register files.
 222 * IBM's extensive 20+ years of patents is available, royalty-free,
 223   to protect implementors as long as they are also members of the
 224   OpenPOWER Foundation
 225 * IBM designed and maintained the Power ISA as a Supercomputing
 226   class ISA from its inception over 25 years ago.
 227 * Coherent distributed memory access is possible through OpenCAPI
 228 * Extensions to the Power ISA may be submitted through an External
 229   RFC Process that does not require membership of OPF.
 230
 231 From this strong base, the next step is: how to leverage this
 232 foundation to take a leap forward in performance and performance/watt,
 233 *without* losing all the advantages of an ubiquitous software ecosystem,
 234 the lack of which has historically plagued other systems and relegated
 235 them to a risky niche market?
 236
 237 # How do you turn a Scalar ISA into a Vector one?
 238
 239 The most obvious question before that is: why on earth would you want to?
 240 As explained in the "SIMD Considered Harmful" article, Cray-style
 241 Vector ISAs break the link between data element batches and the
 242 underlying architectural back-end parallel processing capability.
 243 Packed SIMD explicitly smashes that width right in the face of the
 244 programmer and expects them to like it.  As the article immediately
 245 demonstrates, an arbitrary-sized data set has to contend with
 246 an insane power-of-two Packed SIMD cascade at both setup and teardown
 247 that routinely adds literally an order
 248 of magnitude increase in the number of hand-written lines of assembler
 249 compared to a well-designed Cray-style Vector ISA with a `setvl`
 250 instruction.
 251
 252 *<blockquote>
 253 Packed SIMD looped algorithms actually have to
 254 contain multiple implementations processing fragments of data at
 255 different SIMD widths: Cray-style Vectors have just the one, covering not
 256 just current architectural implementations but future ones with
 257 wider back-end ALUs as well.
 258 </blockquote>*
 259
 260 Assuming then that variable-length Vectors are obviously desirable,
 261 it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
 262 went the way of adding explicit Vector opcodes, a style which RVV
 263 copied and modernised. In the case of RVV this introduced 192 new
 264 instructions on top of an existing 95+ for base RV64GC.  Adding
 265 200% more instructions than the base ISA seems unwise: at least,
 266 it feels like there should be a better way, particularly on
 267 close inspection of RVV as an example, the basic arithmetic
 268 operations are massively duplicated: scalar-scalar from the base
 269 is joined by both scalar-vector and vector-vector *and* predicate
 270 mask management, and transfer instructions between all the same,
 271 which goes a long way towards explaining why there are twice as many
 272 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
 273
 274 The question then becomes: with all the duplication of arithmetic
 275 operations just to make the registers scalar or vector, why not
 276 leverage the *existing* Scalar ISA with some sort of "context"
 277 or prefix that augments its behaviour? Make "Scalar instruction"
 278 synonymous with "Vector Element instruction" and through nothing
 279 more than contextual
 280 augmentation the Scalar ISA *becomes* the Vector ISA.
 281 Then, by not having to have any Vector instructions at all,
 282 the Instruction Decode
 283 phase is greatly simplified, reducing design complexity and leaving
 284 plenty of headroom for further expansion.
 285
 286 Remarkably this is not a new idea.  Intel's x86 `REP` instruction
 287 gives the base concept, but in 1994 it was Peter Hsu, the designer
 288 of the MIPS R8000, who first came up with the idea of Vector-augmented
 289 prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
 290 the prefix would mark which of the registers were to be treated as
 291 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
 292 as a guide and making "scalar instruction" synonymous with "Vector element",
 293 perform a `REP`-like loop that
 294 jammed multiple scalar operations into the Multi-Issue Execution
 295 Engine.  The only reason that the team did not take this forward
 296 into a commercial product
 297 was because they could not work out how to cleanly do OoO
 298 multi-issue at the time.
 299
 300 In its simplest form, then, this "prefixing" idea is a matter
 301 of:
 302
 303 * Defining the format of the prefix
 304 * Adding a `setvl` instruction
 305 * Adding Vector-context SPRs and working out how to do
 306   context-switches with them
 307 * Writing an awful lot of Specification Documentation
 308   (4 years and counting)
 309
 310 Once the basics of this concept have sunk in, early
 311 advancements quickly follow naturally from analysis
 312 of the problem-space:
 313
 314 * Expanding the size of GPR, FPR and CR register files to
 315   provide 128 entries in each. This is a bare minimum for GPUs
 316   in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
 317   batching as possible.
 318 * Predication (an absolutely critical component for a Vector ISA),
 319   then the next logical advancement is to allow separate predication masks
 320   to be applied to *both* the source *and* the destination, independently.
 321   (*Readers familiar with Vector ISAs will recognise this as a back-to-back
 322   `VGATHER-VSCATTER`*)
 323 * Element-width overrides: most Scalar ISAs today are 64-bit only,
 324   with primarily Load and Store being able to handle 8/16/32/64
 325   and sometimes 128-bit (quad-word), where Vector ISAs need to
 326   go as low as 8-bit arithmetic, even 8-bit Floating-Point for
 327   high-performance AI. Rather than waste opcode space adding all
 328   such operations at different bitwidths, let the prefix
 329   *redefine* (override) the element width, without actually altering
 330   the Scalar ISA at all.
 331 * "Reordering" of the assumption of linear sequential element
 332   access, for Matrices, rotations, transposition, Convolutions,
 333   DCT, FFT, Parallel Prefix-Sum and other common transformations
 334   that require significant programming effort in other ISAs.
 335
 336 All of these things come entirely from "Augmentation" of the Scalar operation
 337 being prefixed: at no time is the Scalar operation significantly
 338 altered.
 339 From there, several more "Modes" can be added, including
 340
 341 * saturation,
 342 which is needed for Audio and Video applications
 343 *  "Reverse Gear"
 344 which runs the Element Loop in reverse order (needed for Prefix
 345 Sum)
 346 * Data-dependent Fail-First, which emerged from asking the simple
 347   question, "If modern Vector ISAs have Load/Store Fail-First,
 348   and the Power ISA has Condition Codes, why not make Conditional
 349   early-exit from Arithmetic operation looping?"
 350 * over 500 Branch-Conditional Modes emerge from application of
 351   Boolean Logic in a Vector context, on top of an already-powerful
 352   Scalar Branch-Conditional/Counter instruction
 353
 354 **What is missing from Power Scalar ISA that a Vector ISA needs?**
 355
 356 Remarkably, very little: the devil is in the details though.
 357
 358 * The traditional `iota` instruction may be
 359   synthesised with an overlapping add, that stacks up incrementally
 360   and sequentially.  Although it requires two instructions (one to
 361   start the sum-chain) the technique has the advantage of allowing
 362   increments by arbitrary amounts, and is not limited to addition,
 363   either.
 364 * Big-integer addition (arbitrary-precision arithmetic) is an
 365   emergent characteristic from the carry-in, carry-out capability of
 366   Power ISA `adde` instruction. `sv.adde` as a BigNum add
 367   naturally emerges from the
 368   sequential carry-flag chaining of these scalar instructions.
 369 * The Condition Register Fields of the Power ISA make a great candidate
 370   for use as Predicate Masks, particularly when combined with
 371   Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
 372
 373 It is only when looking slightly deeper into the Power ISA that
 374 certain things turn out to be missing, and this is down in part to IBM's
 375 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
 376 so Scalar ones.  Examples include that transfer operations between the
 377 Integer and Floating-point Scalar register files were dropped approximately
 378 a decade ago after the Packed SIMD variants were considered to be
 379 duplicates.  With it being completely inappropriate to attempt to Vectorise
 380 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
 381 the Scalar ISA, a much better all-round candidate for Vectorisation is
 382 left anaemic.
 383
 384 A particular key instruction that is missing is `MV.X` which is
 385 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
 386 expensive instruction causing a huge swathe of Register Hazards
 387 in one single hit is almost never added to a Scalar ISA but
 388 is almost always added to a Vector one. When `MV.X` is
 389 Vectorised it allows for arbitrary
 390 remapping of elements within a Vector to positions specified
 391 by another Vector. A typical Scalar ISA will use Memory to
 392 achieve this task, but with Vector ISAs the Vector Register Files are
 393 usually so enormous, and so far away from Memory, that it is easier and
 394 more efficient, architecturally, to provide these Indexing instructions.
 395
 396 Fortunately, with the ISA Working Group being willing
 397 to consider RFCs (Requests For Change) these omissions have the potential
 398 to be corrected.
 399
 400 One deliberate decision in SVP64 involves Predication. Typical Vector
 401 ISAs have quite comprehensive arithmetic and logical operations on
 402 Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
 403 side of Power ISA already has most of them.
 404 If CR Fields were the only predicates in SVP64
 405 it would put pressure on to start adding the exact same arithmetic and logical
 406 operations that already exist in the Integer opcodes, which is less
 407 than desirable.
 408 Instead of taking that route the decision was made to allow *both*
 409 Integer *and* CR Fields to be Predicate Masks, and to create Draft
 410 instructions that provide better transfer capability between CR Fields
 411 and Integer Register files.
 412
 413 Beyond that, further extensions to the Power ISA become much more
 414 domain-specific, such as adding bitmanipulation for Audio, Video
 415 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
 416 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
 417 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
 418 *automatically* is inherently added to the Vector one as well, and
 419 because these GPU and Video opcodes have been added to the CPU ISA,
 420 Software Driver development and debugging is dramatically simplified.
 421
 422 Which brings us to the next important question: how is any of these
 423 CPU-centric Vector-centric improvements relevant to power efficiency
 424 and making more effective use of resources?
 425
 426 # Simpler more compact programs saves power
 427
 428 The first and most obvious saving is that, just as with any Vector
 429 ISA, the amount of data processing requested
 430 and controlled by each instruction is enormous, and leaves the
 431 Decode and Issue Engines idle, as well as the L1 I-Cache. With
 432 programs being smaller, chances are higher that they fit into
 433 L1 Cache, or that the L1 Cache may be made smaller: either way
 434 is a considerable O(N^2) power-saving.
 435
 436 Even a Packed SIMD ISA could take limited advantage of a higher
 437 bang-per-buck for limited specific workloads, as long as the
 438 stripmining setup and teardown is not required.  However a
 439 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
 440 ratio as a 64-wide Vector Length.
 441
 442 Realistically, for general use cases however it is extremely common
 443 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
 444 astounding 240 hand-coded assembler instructions where it is around
 445 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
 446 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
 447 the case of the IBM POWER9 with a little-known design flaw not
 448 normally otherwise encountered this results in
 449 contention between the L1 D and I Caches at the L2 Bus, slowing down
 450 execution even further.  Power ISA 3.1 MMA (Matrix-Multiply-Assist)
 451 requires loop-unrolling to contend with non-power-of-two Matrix
 452 sizes: SVP64 does not (as hinted at below).
 453 [Figures 8 and 9](https://arxiv.org/abs/2104.03142)
 454 illustrate the process of concatenating copies of data in order
 455 to match RADIX2 limitations of MMA.
 456
 457 Additional savings come in the form of `SVREMAP`. Like the
 458 hardware-assist of Google's TPU mentioned on p9 of the above MMA paper,
 459 `SVREMAP` is a hardware
 460 index transformation system where the normally sequentially-linear
 461 Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
 462 commonly-used deterministic schedules, for example Matrix Multiply,
 463 DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
 464 2x6 with optional *in-place* transpose, mirroring or rotation
 465 on any source or destination Matrix
 466 may be performed in as little as 4 instructions, one of which
 467 is to zero-initialise the accumulator Vector used to store the result.
 468 If addition to another Matrix is also required then it is only three
 469 instructions.
 470
 471 Not only that, but because the "Schedule" is an abstract
 472 concept separated from the mathematical operation, there is no reason
 473 why Matrix Multiplication Schedules may not be applied to Integer
 474 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
 475 AND-and-OR, or any other future instruction such as Complex-Number
 476 Multiply-and-Accumulate that a future version of the Power ISA might
 477 support.  The flexibility is not only enormous, but the compactness
 478 unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
 479 around 11 instructions. The only other processors well-known to have
 480 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
 481 and Qualcom's Hexagon, and both are targetted at FFTs only.
 482
 483 There is no reason at all why future algorithmic schedules should not
 484 be proposed as extensions to SVP64 (sorting algorithms,
 485 compression algorithms, Sparse Data Sets, Graph Node walking
 486 for example). (*Bear in mind that
 487 the submission process will be
 488 entirely at the discretion of the OpenPOWER Foundation ISA WG,
 489 something that is both encouraged and welcomed by the OPF.*)
 490
 491 One of SVP64's current limitations is that it was initially designed
 492 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
 493 a heavy focus on adding hardware-for-loops onto the *Registers*.
 494 After more than three years of development the realisation hit that
 495 the SVP64 concept could be expanded to Coherent Distributed Memory.
 496 This astoundingly powerful concept is explored in the next section.
 497
 498 # Coherent Deterministic Hybrid Distributed In-Memory Processing
 499
 500 It is not often that a heading in an article can legitimately
 501 contain quite so many comically-chained buzzwords, but in this section
 502 they are justified.  As hinted at in the first section, the last time
 503 that memory was the same speed as processors was the Pentium III
 504 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
 505 CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
 506 these rates*, yet the pressure from Software Engineers is to
 507 make *sequential* algorithm processing faster and faster because
 508 parallelising of algorithms is simply too difficult to master, and always
 509 has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
 510 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
 511 are at an astonishing four levels of cache (L1 to L4).
 512
 513 It should therefore come as no surprise that attempts are being made
 514 to move (distribute) processing closer to the DRAM Memory, firmly
 515 on the *opposite* side of the main CPU's L1/2/3/4 Caches,
 516 where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
 517 why this approach is compelling.  However
 518 the alarm bells ring here at the keyword "distributed", because by
 519 moving the processing down next to the Memory, even onto
 520 the same die as the DRAM, the speed of any
 521 of the parallel Processing Elements (PEs) would likely drop
 522 by almost two orders of magnitude (5 ghz down to 150 mhz),
 523 the simplicity of each PE has, for pure pragmatic reasons,
 524 to drop by several
 525 orders of magnitude as well.
 526 Things that the average "sequential algorithm"
 527 programmer
 528 takes for granted such as SMP, Cache Coherency, Virtual Memory,
 529 spinlocks (atomic locking, mutexes), all of these are either outright gone
 530 or expected that the programmer shall explicitly contend with
 531 (even if that programmer is the Compiler Developer). There's definitely
 532 not going to be a standard OS: the PEs will be too basic, too
 533 resource-constrained, and definitely too busy.
 534
 535 To give an extreme example: Aspex's Array-String Processor, which
 536 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
 537 Memory, was capable of literally a hundred-fold improvement in
 538 performance over Scalar CPUs such as the Pentium III of its era,
 539 all on a 3 watt budget at only 250 mhz in 130 nm.  Yet to take
 540 proper advantage of its capability required an astounding 5-10
 541 *days* per line of assembly code because multiple versions of
 542 an algorithm had to be hand-crafted then compared, and only
 543 the best one selected: all others discarded. 20 lines of optimised
 544 Assembler taking three to six months to write can in no way be termed
 545 "productive", yet this extreme level of unproductivity is an inherent
 546 side-effect of going down the parallel-processing rabbithole where
 547 the cost of providing "Traditional" programmabilility (Virtual Memory,
 548 SMP) is worse than counter-productive, it's often outright impossible.
 549
 550 **In short, we are in "Programmer's nightmare" territory**
 551
 552 Having dug a proverbial hole that rivals the Grand Canyon, and
 553 jumped in it feet-first, the next
 554 task is to piece together a strategy to climb back out and show
 555 how falling back in can be avoided. This takes some explaining,
 556 and first requires some background on various research efforts and
 557 commercial designs.  Once the context is clear, their synthesis
 558 can be proposed.  These are:
 559
 560 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
 561 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
 562 * [Snitch](https://arxiv.org/abs/2002.10143)
 563
 564 **ZOLC: Zero-Overhead Loop Control**
 565
 566 Zero-Overhead Looping is the concept of automatically running a set sequence
 567 of instructions a predetermined number of times, without requiring
 568 a branch. This is conceptually similar but
 569 slightly different from using Power ISA `bc` in `CTR`
 570 (Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
 571
 572 The simplest longest commercially successful deployment of Zero-overhead looping
 573 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
 574 within the VLIW word may be repeatedly deployed on successive clock
 575 cycles until a countdown reaches zero. This extraordinarily simple
 576 concept needs no branches, and has no complex Register Hazard
 577 Management in the hardware
 578 because it is down to the programmer (or, the compiler),
 579 to ensure data overlaps do not occur.  Careful crafting of those
 580 14 instructions can keep the ALUs 100% occupied for sustained periods,
 581 and the iconic example for which the TI DSPs are renowned
 582 is that an entire inner loop for large FFTs
 583 can be done with that one VLIW word: no stalls, no stopping, no fuss,
 584 an entire 1024 or 4096 wide FFT Layer in one instruction.
 585
 586 <blockquote>
 587 The key aspect of these
 588 very simplistic countdown loops as far as we are concerned:
 589 is: *they are deterministic*.
 590 </blockquote>
 591
 592 Zero-Overhead Loop Control takes this basic "single loop" concept
 593 way further: both nested loops and conditional exit are included,
 594 but also arbitrary control-jumping from the current inner loop
 595 out to an entirely different loop, all based on conditions determined
 596 dynamically at runtime.
 597
 598 Even when deployed on as basic a CPU as a single-issue in-order RISC
 599 core, the performance and power-savings were astonishing: between 20
 600 and **80%** reduction in algorithm completion times were achieved compared
 601 to a more traditional branch-speculative in-order RISC CPU.  MPEG
 602 Decode, the target algorithm specifically picked by the researcher
 603 due to its high complexity with 6-deep nested loops and conditional
 604 execution that frequently jumped in and out of at least 2 loops,
 605 came out with an astonishing 43% improvement in completion time. 43%
 606 less instructions executed is an almost unheard-of level of optimisation:
 607 most ISA designers are elated if they can achieve 5 to 10%. The reduction
 608 was so compelling that ST Microelectronics put it into commercial
 609 production in one of their embedded CPUs, the ST120 DSP-MCU.
 610
 611 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
 612 design of its triple-nested for-loop system
 613 turned out to be remarkably similar to the
 614 core nested for-loop engine of ZOLC. In hindsight this should not
 615 have come as a surprise, because both are basically nested for-loops
 616 that do not need branches to issue instructions.
 617
 618 The important insight is, however, that if ZOLC can be general-purpose
 619 and apply deterministic nested looped instruction
 620 schedules to more than just registers
 621 (unlike SVP64 in its current incarnation) then so can SVP64.
 622
 623 **OpenCAPI and Extra-V**
 624
 625 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
 626 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
 627
 628 <blockquote>(Side note:
 629 POWER10 *only*
 630 has OpenCAPI Memory interfaces: an astounding number of them,
 631 with overall bandwidth so high it's actually difficult to conceptualise.
 632 An OMI-to-DDR4/5 Bridge PHY is therefore required
 633 to connect to standard Memory DIMMs.)
 634 </blockquote>
 635
 636 Extra-V appears to be a remarkable research project based on OpenCAPI that,
 637 by assuming that the map of edges (excluding the actual data)
 638 in any given arbitrary data graph
 639 could be kept by the main CPU in-memory, could distribute and delegate
 640 a limited-capability deterministic but most importantly *data-dependent*
 641 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor
 642 (non-Turing-complete) analysed
 643 the data it had read (at the Memory), and determined if it should
 644 notify the main processor that this "Node" is worth investigating,
 645 or if the Graph node-walk should split in a different direction.
 646 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
 647 abstraction, locking, and cache-coherency, many of the nightmare problems
 648 of other more explicit parallel processing paradigms disappear.
 649
 650 The similarity to ZOLC should not have gone unnoticed: where ZOLC
 651 has nested conditional for-loops Extra-V appears to have just the
 652 one conditional for-loop, but the key strategically-crucial
 653 part of this multi-faceted puzzle is that due to the deterministic and
 654 coherent nature of Extra-V, the processing of the loops, which
 655 requires a tiny non-Turing-Complete processor, is not
 656 done close to or by the main CPU at all: it is
 657 *embedded right next to the memory*.
 658
 659 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
 660 Array-String Processing, and Elixent 2D Array Processing, should
 661 also not have gone unnoticed.  All of these solutions utilised
 662 or utilise
 663 a more comprehensive Turing-complete von-Neumann "Management Core"
 664 to coordinate data passed in and out of PEs: none of them have or
 665 had something
 666 as powerful as OpenCAPI as part of that picture.
 667
 668 The fact that Neural Networks may be expressed as arbitrary Graphs,
 669 and comprise Sparse Matrices, should also have been noted by the reader
 670 interested in AI.
 671
 672 **Snitch**
 673
 674 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
 675 become "tagged" with a Memory-access Mode that went out of fashion
 676 over forty years ago: Load-then-Auto-Increment. Expressed in c as
 677 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
 678 thanks to the RISC paradigm having gone too far,
 679 the efficiency and effectiveness
 680 of these Load-Store-with-Increment instructions has been
 681 forgotten until Snitch.
 682
 683 What the designers did however was not to add any new Load-Store
 684 or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
 685 registers with a tag which *augmented* (altered) the behaviour
 686 of *existing* instructions.  These tags tell the CPU: when you are asked to
 687 carry out
 688 an add instruction on r6 and r7, do not take r6 or r7 from the register
 689 file, instead please perform a Cache-coherent Load-with-Increment
 690 on each, using special (hidden, implicit)
 691 Address Registers for each.  Each new use
 692 of r6 therefore brings in an entirely new value *directly from
 693 memory*. Likewise on the second operand, r7, and likewise on
 694 the destination result which can be an automatic Coherent
 695 Store-and-increment
 696 directly into Memory.
 697
 698 <blockquote>
 699 *The act of "reading" or "writing" a register has been decoupled
 700 and intercepted, then connected transparently to a completely
 701 separate Coherent Memory Subsystem*
 702 </blockquote>
 703
 704 On top of a barrel-architecture the slowness of Memory access
 705 was not a problem because the Deterministic nature of classic
 706 Load-Store-Increment can be compensated for by having 8 Memory
 707 accesses scheduled underway and interleaved in a time-sliced
 708 fashion with an FPU that is correspondingly 8 times faster than
 709 the Coherent Memory accesses.
 710
 711 This design is reminiscent of the early Vector Processors
 712 of the late 1950s and early 1960s, which also critically relied
 713 on implicit auto-increment addressing.
 714 The [CDC STAR-100](https://en.m.wikipedia.org/wiki/CDC_STAR-100)
 715 for example was specifically designed as a Memory-to-Memory Vector
 716 Processor. The barrel-architecture of Snitch neatly
 717 solves one of the inherent problems of those early designs (a mismatch
 718 with memory
 719 speed) and the presence of a full register file (non-tagged,
 720 normal, standard scalar registers) caters for a
 721 second limitation of pure Memory-based Vector Processors: temporary
 722 variables needed in the computation of intermediate results, which
 723 also had to go through memory, put
 724 an awfully high artificial load on Memory bandwidth.
 725
 726 The similarity to SVP64 should be clear: SVP64 Prefixing and the
 727 associated REMAP system is just another form of register "tagging"
 728 that augments what was formerly designated by its original authors
 729 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
 730 with advanced behaviour not previously envisaged.
 731
 732 What Snitch brings to the table therefore is a further illustration of
 733 the concept introduced by Extra-V: where Extra-V brought information
 734 about Sparse-Distributed Data to the attention of the main CPU in
 735 a coherent fashion *without the CPU having to ask for it*, Snitch
 736 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
 737 distributed coherent manner, and does so with dramatically-reduced
 738 power consumption.
 739
 740 **Bringing it all together**
 741
 742 At this point we are well into a future revision of SVP64, one that
 743 clearly has some startlingly powerful potential: Supercomputing-class
 744 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
 745 sustained fashion with reduced complexity, reduced power consumption
 746 and reduced completion time, thanks to Deterministic Coherent Scheduling
 747 of the data fed in and out, or even moved down next to Memory.
 748
 749 This last part is where it normally gets hair-raising, but as ZOLC shows
 750 there is no reason at all why even complex algorithms such as MPEG cannot
 751 be run in a partially-deterministic manner, and anything that is
 752 deterministic can be Scheduled, coherently.  Combine that with OpenCAPI
 753 which solves the many issues associated with SMP Virtual Memory and so on
 754 yet still allows Cache-Coherent Distributed Memory Access, and what was
 755 previously an intractable Computer Science problem for decades begins to
 756 look like there is a potential solution.
 757
 758 The Deterministic Schedules created by ZOLC should even be possible to identify their
 759 suitability for full off-CPU distributed processing, as long as OpenCAPI
 760 is integrated into the mix.  What a compiler - or even the hardware -
 761 will be looking out for is a Basic Block of instructions that:
 762
 763 * begins with a LOAD (to be handled by OpenCAPI)
 764 * contains some instructions that a given PE is capable of executing
 765 * ends with a STORE (again: OpenCAPI)
 766
 767 For best results that would be wrapped with a Zero-Overhead Loop
 768 (which is offloaded - in full - down to the PE), where
 769 the Compiler (or hardware at runtime) could easily identify, in advance,
 770 the full range of Memory Addresses that the Loop is to encounter.  Copies
 771 of loop-invariant data would need to be passed down to the remote PE:
 772 again, for simple-enough Basic Blocks, with assistance from the Compiler,
 773 loop-invariant inputs are easily identified. Parallel Processing
 774 opportunities should also be easy enough to create, simply by farming out
 775 different parts of a given Deterministic Zero-Overhead Loop to
 776 different PEs based on their proximity, bandwidth or ease of access to
 777 given Memory.
 778
 779 The importance of OpenCAPI in this mix cannot be underestimated, because
 780 it will be the means by which the main CPU coordinates its activities
 781 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
 782 violated. It should also be straightforward to ensure that the offloading
 783 is entirely transparent to the developer, in fact this is a hard requirement
 784 because at any given moment there is the possibility that the PEs may be
 785 busy and it is the main CPU that has to complete the Processing Task itself.
 786
 787 It is also important to note that we are not necessarily talking about
 788 the Remote PEs executing the Power ISA, but if they do so it becomes
 789 much easier for the main CPU to take over in the event that PEs are
 790 currently occupied.  Plus, the twin lessons that inventing ISAs, even
 791 a small one, is hard (mostly in compiler writing) and how complex
 792 GPU Task Scheduling is, are being heard loud and clear.
 793
 794 Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
 795
 796 * In order that the main CPU can execute the same sequence if necessary,
 797   the CPU must support dual ISAs: Power and PE **OR**
 798 * There must be a JIT binary-translator which either turns PE code
 799  into Power ISA code or vice-versa **OR**
 800 * The compiler dual-compiles the original source code, and embeds
 801   both a Power binary and a PE binary into the ZOLC Basic Block **OR**
 802 * All binaries are stored in an Intermediate Representation
 803   (LLVM-IR, SPIR-V) and JIT-compiled on-demand.
 804
 805 All of these would work, but it is simpler and a lot less work
 806 just to have the PEs
 807 execute the exact same ISA (or a subset of it). If however the
 808 concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
 809 which would increase adoption and reduce cost, a bit more thought
 810 is required here because ARM or Intel or MIPS might not necessarily
 811 be happy that a Processing Element has to execute Power ISA binaries.
 812 At least the Power ISA is much richer, more powerful, still RISC,
 813 and is an Open Standard, as discussed in a earlier sections.
 814
 815 # Transparently-Distributed Vector Processing
 816
 817 It is very strange to the author to be describing what amounts to a
 818 "Holy Grail" solution to a decades-long intractable problem that
 819 mitigates the anticipated end of Moore's Law: how to make it easy for
 820 well-defined workloads, expressed as a perfectly normal
 821 sequential program, compiled to a standard well-known ISA, to have
 822 the potential of being offloaded transparently to Parallel Compute Engines,
 823 all without the Software Developer being excessively burdened with
 824 a Parallel-Processing Paradigm that is alien to all their experience
 825 and training, as well as Industry-wide common knowledge.
 826
 827 Will it be that easy? ZOLC is, honestly, in its current incarnation,
 828 not that straightforward: programs
 829 have to be "massaged" by tools that insert intrinsics into the
 830 source code, in order to identify the Basic Blocks that the Zero-Overhead
 831 Loops can run. Can this be merged into standard gcc and llvm
 832 compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
 833 if an infinite supply of money and engineering time is thrown at it.
 834 Is a half-way-house solution of compiler intrinsics good enough?
 835 Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
 836 for several decades, and advanced programmers are comfortable with the
 837 practice.
 838
 839 Additional questions remain as to whether OpenCAPI or its use for this
 840 particular scenario requires that the PEs, even quite basic ones,
 841 implement a full RADIX MMU, and associated TLB lookup? In order to ensure
 842 that programs may be cleanly and seamlessly transferred between PEs
 843 and CPU the answer is quite likely to be "yes", which is interesting
 844 in and of itself.  Fortunately, the associated L1 Cache with TLB
 845 Translation does not have to be large, and the actual RADIX Tree Walk
 846 need not explicitly be done by the PEs, it can be handled by the main
 847 CPU as a software-extension: PEs generate a TLB Miss notification
 848 to the main CPU over OpenCAPI, and the main CPU feeds back the new
 849 TLB entries to the PE in response.
 850
 851 Also in practical terms, with the PEs anticipated to be so small as to
 852 make running a full SMP-aware OS impractical it will not just be their TLB
 853 pages that need remote management but their entire register file including
 854 the Program Counter will need to be set up, and the ZOLC Context as
 855 well. With OpenCAPI packet formats being quite large a concern is that
 856 the context management increases latency to the point where the premise
 857 of this paper is invalidated. Research is needed here as to whether a
 858 bare-bones microkernel
 859 would be viable, or a Management Core closer to the PEs (on the same
 860 die or Multi-Chip-Module as the PEs) would allow better bandwidth and
 861 reduce Management Overhead on the main CPUs.
 862
 863 **Use-case: Matrix and Convolutions**
 864
 865 * **Horizontal-First**: (aka standard Cray Vectors) walk
 866   through **elements** first before moving to next **instruction**
 867 * **Vertical-First**: walk through **instructions** before
 868   moving to next **element**.  Currently managed by `svstep`,
 869   ZOLC may be deployed to manage the stepping, in a Deterministic manner.
 870
 871 Imagine a large Matrix scenario, with several values close to zero that
 872 could be skipped: no need to include zero-multiplications, but a
 873 traditional CPU in no way can help: only by loading the data through
 874 the L1-L4 Cache and Virtual Memory Barriers is it possible to
 875 ascertain, retrospectively, that time and power had just been wasted.
 876
 877 SVP64 is able to do what is termed "Vertical-First" Vectorisation,
 878 combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
 879 extended, Snitch-style, to perform a deterministic memory-array walk of
 880 a large Matrix.
 881
 882 Let us also imagine that the Matrices are stored in Memory with PEs
 883 attached, and that the PEs are fully functioning Power ISA with Draft
 884 SVP64, but their Multiply capability is not as good as the main CPU.
 885 Therefore:
 886 we want the PEs to conditionally
 887 feed sparse data to the main CPU, a la "Extra-V".
 888
 889 * The ZOLC SVREMAP System running on the main CPU generates a Matrix
 890   Memory-Load Schedule.
 891 * The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
 892 * The PEs are also sent the Basic Block to be executed on each
 893   Memory Load (each element of the Matrices to be multiplied)
 894 * The PEs execute the Basic Block and **exclude**, in a deterministic
 895   fashion, any elements containing Zero values
 896 * Non-zero elements are sent, via OpenCAPI, to the main CPU, which
 897   queues sequences of Multiply-and-Accumulate, and feeds the results
 898   back to Memory, again via OpenCAPI, to the PEs.
 899 * The PEs, which are tracking the Sparse Conditions, know where
 900   to store the results received
 901
 902 In essence this is near-identical to the original Snitch concept
 903 except that there are, like Extra-V, PEs able to perform
 904 conditional testing of the data as it goes both to and from the
 905 main CPU.  In this way a large Sparse Matrix Multiply or Convolution
 906 may be achieved without having to pass unnecessary data through
 907 L1/L2/L3 Caches only to find, at the CPU, that it is zero.
 908
 909 The reason in this case for the use of Vertical-First Mode is the
 910 conditional execution of the Multiply-and-Accumulate.
 911 Horizontal-First Mode is the standard Cray-Style Vectorisation:
 912 loop on all *elements* with the same instruction before moving
 913 on to the next instruction. Predication needs to be pre-calculated
 914 for the entire Vector in order to exclude certain elements from
 915 the computation. In this case, that's an expensive inconvenience
 916 (similar to the problems associated with Memory-to-Memory
 917 Vector Machines such as the CDC Star-100).
 918
 919 Vertical-First allows *scalar* instructions and
 920 *scalar* temporary registers to be utilised
 921 in the assessment as to whether a particular Vector element should
 922 be skipped, utilising a straight Branch instruction *(or ZOLC
 923 Conditions)*.  The Vertical Vector technique
 924 is pioneered by Mitch Alsup and is a key feature of his VVM Extension
 925 to MyISA 66000.  Careful analysis of the registers within the
 926 Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
 927 *amortise in-flight scalar looped operations into SIMD batches*
 928 as long as the loop is kept small enough to entirely fit into
 929 in-flight Reservation Stations in the first place.
 930
 931 *<blockquote>
 932 (With thanks and gratitude to Mitch Alsup on comp.arch for
 933 spending considerable time explaining VVM, how its Loop
 934 Construct explicitly identifies loop-invariant registers,
 935 and how that helps Register Hazards and SIMD amortisation
 936 on a GB-OoO Micro-architecture)
 937 </blockquote>*
 938
 939 Draft Image (placeholder):
 940
 941 <img src="/openpower/sv/zolc_svp64_extrav.jpg" width=800 />
 942
 943 The program being executed is a simple loop with a conditional
 944 test that ignores the multiply if the input is zero.
 945
 946 * In the CPU-only case (top) the data goes through L1/L2
 947   Cache before reaching the CPU.
 948 * However the PE version does not send zero-data to the CPU,
 949   and even when it does it goes into a Coherent FIFO: no real
 950   compelling need to enter L1/L2 Cache or even the CPU Register
 951   File (one of the key reasons why Snitch saves so much power).
 952 * The PE-only version (see next use-case) the CPU is mostly
 953   idle, serving RADIX MMU TLB requests for PEs, and OpenCAPI
 954   requests.
 955
 956 **Use-case variant: More powerful in-memory PEs**
 957
 958 An obvious variant of the above is that, if there is inherently
 959 more parallelism in the data set, then the PEs get their own
 960 Multiply-and-Accumulate instruction, and rather than send the
 961 data to the CPU over OpenCAPI, perform the Matrix-Multiply
 962 directly themselves.
 963
 964 However the source code and binary would be near-identical if
 965 not identical in every respect, and the PEs implementing the full
 966 ZOLC capability in order to compact binary size to the bare minimum.
 967 The main CPU's role would be to coordinate and manage the PEs
 968 over OpenCAPI.
 969
 970 One key strategic question does remain: do the PEs need to have
 971 a RADIX MMU and associated TLB-aware minimal L1 Cache, in order
 972 to support OpenCAPI properly? The answer is very likely to be yes.
 973 The saving grace here is that with
 974 the expectation of running only hot-loops with ZOLC-driven
 975 binaries, the size of each PE's TLB-aware
 976 L1 Cache needed would be miniscule compared
 977 to the average high-end CPU.
 978
 979 **Roadmap summary of Advanced SVP64**
 980
 981 The future direction for SVP64, then, is:
 982
 983 * To overcome its current limitation of REMAP Schedules being
 984   restricted to Register Files, leveraging the Snitch-style
 985   register interception "tagging" technique.
 986 * To adopt ZOLC and merge REMAP Schedules into ZOLC
 987 * To bring OpenCAPI Memory Access into ZOLC as a first-level
 988   concept that mirrors Snitch's Coherent Memory interception
 989 * To add the Graph-Node Walking Capability of Extra-V
 990   to ZOLC / SVREMAP
 991 * To make it possible, in a combination of hardware and software,
 992   to easily identify ZOLC / SVREMAP Blocks
 993   that may be transparently pushed down closer to Memory, for
 994   localised distributed parallel execution, by OpenCAPI-aware PEs,
 995   exploiting both the Deterministic nature of ZOLC / SVREMAP
 996   combined with the Cache-Coherent nature of OpenCAPI,
 997   to the maximum extent possible.
 998 * To make the exploitation of this powerful solution as simple
 999   and straightforward as possible for Software Engineers to use,
1000   in standard common-usage compilers, gcc and llvm.
1001 * To propose extensions to Public Standards that allow all of
1002   the above to become part of everyday ubiquitous mass-volume
1003   computing.
1004
1005 Even the first of these - merging Snitch-style register tagging
1006 into SVP64 - would
1007 expand SVP64's capability for Matrices, currently limited to
1008 around 5x7 to 6x6 Matrices and constrained by the size of
1009 the register files (128 64-bit entries), to arbitrary (massive) sizes.
1010
1011 **Summary**
1012
1013 Bottom line is that there is a clear roadmap towards solving a long
1014 standing problem facing Computer Science and doing so in a way that
1015 reduces power consumption reduces algorithm completion time and reduces
1016 the need for complex hardware microarchitectures in favour of much
1017 smaller distributed coherent Processing Elements.