17a783f014df08b2f9f0eee3c878d6c882476c92
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * v0.00 05may2021 first created
6 * v0.01 06may2021 initial first draft
7
8 **Table of Contents**
9
10 [[!toc]]
11
12 # Why in the 2020s would you invent a new Vector ISA
13
14 *(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
15
16 Inventing a new Scalar ISA from scratch is over a decade-long task
17 including simulators and compilers: OpenRISC 1200 took 12 years to
18 mature. A Vector or Packed SIMD ISA to reach stable *general-purpose*
19 auto-vectorisation compiler support has never been achieved in the
20 history of computing, not with the combined resources of ARM, Intel,
21 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
22 assembler and direct use of intrinsics is the Industry-standard norm
23 to achieve high-performance optimisation where it matters*).
24 GPUs full this void both in hardware and software terms by having
25 ultra-specialist compilers (CUDA) that are designed from the ground up
26 to support Vector/SIMD parallelism, and associated standards
27 (SPIR-V, Vulkan, OpenCL) managed by
28 the Khronos Group, with multi-man-century development committment from
29 multiple billion-dollar-revenue companies, to sustain them.
30
31 Therefore it begs the question, why on earth would anyone consider
32 this task, and what, in Computer Science, actually needs solving?
33
34 First hints are that whilst memory bitcells have not increased in speed
35 since the 90s (around 150 mhz), increasing the bank width, striping, and
36 datapath widths and speeds to the same has allowed
37 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
38 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
39 all make an effort (all simply increasing the parallel deployment of
40 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
41 two nearly three orders of magnitude increase in CPU horsepower
42 over the same timeframe. Seymour
43 Cray, from his amazing in-depth knowledge, predicted that the mismatch
44 would become a serious limitation, over two decades ago. Some systems
45 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
46 by way of compensation, and as we know from experience even that will
47 be considered inadequate in future.
48
49 Efforts to solve this problem by moving the processing closer to or
50 directly integrated into the memory have traditionally not gone well:
51 Aspex Microelectronics, Elixent, these are parallel processing companies
52 that very few have heard of, because their software stack was so
53 specialist that it required heavy investment by customers to utilise.
54 D-Matrix, a Systolic Array Processor, is a modern incarnation of the exact same
55 "specialist parallel processing" mistake, betting heavily on AI with
56 Matrix and Convolution Engines that can do no other task. Aspex only
57 survived by being bought by Ericsson, where its specialised suitability
58 for massive wide Baseband FFTs saved it from going under.
59 The huge risk is that any "better
60 AI mousetrap" created by an innovative competitor
61 that comes along quickly renders a too-specialist design obsolete.
62
63 NVIDIA and other GPUs have taken a different approach again: massive
64 parallelism with more Turing-complete ISAs in each, and dedicated
65 slower parallel memory paths (GDDR5) suited to the specific tasks of
66 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
67 by the amount of money poured into the software ecosystem in order
68 to make it accessible, and even then, GPU Programmers are a specialist
69 and rare (expensive) breed.
70
71 Second hints as to the answer emerge from an article
72 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
73 which illustrates a catastrophic rabbit-hole taken by Industry Giants
74 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
75 Order(N^6) opcode proliferation nightmare, with its mantra "make it
76 easy for hardware engineers, let software sort out the mess" literally
77 overwhelming programmers with thousands of instructions. Specialists charging
78 clients for assembly-code Optimisation Services are finding that AVX-512,
79 to take an
80 example, is anything but optimal: overall performance of AVX-512 actually
81 *decreases* even as power consumption goes up.
82
83 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
84 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
85 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
86 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
87 instruction that makes a truly ubiquitous Vector ISA) in ways that
88 will become apparent over time as adoption increases. In the meantime
89 programmers are, in direct violation of ARM's advice on how to use SVE2,
90 trying desperately to use it as if it was Packed SIMD NEON. The advice
91 not to create SVE2 assembler that is hardcoded to fixed widths is being
92 disregarded, in favour of writing *multiple identical implementations*
93 of a function, each with a different hardware width, and compelling
94 software to choose one at runtime after probing the hardware.
95
96 Even RISC-V, for all that we can be grateful to the RISC-V Founders
97 for reviving Cray Vectors, has severe performance and implementation
98 limitations that are only really apparent to exceptionally experienced
99 assembly-level developers with a wide, diverse depth in multiple ISAs:
100 one of the best and clearest is a
101 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
102 by adrian_b.
103
104 Adrian logically and concisely points out that the fundamental design
105 assumptions and simplifications that went into the RISC-V ISA have an
106 irrevocably damaging effect on its viability for high performance use.
107 That is not to say that its use in low-performance embedded scenarios is
108 not ideal: in private custom secretive commercial usage it is perfect.
109 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
110 ARM with RISC-V and saving themselves USD 1 in licensing royalties
111 per product are a classic case study. Ubiquitous and common everyday
112 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
113 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
114 unfortunately, fundamentally flawed as far as power efficient high
115 performance is concerned.
116
117 Slowly, at this point, a realisation should be sinking in that, actually,
118 there aren't as many really truly viable Vector ISAs out there, as the
119 ones that are evolving in the general direction of Vectorisation are,
120 in various completely different ways, flawed.
121
122 **Successfully identifying a limitation marks the beginning of an
123 opportunity**
124
125 We are nowhere near done, however, because a Vector ISA is a superset of a
126 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
127 support, and even longer to get the software ecosystem up and running.
128
129 Which ISAs, therefore, have or have had, at one point in time, a decent
130 Software Ecosystem? Debian supports most of these including s390:
131
132 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
133 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
134 reputation nobody wants to go near SPARC.
135 * MIPS, created by SGI and only really commonly used in Network switches.
136 Exceptions: Ingenic with embedded CPUs,
137 and China ICT with the Loongson supercomputers.
138 * x86, the most well-known ISA and also one of the most heavily
139 litigously-protected.
140 * ARM, well known in embedded and smartphone scenarios, very slowly
141 making its way into data centres.
142 * OpenRISC, an entirely Open ISA suitable for embedded systems.
143 * s390, a Mainframe ISA very similar to Power.
144 * Power ISA, a Supercomputing-class ISA, as demonstrated by
145 two out of three of the top500.org supercomputers using
146 around 2 million IBM POWER9 Cores each.
147 * ARC, a competitor at the time to ARM, best known for use in
148 Broadcom VideoCore IV.
149 * RISC-V, with a software ecosystem heavily in development
150 and with rapid expansion
151 in an uncontrolled fashion, is set on an unstoppable
152 and inevitable trainwreck path to replicate the
153 opcode conflict nightmare that plagued the Power ISA,
154 two decades ago.
155 * Tensilica, Andes STAR and Western Digital for successful
156 commercial proprietary ISAs: Tensilica in Baseband Modems,
157 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
158 astoundingly commercially successful
159 multi-billion-unit mass volume markets that almost nobody
160 knows anything about. Included for completeness.
161
162 In order of least controlled to most controlled, the viable
163 candidates for further advancement are:
164
165 * OpenRISC 1200, not controlled or restricted by anyone. no patent
166 protection.
167 * RISC-V, touted as "Open" but actually strictly controlled under
168 Trademark License: too new to have adequate patent pool protection,
169 as evidenced by multiple adopters having been hit by patent lawsuits.
170 (Agreements between RISC-V *Members* to not engage in patent litigation
171 does nothing to stop third party patents that *legitimately pre-date*
172 the newly-created RISC-V ISA)
173 * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
174 * Power ISA: protected by IBM's extensive patent portfolio for Members
175 of the OpenPOWER Foundation, covered by Trademarks, permitting
176 and encouraging contributions, and having software support for over
177 20 years.
178 * ARM, not permitting Open Licensing, they survived in the early 90s
179 only by doing a deal with Samsung for an in-perpetuity
180 Royalty-free License, in exchange
181 for GBP 3 million and legal protection through Samsung Research.
182 Several large Corporations (Apple most notably) have licensed the ISA
183 but not ARM designs: the barrier to entry is high and the ISA itself
184 protected from interference as a result.
185 * x86, famous for an unprecedented
186 Court Ruling in 2004 where a Judge "banged heads
187 together" and ordered AMD and Intel to stop wasting his time,
188 make peace, and cross-license each other's patents, anyone wishing
189 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
190 and VIA EDEN processors, and see how they fared.
191 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
192 but the 800lb Gorilla Syndrome seems not to have deterred one
193 particularly disingenuous group from performing illegal
194 Reverse-Engineering.
195
196 By asking the question, "which ISA would be the best and most stable to
197 base a Vector Supercomputing-class Extension on?" where patent protection,
198 software ecosystem, open-ness and pedigree all combine to reduce risk
199 and increase the chances of success, there is really only one candidate.
200
201 **Of all of these, the only one with the most going for it is the Power ISA.**
202
203 The summary of advantages, then, of the Power ISA is that:
204
205 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
206 and more.
207 * Amongst many other features
208 it has Condition Registers which can be used by Branches, greatly
209 reducing pressure on the main register files.
210 * IBM's extensive 20+ years of patents is available, royalty-free,
211 to protect implementors as long as they are also members of the
212 OpenPOWER Foundation
213 * IBM designed and maintained the Power ISA as a Supercomputing
214 class ISA from its inception over 25 years ago.
215 * Coherent distributed memory access is possible through OpenCAPI
216 * Extensions to the Power ISA may be submitted through an External
217 RFC Process that does not require membership of OPF.
218
219 From this strong base, the next step is: how to leverage this
220 foundation to take a leap forward in performance and performance/watt,
221 *without* losing all the advantages of an ubiquitous software ecosystem,
222 the lack of which has historically plagued other systems and relegated
223 them to a risky niche market?
224
225 # How do you turn a Scalar ISA into a Vector one?
226
227 The most obvious question before that is: why on earth would you want to?
228 As explained in the "SIMD Considered Harmful" article, Cray-style
229 Vector ISAs break the link between data element batches and the
230 underlying architectural back-end parallel processing capability.
231 Packed SIMD explicitly smashes that width right in the face of the
232 programmer and expects them to like it. As the article immediately
233 demonstrates, an arbitrary-sized data set has to contend with
234 an insane power-of-two Packed SIMD cascade at both setup and teardown
235 that routinely adds literally an order
236 of magnitude increase in the number of hand-written lines of assembler
237 compared to a well-designed Cray-style Vector ISA with a `setvl`
238 instruction.
239
240 *Packed SIMD looped algorithms actually have to
241 contain multiple implementations processing fragments of data at
242 different SIMD widths: Cray-style Vectors have one, covering not
243 just current architectural implementations but future ones with
244 wider back-end ALUs as well.*
245
246 Assuming then that variable-length Vectors are obviously desirable,
247 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
248 went the way of adding explicit Vector opcodes, a style which RVV
249 copied and modernised. In the case of RVV this introduced 192 new
250 instructions on top of an existing 95+ for base RV64GC. Adding
251 200% more instructions than the base ISA seems unwise: at least,
252 it feels like there should be a better way, particularly on
253 close inspection of RVV as an example, the basic arithmetic
254 operations are massively duplicated: scalar-scalar from the base
255 is joined by both scalar-vector and vector-vector *and* predicate
256 mask management, and transfer instructions between all the same,
257 which goes a long way towards explaining why there are twice as many
258 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
259
260 The question then becomes: with all the duplication of arithmetic
261 operations just to make the registers scalar or vector, why not
262 leverage the *existing* Scalar ISA with some sort of "context"
263 or prefix that augments its behaviour? Make "Scalar instruction"
264 synonymous with "Scalar instruction" and through contextual
265 augmentation the Scalar ISA *becomes* the Vector ISA.
266 Then, by not having to have any Vector instructions at all,
267 the Instruction Decode
268 phase is greatly simplified, reducing design complexity and leaving
269 plenty of headroom for further expansion.
270
271 Remarkably this is not a new idea. Intel's x86 `REP` instruction
272 gives the base concept, but in 1994 it was Peter Hsu, the designer
273 of the MIPS R8000, who first came up with the idea of Vector-augmented
274 prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
275 the prefix would mark which of the registers were to be treated as
276 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
277 as a guide and making "scalar instruction" synonymous with "Vector element",
278 perform a `REP`-like loop that
279 jammed multiple scalar operations into the Multi-Issue Execution
280 Engine. The only reason that the team did not take this forward
281 into a commercial product
282 was because they could not work out how to cleanly do OoO
283 multi-issue at the time.
284
285 In its simplest form, then, this "prefixing" idea is a matter
286 of:
287
288 * Defining the format of the prefix
289 * Adding a `setvl` instruction
290 * Adding Vector-context SPRs and working out how to do
291 context-switches with them
292 * Writing an awful lot of Specification Documentation
293 (4 years and counting)
294
295 Once the basics of this concept have sunk in, early
296 advancements quickly follow naturally from analysis
297 of the problem-space:
298
299 * Expanding the size of GPR, FPR and CR register files to
300 provide 128 entries in each. This is a bare minimum for GPUs
301 in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
302 batching as possible.
303 * Predication (an absolutely critical component for a Vector ISA),
304 then the next logical advancement is to allow separate predication masks
305 to be applied to *both* the source *and* the destination, independently.
306 * Element-width overrides: most Scalar ISAs today are 64-bit only,
307 with primarily Load and Store being able to handle 8/16/32/64
308 and sometimes 128-bit (quad-word), where Vector ISAs need to
309 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
310 high-performance AI. Rather than waste opcode space adding all
311 such operations at different bitwidths, let the prefix
312 *redefine* (override) the element width, without actually altering
313 the Scalar ISA at all.
314 * "Reordering" of the assumption of linear sequential element
315 access, for Matrices, rotations, transposition, Convolutions,
316 DCT, FFT, Parallel Prefix-Sum and other common transformations
317 that require significant programming effort in other ISAs.
318
319 All of these things come entirely from "Augmentation" of the Scalar operation
320 being prefixed: at no time is the Scalar operation significantly
321 altered.
322 From there, several more "Modes" can be added, including
323
324 * saturation,
325 which is needed for Audio and Video applications
326 * "Reverse Gear"
327 which runs the Element Loop in reverse order (needed for Prefix
328 Sum)
329 * Data-dependent Fail-First, which emerged from asking the simple
330 question, "If modern Vector ISAs have Load/Store Fail-First,
331 and the Power ISA has Condition Codes, why not make Conditional
332 early-exit from Arithmetic operation looping?"
333 * over 500 Branch-Conditional Modes emerge from application of
334 Boolean Logic in a Vector context, on top of an already-powerful
335 Scalar Branch-Conditional instruction.
336
337 **What is missing from Power Scalar ISA that a Vector ISA needs?**
338
339 Remarkably, very little: the devil is in the details though.
340
341 * The traditional `iota` instruction may be
342 synthesised with an overlapping add, that stacks up incrementally
343 and sequentially. Although it requires two instructions (one to
344 start the sum-chain) the technique has the advantage of allowing
345 increments by arbitrary amounts, and is not limited to addition,
346 either.
347 * Big-integer addition (arbitrary-precision arithmetic) is an
348 emergent characteristic from the carry-in, carry-out capability of
349 Power ISA `adde` instruction. `sv.adde` as a BigNum add
350 naturally emerges from the
351 sequential carry-flag chaining of these scalar instructions.
352 * The Condition Register Fields of the Power ISA make a great candidate
353 for use as Predicate Masks, particularly when combined with
354 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
355
356 It is only when looking slightly deeper into the Power ISA that
357 certain things turn out to be missing, and this is down in part to IBM's
358 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
359 so Scalar ones. Examples include that transfer operations between the
360 Integer and Floating-point Scalar register files were dropped approximately
361 a decade ago after the Packed SIMD variants were considered to be
362 duplicates. With it being completely inappropriate to attempt to Vectorise
363 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
364 the Scalar ISA, a much better all-round candidate for Vectorisation is
365 left anaemic.
366
367 A particular key instruction that is missing is `MV.X` which is
368 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
369 expensive instruction causing a huge swathe of Register Hazards
370 in one single hit is almost never added to a Scalar ISA but
371 is almost always added to a Vector one. When `MV.X` is
372 Vectorised it allows for arbitrary
373 remapping of elements within a Vector to positions specified
374 by another Vector. A typical Scalar ISA will use Memory to
375 achieve this task, but with Vector ISAs the Vector Register Files are
376 usually so enormous, and so far away from Memory, that it is easier and
377 more efficient, architecturally, to provide these Indexing instructions.
378
379 Fortunately, with the ISA Working Group being willing
380 to consider RFCs (Requests For Change) these omissions have the potential
381 to be corrected.
382
383 One deliberate decision in SVP64 involves Predication. Typical Vector
384 ISAs have quite comprehensive arithmetic and logical operations on
385 Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
386 side of Power ISA already has most of them.
387 If CR Fields were the only predicates in SVP64
388 it would put pressure on to start adding the exact same arithmetic and logical
389 operations that already exist in the Integer opcodes, which is less
390 than desirable.
391 Instead of taking that route the decision was made to allow *both*
392 Integer *and* CR Fields to be Predicate Masks, and to create Draft
393 instructions that provide better transfer capability between CR Fields
394 and Integer Register files.
395
396 Beyond that, further extensions to the Power ISA become much more
397 domain-specific, such as adding bitmanipulation for Audio, Video
398 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
399 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
400 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
401 *automatically* is inherently added to the Vector one as well, and
402 because these GPU and Video opcodes have been added to the CPU ISA,
403 Software Driver development and debugging is dramatically simplified.
404
405 Which brings us to the next important question: how is any of these
406 CPU-centric Vector-centric improvements relevant to power efficiency
407 and making more effective use of resources?
408
409 # Simpler more compact programs saves power
410
411 The first and most obvious saving is that, just as with any Vector
412 ISA, the amount of data processing requested
413 and controlled by each instruction is enormous, and leaves the
414 Decode and Issue Engines idle, as well as the L1 I-Cache. With
415 programs being smaller, chances are higher that they fit into
416 L1 Cache, or that the L1 Cache may be made smaller: either way
417 is a considerable O(N^2) power-saving.
418
419 Even a Packed SIMD ISA could take limited advantage of a higher
420 bang-per-buck for limited specific workloads, as long as the
421 stripmining setup and teardown is not required. However a
422 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
423 ratio as a 64-wide Vector Length.
424
425 Realistically, for general use cases however it is extremely common
426 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
427 astounding 240 hand-coded assembler instructions where it is around
428 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
429 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
430 the case of the IBM POWER9 with a little-known design flaw not
431 normally otherwise encountered this results in
432 contention between the L1 D and I Caches at the L2 Bus, slowing down
433 execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
434 requires loop-unrolling to contend with non-power-of-two Matrix
435 sizes: SVP64 does not, as hinted at below.
436
437 Additional savings come in the form of `SVREMAP`. This is a hardware
438 index transformation system where the normally sequentially-linear
439 Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
440 commonly-used deterministic schedules, for example Matrix Multiply,
441 DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
442 2x6 with optional *in-place* transpose, mirroring or rotation
443 on any source or destination Matrix
444 may be performed in as little as 4 instructions, one of which
445 is to zero-initialise the accumulator Vector used to store the result.
446 If addition to another Matrix is also required then it is only three
447 instructions. Not only that, but because the "Schedule" is an abstract
448 concept separated from the mathematical operation, there is no reason
449 why Matrix Multiplication Schedules may not be applied to Integer
450 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
451 AND-and-OR, or any other future instruction such as Complex-Number
452 Multiply-and-Accumulate that a future version of the Power ISA might
453 support. The flexibility is not only enormous, but the compactness
454 unprecedented. RADIX2 in-place DCT Triple-loop Schedules may be created in
455 around 11 instructions. The only other processors well-known to have
456 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
457 and Qualcom's Hexagon, and both are targetted at FFTs only.
458
459 There is no reason at all why future algorithmic schedules should not
460 be proposed as extensions to SVP64 (sorting algorithms,
461 compression algorithms, Sparse Data Sets, Graph Node walking
462 for example). Bear in mind that
463 the submission process will be
464 entirely at the discretion of the OpenPOWER Foundation ISA WG,
465 something that is both encouraged and welcomed by the OPF.
466
467 One of SVP64's current limitations is that it was initially designed
468 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
469 a heavy focus on adding hardware-for-loops onto the *Registers*.
470 After more than three years of development the realisation hit that
471 the SVP64 concept could be expanded to Coherent Distributed Memory,
472 This astoundingly powerful concept is explored in the next section.
473
474 # Coherent Deterministic Hybrid Distributed In-Memory Processing
475
476 It is not often that a heading in an article can legitimately
477 contain quite so many comically-chained buzzwords, but in this section
478 they are justified. As hinted at in the first section, the last time
479 that memory was the same speed as processors was the Pentium III
480 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
481 CPUs were about the same rate. DRAM bitcells *simply cannot exceed
482 these rates*, yet the pressure from Software Engineers is to
483 make *sequential* algorithm processing faster and faster because
484 parallelising of algorithms is simply too difficult to master, and always
485 has been. Thus whilst DRAM has to go parallel (like RAID Striping) to
486 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
487 are at an astonishing four levels of cache (L1 to L4).
488
489 It should therefore come as no surprise that attempts are being made
490 to move (distribute) processing closer to the DRAM Memory, firmly
491 on the *opposite* side of the main CPU's L1/2/3/4 Caches,
492 where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
493 why this approach is compelling. However
494 the alarm bells ring here at the keyword "distributed", because by
495 moving the processing down next to the Memory, even onto
496 the same die as the DRAM, the speed of any
497 of the parallel Processing Elements (PEs) would likely drop
498 by almost two orders of magnitude (5 ghz down to 150 mhz),
499 the simplicity of each PE has, for pure pragmatic reasons,
500 to drop by several
501 orders of magnitude as well.
502 Things that the average "sequential algorithm"
503 programmer
504 takes for granted such as SMP, Cache Coherency, Virtual Memory,
505 spinlocks (atomic locking, mutexes), all of these are either outright gone
506 or expected that the programmer shall explicitly contend with
507 (even if that programmer is the Compiler Developer). There's definitely
508 not going to be a standard OS: the PEs will be too basic, too
509 resource-constrained, and definitely too busy.
510
511 To give an extreme example: Aspex's Array-String Processor, which
512 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
513 Memory, was capable of literally a hundred-fold improvement in
514 performance over Scalar CPUs such as the Pentium III of its era,
515 all on a 3 watt budget at only 250 mhz in 130 nm. Yet to take
516 proper advantage of its capability required an astounding 5-10
517 *days* per line of assembly code because multiple versions of
518 an algorithm had to be hand-crafted then compared, and only
519 the best one selected: all others discarded. 20 lines of optimised
520 Assembler taking three to six months to write can in no way be termed
521 "productive", yet this extreme level of unproductivity is an inherent
522 side-effect of going down the parallel-processing rabbithole.
523
524 **In short, we are in "Programmer's nightmare" territory**
525
526 Having dug a proverbial hole that rivals the Grand Canyon, and
527 jumped in it feet-first, the next
528 task is to piece together a strategy to climb back out and show
529 how falling back in can be avoided. This takes some explaining,
530 and first requires some background on various research efforts and
531 commercial designs. Once the context is clear, their synthesis
532 can be proposed. These are:
533
534 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
535 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
536 * [Snitch](https://arxiv.org/abs/2002.10143)
537
538 **ZOLC: Zero-Overhead Loop Control**
539
540 The simplest longest commercially successful deployment of Zero-overhead looping
541 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
542 within the VLIW word may be repeatedly deployed on successive clock
543 cycles until a countdown reaches zero. This extraordinarily simple
544 concept needs no branches, and has no complex Register Hazard
545 Management in the hardware
546 because it is down to the programmer (or, the compiler),
547 to ensure data overlaps do not occur. Careful crafting of those
548 14 instructions can keep the ALUs 100% occupied for sustained periods,
549 and the iconic example for which the TI DSPs are renowned
550 is that an entire inner loop for large FFTs
551 can be done with that one VLIW word: no stalls, no stopping, no fuss.
552
553 The key aspect of these
554 very simplistic countdown loops as far as we are concerned:
555 is: *they are deterministic*.
556
557 Zero-Overhead Loop Control takes this basic "single loop" concept
558 way further: both nested loops and conditional exit are included,
559 but also arbitrary control-jumping from the current inner loop
560 out to an entirely different loop, all based on conditions determined
561 dynamically at runtime.
562
563 Even when deployed on as basic a CPU as a single-issue in-order RISC
564 core, the performance and power-savings were astonishing: between 20
565 and **80%** reduction in algorithm completion times were achieved compared
566 to a more traditional branch-speculative in-order RISC CPU. MPEG
567 Decode, the target algorithm specifically picked by the researcher
568 due to its high complexity with 6-deep nested loops and conditional
569 execution that frequently jumped in and out of at least 2 loops,
570 came out with an astonishing 43% improvement in completion time. 43%
571 less instructions executed is an almost unheard-of level of optimisation:
572 most ISA designers are elated if they can achieve 5 to 10%. The reduction
573 was so compelling that ST Microelectronics put it into commercial
574 production in one of their embedded CPUs, the ST120 DSP-MCU.
575
576 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
577 design of its triple-nested for-loop system
578 turned out to be remarkably similar to the
579 core nested for-loop engine of ZOLC. In hindsight this should not
580 have come as a surprise, because both are basically nested for-loops
581 that do not need branches to issue instructions.
582
583 The important insight is, however, that if ZOLC can be general-purpose
584 and apply deterministic nested looped instruction
585 schedules to more than just registers
586 (unlike SVP64 in its current incarnation) then so can SVP64.
587
588 **OpenCAPI and Extra-V**
589
590 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
591 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
592 has OpenCAPI Memory interfaces, and requires an OMI-to-DDR4/5 Bridge PHY
593 to connect to standard DIMMs.
594
595 Extra-V appears to be a remarkable research project based on OpenCAPI that,
596 by assuming that the map of edges (excluding the actual data)
597 in any given arbitrary data graph
598 could be kept by the main CPU in-memory, could distribute and delegate
599 a limited-capability deterministic but most importantly *data-dependent*
600 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier). A miniature processor
601 (non-Turing-complete) analysed
602 the data it had read (at the Memory), and determined if it should
603 notify the main processor that this "Node" is worth investigating,
604 or if the Graph node-walk should split in a different direction.
605 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
606 abstraction, locking, and cache-coherency, many of the nightmare problems
607 of other more explicit parallel processing paradigms disappear.
608
609 The similarity to ZOLC should not have gone unnoticed: where ZOLC
610 has nested conditional for-loops Extra-V appears to have just the
611 one conditional for-loop, but the key strategically-crucial
612 part of this multi-faceted puzzle is that due to the deterministic and
613 coherent nature of Extra-V, the processing of the loops, which
614 requires a tiny non-Turing-Complete processor, is not
615 done close to or by the main CPU at all: it is
616 *embedded right next to the memory*.
617
618 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
619 Array-String Processing, and Elixent 2D Array Processing, should
620 also not have gone unnoticed. All of these solutions utilised
621 or utilise
622 a more comprehensive Turing-complete von-Neumann "Management Core"
623 to coordinate data passed in and out of PEs: none of them have or
624 had something
625 as powerful as OpenCAPI as part of that picture.
626
627 The fact that Neural Networks may be expressed as arbitrary Graphs,
628 and comprise Sparse Matrices, should also have been noted by the reader
629 interested in AI.
630
631 **Snitch**
632
633 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
634 become "tagged" with a Memory-access Mode that went out of fashion
635 over forty years ago: Load-then-Auto-Increment. Expressed in c as
636 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
637 thanks to the RISC paradigm having gone too far,
638 the efficiency and effectiveness
639 of these Load-Store-with-Increment instructions has been
640 forgotten until Snitch.
641
642 What the designers did however was not to add new Load-Store
643 or Arithmetic instructions to RISC-V, but instead to "mark"
644 registers with a tag. These tags tell the CPU: when you are asked to
645 carry out
646 an add instruction on r6 and r7, do not take r6 or r7 from the register
647 file, instead please perform a Cache-coherent Load-with-Increment
648 on each, using special Address Registers for each. Each new use
649 of r6 therefore brings in an entirely new value *directly from
650 memory*. Likewise on the second operand, r7, and likewise on
651 the destination result which can be an automatic Coherent
652 Store-and-increment
653 directly into Memory. In essence:
654
655 *The act of "reading" or "writing" a register has been decoupled
656 and intercepted, then connected transparently to a completely
657 separate Coherent Memory Subsystem*
658
659 On top of a barrel-architecture the slowness of Memory access
660 was not a problem because the Deterministic nature of classic
661 Load-Store-Increment can be compensated for by having 8 Memory
662 accesses scheduled underway and interleaved in a time-sliced
663 fashion with an FPU that is correspondingly 8 times faster than
664 the Coherent Memory accesses.
665
666 This design is reminiscent of the early Vector Processors
667 of the late 1950s and early 1960s, which also critically relied
668 on implicit auto-increment addressing.
669 The [CDC STAR-100](https://en.m.wikipedia.org/wiki/CDC_STAR-100)
670 for example was specifically designed as a Memory-to-Memory Vector
671 Processor. The barrel-architecture of Snitch neatly
672 solves one of the inherent problems of those early designs (a mismatch
673 with memory
674 speed) and the presence of a full register file (non-tagged,
675 normal, standard scalar registers) caters for a
676 second limitation of pure Memory-based Vector Processors: temporary
677 variables needed in the computation of intermediate results, which
678 also had to go through memory, put
679 an awfully high artificial load on Memory bandwidth.
680
681 The similarity to SVP64 should be clear: SVP64 Prefixing and the
682 associated REMAP system is just another form of register "tagging"
683 that augments what was formerly designated by its original authors
684 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
685 with advanced behaviour not previously envisaged.
686
687 What Snitch brings to the table therefore is a further illustration of
688 the concept introduced by Extra-V: where Extra-V brought information
689 about Sparse-Distributed Data to the attention of the main CPU in
690 a coherent fashion *without the CPU having to ask for it*, Snitch
691 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
692 distributed coherent manner, and does so with dramatically-reduced
693 power consumption.
694
695 **Bringing it all together**
696
697 At this point we are well into a future revision of SVP64, one that
698 clearly has some startlingly powerful potential: Supercomputing-class
699 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
700 sustained fashion with reduced complexity, reduced power consumption
701 and reduced completion time, thanks to Deterministic Coherent Scheduling
702 of the data fed in and out, or even moved down next to Memory.
703
704 This last part is where it normally gets hair-raising, but as ZOLC shows
705 there is no reason at all why even complex algorithms such as MPEG cannot
706 be run in a partially-deterministic manner, and anything that is
707 deterministic can be Scheduled, coherently. Combine that with OpenCAPI
708 which solves the many issues associated with SMP Virtual Memory and so on
709 yet still allows Cache-Coherent Distributed Memory Access, and what was
710 previously an intractable Computer Science problem for decades begins to
711 look like there is a potential solution.
712
713 The Deterministic Schedules created by ZOLC should even be possible to identify their
714 suitability for full off-CPU distributed processing, as long as OpenCAPI
715 is integrated into the mix. What a compiler - or even the hardware -
716 will be looking out for is a Basic Block of instructions that:
717
718 * begins with a LOAD (to be handled by OpenCAPI)
719 * contains some instructions that a given PE is capable of executing
720 * ends with a STORE (again: OpenCAPI)
721
722 For best results that would be wrapped with a Zero-Overhead Loop
723 (which is offloaded - in full - down to the PE), where
724 the Compiler (or hardware at runtime) could easily identify, in advance,
725 the full range of Memory Addresses that the Loop is to encounter. Copies
726 of loop-invariant data would need to be passed down to the remote PE:
727 again, for simple-enough Basic Blocks, with assistance from the Compiler,
728 loop-invariant inputs are easily identified. Parallel Processing
729 opportunities should also be easy enough to create, simply by farming out
730 different parts of a given Deterministic Zero-Overhead Loop to
731 different PEs based on their proximity, bandwidth or ease of access to
732 given Memory.
733
734 The importance of OpenCAPI in this mix cannot be underestimated, because
735 it will be the means by which the main CPU coordinates its activities
736 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
737 violated. It should also be straightforward to ensure that the offloading
738 is entirely transparent to the developer, in fact this is a hard requirement
739 because at any given moment there is the possibility that the PEs may be
740 busy and it is the main CPU that has to complete the Processing Task itself.
741
742 It is also important to note that we are not necessarily talking about
743 the Remote PEs executing the Power ISA, but if they do so it becomes
744 much easier for the main CPU to take over in the event that PEs are
745 currently occupied. Plus, the twin lessons that inventing ISAs, even
746 a small one, is hard (mostly in compiler writing) and how complex
747 GPU Task Scheduling is, are being heard loud and clear.
748
749 Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
750
751 * In order that the main CPU can execute the same sequence if necessary,
752 the CPU must support dual ISAs: Power and PE **OR**
753 * There must be a JIT binary-translator which either turns PE code
754 into Power ISA code or vice-versa **OR**
755 * The compiler dual-compiles the original source code, and embeds
756 both a Power binary and a PE binary into the ZOLC Basic Block **OR**
757 * All binaries are stored in an Intermediate Representation
758 (LLVM-IR, SPIR-V) and JIT-compiled on-demand.
759
760 All of these would work, but it is simpler and a lot less work
761 just to have the PEs
762 execute the exact same ISA (or a subset of it). If however the
763 concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
764 which would increase adoption and reduce cost, a bit more thought
765 is required here because ARM or Intel or MIPS might not necessarily
766 be happy that a Processing Element has to execute Power ISA binaries.
767 At least the Power ISA is much richer, more powerful, still RISC,
768 and is an Open Standard, as discussed in a earlier sections.
769
770 # Transparently-Distributed Vector Processing
771
772 It is very strange to the author to be describing what amounts to a
773 "Holy Grail" solution to a decades-long intractable problem that
774 mitigates the anticipated end of Moore's Law: how to make it easy for
775 well-defined workloads, expressed as a perfectly normal
776 sequential program, compiled to a standard well-known ISA, to have
777 the potential of being offloaded transparently to Parallel Compute Engines,
778 all without the Software Developer being excessively burdened with
779 a Parallel-Processing Paradigm that is alien to all their experience
780 and training, as well as Industry-wide common knowledge.
781
782 Will it be that easy? ZOLC is, honestly, in its current incarnation,
783 not that straightforward: programs
784 have to be "massaged" by tools that insert intrinsics into the
785 source code, in order to identify the Basic Blocks that the Zero-Overhead
786 Loops can run. Can this be merged into standard gcc and llvm
787 compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
788 if an infinite supply of money and engineering time is thrown at it.
789 Is a half-way-house solution of compiler intrinsics good enough?
790 Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
791 for several decades, and advanced programmers are comfortable with the
792 practice.
793
794 **Roadmap summary of Advanced SVP64**
795
796 The future direction for SVP64, then, is:
797
798 * To overcome its current limitation of REMAP Schedules being
799 restricted to Register Files, leveraging the Snitch-style
800 register interception "tagging" technique.
801 * To adopt ZOLC and merge REMAP Schedules into ZOLC
802 * To bring OpenCAPI Memory Access into ZOLC as a first-level
803 concept that mirrors Snitch's Coherent Memory interception
804 * To add the Graph-Node Walking Capability of Extra-V
805 to ZOLC / SVREMAP
806 * To make it possible, in a combination of hardware and software,
807 to easily identify ZOLC / SVREMAP Blocks
808 that may be transparently pushed down closer to Memory, for
809 localised distributed parallel execution, by OpenCAPI-aware PEs,
810 exploiting both the Deterministic nature of ZOLC / SVREMAP
811 combined with the Cache-Coherent nature of OpenCAPI,
812 to the maximum extent possible.
813 * To make the exploitation of this powerful solution as simple
814 and straightforward as possible for Software Engineers to use,
815 in standard common-usage compilers, gcc and llvm.
816 * To propose extensions to Public Standards that allow all of
817 the above to become part of everyday ubiquitous mass-volume
818 computing.
819
820 Even the first of these - merging Snitch-style register tagging
821 into SVP64 - would
822 expand SVP64's capability for Matrices, currently limited to
823 around 5x7 to 6x6 Matrices and constrained by the size of
824 the register files (128 64-bit entries), to arbitrary (massive) sizes.
825
826 Bottom line is that there is a clear roadmap towards solving a long
827 standing problem facing Computer Science and doing so in a way that
828 reduces power consumption reduces algorithm completion time and reduces
829 the need for complex hardware microarchitectures in favour of much
830 smaller distributed coherent Processing Elements.