20cd6a83692dc7f95121fbeee333e6ec66f26b09
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * v0.00 05may2021 first created
6 * v0.01 06may2021 initial first draft
7
8 **Table of Contents**
9
10 [[!toc]]
11
12 # Why in the 2020s would you invent a new Vector ISA
13
14 *(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
15
16 Inventing a new Scalar ISA from scratch is over a decade-long task
17 including simulators and compilers: OpenRISC 1200 took 12 years to
18 mature. Stable ISAs require Standards and Compliance Suites that
19 take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
20 auto-vectorisation compiler support has never been achieved in the
21 history of computing, not with the combined resources of ARM, Intel,
22 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
23 assembler and direct use of intrinsics is the Industry-standard norm
24 to achieve high-performance optimisation where it matters*).
25 GPUs fill this void both in hardware and software terms by having
26 ultra-specialist compilers (CUDA) that are designed from the ground up
27 to support Vector/SIMD parallelism, and associated standards
28 (SPIR-V, Vulkan, OpenCL) managed by
29 the Khronos Group, with multi-man-century development committment from
30 multiple billion-dollar-revenue companies, to sustain them.
31
32 Therefore it begs the question, why on earth would anyone consider
33 this task, and what, in Computer Science, actually needs solving?
34
35 First hints are that whilst memory bitcells have not increased in speed
36 since the 90s (around 150 mhz), increasing the bank width, striping, and
37 datapath widths and speeds to the same has, with significant relative
38 latency penalties, allowed
39 apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
40 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI's OMI,
41 all make an effort (all simply increasing the parallel deployment of
42 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
43 two nearly three orders of magnitude increase in CPU horsepower
44 over the same timeframe. Seymour
45 Cray, from his amazing in-depth knowledge, predicted that the mismatch
46 would become a serious limitation, over two decades ago. Some systems
47 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
48 by way of compensation, and as we know from experience even that will
49 be considered inadequate in future.
50
51 Efforts to solve this problem by moving the processing closer to or
52 directly integrated into the memory have traditionally not gone well:
53 Aspex Microelectronics, Elixent, these are parallel processing companies
54 that very few have heard of, because their software stack was so
55 specialist that it required heavy investment by customers to utilise.
56 D-Matrix, a Systolic Array Processor, is a modern incarnation of the exact same
57 "specialist parallel processing" mistake, betting heavily on AI with
58 Matrix and Convolution Engines that can do no other task. Aspex only
59 survived by being bought by Ericsson, where its specialised suitability
60 for massive wide Baseband FFTs saved it from going under.
61 The huge risk is that any "better
62 AI mousetrap" created by an innovative competitor
63 that comes along quickly renders a too-specialist design obsolete.
64
65 NVIDIA and other GPUs have taken a different approach again: massive
66 parallelism with more Turing-complete ISAs in each, and dedicated
67 slower parallel memory paths (GDDR5) suited to the specific tasks of
68 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
69 by the amount of money poured into the software ecosystem in order
70 to make it accessible, and even then, GPU Programmers are a specialist
71 and rare (expensive) breed.
72
73 Second hints as to the answer emerge from an article
74 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
75 which illustrates a catastrophic rabbit-hole taken by Industry Giants
76 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
77 Order(N^6) opcode proliferation nightmare, with its mantra "make it
78 easy for hardware engineers, let software sort out the mess" literally
79 overwhelming programmers with thousands of instructions. Specialists charging
80 clients for assembly-code Optimisation Services are finding that AVX-512,
81 to take an
82 example, is anything but optimal: overall performance of AVX-512 actually
83 *decreases* even as power consumption goes up.
84
85 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
86 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
87 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
88 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
89 instruction that makes a truly ubiquitous Vector ISA) in ways that
90 will become apparent over time as adoption increases. In the meantime
91 programmers are, in direct violation of ARM's advice on how to use SVE2,
92 trying desperately to understand it by applying their experience
93 of Packed SIMD NEON. The advice from ARM
94 not to create SVE2 assembler that is hardcoded to fixed widths is being
95 disregarded, in favour of writing *multiple identical implementations*
96 of a function, each with a different hardware width, and compelling
97 software to choose one at runtime after probing the hardware.
98
99 Even RISC-V, for all that we can be grateful to the RISC-V Founders
100 for reviving Cray Vectors, has severe performance and implementation
101 limitations that are only really apparent to exceptionally experienced
102 assembly-level developers with a wide, diverse depth in multiple ISAs:
103 one of the best and clearest is a
104 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
105 by adrian_b.
106
107 Adrian logically and concisely points out that the fundamental design
108 assumptions and simplifications that went into the RISC-V ISA have an
109 irrevocably damaging effect on its viability for high performance use.
110 That is not to say that its use in low-performance embedded scenarios is
111 not ideal: in private custom secretive commercial usage it is perfect.
112 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
113 ARM with RISC-V and saving themselves USD 1 in licensing royalties
114 per product are a classic case study. Ubiquitous and common everyday
115 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
116 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
117 unfortunately, fundamentally flawed as far as power efficient high
118 performance is concerned.
119
120 Slowly, at this point, a realisation should be sinking in that, actually,
121 there aren't as many really truly viable Vector ISAs out there, as the
122 ones that are evolving in the general direction of Vectorisation are,
123 in various completely different ways, flawed.
124
125 **Successfully identifying a limitation marks the beginning of an
126 opportunity**
127
128 We are nowhere near done, however, because a Vector ISA is a superset of a
129 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
130 support, and even longer to get the software ecosystem up and running.
131
132 Which ISAs, therefore, have or have had, at one point in time, a decent
133 Software Ecosystem? Debian supports most of these including s390:
134
135 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
136 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
137 reputation nobody wants to go near SPARC.
138 * MIPS, created by SGI and only really commonly used in Network switches.
139 Exceptions: Ingenic with embedded CPUs,
140 and China ICT with the Loongson supercomputers.
141 * x86, the most well-known ISA and also one of the most heavily
142 litigously-protected.
143 * ARM, well known in embedded and smartphone scenarios, very slowly
144 making its way into data centres.
145 * OpenRISC, an entirely Open ISA suitable for embedded systems.
146 * s390, a Mainframe ISA very similar to Power.
147 * Power ISA, a Supercomputing-class ISA, as demonstrated by
148 two out of three of the top500.org supercomputers using
149 around 2 million IBM POWER9 Cores each.
150 * ARC, a competitor at the time to ARM, best known for use in
151 Broadcom VideoCore IV.
152 * RISC-V, with a software ecosystem heavily in development
153 and with rapid expansion
154 in an uncontrolled fashion, is set on an unstoppable
155 and inevitable trainwreck path to replicate the
156 opcode conflict nightmare that plagued the Power ISA,
157 two decades ago.
158 * Tensilica, Andes STAR and Western Digital for successful
159 commercial proprietary ISAs: Tensilica in Baseband Modems,
160 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
161 astoundingly commercially successful
162 multi-billion-unit mass volume markets that almost nobody
163 knows anything about. Included for completeness.
164
165 In order of least controlled to most controlled, the viable
166 candidates for further advancement are:
167
168 * OpenRISC 1200, not controlled or restricted by anyone. no patent
169 protection.
170 * RISC-V, touted as "Open" but actually strictly controlled under
171 Trademark License: too new to have adequate patent pool protection,
172 as evidenced by multiple adopters having been hit by patent lawsuits.
173 (Agreements between RISC-V *Members* to not engage in patent litigation
174 does nothing to stop third party patents that *legitimately pre-date*
175 the newly-created RISC-V ISA)
176 * MIPS, SPARC, ARC, and others, simply have no viable publicly
177 managed ecosystem. They work well within their niche markets.
178 * Power ISA: protected by IBM's extensive patent portfolio for Members
179 of the OpenPOWER Foundation, covered by Trademarks, permitting
180 and encouraging contributions, and having software support for over
181 20 years.
182 * ARM, not permitting Open Licensing, they survived in the early 90s
183 only by doing a deal with Samsung for an in-perpetuity
184 Royalty-free License, in exchange
185 for GBP 3 million and legal protection through Samsung Research.
186 Several large Corporations (Apple most notably) have licensed the ISA
187 but not ARM designs: the barrier to entry is high and the ISA itself
188 protected from interference as a result.
189 * x86, famous for an unprecedented
190 Court Ruling in 2004 where a Judge "banged heads
191 together" and ordered AMD and Intel to stop wasting his time,
192 make peace, and cross-license each other's patents, anyone wishing
193 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
194 and VIA EDEN processors, and see how they fared.
195 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
196 but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
197 particularly disingenuous group from performing illegal
198 Reverse-Engineering.
199
200 By asking the question, "which ISA would be the best and most stable to
201 base a Vector Supercomputing-class Extension on?" where patent protection,
202 software ecosystem, open-ness and pedigree all combine to reduce risk
203 and increase the chances of success, there is really only one candidate.
204
205 **Of all of these, the only one with the most going for it is the Power ISA.**
206
207 The summary of advantages, then, of the Power ISA is that:
208
209 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
210 and more.
211 * Amongst many other features
212 it has Condition Registers which can be used by Branches, greatly
213 reducing pressure on the main register files.
214 * IBM's extensive 20+ years of patents is available, royalty-free,
215 to protect implementors as long as they are also members of the
216 OpenPOWER Foundation
217 * IBM designed and maintained the Power ISA as a Supercomputing
218 class ISA from its inception over 25 years ago.
219 * Coherent distributed memory access is possible through OpenCAPI
220 * Extensions to the Power ISA may be submitted through an External
221 RFC Process that does not require membership of OPF.
222
223 From this strong base, the next step is: how to leverage this
224 foundation to take a leap forward in performance and performance/watt,
225 *without* losing all the advantages of an ubiquitous software ecosystem,
226 the lack of which has historically plagued other systems and relegated
227 them to a risky niche market?
228
229 # How do you turn a Scalar ISA into a Vector one?
230
231 The most obvious question before that is: why on earth would you want to?
232 As explained in the "SIMD Considered Harmful" article, Cray-style
233 Vector ISAs break the link between data element batches and the
234 underlying architectural back-end parallel processing capability.
235 Packed SIMD explicitly smashes that width right in the face of the
236 programmer and expects them to like it. As the article immediately
237 demonstrates, an arbitrary-sized data set has to contend with
238 an insane power-of-two Packed SIMD cascade at both setup and teardown
239 that routinely adds literally an order
240 of magnitude increase in the number of hand-written lines of assembler
241 compared to a well-designed Cray-style Vector ISA with a `setvl`
242 instruction.
243
244 *<blockquote>
245 Packed SIMD looped algorithms actually have to
246 contain multiple implementations processing fragments of data at
247 different SIMD widths: Cray-style Vectors have just the one, covering not
248 just current architectural implementations but future ones with
249 wider back-end ALUs as well.
250 </blockquote>*
251
252 Assuming then that variable-length Vectors are obviously desirable,
253 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
254 went the way of adding explicit Vector opcodes, a style which RVV
255 copied and modernised. In the case of RVV this introduced 192 new
256 instructions on top of an existing 95+ for base RV64GC. Adding
257 200% more instructions than the base ISA seems unwise: at least,
258 it feels like there should be a better way, particularly on
259 close inspection of RVV as an example, the basic arithmetic
260 operations are massively duplicated: scalar-scalar from the base
261 is joined by both scalar-vector and vector-vector *and* predicate
262 mask management, and transfer instructions between all the same,
263 which goes a long way towards explaining why there are twice as many
264 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
265
266 The question then becomes: with all the duplication of arithmetic
267 operations just to make the registers scalar or vector, why not
268 leverage the *existing* Scalar ISA with some sort of "context"
269 or prefix that augments its behaviour? Make "Scalar instruction"
270 synonymous with "Vector Element instruction" and through nothing
271 more than contextual
272 augmentation the Scalar ISA *becomes* the Vector ISA.
273 Then, by not having to have any Vector instructions at all,
274 the Instruction Decode
275 phase is greatly simplified, reducing design complexity and leaving
276 plenty of headroom for further expansion.
277
278 Remarkably this is not a new idea. Intel's x86 `REP` instruction
279 gives the base concept, but in 1994 it was Peter Hsu, the designer
280 of the MIPS R8000, who first came up with the idea of Vector-augmented
281 prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
282 the prefix would mark which of the registers were to be treated as
283 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
284 as a guide and making "scalar instruction" synonymous with "Vector element",
285 perform a `REP`-like loop that
286 jammed multiple scalar operations into the Multi-Issue Execution
287 Engine. The only reason that the team did not take this forward
288 into a commercial product
289 was because they could not work out how to cleanly do OoO
290 multi-issue at the time.
291
292 In its simplest form, then, this "prefixing" idea is a matter
293 of:
294
295 * Defining the format of the prefix
296 * Adding a `setvl` instruction
297 * Adding Vector-context SPRs and working out how to do
298 context-switches with them
299 * Writing an awful lot of Specification Documentation
300 (4 years and counting)
301
302 Once the basics of this concept have sunk in, early
303 advancements quickly follow naturally from analysis
304 of the problem-space:
305
306 * Expanding the size of GPR, FPR and CR register files to
307 provide 128 entries in each. This is a bare minimum for GPUs
308 in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
309 batching as possible.
310 * Predication (an absolutely critical component for a Vector ISA),
311 then the next logical advancement is to allow separate predication masks
312 to be applied to *both* the source *and* the destination, independently.
313 (*Readers familiar with Vector ISAs will recognise this as a back-to-back
314 `VGATHER-VSCATTER`*)
315 * Element-width overrides: most Scalar ISAs today are 64-bit only,
316 with primarily Load and Store being able to handle 8/16/32/64
317 and sometimes 128-bit (quad-word), where Vector ISAs need to
318 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
319 high-performance AI. Rather than waste opcode space adding all
320 such operations at different bitwidths, let the prefix
321 *redefine* (override) the element width, without actually altering
322 the Scalar ISA at all.
323 * "Reordering" of the assumption of linear sequential element
324 access, for Matrices, rotations, transposition, Convolutions,
325 DCT, FFT, Parallel Prefix-Sum and other common transformations
326 that require significant programming effort in other ISAs.
327
328 All of these things come entirely from "Augmentation" of the Scalar operation
329 being prefixed: at no time is the Scalar operation significantly
330 altered.
331 From there, several more "Modes" can be added, including
332
333 * saturation,
334 which is needed for Audio and Video applications
335 * "Reverse Gear"
336 which runs the Element Loop in reverse order (needed for Prefix
337 Sum)
338 * Data-dependent Fail-First, which emerged from asking the simple
339 question, "If modern Vector ISAs have Load/Store Fail-First,
340 and the Power ISA has Condition Codes, why not make Conditional
341 early-exit from Arithmetic operation looping?"
342 * over 500 Branch-Conditional Modes emerge from application of
343 Boolean Logic in a Vector context, on top of an already-powerful
344 Scalar Branch-Conditional/Counter instruction
345
346 **What is missing from Power Scalar ISA that a Vector ISA needs?**
347
348 Remarkably, very little: the devil is in the details though.
349
350 * The traditional `iota` instruction may be
351 synthesised with an overlapping add, that stacks up incrementally
352 and sequentially. Although it requires two instructions (one to
353 start the sum-chain) the technique has the advantage of allowing
354 increments by arbitrary amounts, and is not limited to addition,
355 either.
356 * Big-integer addition (arbitrary-precision arithmetic) is an
357 emergent characteristic from the carry-in, carry-out capability of
358 Power ISA `adde` instruction. `sv.adde` as a BigNum add
359 naturally emerges from the
360 sequential carry-flag chaining of these scalar instructions.
361 * The Condition Register Fields of the Power ISA make a great candidate
362 for use as Predicate Masks, particularly when combined with
363 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
364
365 It is only when looking slightly deeper into the Power ISA that
366 certain things turn out to be missing, and this is down in part to IBM's
367 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
368 so Scalar ones. Examples include that transfer operations between the
369 Integer and Floating-point Scalar register files were dropped approximately
370 a decade ago after the Packed SIMD variants were considered to be
371 duplicates. With it being completely inappropriate to attempt to Vectorise
372 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
373 the Scalar ISA, a much better all-round candidate for Vectorisation is
374 left anaemic.
375
376 A particular key instruction that is missing is `MV.X` which is
377 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
378 expensive instruction causing a huge swathe of Register Hazards
379 in one single hit is almost never added to a Scalar ISA but
380 is almost always added to a Vector one. When `MV.X` is
381 Vectorised it allows for arbitrary
382 remapping of elements within a Vector to positions specified
383 by another Vector. A typical Scalar ISA will use Memory to
384 achieve this task, but with Vector ISAs the Vector Register Files are
385 usually so enormous, and so far away from Memory, that it is easier and
386 more efficient, architecturally, to provide these Indexing instructions.
387
388 Fortunately, with the ISA Working Group being willing
389 to consider RFCs (Requests For Change) these omissions have the potential
390 to be corrected.
391
392 One deliberate decision in SVP64 involves Predication. Typical Vector
393 ISAs have quite comprehensive arithmetic and logical operations on
394 Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
395 side of Power ISA already has most of them.
396 If CR Fields were the only predicates in SVP64
397 it would put pressure on to start adding the exact same arithmetic and logical
398 operations that already exist in the Integer opcodes, which is less
399 than desirable.
400 Instead of taking that route the decision was made to allow *both*
401 Integer *and* CR Fields to be Predicate Masks, and to create Draft
402 instructions that provide better transfer capability between CR Fields
403 and Integer Register files.
404
405 Beyond that, further extensions to the Power ISA become much more
406 domain-specific, such as adding bitmanipulation for Audio, Video
407 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
408 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
409 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
410 *automatically* is inherently added to the Vector one as well, and
411 because these GPU and Video opcodes have been added to the CPU ISA,
412 Software Driver development and debugging is dramatically simplified.
413
414 Which brings us to the next important question: how is any of these
415 CPU-centric Vector-centric improvements relevant to power efficiency
416 and making more effective use of resources?
417
418 # Simpler more compact programs saves power
419
420 The first and most obvious saving is that, just as with any Vector
421 ISA, the amount of data processing requested
422 and controlled by each instruction is enormous, and leaves the
423 Decode and Issue Engines idle, as well as the L1 I-Cache. With
424 programs being smaller, chances are higher that they fit into
425 L1 Cache, or that the L1 Cache may be made smaller: either way
426 is a considerable O(N^2) power-saving.
427
428 Even a Packed SIMD ISA could take limited advantage of a higher
429 bang-per-buck for limited specific workloads, as long as the
430 stripmining setup and teardown is not required. However a
431 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
432 ratio as a 64-wide Vector Length.
433
434 Realistically, for general use cases however it is extremely common
435 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
436 astounding 240 hand-coded assembler instructions where it is around
437 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
438 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
439 the case of the IBM POWER9 with a little-known design flaw not
440 normally otherwise encountered this results in
441 contention between the L1 D and I Caches at the L2 Bus, slowing down
442 execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
443 requires loop-unrolling to contend with non-power-of-two Matrix
444 sizes: SVP64 does not (as hinted at below).
445 [Figures 8 and 9](https://arxiv.org/abs/2104.03142)
446 illustrate the process of concatenating copies of data in order
447 to match RADIX2 limitations of MMA.
448
449 Additional savings come in the form of `SVREMAP`. Like the
450 hardware-assist of Google's TPU mentioned on p9 of the above MMA paper,
451 `SVREMAP` is a hardware
452 index transformation system where the normally sequentially-linear
453 Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
454 commonly-used deterministic schedules, for example Matrix Multiply,
455 DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
456 2x6 with optional *in-place* transpose, mirroring or rotation
457 on any source or destination Matrix
458 may be performed in as little as 4 instructions, one of which
459 is to zero-initialise the accumulator Vector used to store the result.
460 If addition to another Matrix is also required then it is only three
461 instructions.
462
463 Not only that, but because the "Schedule" is an abstract
464 concept separated from the mathematical operation, there is no reason
465 why Matrix Multiplication Schedules may not be applied to Integer
466 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
467 AND-and-OR, or any other future instruction such as Complex-Number
468 Multiply-and-Accumulate that a future version of the Power ISA might
469 support. The flexibility is not only enormous, but the compactness
470 unprecedented. RADIX2 in-place DCT Triple-loop Schedules may be created in
471 around 11 instructions. The only other processors well-known to have
472 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
473 and Qualcom's Hexagon, and both are targetted at FFTs only.
474
475 There is no reason at all why future algorithmic schedules should not
476 be proposed as extensions to SVP64 (sorting algorithms,
477 compression algorithms, Sparse Data Sets, Graph Node walking
478 for example). (*Bear in mind that
479 the submission process will be
480 entirely at the discretion of the OpenPOWER Foundation ISA WG,
481 something that is both encouraged and welcomed by the OPF.*)
482
483 One of SVP64's current limitations is that it was initially designed
484 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
485 a heavy focus on adding hardware-for-loops onto the *Registers*.
486 After more than three years of development the realisation hit that
487 the SVP64 concept could be expanded to Coherent Distributed Memory.
488 This astoundingly powerful concept is explored in the next section.
489
490 # Coherent Deterministic Hybrid Distributed In-Memory Processing
491
492 It is not often that a heading in an article can legitimately
493 contain quite so many comically-chained buzzwords, but in this section
494 they are justified. As hinted at in the first section, the last time
495 that memory was the same speed as processors was the Pentium III
496 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
497 CPUs were about the same rate. DRAM bitcells *simply cannot exceed
498 these rates*, yet the pressure from Software Engineers is to
499 make *sequential* algorithm processing faster and faster because
500 parallelising of algorithms is simply too difficult to master, and always
501 has been. Thus whilst DRAM has to go parallel (like RAID Striping) to
502 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
503 are at an astonishing four levels of cache (L1 to L4).
504
505 It should therefore come as no surprise that attempts are being made
506 to move (distribute) processing closer to the DRAM Memory, firmly
507 on the *opposite* side of the main CPU's L1/2/3/4 Caches,
508 where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
509 why this approach is compelling. However
510 the alarm bells ring here at the keyword "distributed", because by
511 moving the processing down next to the Memory, even onto
512 the same die as the DRAM, the speed of any
513 of the parallel Processing Elements (PEs) would likely drop
514 by almost two orders of magnitude (5 ghz down to 150 mhz),
515 the simplicity of each PE has, for pure pragmatic reasons,
516 to drop by several
517 orders of magnitude as well.
518 Things that the average "sequential algorithm"
519 programmer
520 takes for granted such as SMP, Cache Coherency, Virtual Memory,
521 spinlocks (atomic locking, mutexes), all of these are either outright gone
522 or expected that the programmer shall explicitly contend with
523 (even if that programmer is the Compiler Developer). There's definitely
524 not going to be a standard OS: the PEs will be too basic, too
525 resource-constrained, and definitely too busy.
526
527 To give an extreme example: Aspex's Array-String Processor, which
528 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
529 Memory, was capable of literally a hundred-fold improvement in
530 performance over Scalar CPUs such as the Pentium III of its era,
531 all on a 3 watt budget at only 250 mhz in 130 nm. Yet to take
532 proper advantage of its capability required an astounding 5-10
533 *days* per line of assembly code because multiple versions of
534 an algorithm had to be hand-crafted then compared, and only
535 the best one selected: all others discarded. 20 lines of optimised
536 Assembler taking three to six months to write can in no way be termed
537 "productive", yet this extreme level of unproductivity is an inherent
538 side-effect of going down the parallel-processing rabbithole where
539 the cost of providing "Traditional" programmabilility (Virtual Memory,
540 SMP) is worse than counter-productive, it's often outright impossible.
541
542 **In short, we are in "Programmer's nightmare" territory**
543
544 Having dug a proverbial hole that rivals the Grand Canyon, and
545 jumped in it feet-first, the next
546 task is to piece together a strategy to climb back out and show
547 how falling back in can be avoided. This takes some explaining,
548 and first requires some background on various research efforts and
549 commercial designs. Once the context is clear, their synthesis
550 can be proposed. These are:
551
552 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
553 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
554 * [Snitch](https://arxiv.org/abs/2002.10143)
555
556 **ZOLC: Zero-Overhead Loop Control**
557
558 Zero-Overhead Looping is the concept of automatically running a set sequence
559 of instructions a predetermined number of times, without requiring
560 a branch. This is conceptually similar but
561 slightly different from using Power ISA `bc` in `CTR`
562 (Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
563
564 The simplest longest commercially successful deployment of Zero-overhead looping
565 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
566 within the VLIW word may be repeatedly deployed on successive clock
567 cycles until a countdown reaches zero. This extraordinarily simple
568 concept needs no branches, and has no complex Register Hazard
569 Management in the hardware
570 because it is down to the programmer (or, the compiler),
571 to ensure data overlaps do not occur. Careful crafting of those
572 14 instructions can keep the ALUs 100% occupied for sustained periods,
573 and the iconic example for which the TI DSPs are renowned
574 is that an entire inner loop for large FFTs
575 can be done with that one VLIW word: no stalls, no stopping, no fuss,
576 an entire 1024 or 4096 wide FFT Layer in one instruction.
577
578 <blockquote>
579 The key aspect of these
580 very simplistic countdown loops as far as we are concerned:
581 is: *they are deterministic*.
582 </blockquote>
583
584 Zero-Overhead Loop Control takes this basic "single loop" concept
585 way further: both nested loops and conditional exit are included,
586 but also arbitrary control-jumping from the current inner loop
587 out to an entirely different loop, all based on conditions determined
588 dynamically at runtime.
589
590 Even when deployed on as basic a CPU as a single-issue in-order RISC
591 core, the performance and power-savings were astonishing: between 20
592 and **80%** reduction in algorithm completion times were achieved compared
593 to a more traditional branch-speculative in-order RISC CPU. MPEG
594 Decode, the target algorithm specifically picked by the researcher
595 due to its high complexity with 6-deep nested loops and conditional
596 execution that frequently jumped in and out of at least 2 loops,
597 came out with an astonishing 43% improvement in completion time. 43%
598 less instructions executed is an almost unheard-of level of optimisation:
599 most ISA designers are elated if they can achieve 5 to 10%. The reduction
600 was so compelling that ST Microelectronics put it into commercial
601 production in one of their embedded CPUs, the ST120 DSP-MCU.
602
603 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
604 design of its triple-nested for-loop system
605 turned out to be remarkably similar to the
606 core nested for-loop engine of ZOLC. In hindsight this should not
607 have come as a surprise, because both are basically nested for-loops
608 that do not need branches to issue instructions.
609
610 The important insight is, however, that if ZOLC can be general-purpose
611 and apply deterministic nested looped instruction
612 schedules to more than just registers
613 (unlike SVP64 in its current incarnation) then so can SVP64.
614
615 **OpenCAPI and Extra-V**
616
617 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
618 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
619
620 <blockquote>(Side note:
621 POWER10 *only*
622 has OpenCAPI Memory interfaces: an astounding number of them,
623 with overall bandwidth so high it's actually difficult to conceptualise.
624 An OMI-to-DDR4/5 Bridge PHY is therefore required
625 to connect to standard Memory DIMMs.)
626 </blockquote>
627
628 Extra-V appears to be a remarkable research project based on OpenCAPI that,
629 by assuming that the map of edges (excluding the actual data)
630 in any given arbitrary data graph
631 could be kept by the main CPU in-memory, could distribute and delegate
632 a limited-capability deterministic but most importantly *data-dependent*
633 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier). A miniature processor
634 (non-Turing-complete) analysed
635 the data it had read (at the Memory), and determined if it should
636 notify the main processor that this "Node" is worth investigating,
637 or if the Graph node-walk should split in a different direction.
638 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
639 abstraction, locking, and cache-coherency, many of the nightmare problems
640 of other more explicit parallel processing paradigms disappear.
641
642 The similarity to ZOLC should not have gone unnoticed: where ZOLC
643 has nested conditional for-loops Extra-V appears to have just the
644 one conditional for-loop, but the key strategically-crucial
645 part of this multi-faceted puzzle is that due to the deterministic and
646 coherent nature of Extra-V, the processing of the loops, which
647 requires a tiny non-Turing-Complete processor, is not
648 done close to or by the main CPU at all: it is
649 *embedded right next to the memory*.
650
651 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
652 Array-String Processing, and Elixent 2D Array Processing, should
653 also not have gone unnoticed. All of these solutions utilised
654 or utilise
655 a more comprehensive Turing-complete von-Neumann "Management Core"
656 to coordinate data passed in and out of PEs: none of them have or
657 had something
658 as powerful as OpenCAPI as part of that picture.
659
660 The fact that Neural Networks may be expressed as arbitrary Graphs,
661 and comprise Sparse Matrices, should also have been noted by the reader
662 interested in AI.
663
664 **Snitch**
665
666 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
667 become "tagged" with a Memory-access Mode that went out of fashion
668 over forty years ago: Load-then-Auto-Increment. Expressed in c as
669 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
670 thanks to the RISC paradigm having gone too far,
671 the efficiency and effectiveness
672 of these Load-Store-with-Increment instructions has been
673 forgotten until Snitch.
674
675 What the designers did however was not to add any new Load-Store
676 or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
677 registers with a tag which *augmented* (altered) the behaviour
678 of *existing* instructions. These tags tell the CPU: when you are asked to
679 carry out
680 an add instruction on r6 and r7, do not take r6 or r7 from the register
681 file, instead please perform a Cache-coherent Load-with-Increment
682 on each, using special (hidden, implicit)
683 Address Registers for each. Each new use
684 of r6 therefore brings in an entirely new value *directly from
685 memory*. Likewise on the second operand, r7, and likewise on
686 the destination result which can be an automatic Coherent
687 Store-and-increment
688 directly into Memory.
689
690 <blockquote>
691 *The act of "reading" or "writing" a register has been decoupled
692 and intercepted, then connected transparently to a completely
693 separate Coherent Memory Subsystem*
694 </blockquote>
695
696 On top of a barrel-architecture the slowness of Memory access
697 was not a problem because the Deterministic nature of classic
698 Load-Store-Increment can be compensated for by having 8 Memory
699 accesses scheduled underway and interleaved in a time-sliced
700 fashion with an FPU that is correspondingly 8 times faster than
701 the Coherent Memory accesses.
702
703 This design is reminiscent of the early Vector Processors
704 of the late 1950s and early 1960s, which also critically relied
705 on implicit auto-increment addressing.
706 The [CDC STAR-100](https://en.m.wikipedia.org/wiki/CDC_STAR-100)
707 for example was specifically designed as a Memory-to-Memory Vector
708 Processor. The barrel-architecture of Snitch neatly
709 solves one of the inherent problems of those early designs (a mismatch
710 with memory
711 speed) and the presence of a full register file (non-tagged,
712 normal, standard scalar registers) caters for a
713 second limitation of pure Memory-based Vector Processors: temporary
714 variables needed in the computation of intermediate results, which
715 also had to go through memory, put
716 an awfully high artificial load on Memory bandwidth.
717
718 The similarity to SVP64 should be clear: SVP64 Prefixing and the
719 associated REMAP system is just another form of register "tagging"
720 that augments what was formerly designated by its original authors
721 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
722 with advanced behaviour not previously envisaged.
723
724 What Snitch brings to the table therefore is a further illustration of
725 the concept introduced by Extra-V: where Extra-V brought information
726 about Sparse-Distributed Data to the attention of the main CPU in
727 a coherent fashion *without the CPU having to ask for it*, Snitch
728 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
729 distributed coherent manner, and does so with dramatically-reduced
730 power consumption.
731
732 **Bringing it all together**
733
734 At this point we are well into a future revision of SVP64, one that
735 clearly has some startlingly powerful potential: Supercomputing-class
736 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
737 sustained fashion with reduced complexity, reduced power consumption
738 and reduced completion time, thanks to Deterministic Coherent Scheduling
739 of the data fed in and out, or even moved down next to Memory.
740
741 This last part is where it normally gets hair-raising, but as ZOLC shows
742 there is no reason at all why even complex algorithms such as MPEG cannot
743 be run in a partially-deterministic manner, and anything that is
744 deterministic can be Scheduled, coherently. Combine that with OpenCAPI
745 which solves the many issues associated with SMP Virtual Memory and so on
746 yet still allows Cache-Coherent Distributed Memory Access, and what was
747 previously an intractable Computer Science problem for decades begins to
748 look like there is a potential solution.
749
750 The Deterministic Schedules created by ZOLC should even be possible to identify their
751 suitability for full off-CPU distributed processing, as long as OpenCAPI
752 is integrated into the mix. What a compiler - or even the hardware -
753 will be looking out for is a Basic Block of instructions that:
754
755 * begins with a LOAD (to be handled by OpenCAPI)
756 * contains some instructions that a given PE is capable of executing
757 * ends with a STORE (again: OpenCAPI)
758
759 For best results that would be wrapped with a Zero-Overhead Loop
760 (which is offloaded - in full - down to the PE), where
761 the Compiler (or hardware at runtime) could easily identify, in advance,
762 the full range of Memory Addresses that the Loop is to encounter. Copies
763 of loop-invariant data would need to be passed down to the remote PE:
764 again, for simple-enough Basic Blocks, with assistance from the Compiler,
765 loop-invariant inputs are easily identified. Parallel Processing
766 opportunities should also be easy enough to create, simply by farming out
767 different parts of a given Deterministic Zero-Overhead Loop to
768 different PEs based on their proximity, bandwidth or ease of access to
769 given Memory.
770
771 The importance of OpenCAPI in this mix cannot be underestimated, because
772 it will be the means by which the main CPU coordinates its activities
773 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
774 violated. It should also be straightforward to ensure that the offloading
775 is entirely transparent to the developer, in fact this is a hard requirement
776 because at any given moment there is the possibility that the PEs may be
777 busy and it is the main CPU that has to complete the Processing Task itself.
778
779 It is also important to note that we are not necessarily talking about
780 the Remote PEs executing the Power ISA, but if they do so it becomes
781 much easier for the main CPU to take over in the event that PEs are
782 currently occupied. Plus, the twin lessons that inventing ISAs, even
783 a small one, is hard (mostly in compiler writing) and how complex
784 GPU Task Scheduling is, are being heard loud and clear.
785
786 Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
787
788 * In order that the main CPU can execute the same sequence if necessary,
789 the CPU must support dual ISAs: Power and PE **OR**
790 * There must be a JIT binary-translator which either turns PE code
791 into Power ISA code or vice-versa **OR**
792 * The compiler dual-compiles the original source code, and embeds
793 both a Power binary and a PE binary into the ZOLC Basic Block **OR**
794 * All binaries are stored in an Intermediate Representation
795 (LLVM-IR, SPIR-V) and JIT-compiled on-demand.
796
797 All of these would work, but it is simpler and a lot less work
798 just to have the PEs
799 execute the exact same ISA (or a subset of it). If however the
800 concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
801 which would increase adoption and reduce cost, a bit more thought
802 is required here because ARM or Intel or MIPS might not necessarily
803 be happy that a Processing Element has to execute Power ISA binaries.
804 At least the Power ISA is much richer, more powerful, still RISC,
805 and is an Open Standard, as discussed in a earlier sections.
806
807 # Transparently-Distributed Vector Processing
808
809 It is very strange to the author to be describing what amounts to a
810 "Holy Grail" solution to a decades-long intractable problem that
811 mitigates the anticipated end of Moore's Law: how to make it easy for
812 well-defined workloads, expressed as a perfectly normal
813 sequential program, compiled to a standard well-known ISA, to have
814 the potential of being offloaded transparently to Parallel Compute Engines,
815 all without the Software Developer being excessively burdened with
816 a Parallel-Processing Paradigm that is alien to all their experience
817 and training, as well as Industry-wide common knowledge.
818
819 Will it be that easy? ZOLC is, honestly, in its current incarnation,
820 not that straightforward: programs
821 have to be "massaged" by tools that insert intrinsics into the
822 source code, in order to identify the Basic Blocks that the Zero-Overhead
823 Loops can run. Can this be merged into standard gcc and llvm
824 compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
825 if an infinite supply of money and engineering time is thrown at it.
826 Is a half-way-house solution of compiler intrinsics good enough?
827 Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
828 for several decades, and advanced programmers are comfortable with the
829 practice.
830
831 Additional questions remain as to whether OpenCAPI or its use for this
832 particular scenario requires that the PEs, even quite basic ones,
833 implement a full RADIX MMU, and associated TLB lookup? In order to ensure
834 that programs may be cleanly and seamlessly transferred between PEs
835 and CPU the answer is quite likely to be "yes", which is interesting
836 in and of itself. Fortunately, the associated L1 Cache with TLB
837 Translation does not have to be large, and the actual RADIX Tree Walk
838 need not explicitly be done by the PEs, it can be handled by the main
839 CPU as a software-extension.
840
841 **Use-case: Matrix and Convolutions**
842
843 Imagine a large Matrix scenario, with several values close to zero that
844 could be skipped: no need to include zero-multiplications, but a
845 traditional CPU in no way can help: only by loading the data through
846 the L1-L4 Cache and Virtual Memory Barriers is it possible to
847 ascertain, retrospectively, that time and power had just been wasted.
848
849 SVP64 is able to do what is termed "Vertical-First" Vectorisation
850 *(walk first through a batch of instructions before explicitly
851 moving to the next element, and repeating the batch)*,
852 combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been
853 extended, Snitch-style, to perform a deterministic memory-array walk of
854 a large Matrix.
855
856 Let us also imagine that the Matrices are stored in Memory with PEs
857 attached, and that the PEs are fully functioning Power ISA with Draft
858 SVP64, but their Multiply capability is not as good as the main CPU.
859 Therefore:
860 we want the PEs to feed the sparse data to the main CPU, a la "Extra-V".
861
862 * The ZOLC SVREMAP System running on the main CPU generates a Matrix
863 Memory-Load Schedule.
864 * The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
865 * The PEs are also sent the Basic Block to be executed on each
866 Memory Load (each element of the Matrices to be multiplied)
867 * The PEs execute the Basic Block and **exclude**, in a deterministic
868 fashion, any elements containing Zero values
869 * Non-zero elements are sent, via OpenCAPI, to the main CPU, which
870 queues sequences of Multiply-and-Accumulate, and feeds the results
871 back to Memory, again via OpenCAPI, to the PEs.
872 * The PEs, which are tracking the Sparse Conditions, know where
873 to store the results received
874
875 In essence this is near-identical to the original Snitch concept
876 except that there are, like Extra-V, PEs able to perform
877 conditional testing of the data as it goes both to and from the
878 main CPU. In this way a large Sparse Matrix Multiply or Convolution
879 may be achieved without having to pass unnecessary data through
880 L1/L2/L3 Caches only to find, at the CPU, that it is zero.
881
882 **Use-case: More powerful in-memory PEs**
883
884 An obvious variant of the above is that, if there is inherently
885 more parallelism in the data set, then the PEs get their own
886 Multiply-and-Accumulate instruction, and rather than send the
887 data to the CPU over OpenCAPI, perform the Matrix-Multiply
888 directly themselves.
889
890 However the source code and binary would be near-identical if
891 not identical in every respect, and the PEs implementing the full
892 ZOLC capability in order to compact binary size to the bare minimum.
893 The main CPU's role would be to coordinate and manage the PEs
894 over OpenCAPI.
895
896 One key strategic question does remain: do the PEs need to have
897 a RADIX MMU and associated TLB-aware minimal L1 Cache, in order
898 to support OpenCAPI properly? The answer is very likely to be yes.
899 The saving grace here is that with
900 the expectation of running only hot-loops with ZOLC-driven
901 binaries, the size of L1 Cache needed would be miniscule compared
902 to the average high-end CPU.
903
904 **Roadmap summary of Advanced SVP64**
905
906 The future direction for SVP64, then, is:
907
908 * To overcome its current limitation of REMAP Schedules being
909 restricted to Register Files, leveraging the Snitch-style
910 register interception "tagging" technique.
911 * To adopt ZOLC and merge REMAP Schedules into ZOLC
912 * To bring OpenCAPI Memory Access into ZOLC as a first-level
913 concept that mirrors Snitch's Coherent Memory interception
914 * To add the Graph-Node Walking Capability of Extra-V
915 to ZOLC / SVREMAP
916 * To make it possible, in a combination of hardware and software,
917 to easily identify ZOLC / SVREMAP Blocks
918 that may be transparently pushed down closer to Memory, for
919 localised distributed parallel execution, by OpenCAPI-aware PEs,
920 exploiting both the Deterministic nature of ZOLC / SVREMAP
921 combined with the Cache-Coherent nature of OpenCAPI,
922 to the maximum extent possible.
923 * To make the exploitation of this powerful solution as simple
924 and straightforward as possible for Software Engineers to use,
925 in standard common-usage compilers, gcc and llvm.
926 * To propose extensions to Public Standards that allow all of
927 the above to become part of everyday ubiquitous mass-volume
928 computing.
929
930 Even the first of these - merging Snitch-style register tagging
931 into SVP64 - would
932 expand SVP64's capability for Matrices, currently limited to
933 around 5x7 to 6x6 Matrices and constrained by the size of
934 the register files (128 64-bit entries), to arbitrary (massive) sizes.
935
936 **Summary**
937
938 Bottom line is that there is a clear roadmap towards solving a long
939 standing problem facing Computer Science and doing so in a way that
940 reduces power consumption reduces algorithm completion time and reduces
941 the need for complex hardware microarchitectures in favour of much
942 smaller distributed coherent Processing Elements.