add twin butterfly page (stub)
[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
1 [[!tag whitepapers]]
2
3 **Revision History**
4
5 * v0.00 05may2022 first created
6 * v0.01 06may2022 initial first draft
7 * v0.02 08may2022 add scenarios / use-cases
8 * v0.03 09may2022 add draft image for scenario
9 * v0.04 14may2022 add appendix with other research
10 * v0.05 14jun2022 update images (thanks to Veera)
11
12 **Table of Contents**
13
14 [[!toc]]
15
16 # Why in the 2020s would you invent a new Vector ISA
17
18 *(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
19
20 Inventing a new Scalar ISA from scratch is over a decade-long task
21 including simulators and compilers: OpenRISC 1200 took 12 years to
22 mature. Stable Open ISAs require Standards and Compliance Suites that
23 take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
24 auto-vectorisation compiler support has never been achieved in the
25 history of computing, not with the combined resources of ARM, Intel,
26 AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
27 assembler and direct use of intrinsics is the Industry-standard norm
28 to achieve high-performance optimisation where it matters*).
29 GPUs fill this void both in hardware and software terms by having
30 ultra-specialist compilers (CUDA) that are designed from the ground up
31 to support Vector/SIMD parallelism, and associated standards
32 (SPIR-V, Vulkan, OpenCL) managed by
33 the Khronos Group, with multi-man-century development committment from
34 multiple billion-dollar-revenue companies, to sustain them.
35
36 Therefore it begs the question, why on earth would anyone consider
37 this task, and what, in Computer Science, actually needs solving?
38
39 First hints are that whilst memory bitcells have not increased in speed
40 since the 90s (around 150 mhz), increasing the bank width, striping, and
41 datapath widths and speeds to the same has, with significant relative
42 latency penalties, allowed
43 apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
44 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI's OMI,
45 all make an effort (all simply increasing the parallel deployment of
46 the underlying 150 mhz bitcells), but these efforts are dwarfed by the
47 two nearly three orders of magnitude increase in CPU horsepower
48 over the same timeframe. Seymour
49 Cray, from his amazing in-depth knowledge, predicted that the mismatch
50 would become a serious limitation, over two decades ago.
51
52 The latency gap between that bitcell speed and the CPU speed can do nothing to help Random Access (unpredictable reads/writes). Cacheing helps only so
53 much, but not with some types of workloads (FFTs are one of the worst)
54 even though
55 they are fully deterministic.
56 Some systems
57 at the time of writing are now approaching a *Gigabyte* of L4 Cache,
58 by way of compensation, and as we know from experience even that will
59 be considered inadequate in future.
60
61 Efforts to solve this problem by moving the processing closer to or
62 directly integrated into the memory have traditionally not gone well:
63 Aspex Microelectronics, Elixent, these are parallel processing companies
64 that very few have heard of, because their software stack was so
65 specialist that it required heavy investment by customers to utilise.
66 D-Matrix, a Systolic Array Processor, is a modern incarnation of the exact same
67 "specialist parallel processing" mistake, betting heavily on AI with
68 Matrix and Convolution Engines that can do no other task. Aspex only
69 survived by being bought by Ericsson, where its specialised suitability
70 for massive wide Baseband FFTs saved it from going under.
71 The huge risk is that any "better
72 AI mousetrap" created by an innovative competitor
73 that comes along quickly renders a too-specialist design obsolete.
74
75 NVIDIA and other GPUs have taken a different approach again: massive
76 parallelism with more Turing-complete ISAs in each, and dedicated
77 slower parallel memory paths (GDDR5) suited to the specific tasks of
78 3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
79 by the amount of money poured into the software ecosystem in order
80 to make it accessible, and even then, GPU Programmers are a specialist
81 and rare (expensive) breed.
82
83 Second hints as to the answer emerge from an article
84 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
85 which illustrates a catastrophic rabbit-hole taken by Industry Giants
86 ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
87 Order(N^6) opcode proliferation nightmare, with its mantra "make it
88 easy for hardware engineers, let software sort out the mess" literally
89 overwhelming programmers with thousands of instructions. Specialists charging
90 clients for assembly-code Optimisation Services are finding that AVX-512,
91 to take an
92 example, is anything but optimal: overall performance of AVX-512 actually
93 *decreases* even as power consumption goes up.
94
95 Cray-style Vectors solved, over thirty years ago, the opcode proliferation
96 nightmare. Only the NEC SX Aurora however truly kept the Cray Vector
97 flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
98 it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
99 instruction that makes a truly ubiquitous Vector ISA) in ways that
100 will become apparent over time as adoption increases. In the meantime
101 programmers are, in direct violation of ARM's advice on how to use SVE2,
102 trying desperately to understand it by applying their experience
103 of Packed SIMD NEON. The advice from ARM
104 not to create SVE2 assembler that is hardcoded to fixed widths is being
105 disregarded, in favour of writing *multiple identical implementations*
106 of a function, each with a different hardware width, and compelling
107 software to choose one at runtime after probing the hardware.
108
109 Even RISC-V, for all that we can be grateful to the RISC-V Founders
110 for reviving Cray Vectors, has severe performance and implementation
111 limitations that are only really apparent to exceptionally experienced
112 assembly-level developers with a wide, diverse depth in multiple ISAs:
113 one of the best and clearest is a
114 [ycombinator post](https://news.ycombinator.com/item?id=24459041)
115 by adrian_b.
116
117 Adrian logically and concisely points out that the fundamental design
118 assumptions and simplifications that went into the RISC-V ISA have an
119 irrevocably damaging effect on its viability for high performance use.
120 That is not to say that its use in low-performance embedded scenarios is
121 not ideal: in private custom secretive commercial usage it is perfect.
122 Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
123 ARM with RISC-V and saving themselves USD 1 in licensing royalties
124 per product are a classic case study. Ubiquitous and common everyday
125 usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
126 so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
127 unfortunately, fundamentally flawed as far as power efficient high
128 performance is concerned.
129
130 Slowly, at this point, a realisation should be sinking in that, actually,
131 there aren't as many really truly viable Vector ISAs out there, as the
132 ones that are evolving in the general direction of Vectorisation are,
133 in various completely different ways, flawed.
134
135 **Successfully identifying a limitation marks the beginning of an
136 opportunity**
137
138 We are nowhere near done, however, because a Vector ISA is a superset of a
139 Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
140 support, and even longer to get the software ecosystem up and running.
141
142 Which ISAs, therefore, have or have had, at one point in time, a decent
143 Software Ecosystem? Debian supports most of these including s390:
144
145 * SPARC, created by Sun Microsystems and all but abandoned by Oracle.
146 Gaisler Research maintains the LEON Open Source Cores but with Oracle's
147 reputation nobody wants to go near SPARC.
148 * MIPS, created by SGI and only really commonly used in Network switches.
149 Exceptions: Ingenic with embedded CPUs,
150 and China ICT with the Loongson supercomputers.
151 * x86, the most well-known ISA and also one of the most heavily
152 litigously-protected.
153 * ARM, well known in embedded and smartphone scenarios, very slowly
154 making its way into data centres.
155 * OpenRISC, an entirely Open ISA suitable for embedded systems.
156 * s390, a Mainframe ISA very similar to Power.
157 * Power ISA, a Supercomputing-class ISA, as demonstrated by
158 two out of three of the top500.org supercomputers using
159 around 2 million IBM POWER9 Cores each.
160 * ARC, a competitor at the time to ARM, best known for use in
161 Broadcom VideoCore IV.
162 * RISC-V, with a software ecosystem heavily in development
163 and with rapid expansion
164 in an uncontrolled fashion, is set on an unstoppable
165 and inevitable trainwreck path to replicate the
166 opcode conflict nightmare that plagued the Power ISA,
167 two decades ago.
168 * Tensilica, Andes STAR and Western Digital for successful
169 commercial proprietary ISAs: Tensilica in Baseband Modems,
170 Andes in Audio DSPs, WD in HDDs and SSDs. These are all
171 astoundingly commercially successful
172 multi-billion-unit mass volume markets that almost nobody
173 knows anything about, outside their specialised proprietary
174 niche. Included for completeness.
175
176 In order of least controlled to most controlled, the viable
177 candidates for further advancement are:
178
179 * OpenRISC 1200, not controlled or restricted by anyone. no patent
180 protection.
181 * RISC-V, touted as "Open" but actually strictly controlled under
182 Trademark License: too new to have adequate patent pool protection,
183 as evidenced by multiple adopters having been hit by patent lawsuits.
184 (Agreements between RISC-V *Members* to not engage in patent litigation
185 does nothing to stop third party patents that *legitimately pre-date*
186 the newly-created RISC-V ISA)
187 * MIPS, SPARC, ARC, and others, simply have no viable publicly
188 managed ecosystem. They work well within their niche markets.
189 * Power ISA: protected by IBM's extensive patent portfolio for Members
190 of the OpenPOWER Foundation, covered by Trademarks, permitting
191 and encouraging contributions, and having software support for over
192 20 years.
193 * ARM, not permitting Open Licensing, they survived in the early 90s
194 only by doing a deal with Samsung for an in-perpetuity
195 Royalty-free License, in exchange
196 for GBP 3 million and legal protection through Samsung Research.
197 Several large Corporations (Apple most notably) have licensed the ISA
198 but not ARM designs: the barrier to entry is high and the ISA itself
199 protected from interference as a result.
200 * x86, famous for an unprecedented
201 Court Ruling in 2004 where a Judge "banged heads
202 together" and ordered AMD and Intel to stop wasting his time,
203 make peace, and cross-license each other's patents, anyone wishing
204 to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
205 and VIA EDEN processors, and see how they fared.
206 * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
207 but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
208 particularly disingenuous group from performing illegal
209 Reverse-Engineering.
210
211 By asking the question, "which ISA would be the best and most stable to
212 base a Vector Supercomputing-class Extension on?" where patent protection,
213 software ecosystem, open-ness and pedigree all combine to reduce risk
214 and increase the chances of success, there is really only one candidate.
215
216 **Of all of these, the only one with the most going for it is the Power ISA.**
217
218 The summary of advantages, then, of the Power ISA is that:
219
220 * It has a 25-year software ecosystem, with RHEL, Fedora, Debian
221 and more.
222 * Amongst many other features
223 it has Condition Registers which can be used by Branches, greatly
224 reducing pressure on the main register files.
225 * IBM's extensive 20+ years of patents is available, royalty-free,
226 to protect implementors as long as they are also members of the
227 OpenPOWER Foundation
228 * IBM designed and maintained the Power ISA as a Supercomputing
229 class ISA from its inception over 25 years ago.
230 * Coherent distributed memory access is possible through OpenCAPI
231 * Extensions to the Power ISA may be submitted through an External
232 RFC Process that does not require membership of OPF.
233
234 From this strong base, the next step is: how to leverage this
235 foundation to take a leap forward in performance and performance/watt,
236 *without* losing all the advantages of an ubiquitous software ecosystem,
237 the lack of which has historically plagued other systems and relegated
238 them to a risky niche market?
239
240 # How do you turn a Scalar ISA into a Vector one?
241
242 The most obvious question before that is: why on earth would you want to?
243 As explained in the "SIMD Considered Harmful" article, Cray-style
244 Vector ISAs break the link between data element batches and the
245 underlying architectural back-end parallel processing capability.
246 Packed SIMD explicitly smashes that width right in the face of the
247 programmer and expects them to like it. As the article immediately
248 demonstrates, an arbitrary-sized data set has to contend with
249 an insane power-of-two Packed SIMD cascade at both setup and teardown
250 that routinely adds literally an order
251 of magnitude increase in the number of hand-written lines of assembler
252 compared to a well-designed Cray-style Vector ISA with a `setvl`
253 instruction.
254
255 *<blockquote>
256 Packed SIMD looped algorithms actually have to
257 contain multiple implementations processing fragments of data at
258 different SIMD widths: Cray-style Vectors have just the one, covering not
259 just current architectural implementations but future ones with
260 wider back-end ALUs as well.
261 </blockquote>*
262
263 Assuming then that variable-length Vectors are obviously desirable,
264 it becomes a matter of how, not if. Both Cray and NEC SX Aurora
265 went the way of adding explicit Vector opcodes, a style which RVV
266 copied and modernised. In the case of RVV this introduced 192 new
267 instructions on top of an existing 95+ for base RV64GC. Adding
268 200% more instructions than the base ISA seems unwise: at least,
269 it feels like there should be a better way, particularly on
270 close inspection of RVV as an example, the basic arithmetic
271 operations are massively duplicated: scalar-scalar from the base
272 is joined by both scalar-vector and vector-vector *and* predicate
273 mask management, and transfer instructions between all the same,
274 which goes a long way towards explaining why there are twice as many
275 Vector instructions in RISC-V as there are in the RV64GC Scalar base.
276
277 The question then becomes: with all the duplication of arithmetic
278 operations just to make the registers scalar or vector, why not
279 leverage the *existing* Scalar ISA with some sort of "context"
280 or prefix that augments its behaviour? Separate out the
281 "looping" from "thing being looped on" (the elements),
282 make "Scalar instruction"
283 synonymous with "Vector Element instruction" and through nothing
284 more than contextual
285 augmentation the Scalar ISA *becomes* the Vector ISA.
286 Then, by not having to have any Vector instructions at all,
287 the Instruction Decode
288 phase is greatly simplified, reducing design complexity and leaving
289 plenty of headroom for further expansion.
290
291 [[!img "svp64-primer/img/power_pipelines.svg" ]]
292
293 Remarkably this is not a new idea. Intel's x86 `REP` instruction
294 gives the base concept, and the Z80 had something similar.
295 But in 1994 it was Peter Hsu, the designer
296 of the MIPS R8000, who first came up with the idea of Vector-augmented
297 prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
298 the prefix would mark which of the registers were to be treated as
299 Scalar and which as Vector, then, treating the Scalar "suffix" instruction
300 as a guide and making "scalar instruction" synonymous with "Vector element",
301 perform a `REP`-like loop that
302 jammed multiple scalar operations into the Multi-Issue Execution
303 Engine. The only reason that the team did not take this forward
304 into a commercial product
305 was because they could not work out how to cleanly do OoO
306 multi-issue at the time (leveraging Multi-Issue is the most logical
307 way to exploit the Vector-Prefix concept)
308
309 In its simplest form, then, this "prefixing" idea is a matter
310 of:
311
312 * Defining the format of the prefix
313 * Adding a `setvl` instruction
314 * Adding Vector-context SPRs and working out how to do
315 context-switches with them
316 * Writing an awful lot of Specification Documentation
317 (4 years and counting)
318
319 Once the basics of this concept have sunk in, early
320 advancements quickly follow naturally from analysis
321 of the problem-space:
322
323 * Expanding the size of GPR, FPR and CR register files to
324 provide 128 entries in each. This is a bare minimum for GPUs
325 in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
326 batching as possible.
327 * Predication (an absolutely critical component for a Vector ISA),
328 then the next logical advancement is to allow separate predication masks
329 to be applied to *both* the source *and* the destination, independently.
330 (*Readers familiar with Vector ISAs will recognise this as a back-to-back
331 `VGATHER-VSCATTER`*)
332 * Element-width overrides: most Scalar ISAs today are 64-bit only,
333 with primarily Load and Store being able to handle 8/16/32/64
334 and sometimes 128-bit (quad-word), where Vector ISAs need to
335 go as low as 8-bit arithmetic, even 8-bit Floating-Point for
336 high-performance AI. Rather than waste opcode space adding all
337 such operations at different bitwidths, let the prefix
338 *redefine* (override) the element width, without actually altering
339 the Scalar ISA at all.
340 * "Reordering" of the assumption of linear sequential element
341 access, for Matrices, rotations, transposition, Convolutions,
342 DCT, FFT, Parallel Prefix-Sum and other common transformations
343 that require significant programming effort in other ISAs.
344
345 All of these things come entirely from "Augmentation" of the Scalar operation
346 being prefixed: at no time is the Scalar operation's binary pattern decoded
347 differently compared to when it is used as a Scalar operation.
348 From there, several more "Modes" can be added, including
349
350 * saturation,
351 which is needed for Audio and Video applications
352 * "Reverse Gear"
353 which runs the Element Loop in reverse order (needed for Prefix
354 Sum)
355 * Data-dependent Fail-First, which emerged from asking the simple
356 question, "If modern Vector ISAs have Load/Store Fail-First,
357 and the Power ISA has Condition Codes, why not make Conditional
358 early-exit from Arithmetic operation looping?"
359 * over 500 Branch-Conditional Modes emerge from application of
360 Boolean Logic in a Vector context, on top of an already-powerful
361 Scalar Branch-Conditional/Counter instruction
362
363 All of these festures are added as "Augmentations", to create of
364 the order of 1.5 *million* instructions, none of which decode the
365 32-bit scalar suffix any differently.
366
367 **What is missing from Power Scalar ISA that a Vector ISA needs?**
368
369 Remarkably, very little: the devil is in the details though.
370
371 * The traditional `iota` instruction may be
372 synthesised with an overlapping add, that stacks up incrementally
373 and sequentially. Although it requires two instructions (one to
374 start the sum-chain) the technique has the advantage of allowing
375 increments by arbitrary amounts, and is not limited to addition,
376 either.
377 * Big-integer addition (arbitrary-precision arithmetic) is an
378 emergent characteristic from the carry-in, carry-out capability of
379 Power ISA `adde` instruction. `sv.adde` as a BigNum add
380 naturally emerges from the
381 sequential carry-flag chaining of these scalar instructions.
382 * The Condition Register Fields of the Power ISA make a great candidate
383 for use as Predicate Masks, particularly when combined with
384 Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
385
386 It is only when looking slightly deeper into the Power ISA that
387 certain things turn out to be missing, and this is down in part to IBM's
388 primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
389 so Scalar ones. Examples include that transfer operations between the
390 Integer and Floating-point Scalar register files were dropped approximately
391 a decade ago after the Packed SIMD variants were considered to be
392 duplicates. With it being completely inappropriate to attempt to Vectorise
393 a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
394 the Scalar ISA, a much better all-round candidate for Vectorisation
395 (the Scalar parts of Power ISA) is left anaemic.
396
397 A particular key instruction that is missing is `MV.X` which is
398 illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
399 expensive instruction causing a huge swathe of Register Hazards
400 in one single hit is almost never added to a Scalar ISA but
401 is almost always added to a Vector one. When `MV.X` is
402 Vectorised it allows for arbitrary
403 remapping of elements within a Vector to positions specified
404 by another Vector. A typical Scalar ISA will use Memory to
405 achieve this task, but with Vector ISAs the Vector Register Files are
406 usually so enormous, and so far away from Memory, that it is easier and
407 more efficient, architecturally, to provide these Indexing instructions.
408
409 Fortunately, with the ISA Working Group being willing
410 to consider RFCs (Requests For Change) these omissions have the potential
411 to be corrected.
412
413 One deliberate decision in SVP64 involves Predication. Typical Vector
414 ISAs have quite comprehensive arithmetic and logical operations on
415 Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
416 side of Power ISA already has most of them.
417 If CR Fields were the only predicates in SVP64
418 it would put pressure on to start adding the exact same arithmetic and logical
419 operations that already exist in the Integer opcodes, which is less
420 than desirable.
421 Instead of taking that route the decision was made to allow *both*
422 Integer *and* CR Fields to be Predicate Masks, and to create Draft
423 instructions that provide better transfer capability between CR Fields
424 and Integer Register files.
425
426 Beyond that, further extensions to the Power ISA become much more
427 domain-specific, such as adding bitmanipulation for Audio, Video
428 and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
429 `ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
430 of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
431 *automatically* is inherently added to the Vector one as well, and
432 because these GPU and Video opcodes have been added to the CPU ISA,
433 Software Driver development and debugging is dramatically simplified.
434
435 Which brings us to the next important question: how is any of these
436 CPU-centric Vector-centric improvements relevant to power efficiency
437 and making more effective use of resources?
438
439 # Simpler more compact programs saves power
440
441 The first and most obvious saving is that, just as with any Vector
442 ISA, the amount of data processing requested
443 and controlled by each instruction is enormous, and leaves the
444 Decode and Issue Engines idle, as well as the L1 I-Cache. With
445 programs being smaller, chances are higher that they fit into
446 L1 Cache, or that the L1 Cache may be made smaller: either way
447 is a considerable O(N^2) power-saving.
448
449 Even a Packed SIMD ISA could take limited advantage of a higher
450 bang-per-buck for limited specific workloads, as long as the
451 stripmining setup and teardown is not required. However a
452 2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
453 ratio as a 64-wide Vector Length.
454
455 Realistically, for general use cases however it is extremely common
456 to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
457 astounding 240 hand-coded assembler instructions where it is around
458 12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
459 for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
460 the case of the IBM POWER9 with a little-known design flaw not
461 normally otherwise encountered this results in
462 contention between the L1 D and I Caches at the L2 Bus, slowing down
463 execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
464 requires loop-unrolling to contend with non-power-of-two Matrix
465 sizes: SVP64 does not (as hinted at below).
466 [Figures 8 and 9](https://arxiv.org/abs/2104.03142)
467 illustrate the process of concatenating copies of data in order
468 to match RADIX2 limitations of MMA.
469
470 Additional savings come in the form of `SVREMAP`. Like the
471 hardware-assist of Google's TPU mentioned on p9 of the above MMA paper,
472 `SVREMAP` is a hardware
473 index transformation system where the normally sequentially-linear
474 Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
475 commonly-used deterministic schedules, for example Matrix Multiply,
476 DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
477 2x6 with optional *in-place* transpose, mirroring or rotation
478 on any source or destination Matrix
479 may be performed in as little as 4 instructions, one of which
480 is to zero-initialise the accumulator Vector used to store the result.
481 If addition to another Matrix is also required then it is only three
482 instructions.
483
484 Not only that, but because the "Schedule" is an abstract
485 concept separated from the mathematical operation, there is no reason
486 why Matrix Multiplication Schedules may not be applied to Integer
487 Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
488 AND-and-OR, or any other future instruction such as Complex-Number
489 Multiply-and-Accumulate or Abs-Diff-and-Accumulate
490 that a future version of the Power ISA might
491 support. The flexibility is not only enormous, but the compactness
492 unprecedented. RADIX2 in-place DCT may be created in
493 around 11 instructions using the Triple-loop DCT Schedule. The only other processors well-known to have
494 this type of compact capability are both VLIW DSPs: TI's TMS320 Series
495 and Qualcom's Hexagon, and both are targetted at FFTs only.
496
497 There is no reason at all why future algorithmic schedules should not
498 be proposed as extensions to SVP64 (sorting algorithms,
499 compression algorithms, Sparse Data Sets, Graph Node walking
500 for example). (*Bear in mind that
501 the submission process will be
502 entirely at the discretion of the OpenPOWER Foundation ISA WG,
503 something that is both encouraged and welcomed by the OPF.*)
504
505 One of SVP64's current limitations is that it was initially designed
506 for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
507 a heavy focus on adding hardware-for-loops onto the *Registers*.
508 After more than three years of development the realisation hit that
509 the SVP64 concept could be expanded to Coherent Distributed Memory.
510 This astoundingly powerful concept is explored in the next section.
511
512 # Coherent Deterministic Hybrid Distributed In-Memory Processing
513
514 It is not often that a heading in an article can legitimately
515 contain quite so many comically-chained buzzwords, but in this section
516 they are justified. As hinted at in the first section, the last time
517 that memory was the same speed as processors was the Pentium III
518 and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
519 CPUs were about the same rate. DRAM bitcells *simply cannot exceed
520 these rates*, yet the pressure from Software Engineers is to
521 make *sequential* algorithm processing faster and faster because
522 parallelising of algorithms is simply too difficult to master, and always
523 has been. Thus whilst DRAM has to go parallel (like RAID Striping) to
524 keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
525 are at an astonishing four levels of cache (L1 to L4).
526
527 It should therefore come as no surprise that attempts are being made
528 to move (distribute) processing closer to the DRAM Memory, firmly
529 on the *opposite* side of the main CPU's L1/2/3/4 Caches,
530 where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
531 why this approach is compelling. However
532 the alarm bells ring here at the keyword "distributed", because by
533 moving the processing down next to the Memory, even onto
534 the same die as the DRAM, the speed of any
535 of the parallel Processing Elements (PEs) would likely drop
536 by almost two orders of magnitude (5 ghz down to 150 mhz),
537 the simplicity of each PE has, for pure pragmatic reasons,
538 to drop by several
539 orders of magnitude as well.
540 Things that the average "sequential algorithm"
541 programmer
542 takes for granted such as SMP, Cache Coherency, Virtual Memory,
543 spinlocks (atomic locking, mutexes), all of these are either outright gone
544 or expected that the programmer shall explicitly contend with
545 (even if that programmer is the Compiler Developer). There's definitely
546 not going to be a standard OS: the PEs will be too basic, too
547 resource-constrained, and definitely too busy.
548
549 To give an extreme example: Aspex's Array-String Processor, which
550 was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
551 Memory, was capable of literally a hundred-fold improvement in
552 performance over Scalar CPUs such as the Pentium III of its era,
553 all on a 3 watt budget at only 250 mhz in 130 nm. Yet to take
554 proper advantage of its capability required an astounding 5-10
555 *days* per line of assembly code because multiple versions of
556 an algorithm had to be hand-crafted then compared, and only
557 the best one selected: all others discarded. 20 lines of optimised
558 Assembler taking three to six months to write can in no way be termed
559 "productive", yet this extreme level of unproductivity is an inherent
560 side-effect of going down the parallel-processing rabbithole where
561 the cost of providing "Traditional" programmabilility (Virtual Memory,
562 SMP) is worse than counter-productive, it's often outright impossible.
563
564 *<blockquote>
565 Similar to how GPUs achieve astounding task-dedicated
566 performance by giving
567 ALUs 30% of total silicon area and sacrificing the ability to run
568 General-Purpose programs, Aspex, Google's Tensor Processor and D-Matrix
569 likewise took this route and made the same compromise.
570 </blockquote>*
571
572 **In short, we are in "Programmer's nightmare" territory**
573
574 Having dug a proverbial hole that rivals the Grand Canyon, and
575 jumped in it feet-first, the next
576 task is to piece together a strategy to climb back out and show
577 how falling back in can be avoided. This takes some explaining,
578 and first requires some background on various research efforts and
579 commercial designs. Once the context is clear, their synthesis
580 can be proposed. These are:
581
582 * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
583 available [no paywall](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf)
584 * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
585 * [Snitch](https://arxiv.org/abs/2002.10143)
586
587 **ZOLC: Zero-Overhead Loop Control**
588
589 Zero-Overhead Looping is the concept of automatically running a set sequence
590 of instructions a predetermined number of times, without requiring
591 a branch. This is conceptually similar but
592 slightly different from using Power ISA `bc` in `CTR`
593 (Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
594
595 The simplest longest commercially successful deployment of Zero-overhead looping
596 has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
597 within the VLIW word may be repeatedly deployed on successive clock
598 cycles until a countdown reaches zero. This extraordinarily simple
599 concept needs no branches, and has no complex Register Hazard
600 Management in the hardware
601 because it is down to the programmer (or, the compiler),
602 to ensure data overlaps do not occur. Careful crafting of those
603 14 instructions can keep the ALUs 100% occupied for sustained periods,
604 and the iconic example for which the TI DSPs are renowned
605 is that an entire inner loop for large FFTs
606 can be done with that one VLIW word: no stalls, no stopping, no fuss,
607 an entire 1024 or 4096 wide FFT Layer in one instruction.
608
609 <blockquote>
610 The key aspect of these
611 very simplistic countdown loops as far as we are concerned:
612 is: *they are deterministic*.
613 </blockquote>
614
615 Zero-Overhead Loop Control takes this basic "single loop" concept
616 way further: both nested loops and conditional exit are included,
617 but also arbitrary control-jumping from the current inner loop
618 out to an entirely different loop, all based on conditions determined
619 dynamically at runtime.
620
621 Even when deployed on as basic a CPU as a single-issue in-order RISC
622 core, the performance and power-savings were astonishing: between 27
623 and **75%** reduction in algorithm completion times were achieved compared
624 to a more traditional branch-speculative in-order RISC CPU. MPEG
625 Encode's timing, the target algorithm specifically picked by the researcher
626 due to its high complexity with 6-deep nested loops and conditional
627 execution that frequently jumped in and out of at least 2 loops,
628 came out with an astonishing 43% improvement in completion time. 43%
629 less instructions executed is an almost unheard-of level of optimisation:
630 most ISA designers are elated if they can achieve 5 to 10%. The reduction
631 was so compelling that ST Microelectronics put it into commercial
632 production in one of their embedded CPUs, the ST120 DSP-MCU.
633
634 The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
635 design of its triple-nested for-loop system
636 turned out to be remarkably similar to the
637 core nested for-loop engine of ZOLC. In hindsight this should not
638 have come as a surprise, because both are basically nested for-loops
639 that do not need branches to issue instructions.
640
641 The important insight is, however, that if ZOLC can be general-purpose
642 and apply deterministic nested looped instruction
643 schedules to more than just registers
644 (unlike SVP64 in its current incarnation) then so can SVP64.
645
646 **OpenCAPI and Extra-V**
647
648 OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
649 cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
650
651 <blockquote>(Side note:
652 POWER10 *only*
653 has OpenCAPI Memory interfaces: an astounding number of them,
654 with overall bandwidth so high it's actually difficult to conceptualise.
655 An OMI-to-DDR4/5 Bridge PHY is therefore required
656 to connect to standard Memory DIMMs.)
657 </blockquote>
658
659 Extra-V appears to be a remarkable research project based on OpenCAPI that,
660 by assuming that the map of edges (excluding the actual data)
661 in any given arbitrary data graph
662 could be kept by the main CPU in-memory, could distribute and delegate
663 a limited-capability deterministic but most importantly *data-dependent*
664 node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier). A miniature processor
665 (non-Turing-complete) analysed
666 the data it had read (at the Memory), and determined if it should
667 notify the main processor that this "Node" is worth investigating,
668 or if the Graph node-walk should split in a different direction.
669 Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
670 abstraction, locking, and cache-coherency, many of the nightmare problems
671 of other more explicit parallel processing paradigms disappear.
672
673 The similarity to ZOLC should not have gone unnoticed: where ZOLC
674 has nested conditional for-loops Extra-V appears to have just the
675 one conditional for-loop, but the key strategically-crucial
676 part of this multi-faceted puzzle is that due to the deterministic and
677 coherent nature of Extra-V, the processing of the loops, which
678 requires a tiny non-Turing-Complete processor, is not
679 done close to or by the main CPU at all: it is
680 *embedded right next to the memory*.
681
682 The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
683 Array-String Processing, and Elixent 2D Array Processing, should
684 also not have gone unnoticed. All of these solutions utilised
685 or utilise
686 a more comprehensive Turing-complete von-Neumann "Management Core"
687 to coordinate data passed in and out of PEs: none of them have or
688 had something
689 as powerful as OpenCAPI as part of that picture.
690
691 The fact that Neural Networks may be expressed as arbitrary Graphs,
692 and comprise Sparse Matrices, should also have been noted by the reader
693 interested in AI.
694
695 **Snitch**
696
697 Snitch is an elegant Memory-Coherent Barrel-Processor where registers
698 become "tagged" with a Memory-access Mode that went out of fashion
699 over forty years ago: Load-then-Auto-Increment. Expressed in c as
700 `src = *x++`, and requiring special Address Registers (PDP-11, 68000),
701 thanks to the RISC paradigm having gone too far,
702 the efficiency and effectiveness
703 of these Load-Store-with-Increment instructions has been
704 forgotten until Snitch.
705
706 What the designers did however was not to add any new Load-Store
707 or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
708 registers with a tag which *augmented* (altered) the behaviour
709 of *existing* instructions. These tags tell the CPU: when you are asked to
710 carry out
711 an add instruction on r6 and r7, do not take r6 or r7 from the register
712 file, instead please perform a Cache-coherent Load-with-Increment
713 on each, using special (hidden, implicit)
714 Address Registers for each. Each new use
715 of r6 therefore brings in an entirely new value *directly from
716 memory*. Likewise on the second operand, r7, and likewise on
717 the destination result which can be an automatic Coherent
718 Store-and-increment
719 directly into Memory.
720
721 <blockquote>
722 *The act of "reading" or "writing" a register has been decoupled
723 and intercepted, then connected transparently to a completely
724 separate Coherent Memory Subsystem*
725 </blockquote>
726
727 On top of a barrel-architecture the slowness of Memory access
728 was not a problem because the Deterministic nature of classic
729 Load-Store-Increment can be compensated for by having 8 Memory
730 accesses scheduled underway and interleaved in a time-sliced
731 fashion with an FPU that is correspondingly 8 times faster than
732 the Coherent Memory accesses.
733
734 This design is reminiscent of the early Vector Processors
735 of the late 1950s and early 1960s, which also critically relied
736 on implicit auto-increment addressing.
737 The [CDC STAR-100](https://en.m.wikipedia.org/wiki/CDC_STAR-100)
738 for example was specifically designed as a Memory-to-Memory Vector
739 Processor. The barrel-architecture of Snitch neatly
740 solves one of the inherent problems of those early designs (a mismatch
741 with memory
742 speed) and the presence of a full register file (non-tagged,
743 normal, standard scalar registers) caters for a
744 second limitation of pure Memory-based Vector Processors: temporary
745 variables needed in the computation of intermediate results, which
746 also had to go through memory, put
747 an awfully high artificial load on Memory bandwidth.
748
749 The similarity to SVP64 should be clear: SVP64 Prefixing and the
750 associated REMAP system is just another form of register "tagging"
751 that augments what was formerly designated by its original authors
752 as "just a Scalar ISA", tagging allows for dramatic implicit alteration
753 with advanced behaviour not previously envisaged.
754
755 What Snitch brings to the table therefore is a further illustration of
756 the concept introduced by Extra-V: where Extra-V brought information
757 about Sparse-Distributed Data to the attention of the main CPU in
758 a coherent fashion *without the CPU having to ask for it*, Snitch
759 demonstrates a classic LOAD-COMPUTE-STORE cycle in the same
760 distributed coherent manner, and does so with dramatically-reduced
761 power consumption.
762
763 **Bringing it all together**
764
765 At this point we are well into a future revision of SVP64, one that
766 clearly has some startlingly powerful potential: Supercomputing-class
767 Multi-Issue Vector Engines kept 100% occupied in a 100% long-term
768 sustained fashion with reduced complexity, reduced power consumption
769 and reduced completion time, thanks to Deterministic Coherent Scheduling
770 of the data fed in and out, or even moved down next to Memory.
771
772 This last part is where it normally gets hair-raising, but as ZOLC shows
773 there is no reason at all why even complex algorithms such as MPEG cannot
774 be run in a partially-deterministic manner, and anything that is
775 deterministic can be Scheduled, coherently. Combine that with OpenCAPI
776 which solves the many issues associated with SMP Virtual Memory and so on
777 yet still allows Cache-Coherent Distributed Memory Access, and what was
778 previously an intractable Computer Science problem for decades begins to
779 look like there is a potential solution.
780
781 The Deterministic Schedules created by ZOLC should even be possible to identify their
782 suitability for full off-CPU distributed processing, as long as OpenCAPI
783 is integrated into the mix. What a compiler - or even the hardware -
784 will be looking out for is a Basic Block of instructions that:
785
786 * begins with a LOAD (to be handled by OpenCAPI)
787 * contains some instructions that a given PE is capable of executing
788 * ends with a STORE (again: OpenCAPI)
789
790 For best results that would be wrapped with a Zero-Overhead Loop
791 (which is offloaded - in full - down to the PE), where
792 the Compiler (or hardware at runtime) could easily identify, in advance,
793 the full range of Memory Addresses that the Loop is to encounter. Copies
794 of loop-invariant data would need to be passed down to the remote PE:
795 again, for simple-enough Basic Blocks, with assistance from the Compiler,
796 loop-invariant inputs are easily identified. Parallel Processing
797 opportunities should also be easy enough to create, simply by farming out
798 different parts of a given Deterministic Zero-Overhead Loop to
799 different PEs based on their proximity, bandwidth or ease of access to
800 given Memory.
801
802 The importance of OpenCAPI in this mix cannot be underestimated, because
803 it will be the means by which the main CPU coordinates its activities
804 with the remote PEs, ensuring that LOAD/STORE Memory Hazards are not
805 violated. It should also be straightforward to ensure that the offloading
806 is entirely transparent to the developer, in fact this is a hard requirement
807 because at any given moment there is the possibility that the PEs may be
808 busy and it is the main CPU that has to complete the Processing Task itself.
809
810 It is also important to note that we are not necessarily talking about
811 the Remote PEs executing the Power ISA, but if they do so it becomes
812 much easier for the main CPU to take over in the event that PEs are
813 currently occupied. Plus, the twin lessons that inventing ISAs, even
814 a small one, is hard (mostly in compiler writing) and how complex
815 GPU Task Scheduling is, are being heard loud and clear.
816
817 Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
818
819 * In order that the main CPU can execute the same sequence if necessary,
820 the CPU must support dual ISAs: Power and PE **OR**
821 * There must be a JIT binary-translator which either turns PE code
822 into Power ISA code or vice-versa **OR**
823 * The compiler dual-compiles the original source code, and embeds
824 both a Power binary and a PE binary into the ZOLC Basic Block **OR**
825 * All binaries are stored in an Intermediate Representation
826 (LLVM-IR, SPIR-V) and JIT-compiled on-demand.
827
828 All of these would work, but it is simpler and a lot less work
829 just to have the PEs
830 execute the exact same ISA (or a subset of it). If however the
831 concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
832 which would increase adoption and reduce cost, a bit more thought
833 is required here because ARM or Intel or MIPS might not necessarily
834 be happy that a Processing Element (PE) has to execute Power ISA binaries.
835 At least the Power ISA is much richer, more powerful, still RISC,
836 and is an Open Standard, as discussed in a earlier sections.
837
838 A reasonable compromise as a JEDEC Standard is illustrated with
839 the following diagram: a 3-way Bridge PHY that allows for full
840 direct interaction between DRAM ICs, PEs, and one or more main CPUs
841 (* a variant of the Northbridge and/or IBM POWER10 OMI-to-DDR5 PHY concept*).
842 It is also the ideal location for a "Management Core".
843 If the 3-way Bridge (4-way if connectivity to other Bridge PHYs
844 is also included) does not itself have PEs built-in then the ISA
845 utilised on any PE or CPU is non-critical. The only concern regarding
846 mixed ISAs is that the PHY should be capable of transferring all and
847 any types of "Management" packets, particularly PE Virtual Memory Management
848 and Register File Control (Context-switch Management given that the PEs
849 are expected to be ALU-heavy and not capable of running a full SMP Operating
850 System).
851
852 There is also no reason why this type of arrangement should not be deployed
853 in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
854 the performance boost that goes with smaller line-drivers.
855
856 <img src="/openpower/sv/bridge_phy.svg" width=600 />
857
858 # Transparently-Distributed Vector Processing
859
860 It is very strange to the author to be describing what amounts to a
861 "Holy Grail" solution to a decades-long intractable problem that
862 mitigates the anticipated end of Moore's Law: how to make it easy for
863 well-defined workloads, expressed as a perfectly normal
864 sequential program, compiled to a standard well-known ISA, to have
865 the potential of being offloaded transparently to Parallel Compute Engines,
866 all without the Software Developer being excessively burdened with
867 a Parallel-Processing Paradigm that is alien to all their experience
868 and training, as well as Industry-wide common knowledge.
869
870 Will it be that easy? ZOLC is, honestly, in its current incarnation,
871 not that straightforward: programs
872 have to be "massaged" by tools that insert intrinsics into the
873 source code, in order to identify the Basic Blocks that the Zero-Overhead
874 Loops can run. Can this be merged into standard gcc and llvm
875 compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
876 if an infinite supply of money and engineering time is thrown at it.
877 Is a half-way-house solution of compiler intrinsics good enough?
878 Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
879 for several decades, and advanced programmers are comfortable with the
880 practice.
881
882 Additional questions remain as to whether OpenCAPI or its use for this
883 particular scenario requires that the PEs, even quite basic ones,
884 implement a full RADIX MMU, and associated TLB lookup? In order to ensure
885 that programs may be cleanly and seamlessly transferred between PEs
886 and CPU the answer is quite likely to be "yes", which is interesting
887 in and of itself. Fortunately, the associated L1 Cache with TLB
888 Translation does not have to be large, and the actual RADIX Tree Walk
889 need not explicitly be done by the PEs, it can be handled by the main
890 CPU as a software-extension: PEs generate a TLB Miss notification
891 to the main CPU over OpenCAPI, and the main CPU feeds back the new
892 TLB entries to the PE in response.
893
894 Also in practical terms, with the PEs anticipated to be so small as to
895 make running a full SMP-aware OS impractical it will not just be their TLB
896 pages that need remote management but their entire register file including
897 the Program Counter will need to be set up, and the ZOLC Context as
898 well. With OpenCAPI packet formats being quite large a concern is that
899 the context management increases latency to the point where the premise
900 of this paper is invalidated. Research is needed here as to whether a
901 bare-bones microkernel
902 would be viable, or a Management Core closer to the PEs (on the same
903 die or Multi-Chip-Module as the PEs) would allow better bandwidth and
904 reduce Management Overhead on the main CPUs. However if
905 the same level of power saving as Snitch (1/6th) and
906 the same sort of reduction in algorithm runtime as ZOLC (20 to 80%) is not
907 unreasonable to expect, this is
908 definitely compelling enough to warrant in-depth investigation.
909
910 **Use-case: Matrix and Convolutions**
911
912 <img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
913
914 First, some important definitions, because there are two different
915 Vectorisation Modes in SVP64:
916
917 * **Horizontal-First**: (aka standard Cray Vectors) walk
918 through **elements** first before moving to next **instruction**
919 * **Vertical-First**: walk through **instructions** before
920 moving to next **element**. Currently managed by `svstep`,
921 ZOLC may be deployed to manage the stepping, in a Deterministic manner.
922
923 Second:
924 SVP64 Draft Matrix Multiply is currently set up to arrange a Schedule
925 of Multiply-and-Accumulates, suitable for pipelining, that will,
926 ultimately, result in a Matrix Multiply. Normal processors are forced
927 to perform "loop-unrolling" in order to achieve this same Schedule.
928 SIMD processors are further forced into a situation of pre-arranging rotated
929 copies of data if the Matrices are not exactly on a power-of-two boundary.
930
931 The current limitation of SVP64 however is (when Horizontal-First
932 is deployed, at least, which is the least number of instructions)
933 that both source and destination Matrices have to be in-registers,
934 in full. Vertical-First may be used to perform a LD/ST within
935 the loop, covered by `svstep`, but it is still not ideal. This
936 is where the Snitch and EXTRA-V concepts kick in.
937
938 <img src="/openpower/sv/matrix_svremap.svg" />
939
940 Imagine a large Matrix scenario, with several values close to zero that
941 could be skipped: no need to include zero-multiplications, but a
942 traditional CPU in no way can help: only by loading the data through
943 the L1-L4 Cache and Virtual Memory Barriers is it possible to
944 ascertain, retrospectively, that time and power had just been wasted.
945
946 SVP64 is able to do what is termed "Vertical-First" Vectorisation,
947 combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been
948 extended, Snitch-style, to perform a deterministic memory-array walk of
949 a large Matrix.
950
951 Let us also imagine that the Matrices are stored in Memory with PEs
952 attached, and that the PEs are fully functioning Power ISA with Draft
953 SVP64, but their Multiply capability is not as good as the main CPU.
954 Therefore:
955 we want the PEs to conditionally
956 feed sparse data to the main CPU, a la "Extra-V".
957
958 * The ZOLC SVREMAP System running on the main CPU generates a Matrix
959 Memory-Load Schedule.
960 * The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
961 * The PEs are also sent the Basic Block to be executed on each
962 Memory Load (each element of the Matrices to be multiplied)
963 * The PEs execute the Basic Block and **exclude**, in a deterministic
964 fashion, any elements containing Zero values
965 * Non-zero elements are sent, via OpenCAPI, to the main CPU, which
966 queues sequences of Multiply-and-Accumulate, and feeds the results
967 back to Memory, again via OpenCAPI, to the PEs.
968 * The PEs, which are tracking the Sparse Conditions, know where
969 to store the results received
970
971 In essence this is near-identical to the original Snitch concept
972 except that there are, like Extra-V, PEs able to perform
973 conditional testing of the data as it goes both to and from the
974 main CPU. In this way a large Sparse Matrix Multiply or Convolution
975 may be achieved without having to pass unnecessary data through
976 L1/L2/L3 Caches only to find, at the CPU, that it is zero.
977
978 The reason in this case for the use of Vertical-First Mode is the
979 conditional execution of the Multiply-and-Accumulate.
980 Horizontal-First Mode is the standard Cray-Style Vectorisation:
981 loop on all *elements* with the same instruction before moving
982 on to the next instruction. Horizontal-First
983 Predication needs to be pre-calculated
984 for the entire Vector in order to exclude certain elements from
985 the computation. In this case, that's an expensive inconvenience
986 (remarkably similar to the problems associated with Memory-to-Memory
987 Vector Machines such as the CDC Star-100).
988
989 Vertical-First allows *scalar* instructions and
990 *scalar* temporary registers to be utilised
991 in the assessment as to whether a particular Vector element should
992 be skipped, utilising a straight Branch instruction *(or ZOLC
993 Conditions)*. The Vertical Vector technique
994 is pioneered by Mitch Alsup and is a key feature of his VVM Extension
995 to MyISA 66000. Careful analysis of the registers within the
996 Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
997 *amortise in-flight scalar looped operations into SIMD batches*
998 as long as the loop is kept small enough to entirely fit into
999 in-flight Reservation Stations in the first place.
1000
1001 *<blockquote>
1002 (With thanks and gratitude to Mitch Alsup on comp.arch for
1003 spending considerable time explaining VVM, how its Loop
1004 Construct explicitly identifies loop-invariant registers,
1005 and how that helps Register Hazards and SIMD amortisation
1006 on a GB-OoO Micro-architecture)
1007 </blockquote>*
1008
1009 Draft Image (placeholder):
1010
1011 <img src="/openpower/sv/zolc_svp64_extrav.svg" width=800 />
1012
1013 The program being executed is a simple loop with a conditional
1014 test that ignores the multiply if the input is zero.
1015
1016 * In the CPU-only case (top) the data goes through L1/L2
1017 Cache before reaching the CPU.
1018 * However the PE version does not send zero-data to the CPU,
1019 and even when it does it goes into a Coherent FIFO: no real
1020 compelling need to enter L1/L2 Cache or even the CPU Register
1021 File (one of the key reasons why Snitch saves so much power).
1022 * The PE-only version (see next use-case) the CPU is mostly
1023 idle, serving RADIX MMU TLB requests for PEs, and OpenCAPI
1024 requests.
1025
1026 **Use-case variant: More powerful in-memory PEs**
1027
1028 An obvious variant of the above is that, if there is inherently
1029 more parallelism in the data set, then the PEs get their own
1030 Multiply-and-Accumulate instruction, and rather than send the
1031 data to the CPU over OpenCAPI, perform the Matrix-Multiply
1032 directly themselves.
1033
1034 However the source code and binary would be near-identical if
1035 not identical in every respect, and the PEs implementing the full
1036 ZOLC capability in order to compact binary size to the bare minimum.
1037 The main CPU's role would be to coordinate and manage the PEs
1038 over OpenCAPI.
1039
1040 One key strategic question does remain: do the PEs need to have
1041 a RADIX MMU and associated TLB-aware minimal L1 Cache, in order
1042 to support OpenCAPI properly? The answer is very likely to be yes.
1043 The saving grace here is that with
1044 the expectation of running only hot-loops with ZOLC-driven
1045 binaries, the size of each PE's TLB-aware
1046 L1 Cache needed would be miniscule compared
1047 to the average high-end CPU.
1048
1049 **Comparison of PE-CPU to GPU-CPU interaction**
1050
1051 The informed reader will have noted the remarkable similarity between how
1052 a CPU communicates with a GPU to schedule tasks, and the proposed
1053 architecture. CPUs schedule tasks with GPUs as follows:
1054
1055 * User-space program encounters an OpenGL function, in the
1056 CPU's ISA.
1057 * Proprietary GPU Driver, still in the CPU's ISA, prepares a
1058 Shader Binary written in the GPU's ISA.
1059 * GPU Driver wishes to transfer both the data and the Shader Binary
1060 to the GPU. Both may only do so via Shared Memory, usually
1061 DMA over PCIe (assuming a PCIe Graphics Card)
1062 * GPU Driver which has been running CPU userspace notifies CPU
1063 kernelspace of the desire to transfer data and GPU Shader Binary
1064 to the GPU. A context-switch occurs...
1065
1066 It is almost unfair to burden the reader with further details.
1067 The extraordinarily convoluted procedure is as bad as it sounds. Hundreds
1068 of thousands of tasks per second are scheduled this way, with hundreds
1069 or megabytes of data per second being exchanged as well.
1070
1071 Yet, the process is not that different from how things would work
1072 with the proposed microarchitecture: the differences however are key.
1073
1074 * Both PEs and CPU run the exact same ISA. A major complexity of 3D GPU
1075 and CUDA workloads (JIT compilation etc) is eliminated, and, crucially,
1076 the CPU may directly execute the PE's tasks, if needed. This simply
1077 is not even remotely possible on GPU Architectures.
1078 * Where GPU Drivers use PCIe Shared Memory, the proposed architecture
1079 deploys OpenCAPI.
1080 * Where GPUs are a foreign architecture and a foreign ISA, the proposed
1081 architecture only narrowly misses being defined as big/LITTLE Symmetric
1082 Multi-Processing (SMP) by virtue of the massively-parallel PEs
1083 being a bit light on L1 Cache, in favour of large ALUs and proximity
1084 to Memory, and require a modest amount of "helper" assistance with
1085 their Virtual Memory Management.
1086 * The proposed architecture has the markup points emdedded into the
1087 binary programs
1088 where PEs may take over from the CPU, and there is accompanying
1089 (planned) hardware-level assistance at the ISA level. GPUs, which have to
1090 work with a wide range of commodity CPUs, cannot in any way expect
1091 ARM or Intel to add support for GPU Task Scheduling directly into
1092 the ARM or x86 ISAs!
1093
1094 On this last point it is crucial to note that SVP64 began its inspiration
1095 from a Hybrid CPU-GPU-VPU paradigm (like ICubeCorp's IC3128) and
1096 consequently has versatility that the separate specialisation of both
1097 GPU and CPU architectures lack.
1098
1099 **Roadmap summary of Advanced SVP64**
1100
1101 The future direction for SVP64, then, is:
1102
1103 * To overcome its current limitation of REMAP Schedules being
1104 restricted to Register Files, leveraging the Snitch-style
1105 register interception "tagging" technique.
1106 * To adopt ZOLC and merge REMAP Schedules into ZOLC
1107 * To bring OpenCAPI Memory Access into ZOLC as a first-level
1108 concept that mirrors Snitch's Coherent Memory interception
1109 * To add the Graph-Node Walking Capability of Extra-V
1110 to ZOLC / SVREMAP
1111 * To make it possible, in a combination of hardware and software,
1112 to easily identify ZOLC / SVREMAP Blocks
1113 that may be transparently pushed down closer to Memory, for
1114 localised distributed parallel execution, by OpenCAPI-aware PEs,
1115 exploiting both the Deterministic nature of ZOLC / SVREMAP
1116 combined with the Cache-Coherent nature of OpenCAPI,
1117 to the maximum extent possible.
1118 * To explore "Remote Management" of PE RADIX MMU, TLB, and
1119 Context-Switching (register file transferrance) by proxy,
1120 over OpenCAPI, to ensure that the distributed PEs are as
1121 close to a Standard SMP model as possible, for programmers.
1122 * To make the exploitation of this powerful solution as simple
1123 and straightforward as possible for Software Engineers to use,
1124 in standard common-usage compilers, gcc and llvm.
1125 * To propose extensions to Public Standards that allow all of
1126 the above to become part of everyday ubiquitous mass-volume
1127 computing.
1128
1129 Even the first of these - merging Snitch-style register tagging
1130 into SVP64 - would
1131 expand SVP64's capability for Matrices, currently limited to
1132 around 5x7 to 6x6 Matrices and constrained by the size of
1133 the register files (128 64-bit entries), to arbitrary (massive) sizes.
1134
1135 **Summary**
1136
1137 There are historical and current efforts that step away from both a
1138 general-purpose architecture and from the practice of using compiler
1139 intrinsics in general-purpose compute to make programmer's lives easier.
1140 A classic example being the Cell Processor (Sony PS3) which required
1141 programmers to use DMA to schedule processing tasks. These specialist
1142 high-performance architectures are only tolerated for
1143 as long as there is no equivalent performant alternative that is
1144 easier to program.
1145
1146 Combining SVP64 with ZOLC and OpenCAPI can produce an extremely powerful
1147 architectural base that fits well with intrinsics embedded into standard
1148 general-purpose compilers (gcc, llvm) as a pragmatic compromise which makes
1149 it useful right out the gate. Further R&D may target compiler technology
1150 that brings it on-par with NVIDIA, Graphcore, AMDGPU, but with intrinsics
1151 there is no critical product launch dependence on having such
1152 advanced compilers.
1153
1154 Bottom line is that there is a clear roadmap towards solving a long
1155 standing problem facing Computer Science and doing so in a way that
1156 reduces power consumption reduces algorithm completion time and reduces
1157 the need for complex hardware microarchitectures in favour of much
1158 smaller distributed coherent Processing Elements with a Heterogenous ISA
1159 across the board.
1160
1161 # Appendix
1162
1163 **Samsung PIM**
1164
1165 Samsung's
1166 [Processing-in-Memory](https://semiconductor.samsung.com/emea/newsroom/news/samsung-brings-in-memory-processing-power-to-wider-range-of-applications/)
1167 seems to be ready to launch as a
1168 [commercial product](https://semiconductor.samsung.com/insights/technology/pim/)
1169 that uses HBM as its Memory Standard,
1170 has "some logic suitable for AI", has parallel processing elements,
1171 and offers 70% reduction
1172 in power consumption and a 2x performance increase in speech
1173 recognition. Details beyond that as to its internal workings
1174 or programmability are minimal, however given the similarity
1175 to D-Matrix and Google TPU it is reasonable to place in the
1176 same category.
1177
1178 * [Samsung PIM IEEE Article](https://spectrum.ieee.org/samsung-ai-memory-chips)
1179 explains that there are 9 instructions, mostly FP16 arithmetic,
1180 and that it is designed to "complement" AI rather than compete.
1181 With only 9 instructions, 2 of which will be LOAD and STORE,
1182 conditional code execution seems unlikely.
1183 Silicon area in DRAM is increased by 5% for a much greater reduction
1184 in power. The article notes, pointedly, that programmability will
1185 be a key deciding factor. The article also notes that Samsung has
1186 proposed its architecture as a JEDEC Standard.
1187
1188 **PIM-HBM Research**
1189
1190 [Presentation](https://ieeexplore.ieee.org/document/9073325/) by Seongguk Kim
1191 and associated [video](https://www.youtube.com/watch?v=e4zU6u0YIRU)
1192 showing 3D-stacked DRAM connected to GPUs, but notes that even HBM, due to
1193 large GPU size, is less advantageous than it should be. Processing-in-Memory
1194 is therefore logically proposed. the PE (named a Streaming Multiprocessor)
1195 is much more sophisticated, comprising Register File, L1 Cache, FP32, FP64
1196 and a Tensor Unit.
1197
1198 <img src="/openpower/sv/2022-05-14_11-55.jpg" width=500 />
1199
1200 **etp4hpc.eu**
1201
1202 [ETP 4 HPC](https://etp4hpc.eu) is a European Joint Initiative for HPC,
1203 with an eye towards
1204 [Processing in Memory](https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf)
1205
1206 **Salient Labs**
1207
1208 [Research paper](https://arxiv.org/abs/2002.00281) explaining
1209 that they can exceed a 14 ghz clock rate Multiply-and-Accumulate
1210 using Photonics.
1211
1212 **SparseLNR**
1213
1214 [SparseLNR](https://arxiv.org/abs/2205.11622) restructures sparse
1215 tensor computations using loop-nest restructuring.
1216
1217 **Additional ZOLC Resources**
1218
1219 * <https://www.researchgate.net/publication/3351728_Zero-overhead_loop_controller_that_implements_multimedia_algorithms>