# Why in the 2020s would you invent a new Vector ISA
-*The short answer: you don't. Extend existing technology: on the shoulders of giants*
+*(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
Inventing a new Scalar ISA from scratch is over a decade-long task
including simulators and compilers: OpenRISC 1200 took 12 years to
AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
assembler and direct use of intrinsics is the Industry-standard norm
to achieve high-performance optimisation where it matters*).
-GPUs full this void both in hardware and software terms by having
+GPUs fill this void both in hardware and software terms by having
ultra-specialist compilers (CUDA) that are designed from the ground up
to support Vector/SIMD parallelism, and associated standards
(SPIR-V, Vulkan, OpenCL) managed by
ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
Order(N^6) opcode proliferation nightmare, with its mantra "make it
easy for hardware engineers, let software sort out the mess" literally
-overwhelming programmers. Specialists charging
+overwhelming programmers with thousands of instructions. Specialists charging
clients for assembly-code Optimisation Services are finding that AVX-512,
to take an
example, is anything but optimal: overall performance of AVX-512 actually
* RISC-V, touted as "Open" but actually strictly controlled under
Trademark License: too new to have adequate patent pool protection,
as evidenced by multiple adopters having been hit by patent lawsuits.
+ (Agreements between RISC-V *Members* to not engage in patent litigation
+ does nothing to stop third party patents that *legitimately pre-date*
+ the newly-created RISC-V ISA)
* MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
* Power ISA: protected by IBM's extensive patent portfolio for Members
of the OpenPOWER Foundation, covered by Trademarks, permitting
to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
and VIA EDEN processors, and see how they fared.
* s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
- but the 800lb Gorilla Syndrome seems not to have deterred one
+ but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
particularly disingenuous group from performing illegal
Reverse-Engineering.
*Packed SIMD looped algorithms actually have to
contain multiple implementations processing fragments of data at
-different SIMD widths: Cray-style Vectors have one, covering not
+different SIMD widths: Cray-style Vectors have just the one, covering not
just current architectural implementations but future ones with
wider back-end ALUs as well.*
All of these things come entirely from "Augmentation" of the Scalar operation
being prefixed: at no time is the Scalar operation significantly
altered.
-From there, several more "Modes" can be added, including saturation,
-which is needed for Audio and Video applications, "Reverse Gear"
+From there, several more "Modes" can be added, including
+
+* saturation,
+which is needed for Audio and Video applications
+* "Reverse Gear"
which runs the Element Loop in reverse order (needed for Prefix
-Sum), and more.
+Sum)
+* Data-dependent Fail-First, which emerged from asking the simple
+ question, "If modern Vector ISAs have Load/Store Fail-First,
+ and the Power ISA has Condition Codes, why not make Conditional
+ early-exit from Arithmetic operation looping?"
+* over 500 Branch-Conditional Modes emerge from application of
+ Boolean Logic in a Vector context, on top of an already-powerful
+ Scalar Branch-Conditional/Counter instruction
**What is missing from Power Scalar ISA that a Vector ISA needs?**
emergent characteristic from the carry-in, carry-out capability of
Power ISA `adde` instruction. `sv.adde` as a BigNum add
naturally emerges from the
- sequential chaining of these scalar instructions.
+ sequential carry-flag chaining of these scalar instructions.
* The Condition Register Fields of the Power ISA make a great candidate
for use as Predicate Masks, particularly when combined with
Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
One deliberate decision in SVP64 involves Predication. Typical Vector
ISAs have quite comprehensive arithmetic and logical operations on
-Predicate Masks, and if CR Fields were the only predicates in SVP64
+Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
+side of Power ISA already has most of them.
+If CR Fields were the only predicates in SVP64
it would put pressure on to start adding the exact same arithmetic and logical
-operations that already exist in the Integer opcodes.
+operations that already exist in the Integer opcodes, which is less
+than desirable.
Instead of taking that route the decision was made to allow *both*
Integer *and* CR Fields to be Predicate Masks, and to create Draft
instructions that provide better transfer capability between CR Fields
and controlled by each instruction is enormous, and leaves the
Decode and Issue Engines idle, as well as the L1 I-Cache. With
programs being smaller, chances are higher that they fit into
-L1 Cache, or that the L1 Cache may be made smaller.
+L1 Cache, or that the L1 Cache may be made smaller: either way
+is a considerable O(N^2) power-saving.
Even a Packed SIMD ISA could take limited advantage of a higher
bang-per-buck for limited specific workloads, as long as the
Additional savings come in the form of `SVREMAP`. This is a hardware
index transformation system where the normally sequentially-linear
-element access may be "Re-Mapped" to limited but algorithmic-tailored
+Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
commonly-used deterministic schedules, for example Matrix Multiply,
DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
-2x6 may be performed in as little as 4 instructions, one of which
+2x6 with optional *in-place* transpose, mirroring or rotation
+on any source or destination Matrix
+may be performed in as little as 4 instructions, one of which
is to zero-initialise the accumulator Vector used to store the result.
If addition to another Matrix is also required then it is only three
instructions. Not only that, but because the "Schedule" is an abstract
the SVP64 concept could be expanded to Coherent Distributed Memory,
This astoundingly powerful concept is explored in the next section.
-# Coherent Deterministic Hybrid Distributed Memory-Processing
+# Coherent Deterministic Hybrid Distributed In-Memory Processing
It is not often that a heading in an article can legitimately
contain quite so many comically-chained buzzwords, but in this section
It should therefore come as no surprise that attempts are being made
to move (distribute) processing closer to the DRAM Memory, firmly
-on the *opposite* side of the main CPU's L1/2/3/4 Caches. However
+on the *opposite* side of the main CPU's L1/2/3/4 Caches,
+where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
+why this approach is compelling. However
the alarm bells ring here at the keyword "distributed", because by
moving the processing down next to the Memory, even onto
the same die as the DRAM, the speed of any
**ZOLC: Zero-Overhead Loop Control**
+Zero-Overhead Looping is the concept of automatically running a set sequence
+of instructions for a predetermined number of times, without requiring
+a branch. This is slightly different from using Power ISA `bc` in `CTR`
+(Counter) Mode, because in ZOLC the branch-back is automatic.
+
The simplest longest commercially successful deployment of Zero-overhead looping
has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
within the VLIW word may be repeatedly deployed on successive clock
less instructions executed is an almost unheard-of level of optimisation:
most ISA designers are elated if they can achieve 5 to 10%. The reduction
was so compelling that ST Microelectronics put it into commercial
-production in one of their embedded CPUs.
+production in one of their embedded CPUs, the ST120 DSP-MCU.
The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
design of its triple-nested for-loop system
OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
-has OpenCAPI Memory interfaces, and requires an OpenCAPI-to-DDR4/5 Bridge PHY
+has OpenCAPI Memory interfaces, and requires an OMI-to-DDR4/5 Bridge PHY
to connect to standard DIMMs.
-Extra-V appears to be a remarkable research project that, by leveraging
-OpenCAPI, assuming that the map of edges in any given arbitrary data graph
+Extra-V appears to be a remarkable research project based on OpenCAPI that,
+by assuming that the map of edges (excluding the actual data)
+in any given arbitrary data graph
could be kept by the main CPU in-memory, could distribute and delegate
a limited-capability deterministic but most importantly *data-dependent*
node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier). A miniature processor
currently occupied. Plus, the twin lessons that inventing ISAs, even
a small one, is hard (mostly in compiler writing) and how complex
GPU Task Scheduling is, are being heard loud and clear.
-Put another way:
-* if the PEs run a foriegn ISA, then the Basic Blocks embedded inside
- the ZOLC Loops must be in that ISA **OR**
+Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
+
* In order that the main CPU can execute the same sequence if necessary,
the CPU must support dual ISAs: Power and PE **OR**
* There must be a JIT binary-translator which either turns PE code
It is very strange to the author to be describing what amounts to a
"Holy Grail" solution to a decades-long intractable problem that
mitigates the anticipated end of Moore's Law: how to make it easy for
-well-defined workloads, expressed as a perfecly normal
+well-defined workloads, expressed as a perfectly normal
sequential program, compiled to a standard well-known ISA, to have
the potential of being offloaded transparently to Parallel Compute Engines,
all without the Software Developer being excessively burdened with
-a Parallel-Processing Paradigm that is alien to both their experience
-and training, as well as common knowledge.
+a Parallel-Processing Paradigm that is alien to all their experience
+and training, as well as Industry-wide common knowledge.
Will it be that easy? ZOLC is, honestly, in its current incarnation,
not that straightforward: programs