[[!tag whitepapers]]

**Revision History**

* v0.00 05may2021 first created

**Table of Contents**

[[!toc]]

# Why in the 2020s would you invent a new Vector ISA

Inventing a new Scalar ISA from scratch is over a decade-long task
including simulators and compilers: OpenRISC 1200 took 12 years to
mature.  A Vector or Packed SIMD ISA to reach stable *general-purpose*
auto-vectorisation compiler support has never been achieved in the
history of computing, not with the combined resources of ARM, Intel,
AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
assembler and direct use of intrinsics is the Industry-standard norm
to achieve high-performance optimisation where it matters*).
Rather: GPUs
have ultra-specialist compilers (CUDA) that are designed from the ground up
to support Vector/SIMD parallelism, and associated standards
(SPIR-V, Vulkan, OpenCL) managed by
the Khronos Group, with multi-man-century development committment from
multiple billion-dollar-revenue companies, to sustain them.

Therefore it begs the question, why on earth would anyone consider
this task, and what, in Computer Science, actually needs solving?

First hints are that whilst memory bitcells have not increased in speed
since the 90s (around 150 mhz), increasing the datapath widths has allowed
significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
all make an effort (all simply increasing the parallel deployment of
the underlying 150 mhz bitcells), but these efforts are dwarfed by the
two nearly three orders of magnitude increase in CPU horsepower
over the same timeframe. Seymour
Cray, from his amazing in-depth knowledge, predicted that the mismatch
would become a serious limitation, over two decades ago.  Some systems
at the time of writing are now approaching a *Gigabyte* of L4 Cache,
by way of compensation, and as we know from experience even that will
be considered inadequate in future.

Efforts to solve this problem by moving the processing closer to or
directly integrated into the memory have traditionally not gone well:
Aspex Microelectronics, Elixent, these are parallel processing companies
that very few have heard of, because their software stack was so
specialist that it required heavy investment by customers to utilise.
D-Matrix and Graphcore are a modern incarnation of the exact same
"specialist parallel processing" mistake, betting heavily on AI with
Matrix and Convolution Engines that can do no other task.  Aspex only
survived by being bought by Ericsson, where its specialised suitability
for massive wide Baseband FFTs saved it from going under.
The huge risk is that any "better
AI mousetrap" created by an innovative competitor
that comes along will quickly render both D-Matrix and
Graphcore's approach obsolete.

NVIDIA and other GPUs have taken a different approach again: massive
parallelism with more Turing-complete ISAs in each, and dedicated
slower parallel memory paths (GDDR5) suited to the specific tasks of
3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
by the amount of money poured into the software ecosystem in order
to make it accessible, and even then, GPU Programmers are a specialist
and rare (expensive) breed.
 
Second hints as to the answer emerge from an article
"[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
which illustrates a catastrophic rabbit-hole taken by Industry Giants
ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
Order(N^6) opcode proliferation nightmare, with its mantra "make it
easy for hardware engineers, let software sort out the mess" literally
overwhelming programmers.  Specialists charging
clients for assembly-code Optimisation Services are finding that AVX-512,
to take an
example, is anything but optimal: overall performance of AVX-512 actually
*decreases* even as power consumption goes up.

Cray-style Vectors solved, over thirty years ago, the opcode proliferation
nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
instruction that makes a truly ubiquitous Vector ISA) in ways that
will become apparent over time as adoption increases. In the meantime
programmers are, in direct violation of ARM's advice on how to use SVE2,
trying desperately to use it as if it was Packed SIMD NEON.  The advice
not to create SVE2 assembler that is hardcoded to fixed widths is being
disregarded, in favour of writing *multiple identical implementations*
of a function, each with a different hardware width, and compelling
software to choose one at runtime after probing the hardware.

Even RISC-V, for all that we can be grateful to the RISC-V Founders
for reviving Cray Vectors, has severe performance and implementation
limitations that are only really apparent to exceptionally experienced
assembly-level developers with a wide, diverse depth in multiple ISAs:
one of the best and clearest is a
[ycombinator post](https://news.ycombinator.com/item?id=24459041)
by adrian_b.

Adrian logically and concisely points out that the fundamental design
assumptions and simplifications that went into the RISC-V ISA have an
irrevocably damaging effect on its viability for high performance use.
That is not to say that its use in low-performance embedded scenarios is
not ideal: in private custom secretive commercial usage it is perfect.
Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
ARM with RISC-V and saving themselves USD 1 in licensing royalties
per product are a classic case study.  Ubiquitous and common everyday
usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
unfortunately, fundamentally flawed as far as power efficient high
performance is concerned.

Slowly, at this point, a realisation should be sinking in that, actually,
there aren't as many really truly viable Vector ISAs out there, as the
ones that are evolving in the general direction of Vectorisation are,
in various completely different ways, flawed.

**Successfully identifying a limitation marks the beginning of an
opportunity**

We are nowhere near done, however, because a Vector ISA is a superset of a
Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
support, and even longer to get the software ecosystem up and running.

Which ISAs, therefore, have or have had, at one point in time, a decent
Software Ecosystem? Debian supports most of these including s390:

* SPARC, created by Sun Microsystems and all but abandoned by Oracle.
  Gaisler Research maintains the LEON Open Source Cores but with Oracle's
  reputation nobody wants to go near SPARC.
* MIPS, created by SGI and only really commonly used in Network switches.
  Exceptions: Ingenic with embedded CPUs,
  and China ICT with the Loongson supercomputers.
* x86, the most well-known ISA and also one of the most heavily
  litigously-protected.
* ARM, well known in embedded and smartphone scenarios, very slowly
  making its way into data centres.
* OpenRISC, an entirely Open ISA suitable for embedded systems.
* s390, a Mainframe ISA very similar to Power.
* Power ISA, a Supercomputing-class ISA, as demonstrated by
  two out of three of the top500.org supercomputers using
  around 2 million IBM POWER9 Cores each.
* ARC, a competitor at the time to ARM, best known for use in
  Broadcom VideoCore IV.
* RISC-V, with a software ecosystem heavily in development
  and with rapid expansion
  in an uncontrolled fashion, is set on an unstoppable
  and inevitable trainwreck path to replicate the
  opcode conflict nightmare that plagued the Power ISA,
  two decades ago.
* Tensilica, Andes STAR and Western Digital for successful
  commercial proprietary ISAs: Tensilica in Baseband Modems,
  Andes in Audio DSPs, WD in HDDs and SSDs. These are all
  astoundingly commercially successful
  multi-billion-unit mass volume markets that almost nobody
  knows anything about. Included for completeness.

In order of least controlled to most controlled, the viable
candidates for further advancement are:

* OpenRISC 1200, not controlled or restricted by anyone. no patent
  protection.
* RISC-V, touted as "Open" but actually strictly controlled under
  Trademark License: too new to have adequate patent pool protection,
  as evidenced by multiple adopters having been hit by patent lawsuits.
* MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
* Power ISA: protected by IBM's extensive patent portfolio for Members
  of the OpenPOWER Foundation, covered by Trademarks, permitting
  and encouraging contributions, and having software support for over
  20 years.
* ARM, not permitting Open Licensing, they survived in the early 90s
  only by doing a deal with Samsung for an in-perpetuity
  Royalty-free License, in exchange
  for GBP 3 million and legal protection through Samsung Research.
  Several large Corporations (Apple most notably) have licensed the ISA
  but not ARM designs: the barrier to entry is high and the ISA itself
  protected from interference as a result.
* x86, famous for an unprecedented
  Court Ruling in 2004 where a Judge "banged heads
  together" and ordered AMD and Intel to stop wasting his time,
  make peace, and cross-license each other's patents, anyone wishing
  to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
  and VIA EDEN processors, and see how they fared.
* s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
  but the 800lb Gorilla Syndrome seems not to have deterred one
  particularly disingenuous group from performing illegal
  Reverse-Engineering.

By asking the question, "which ISA would be the best and most stable to
base a Vector Supercomputing-class Extension on?" where patent protection,
software ecosystem, open-ness and pedigree all combine to reduce risk
and increase the chances of success, there is really only one candidate.

**Of all of these, the only one with the most going for it is the Power ISA.**

The summary of advantages, then, of the Power ISA is that:

* It has a 25-year software ecosystem, with RHEL, Fedora, Debian
  and more.
* IBM's extensive 20+ years of patents is available, royalty-free,
  to protect implementors as long as they are also members of the
  OpenPOWER Foundation
* IBM designed and maintained the Power ISA as a Supercomputing
  class ISA from its inception.
* Coherent distributed memory access is possible through OpenCAPI
* Extensions to the Power ISA may be submitted through an External
  RFC Process that does not require membership of OPF.

From this strong base, the next step is: how to leverage this
foundation to take a leap forward in performance and performance/watt,
*without* losing all the advantages of an ubiquitous software ecosystem,
the lack of which has historically plagued other systems and relegated
them to a risky niche market?

# How do you turn a Scalar ISA into a Vector one?

The most obvious question before that is: why on earth would you want to?
As explained in the "SIMD Considered Harmful" article, Cray-style
Vector ISAs break the link between data element batches and the
underlying architectural back-end parallel processing capability.
Packed SIMD explicitly smashes that width right in the face of the
programmer and expects them to like it.  As the article immediately
demonstrates, an arbitrary-sized data set has to contend with
an insane power-of-two Packed SIMD cascade at both setup and teardown
that routinely adds literally an order
of magnitude increase in the number of hand-written lines of assembler
compared to a well-designed Cray-style Vector ISA with a `setvl`
instruction.

Assuming then that variable-length Vectors are obviously desirable,
it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
went the way of adding explicit Vector opcodes, a style which RVV
copied and modernised. In the case of RVV this introduced 192 new
instructions on top of an existing 95+ for base RV64GC.  Adding
200% more instructions than the base ISA seems unwise: at least,
it feels like there should be a better way, particularly on
close inspection of RVV as an example, the basic arithmetic
operations are massively duplicated: scalar-scalar from the base
is joined by both scalar-vector and vector-vector *and* predicate
mask management, and transfer instructions between all the same,
which goes a long way towards explaining why there are twice as many
Vector instructions in RISC-V as there are in the RV64GC Scalar base.

The question then becomes: with all the duplication of arithmetic
operations just to make the registers scalar or vector, why not
leverage the *existing* Scalar ISA with some sort of "context"
or prefix that augments its behaviour?  Then, the Instruction Decode
phase is greatly simplified, reducing design complexity and leaving
plenty of headroom for further expansion.

Remarkably this is not a new idea.  Intel's x86 `REP` instruction
gives the base concept, but in 1994 it was Peter Hsu, the designer
of the MIPS R8000, who first came up with the idea of Vector-augmented
prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
the prefix would mark which of the registers were to be treated as
Scalar and which as Vector, then, treating the Scalar "suffix" instruction
as a guide and making "scalar instruction" synonymous with "Vector element",
perform a `REP`-like loop that
jammed multiple scalar operations into the Multi-Issue Execution
Engine.  The only reason that the team did not take this forward
into a commercial product
was because they could not work out how to cleanly do OoO
multi-issue at the time.

In its simplest form, then, this "prefixing" idea is a matter
of:

* Defining the format of the prefix
* Adding a `setvl` instruction
* Adding Vector-context SPRs and working out how to do
  context-switches with them
* Writing an awful lot of Specification Documentation
  (4 years and counting)

Once the basics of this concept have sunk in, early
advancements quickly follow naturally from analysis
of the problem-space:

* Expanding the size of GPR, FPR and CR register files to
  provide 128 entries in each. This is a bare minimum for GPUs
  in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
  batching as possible.
* Predication (an absolutely critical component for a Vector ISA),
  then the next logical advancement is to allow separate predication masks
  to be applied to *both* the source *and* the destination, independently.
* Element-width overrides: most Scalar ISAs today are 64-bit only,
  with primarily Load and Store being able to handle 8/16/32/64
  and sometimes 128-bit (quad-word), where Vector ISAs need to
  go as low as 8-bit arithmetic, even 8-bit Floating-Point for
  high-performance AI. Rather than waste opcode space adding all
  such operations at different bitwidths, let the prefix
  *redefine* the element width.
* "Reordering" of the assumption of linear sequential element
  access, for Matrices, rotations, transposition, Convolutions,
  DCT, FFT, Parallel Prefix-Sum and other common transformations
  that require significant programming effort in other ISAs.

All of these things come entirely from "Augmentation" of the Scalar operation
being prefixed: at no time is the Scalar operation significantly
altered.
From there, several more "Modes" can be added, including saturation,
which is needed for Audio and Video applications, "Reverse Gear"
which runs the Element Loop in reverse order (needed for Prefix
Sum), and more.

**What is missing from Power Scalar ISA that a Vector ISA needs?**

Remarkably, very little: the devil is in the details though.

* The traditional `iota` instruction may be
  synthesised with an overlapping add, that stacks up incrementally
  and sequentially.  Although it requires two instructions (one to
  start the sum-chain) the technique has the advantage of allowing
  increments by arbitrary amounts, and is not limited to addition,
  either.
* Big-integer addition (arbitrary-precision arithmetic) is an
  emergent characteristic from the carry-in, carry-out capability of
  Power ISA `adde` instruction. `sv.adde` as a BigNum add
  naturally emerges from the
  sequential chaining of these scalar instructions.
* The Condition Register Fields of the Power ISA make a great candidate
  for use as Predicate Masks, particularly when combined with
  Vectorised `cmp` and Vectorised `crand`, `crxor` etc.

It is only when looking slightly deeper into the Power ISA that
certain things turn out to be missing, and this is down in part to IBM's
primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
so Scalar ones.  Examples include that transfer operations between the
Integer and Floating-point Scalar register files were dropped approximately
a decade ago after the Packed SIMD variants were considered to be
duplicates.  With it being completely inappropriate to attempt to Vectorise
a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
the Scalar ISA, a much better all-round candidate for Vectorisation is
left anaemic.

A particular key instruction that is missing is `MV.X` which is
illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
expensive instruction causing a huge swathe of Register Hazards
in one single hit is almost never added to a Scalar ISA but
is almost always added to a Vector one. When `MV.X` is
Vectorised it allows for arbitrary
remapping of elements within a Vector to positions specified
by another Vector. A typical Scalar ISA will use Memory to
achieve this task, but with Vector ISAs the Vector Register Files are
usually so enormous, and so far away from Memory, that it is easier and
more efficient, architecturally, to provide these Indexing instructions.

Fortunately, with the ISA Working Group being willing
to consider RFCs (Requests For Change) these omissions have the potential
to be corrected.

One deliberate decision in SVP64 involves Predication. Typical Vector
ISAs have quite comprehensive arithmetic and logical operations on
Predicate Masks, and if CR Fields were the only predicates in SVP64
it would put pressure on to start adding the exact same arithmetic and logical
operations that already exist in the Integer opcodes.
Instead of taking that route the decision was made to allow *both*
Integer *and* CR Fields to be Predicate Masks, and to create Draft
instructions that provide better transfer capability between CR Fields
and Integer Register files. 

Beyond that, further extensions to the Power ISA become much more
domain-specific, such as adding bitmanipulation for Audio, Video
and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
`ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
*automatically* is inherently added to the Vector one as well, and
because these GPU and Video opcodes have been added to the CPU ISA,
Software Driver development and debugging is dramatically simplified.

Which brings us to the next important question: how is any of these
CPU-centric Vector-centric improvements relevant to power efficiency
and making more effective use of resources?

# Simpler more compact programs saves power

The first and most obvious saving is that, just as with any Vector
ISA, the amount of data processing requested
and controlled by each instruction is enormous, and leaves the
Decode and Issue Engines idle, as well as the L1 I-Cache. With
programs being smaller, chances are higher that they fit into
L1 Cache, or that the L1 Cache may be made smaller.

Even a Packed SIMD ISA could take limited advantage of a higher
bang-per-buck for limited specific workloads, as long as the
stripmining setup and teardown is not required.  However a
2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
ratio as a 64-wide Vector Length.

Realistically, for general use cases however it is extremely common
to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
astounding 240 hand-coded assembler instructions where it is around
12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
the case of the IBM POWER9 with a little-known design flaw not
normally otherwise encountered this results in
contention between the L1 D and I Caches at the L2 Bus, slowing down
execution even further.  Power ISA 3.1 MMA (Matrix-Multiply-Assist)
requires loop-unrolling to contend with non-power-of-two Matrix
sizes: SVP64 does not, as hinted at below.

Additional savings come in the form of `SVREMAP`. This is a hardware
index transformation system where the normally sequentially-linear
element access may be "Re-Mapped" to limited but algorithmic-tailored
commonly-used deterministic schedules, for example Matrix Multiply,
DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
2x6 may be performed in as little as 4 instructions, one of which
is to zero-initialise the accumulator Vector used to store the result.
If addition to another Matrix is also required then it is only three
instructions. Not only that, but because the "Schedule" is an abstract
concept separated from the mathematical operation, there is no reason
why Matrix Multiplication Schedules may not be applied to Integer
Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
AND-and-OR, or any other future instruction such as Complex-Number
Multiply-and-Accumulate that a future version of the Power ISA might
support.  The flexibility is not only enormous, but the compactness
unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
around 11 instructions. The only other processors well-known to have
this type of compact capability are both VLIW DSPs: TI's TMS320 Series
and Qualcom's Hexagon, and both are targetted at FFTs only.

There is no reason at all why future algorithmic schedules should not
be proposed as extensions to SVP64 (sorting algorithms,
compression algorithms, Sparse Data Sets, Graph Node walking
for example). Bear in mind that
the submission process will be
entirely at the discretion of the OpenPOWER Foundation ISA WG,
something that is both encouraged and welcomed by the OPF.

One of SVP64's current limitations is that it was initially designed
for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
a heavy focus on adding hardware-for-loops onto the *Registers*.
After more than three years of development the realisation hit that
the SVP64 concept could be expanded to Coherent Distributed Memory,
This astoundingly powerful concept is explored in the next section.

# Coherent Deterministic Hybrid Distributed Memory-Processing

It is not often that a heading in an article can legitimately
contain quite so many comically-chain buzzwords, but in this section
it is justified.  As hinted at in the first section, the last time
that memory was the same speed as processors was the Pentium III
and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
these rates*, yet the pressure from Software Engineers is to
make *sequential* algorithm processing faster and faster because
parallelising of algorithms is simply too difficult to master and always
has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
are at an astonishing four levels of cache (L1 to L4).

It should therefore come as no surprise that attempts are being made
to move (distribute) processing closer to the DRAM Memory, firmly
on the *opposite* side of the main CPU's L1/2/3/4 Caches.  However
the alarm bells ring here at the keyword "distributed", because by
moving the processing down next to the Memory, the speed of any
of the parallel Processing Elements has dropped
by almost two orders of magnitude,
the simplicity has for pure pragmatic reasons to drop by several
orders of magnitude. Things that the average "sequential algorithm"
programmer
takes for granted such as SMP, Cache Coherency, Virtual Memory,
spinlocks (atomic locking), all of these are either outright gone
or expected that the programmer shall explicitly contend with
(even if that programmer is the Compiler Developer).