[[!tag whitepapers]]

**Revision History**

* First revision 05may2021

**Table of Contents**

[[!toc]]

# Why in the 2020s would you invent a new Vector ISA

Inventing a new Scalar ISA from scratch is over a decade-long task
including simulators and compilers: OpenRISC 1200 took 12 years to
mature.  A Vector or Packed SIMD ISA to reach stable general-purpose
auto-vectorisation compiler support has never been achieved in the
history of computing, not with the combined resources of ARM, Intel,
AMD, MIPS, Sun Microsystems, SGI, Cray, and many more.  Rather: GPUs
have ultra-specialist compilers that are designed from the ground up
to support Vector/SIMD parallelism, and associated standards managed by
the Khronos Group, with multi-man-century development committment from
multiple billion-dollar-revenue companies, to sustain them.

Therefore it begs the question, why on earth would anyone consider
this task, and what, in Computer Science, actually needs solving?

First hints are that whilst memory bitcells have not increased in speed
since the 90s (around 150 mhz), increasing the datapath widths has allowed
significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
all make an effort (all simply increasing the parallel deployment of
the underlying 150 mhz bitcells), but these efforts are dwarfed by the
two nearly three orders of magnitude increase in CPU horsepower. Seymour
Cray, from his amazing in-depth knowledge, predicted that the mismatch
would become a serious limitation, over two decades ago.  Some systems
at the time of writing are now approaching a *Gigabyte* of L4 Cache,
by way of compensation, and as we know from experience even that will
be considered inadequate in future.

Efforts to solve this problem by moving the processing closer to or
directly integrated into the memory have traditionally not gone well:
Aspex Microelectronics, Elixent, these are parallel processing companies
that very few have heard of, because their software stack was so
specialist that it required heavy investment by customers to utilise.
D-Matrix and Graphcore are a modern incarnation of the exact same
"specialist parallel processing" mistake, betting heavily on AI with
Matrix and Convolution Engines that can do no other task.  Aspex only
survived by being bought by Ericsson, where its specialised suitability
for massive wide Baseband FFTs saved it from going under.  Any "better
AI mousetrap" that comes along will quickly render both D-Matrix and
Graphcore obsolete.

NVIDIA and other GPUs have taken a different approach again: massive
parallelism with more Turing-complete ISAs in each, and dedicated
slower parallel memory paths (GDDR5) suited to the specific tasks of
3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
by the amount of money poured into the software ecosystem in order
to make it accessible, and even then, GPU Programmers are a specialist
and rare (expensive) breed.
 
Second hints as to the answer emerge from an article
"[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
which illustrates a catastrophic rabbit-hole taken by Industry Giants
ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
Order(N^6) opcode proliferation nightmare, with its mantra "make it
easy for hardware engineers, let software sort out the mess" literally
overwhelming programmers.  Worse than that, specialists in charging
clients Optimisation Services are finding that AVX-512, to take an
example, is anything but optimal: overall performance of AVX-512 actually
*decreases* even as power consumption goes up.

Cray-style Vectors solved, over thirty years ago, the opcode proliferation
nightmare.  Only the NEC SX Aurora however truly kept the Cray Vector
flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined
it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
instruction that makes a truly ubiquitous Vector ISA) in ways that
will become apparent over time as adoption increases. In the meantime
programmers are, in direct violation of ARM's advice on how to use SVE2,
trying desperately to use it as if it was Packed SIMD NEON.  The advice
not to create SVE2 assembler that is hardcoded to fixed widths is being
disregarded, in favour of writing *multiple identical implementations*
of a function, each with a different hardware width, and compelling
software to choose one at runtime after probing the hardware.

Even RISC-V, for all that we can be grateful to the RISC-V Founders
for reviving Cray Vectors, has severe performance and implementation
limitations that are only really apparent to exceptionally experienced
assembly-level developers with a wide, diverse depth in multiple ISAs:
one of the best and clearest is a
[ycombinator post](https://news.ycombinator.com/item?id=24459041)
by adrian_b.

Adrian logically and concisely points out that the fundamental design
assumptions and simplifications that went into the RISC-V ISA have an
irrevocably damaging effect on its viability for high performance use.
That is not to say that its use in low-performance embedded scenarios is
not ideal: in private custom secretive commercial usage it is perfect.
Trinamic, an early adopter, created their TMC2660 Stepper IC replacing
ARM with RISC-V and saving themselves USD 1 in licensing royalties
per product are a classic case study.  Ubiquitous and common everyday
usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not
so much. Even though RISC-V has Cray-style Vectors, the whole ISA is,
unfortunately, fundamentally flawed as far as power efficient high
performance is concerned.

Slowly, at this point, a realisation should be sinking in that, actually,
there aren't as many really truly viable Vector ISAs out there, as the
ones that are evolving in the general direction of Vectorisation are,
in various completely different ways, flawed.

**Successfully identifying a limitation marks the beginning of an
opportunity**

We are nowhere near done, however, because a Vector ISA is a superset of a
Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler
support, and even longer to get the software ecosystem up and running.

Which ISAs, therefore, have or have had, at one point in time, a decent
Software Ecosystem? Debian supports most of these including s390:

* SPARC, created by Sun Microsystems and all but abandoned by Oracle.
  Gaisler Research maintains the LEON Open Source Cores but with Oracle's
  reputation nobody wants to go near SPARC.
* MIPS, created by SGI and only really commonly used in Network switches.
  Exceptions: Ingenic with embedded CPUs,
  and China ICT with the Loongson supercomputers.
* x86, the most well-known ISA and also one of the most heavily
  litigously-protected.
* ARM, well known in embedded and smartphone scenarios, very slowly
  making its way into data centres.
* OpenRISC, an entirely Open ISA suitable for embedded systems.
* s390, a Mainframe ISA very similar to Power.
* Power ISA, a Supercomputing-class ISA, as demonstrated by
  two out of three of the top500.org supercomputers using
  160,000 IBM POWER9 Cores.
* ARC, a competitor at the time to ARM, best known for use in
  Broadcom VideoCore IV.
* RISC-V, with a software ecosystem heavily in development
  and with rapid adoption 
  in an uncontrolled fashion, is set on an unstoppable
  and inevitable trainwreck path to replicate the
  opcode conflict nightmare that plagued the Power ISA,
  two decades ago.
* Tensilica, Andes STAR and Western Digital for successful
  commercial proprietary ISAs: Tensilica in Baseband Modems,
  Andes in Audio DSPs, WD in HDDs and SSDs. These are all
  multi-billion-unit mass volume markets that almost nobody
  knows anything about. Included for completeness.

In order of least controlled to most controlled, the viable
candidates for further advancement are:

* OpenRISC 1200, not controlled or restricted by anyone. no patent
  protection.
* RISC-V, touted as "Open" but actually strictly controlled under
  Trademark License: too new to have adequate patent pool protection,
  as evidenced by multiple adopters having been hit by patent lawsuits.
* MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
* Power ISA: protected by IBM's extensive patent portfolio for Members
  of the OpenPOWER Foundation, covered by Trademarks, permitting
  and encouraging contributions, and having software support for over
  20 years.
* ARM, not permitting Open Licensing, they survived in the early 90s
  only by doing a deal with Samsung for a Royalty-free License in exchange
  for GBP 3 million and legal protection under Samsung Research Division.
  Several large Corporations (Apple most notably) have licensed the ISA
  but not ARM designs, the barrier to entry is high and the ISA itself
  protected from interference as a result.
* x86, famous for a Court Ruling in 2004 where a Judge "banged heads
  together" and ordered AMD and Intel to stop wasting his time,
  make peace, and cross-license each other's patents, anyone wishing
  to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
  and VIA EDEN processors, and see how they fared.
* s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
  but the 800lb Gorilla Syndrome seems not to have deterred one
  particularly disingenuous group from performing illegal
  Reverse-Engineering.

By asking the question, "which ISA would be the best and most stable to
base a Vector Supercomputing-class Extension on?" where patent protection,
software ecosystem, open-ness and pedigree all combine to reduce risk
and increase the chances of success, there is really only one candidate.

**Of all of these, the only one with the most going for it is the Power ISA.**