**Revision History**
-* v0.00 05may2021 first created
-* v0.01 06may2021 initial first draft
-* v0.02 08may2021 add scenarios / use-cases
-* v0.03 09may2021 add draft image for scenario
+* v0.00 05may2022 first created
+* v0.01 06may2022 initial first draft
+* v0.02 08may2022 add scenarios / use-cases
+* v0.03 09may2022 add draft image for scenario
+* v0.04 14may2022 add appendix with other research
+* v0.05 14jun2022 update images (thanks to Veera)
**Table of Contents**
Andes in Audio DSPs, WD in HDDs and SSDs. These are all
astoundingly commercially successful
multi-billion-unit mass volume markets that almost nobody
- knows anything about. Included for completeness.
+ knows anything about, outside their specialised proprietary
+ niche. Included for completeness.
In order of least controlled to most controlled, the viable
candidates for further advancement are:
The question then becomes: with all the duplication of arithmetic
operations just to make the registers scalar or vector, why not
leverage the *existing* Scalar ISA with some sort of "context"
-or prefix that augments its behaviour? Make "Scalar instruction"
+or prefix that augments its behaviour? Separate out the
+"looping" from "thing being looped on" (the elements),
+make "Scalar instruction"
synonymous with "Vector Element instruction" and through nothing
more than contextual
augmentation the Scalar ISA *becomes* the Vector ISA.
phase is greatly simplified, reducing design complexity and leaving
plenty of headroom for further expansion.
+[[!img "svp64-primer/img/power_pipelines.svg" ]]
+
Remarkably this is not a new idea. Intel's x86 `REP` instruction
-gives the base concept, but in 1994 it was Peter Hsu, the designer
+gives the base concept, and the Z80 had something similar.
+But in 1994 it was Peter Hsu, the designer
of the MIPS R8000, who first came up with the idea of Vector-augmented
prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
the prefix would mark which of the registers were to be treated as
Engine. The only reason that the team did not take this forward
into a commercial product
was because they could not work out how to cleanly do OoO
-multi-issue at the time.
+multi-issue at the time (leveraging Multi-Issue is the most logical
+way to exploit the Vector-Prefix concept)
In its simplest form, then, this "prefixing" idea is a matter
of:
that require significant programming effort in other ISAs.
All of these things come entirely from "Augmentation" of the Scalar operation
-being prefixed: at no time is the Scalar operation significantly
-altered.
+being prefixed: at no time is the Scalar operation's binary pattern decoded
+differently compared to when it is used as a Scalar operation.
From there, several more "Modes" can be added, including
* saturation,
Boolean Logic in a Vector context, on top of an already-powerful
Scalar Branch-Conditional/Counter instruction
+All of these festures are added as "Augmentations", to create of
+the order of 1.5 *million* instructions, none of which decode the
+32-bit scalar suffix any differently.
+
**What is missing from Power Scalar ISA that a Vector ISA needs?**
Remarkably, very little: the devil is in the details though.
why Matrix Multiplication Schedules may not be applied to Integer
Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
AND-and-OR, or any other future instruction such as Complex-Number
-Multiply-and-Accumulate that a future version of the Power ISA might
+Multiply-and-Accumulate or Abs-Diff-and-Accumulate
+that a future version of the Power ISA might
support. The flexibility is not only enormous, but the compactness
-unprecedented. RADIX2 in-place DCT Triple-loop Schedules may be created in
-around 11 instructions. The only other processors well-known to have
+unprecedented. RADIX2 in-place DCT may be created in
+around 11 instructions using the Triple-loop DCT Schedule. The only other processors well-known to have
this type of compact capability are both VLIW DSPs: TI's TMS320 Series
and Qualcom's Hexagon, and both are targetted at FFTs only.
the cost of providing "Traditional" programmabilility (Virtual Memory,
SMP) is worse than counter-productive, it's often outright impossible.
+*<blockquote>
+Similar to how GPUs achieve astounding task-dedicated
+performance by giving
+ALUs 30% of total silicon area and sacrificing the ability to run
+General-Purpose programs, Aspex, Google's Tensor Processor and D-Matrix
+likewise took this route and made the same compromise.
+</blockquote>*
+
**In short, we are in "Programmer's nightmare" territory**
Having dug a proverbial hole that rivals the Grand Canyon, and
can be proposed. These are:
* [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
+ available [no paywall](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf)
* [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
* [Snitch](https://arxiv.org/abs/2002.10143)
dynamically at runtime.
Even when deployed on as basic a CPU as a single-issue in-order RISC
-core, the performance and power-savings were astonishing: between 20
-and **80%** reduction in algorithm completion times were achieved compared
+core, the performance and power-savings were astonishing: between 27
+and **75%** reduction in algorithm completion times were achieved compared
to a more traditional branch-speculative in-order RISC CPU. MPEG
-Decode, the target algorithm specifically picked by the researcher
+Encode's timing, the target algorithm specifically picked by the researcher
due to its high complexity with 6-deep nested loops and conditional
execution that frequently jumped in and out of at least 2 loops,
came out with an astonishing 43% improvement in completion time. 43%
in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
the performance boost that goes with smaller line-drivers.
-
-Draft Image (placeholder):
-
-<img src="/openpower/sv/bridge_phy.jpg" width=800 />
+<img src="/openpower/sv/bridge_phy.svg" width=600 />
# Transparently-Distributed Vector Processing
**Use-case: Matrix and Convolutions**
+<img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
+
First, some important definitions, because there are two different
Vectorisation Modes in SVP64:
moving to next **element**. Currently managed by `svstep`,
ZOLC may be deployed to manage the stepping, in a Deterministic manner.
+Second:
+SVP64 Draft Matrix Multiply is currently set up to arrange a Schedule
+of Multiply-and-Accumulates, suitable for pipelining, that will,
+ultimately, result in a Matrix Multiply. Normal processors are forced
+to perform "loop-unrolling" in order to achieve this same Schedule.
+SIMD processors are further forced into a situation of pre-arranging rotated
+copies of data if the Matrices are not exactly on a power-of-two boundary.
+
+The current limitation of SVP64 however is (when Horizontal-First
+is deployed, at least, which is the least number of instructions)
+that both source and destination Matrices have to be in-registers,
+in full. Vertical-First may be used to perform a LD/ST within
+the loop, covered by `svstep`, but it is still not ideal. This
+is where the Snitch and EXTRA-V concepts kick in.
+
+<img src="/openpower/sv/matrix_svremap.svg" />
+
Imagine a large Matrix scenario, with several values close to zero that
could be skipped: no need to include zero-multiplications, but a
traditional CPU in no way can help: only by loading the data through
conditional execution of the Multiply-and-Accumulate.
Horizontal-First Mode is the standard Cray-Style Vectorisation:
loop on all *elements* with the same instruction before moving
-on to the next instruction. Predication needs to be pre-calculated
+on to the next instruction. Horizontal-First
+Predication needs to be pre-calculated
for the entire Vector in order to exclude certain elements from
the computation. In this case, that's an expensive inconvenience
-(similar to the problems associated with Memory-to-Memory
+(remarkably similar to the problems associated with Memory-to-Memory
Vector Machines such as the CDC Star-100).
Vertical-First allows *scalar* instructions and
Draft Image (placeholder):
-<img src="/openpower/sv/zolc_svp64_extrav.jpg" width=800 />
+<img src="/openpower/sv/zolc_svp64_extrav.svg" width=800 />
The program being executed is a simple loop with a conditional
test that ignores the multiply if the input is zero.
exploiting both the Deterministic nature of ZOLC / SVREMAP
combined with the Cache-Coherent nature of OpenCAPI,
to the maximum extent possible.
+* To explore "Remote Management" of PE RADIX MMU, TLB, and
+ Context-Switching (register file transferrance) by proxy,
+ over OpenCAPI, to ensure that the distributed PEs are as
+ close to a Standard SMP model as possible, for programmers.
* To make the exploitation of this powerful solution as simple
and straightforward as possible for Software Engineers to use,
in standard common-usage compilers, gcc and llvm.
A classic example being the Cell Processor (Sony PS3) which required
programmers to use DMA to schedule processing tasks. These specialist
high-performance architectures are only tolerated for
-as long as they are useful.
+as long as there is no equivalent performant alternative that is
+easier to program.
Combining SVP64 with ZOLC and OpenCAPI can produce an extremely powerful
architectural base that fits well with intrinsics embedded into standard
standing problem facing Computer Science and doing so in a way that
reduces power consumption reduces algorithm completion time and reduces
the need for complex hardware microarchitectures in favour of much
-smaller distributed coherent Processing Elements.
+smaller distributed coherent Processing Elements with a Heterogenous ISA
+across the board.
+
+# Appendix
+
+**Samsung PIM**
+
+Samsung's
+[Processing-in-Memory](https://semiconductor.samsung.com/emea/newsroom/news/samsung-brings-in-memory-processing-power-to-wider-range-of-applications/)
+seems to be ready to launch as a
+[commercial product](https://semiconductor.samsung.com/insights/technology/pim/)
+that uses HBM as its Memory Standard,
+has "some logic suitable for AI", has parallel processing elements,
+and offers 70% reduction
+in power consumption and a 2x performance increase in speech
+recognition. Details beyond that as to its internal workings
+or programmability are minimal, however given the similarity
+to D-Matrix and Google TPU it is reasonable to place in the
+same category.
+
+* [Samsung PIM IEEE Article](https://spectrum.ieee.org/samsung-ai-memory-chips)
+ explains that there are 9 instructions, mostly FP16 arithmetic,
+ and that it is designed to "complement" AI rather than compete.
+ With only 9 instructions, 2 of which will be LOAD and STORE,
+ conditional code execution seems unlikely.
+ Silicon area in DRAM is increased by 5% for a much greater reduction
+ in power. The article notes, pointedly, that programmability will
+ be a key deciding factor. The article also notes that Samsung has
+ proposed its architecture as a JEDEC Standard.
+
+**PIM-HBM Research**
+
+[Presentation](https://ieeexplore.ieee.org/document/9073325/) by Seongguk Kim
+and associated [video](https://www.youtube.com/watch?v=e4zU6u0YIRU)
+showing 3D-stacked DRAM connected to GPUs, but notes that even HBM, due to
+large GPU size, is less advantageous than it should be. Processing-in-Memory
+is therefore logically proposed. the PE (named a Streaming Multiprocessor)
+is much more sophisticated, comprising Register File, L1 Cache, FP32, FP64
+and a Tensor Unit.
+
+<img src="/openpower/sv/2022-05-14_11-55.jpg" width=500 />
+
+**etp4hpc.eu**
+
+[ETP 4 HPC](https://etp4hpc.eu) is a European Joint Initiative for HPC,
+with an eye towards
+[Processing in Memory](https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf)
+
+**Salient Labs**
+
+[Research paper](https://arxiv.org/abs/2002.00281) explaining
+that they can exceed a 14 ghz clock rate Multiply-and-Accumulate
+using Photonics.
+
+**SparseLNR**
+
+[SparseLNR](https://arxiv.org/abs/2205.11622) restructures sparse
+tensor computations using loop-nest restructuring.
+
+**Additional ZOLC Resources**
+
+* <https://www.researchgate.net/publication/3351728_Zero-overhead_loop_controller_that_implements_multimedia_algorithms>