bug 1244: add pospopcount cookbook link

[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index 09fa4629339481a958e163fa4e0c2b7c69bebe04..ea1bf2b9e6f0b9652eee9f54e00dec0ecc21525f 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -2,10 +2,12 @@
  
  **Revision History**
  
-* v0.00 05may2021 first created
-* v0.01 06may2021 initial first draft
-* v0.02 08may2021 add scenarios / use-cases
-* v0.03 09may2021 add draft image for scenario
+* v0.00 05may2022 first created
+* v0.01 06may2022 initial first draft
+* v0.02 08may2022 add scenarios / use-cases
+* v0.03 09may2022 add draft image for scenario
+* v0.04 14may2022 add appendix with other research
+* v0.05 14jun2022 update images (thanks to Veera)
  
  **Table of Contents**
  
@@ -17,9 +19,9 @@
  
  Inventing a new Scalar ISA from scratch is over a decade-long task
  including simulators and compilers: OpenRISC 1200 took 12 years to
-mature.  Stable ISAs require Standards and Compliance Suites that
+mature.  Stable Open ISAs require Standards and Compliance Suites that
  take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
-auto-vectorisation compiler support has never been achieved in the
+auto-vectorization compiler support has never been achieved in the
  history of computing, not with the combined resources of ARM, Intel,
  AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  assembler and direct use of intrinsics is the Industry-standard norm
@@ -127,7 +129,7 @@ performance is concerned.
  
  Slowly, at this point, a realisation should be sinking in that, actually,
  there aren't as many really truly viable Vector ISAs out there, as the
-ones that are evolving in the general direction of Vectorisation are,
+ones that are evolving in the general direction of Vectorization are,
  in various completely different ways, flawed.
  
  **Successfully identifying a limitation marks the beginning of an
@@ -168,7 +170,8 @@ Software Ecosystem? Debian supports most of these including s390:
    Andes in Audio DSPs, WD in HDDs and SSDs. These are all
    astoundingly commercially successful
    multi-billion-unit mass volume markets that almost nobody
-  knows anything about. Included for completeness.
+  knows anything about, outside their specialised proprietary
+  niche. Included for completeness.
  
  In order of least controlled to most controlled, the viable
  candidates for further advancement are:
@@ -274,7 +277,9 @@ Vector instructions in RISC-V as there are in the RV64GC Scalar base.
  The question then becomes: with all the duplication of arithmetic
  operations just to make the registers scalar or vector, why not
  leverage the *existing* Scalar ISA with some sort of "context"
-or prefix that augments its behaviour? Make "Scalar instruction"
+or prefix that augments its behaviour? Separate out the
+"looping" from "thing being looped on" (the elements),
+make "Scalar instruction"
  synonymous with "Vector Element instruction" and through nothing
  more than contextual
  augmentation the Scalar ISA *becomes* the Vector ISA.
@@ -283,8 +288,11 @@ the Instruction Decode
  phase is greatly simplified, reducing design complexity and leaving
  plenty of headroom for further expansion.
  
+[[!img "svp64-primer/img/power_pipelines.svg" ]]
+
  Remarkably this is not a new idea.  Intel's x86 `REP` instruction
-gives the base concept, but in 1994 it was Peter Hsu, the designer
+gives the base concept, and the Z80 had something similar.
+But in 1994 it was Peter Hsu, the designer
  of the MIPS R8000, who first came up with the idea of Vector-augmented
  prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
  the prefix would mark which of the registers were to be treated as
@@ -295,7 +303,8 @@ jammed multiple scalar operations into the Multi-Issue Execution
  Engine.  The only reason that the team did not take this forward
  into a commercial product
  was because they could not work out how to cleanly do OoO
-multi-issue at the time.
+multi-issue at the time (leveraging Multi-Issue is the most logical
+way to exploit the Vector-Prefix concept)
  
  In its simplest form, then, this "prefixing" idea is a matter
  of:
@@ -334,8 +343,8 @@ of the problem-space:
    that require significant programming effort in other ISAs.
  
  All of these things come entirely from "Augmentation" of the Scalar operation
-being prefixed: at no time is the Scalar operation significantly
-altered.
+being prefixed: at no time is the Scalar operation's binary pattern decoded
+differently compared to when it is used as a Scalar operation.
  From there, several more "Modes" can be added, including
  
  * saturation,
@@ -351,6 +360,10 @@ Sum)
    Boolean Logic in a Vector context, on top of an already-powerful
    Scalar Branch-Conditional/Counter instruction
  
+All of these festures are added as "Augmentations", to create of
+the order of 1.5 *million* instructions, none of which decode the
+32-bit scalar suffix any differently.
+
  **What is missing from Power Scalar ISA that a Vector ISA needs?**
  
  Remarkably, very little: the devil is in the details though.
@@ -368,7 +381,7 @@ Remarkably, very little: the devil is in the details though.
    sequential carry-flag chaining of these scalar instructions.
  * The Condition Register Fields of the Power ISA make a great candidate
    for use as Predicate Masks, particularly when combined with
-  Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+  Vectorized `cmp` and Vectorized `crand`, `crxor` etc.
  
  It is only when looking slightly deeper into the Power ISA that
  certain things turn out to be missing, and this is down in part to IBM's
@@ -376,17 +389,17 @@ primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
  so Scalar ones.  Examples include that transfer operations between the
  Integer and Floating-point Scalar register files were dropped approximately
  a decade ago after the Packed SIMD variants were considered to be
-duplicates.  With it being completely inappropriate to attempt to Vectorise
+duplicates.  With it being completely inappropriate to attempt to Vectorize
  a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
-the Scalar ISA, a much better all-round candidate for Vectorisation is
-left anaemic.
+the Scalar ISA, a much better all-round candidate for Vectorization 
+(the Scalar parts of Power ISA) is left anaemic.
  
  A particular key instruction that is missing is `MV.X` which is
  illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
  expensive instruction causing a huge swathe of Register Hazards
  in one single hit is almost never added to a Scalar ISA but
  is almost always added to a Vector one. When `MV.X` is
-Vectorised it allows for arbitrary
+Vectorized it allows for arbitrary
  remapping of elements within a Vector to positions specified
  by another Vector. A typical Scalar ISA will use Memory to
  achieve this task, but with Vector ISAs the Vector Register Files are
@@ -473,10 +486,11 @@ concept separated from the mathematical operation, there is no reason
  why Matrix Multiplication Schedules may not be applied to Integer
  Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
  AND-and-OR, or any other future instruction such as Complex-Number
-Multiply-and-Accumulate that a future version of the Power ISA might
+Multiply-and-Accumulate or Abs-Diff-and-Accumulate
+that a future version of the Power ISA might
  support.  The flexibility is not only enormous, but the compactness
-unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
-around 11 instructions. The only other processors well-known to have
+unprecedented.  RADIX2 in-place DCT may be created in
+around 11 instructions using the Triple-loop DCT Schedule. The only other processors well-known to have
  this type of compact capability are both VLIW DSPs: TI's TMS320 Series
  and Qualcom's Hexagon, and both are targetted at FFTs only.
  
@@ -547,6 +561,14 @@ side-effect of going down the parallel-processing rabbithole where
  the cost of providing "Traditional" programmabilility (Virtual Memory,
  SMP) is worse than counter-productive, it's often outright impossible.
  
+*<blockquote>
+Similar to how GPUs achieve astounding task-dedicated
+performance by giving
+ALUs 30% of total silicon area and sacrificing the ability to run
+General-Purpose programs, Aspex, Google's Tensor Processor and D-Matrix
+likewise took this route and made the same compromise.
+</blockquote>*
+
  **In short, we are in "Programmer's nightmare" territory**
  
  Having dug a proverbial hole that rivals the Grand Canyon, and
@@ -558,6 +580,7 @@ commercial designs.  Once the context is clear, their synthesis
  can be proposed.  These are:
  
  * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
+ available [no paywall](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf)
  * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
  * [Snitch](https://arxiv.org/abs/2002.10143)
  
@@ -596,10 +619,10 @@ out to an entirely different loop, all based on conditions determined
  dynamically at runtime.
  
  Even when deployed on as basic a CPU as a single-issue in-order RISC
-core, the performance and power-savings were astonishing: between 20
-and **80%** reduction in algorithm completion times were achieved compared
+core, the performance and power-savings were astonishing: between 27
+and **75%** reduction in algorithm completion times were achieved compared
  to a more traditional branch-speculative in-order RISC CPU.  MPEG
-Decode, the target algorithm specifically picked by the researcher
+Encode's timing, the target algorithm specifically picked by the researcher
  due to its high complexity with 6-deep nested loops and conditional
  execution that frequently jumped in and out of at least 2 loops,
  came out with an astonishing 43% improvement in completion time. 43%
@@ -808,10 +831,30 @@ execute the exact same ISA (or a subset of it). If however the
  concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
  which would increase adoption and reduce cost, a bit more thought
  is required here because ARM or Intel or MIPS might not necessarily
-be happy that a Processing Element has to execute Power ISA binaries.
+be happy that a Processing Element (PE) has to execute Power ISA binaries.
  At least the Power ISA is much richer, more powerful, still RISC,
  and is an Open Standard, as discussed in a earlier sections.
  
+A reasonable compromise as a JEDEC Standard is illustrated with
+the following diagram: a 3-way Bridge PHY that allows for full
+direct interaction between DRAM ICs, PEs, and one or more main CPUs
+(* a variant of the Northbridge and/or IBM POWER10 OMI-to-DDR5 PHY concept*).
+It is also the ideal location for a "Management Core".
+If the 3-way Bridge (4-way if connectivity to other Bridge PHYs
+is also included) does not itself have PEs built-in then the ISA
+utilised on any PE or CPU is non-critical.  The only concern regarding
+mixed ISAs is that the PHY should be capable of transferring all and
+any types of "Management" packets, particularly PE Virtual Memory Management
+and Register File Control (Context-switch Management given that the PEs
+are expected to be ALU-heavy and not capable of running a full SMP Operating
+System).
+
+There is also no reason why this type of arrangement should not be deployed
+in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
+the performance boost that goes with smaller line-drivers.
+
+<img src="/openpower/sv/bridge_phy.svg" width=600 />
+
  # Transparently-Distributed Vector Processing
  
  It is very strange to the author to be describing what amounts to a
@@ -829,7 +872,7 @@ not that straightforward: programs
  have to be "massaged" by tools that insert intrinsics into the
  source code, in order to identify the Basic Blocks that the Zero-Overhead
  Loops can run. Can this be merged into standard gcc and llvm
-compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
+compilers? As intrinsics: of course. Can it become part of auto-vectorization? Probably,
  if an infinite supply of money and engineering time is thrown at it.
  Is a half-way-house solution of compiler intrinsics good enough?
  Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
@@ -848,21 +891,59 @@ CPU as a software-extension: PEs generate a TLB Miss notification
  to the main CPU over OpenCAPI, and the main CPU feeds back the new
  TLB entries to the PE in response.
  
+Also in practical terms, with the PEs anticipated to be so small as to
+make running a full SMP-aware OS impractical it will not just be their TLB
+pages that need remote management but their entire register file including
+the Program Counter will need to be set up, and the ZOLC Context as
+well. With OpenCAPI packet formats being quite large a concern is that
+the context management increases latency to the point where the premise
+of this paper is invalidated. Research is needed here as to whether a
+bare-bones microkernel
+would be viable, or a Management Core closer to the PEs (on the same
+die or Multi-Chip-Module as the PEs) would allow better bandwidth and
+reduce Management Overhead on the main CPUs.  However if
+the same level of power saving as Snitch (1/6th) and
+the same sort of reduction in algorithm runtime as ZOLC (20 to 80%) is not
+unreasonable to expect, this is
+definitely compelling enough to warrant in-depth investigation.
+
  **Use-case: Matrix and Convolutions**
  
+<img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
+
+First, some important definitions, because there are two different
+Vectorization Modes in SVP64:
+
  * **Horizontal-First**: (aka standard Cray Vectors) walk
    through **elements** first before moving to next **instruction**
  * **Vertical-First**: walk through **instructions** before
    moving to next **element**.  Currently managed by `svstep`,
    ZOLC may be deployed to manage the stepping, in a Deterministic manner.
  
+Second:
+SVP64 Draft Matrix Multiply is currently set up to arrange a Schedule
+of Multiply-and-Accumulates, suitable for pipelining, that will,
+ultimately, result in a Matrix Multiply. Normal processors are forced
+to perform "loop-unrolling" in order to achieve this same Schedule.
+SIMD processors are further forced into a situation of pre-arranging rotated
+copies of data if the Matrices are not exactly on a power-of-two boundary.
+
+The current limitation of SVP64 however is (when Horizontal-First
+is deployed, at least, which is the least number of instructions)
+that both source and destination Matrices have to be in-registers,
+in full.  Vertical-First may be used to perform a LD/ST within
+the loop, covered by `svstep`, but it is still not ideal.  This
+is where the Snitch and EXTRA-V concepts kick in.
+
+<img src="/openpower/sv/matrix_svremap.svg" />
+
  Imagine a large Matrix scenario, with several values close to zero that
  could be skipped: no need to include zero-multiplications, but a
  traditional CPU in no way can help: only by loading the data through
  the L1-L4 Cache and Virtual Memory Barriers is it possible to
  ascertain, retrospectively, that time and power had just been wasted.
  
-SVP64 is able to do what is termed "Vertical-First" Vectorisation,
+SVP64 is able to do what is termed "Vertical-First" Vectorization,
  combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
  extended, Snitch-style, to perform a deterministic memory-array walk of
  a large Matrix.
@@ -896,12 +977,13 @@ L1/L2/L3 Caches only to find, at the CPU, that it is zero.
  
  The reason in this case for the use of Vertical-First Mode is the
  conditional execution of the Multiply-and-Accumulate.
-Horizontal-First Mode is the standard Cray-Style Vectorisation:
+Horizontal-First Mode is the standard Cray-Style Vectorization:
  loop on all *elements* with the same instruction before moving
-on to the next instruction. Predication needs to be pre-calculated
+on to the next instruction. Horizontal-First
+Predication needs to be pre-calculated
  for the entire Vector in order to exclude certain elements from
  the computation. In this case, that's an expensive inconvenience 
-(similar to the problems associated with Memory-to-Memory
+(remarkably similar to the problems associated with Memory-to-Memory
  Vector Machines such as the CDC Star-100).
  
  Vertical-First allows *scalar* instructions and
@@ -926,7 +1008,7 @@ on a GB-OoO Micro-architecture)
  
  Draft Image (placeholder):
  
-<img src="/openpower/sv/zolc_svp64_extrav.jpg" width=800 />
+<img src="/openpower/sv/zolc_svp64_extrav.svg" width=800 />
  
  The program being executed is a simple loop with a conditional
  test that ignores the multiply if the input is zero.
@@ -964,6 +1046,56 @@ binaries, the size of each PE's TLB-aware
  L1 Cache needed would be miniscule compared
  to the average high-end CPU.
  
+**Comparison of PE-CPU to GPU-CPU interaction**
+
+The informed reader will have noted the remarkable similarity between how
+a CPU communicates with a GPU to schedule tasks, and the proposed
+architecture.  CPUs schedule tasks with GPUs as follows:
+
+* User-space program encounters an OpenGL function, in the
+  CPU's ISA.
+* Proprietary GPU Driver, still in the CPU's ISA, prepares a
+  Shader Binary written in the GPU's ISA.
+* GPU Driver wishes to transfer both the data and the Shader Binary
+  to the GPU. Both may only do so via Shared Memory, usually
+  DMA over PCIe (assuming a PCIe Graphics Card)
+* GPU Driver which has been running CPU userspace notifies CPU
+  kernelspace of the desire to transfer data and GPU Shader Binary
+  to the GPU. A context-switch occurs...
+
+It is almost unfair to burden the reader with further details.
+The extraordinarily convoluted procedure is as bad as it sounds. Hundreds
+of thousands of tasks per second are scheduled this way, with hundreds
+or megabytes of data per second being exchanged as well.
+
+Yet, the process is not that different from how things would work
+with the proposed microarchitecture: the differences however are key.
+
+* Both PEs and CPU run the exact same ISA.  A major complexity of 3D GPU
+  and CUDA workloads  (JIT compilation etc) is eliminated, and, crucially,
+  the CPU may directly execute the PE's tasks, if needed. This simply
+  is not even remotely possible on GPU Architectures.
+* Where GPU Drivers use PCIe Shared Memory, the proposed architecture
+  deploys OpenCAPI.
+* Where GPUs are a foreign architecture and a foreign ISA, the proposed
+  architecture only narrowly misses being defined as big/LITTLE Symmetric
+  Multi-Processing (SMP) by virtue of the massively-parallel PEs
+  being a bit light on L1 Cache, in favour of large ALUs and proximity
+  to Memory, and require a modest amount of "helper" assistance with
+  their Virtual Memory Management.
+* The proposed architecture has the markup points emdedded into the
+  binary programs
+  where PEs may take over from the CPU, and there is accompanying
+  (planned) hardware-level assistance at the ISA level.  GPUs, which have to
+  work with a wide range of commodity CPUs, cannot in any way expect
+  ARM or Intel to add support for GPU Task Scheduling directly into
+  the ARM or x86 ISAs!
+
+On this last point it is crucial to note that SVP64 began its inspiration
+from a Hybrid CPU-GPU-VPU paradigm (like ICubeCorp's IC3128) and
+consequently has versatility that the separate specialisation of both
+GPU and CPU architectures lack.
+
  **Roadmap summary of Advanced SVP64**
  
  The future direction for SVP64, then, is:
@@ -983,6 +1115,10 @@ The future direction for SVP64, then, is:
    exploiting both the Deterministic nature of ZOLC / SVREMAP
    combined with the Cache-Coherent nature of OpenCAPI,
    to the maximum extent possible.
+* To explore "Remote Management" of PE RADIX MMU, TLB, and
+  Context-Switching (register file transferrance) by proxy,
+  over OpenCAPI, to ensure that the distributed PEs are as
+  close to a Standard SMP model as possible, for programmers.
  * To make the exploitation of this powerful solution as simple
    and straightforward as possible for Software Engineers to use,
    in standard common-usage compilers, gcc and llvm.
@@ -998,8 +1134,86 @@ the register files (128 64-bit entries), to arbitrary (massive) sizes.
  
  **Summary**
  
+There are historical and current efforts that step away from both a
+general-purpose architecture and from the practice of using compiler
+intrinsics in general-purpose compute to make programmer's lives easier.
+A classic example being the Cell Processor (Sony PS3) which required 
+programmers to use DMA to schedule processing tasks. These specialist
+high-performance architectures are only tolerated for
+as long as there is no equivalent performant alternative that is
+easier to program.
+
+Combining SVP64 with ZOLC and OpenCAPI can produce an extremely powerful
+architectural base that fits well with intrinsics embedded into standard
+general-purpose compilers (gcc, llvm) as a pragmatic compromise which makes
+it useful right out the gate. Further R&D may target compiler technology
+that brings it on-par with NVIDIA, Graphcore, AMDGPU, but with intrinsics
+there is no critical product launch dependence on having such
+advanced compilers.
+
  Bottom line is that there is a clear roadmap towards solving a long
  standing problem facing Computer Science and doing so in a way that
  reduces power consumption reduces algorithm completion time and reduces
  the need for complex hardware microarchitectures in favour of much
-smaller distributed coherent Processing Elements.
+smaller distributed coherent Processing Elements with a Heterogenous ISA
+across the board.
+
+# Appendix
+
+**Samsung PIM**
+
+Samsung's
+[Processing-in-Memory](https://semiconductor.samsung.com/emea/newsroom/news/samsung-brings-in-memory-processing-power-to-wider-range-of-applications/)
+seems to be ready to launch as a
+[commercial product](https://semiconductor.samsung.com/insights/technology/pim/)
+that uses HBM as its Memory Standard,
+has "some logic suitable for AI", has parallel processing elements,
+and offers 70% reduction
+in power consumption and a 2x performance increase in speech
+recognition. Details beyond that as to its internal workings
+or programmability are minimal, however given the similarity
+to D-Matrix and Google TPU it is reasonable to place in the
+same category.
+
+* [Samsung PIM IEEE Article](https://spectrum.ieee.org/samsung-ai-memory-chips)
+  explains that there are 9 instructions, mostly FP16 arithmetic,
+  and that it is designed to "complement" AI rather than compete.
+  With only 9 instructions, 2 of which will be LOAD and STORE,
+  conditional code execution seems unlikely.
+  Silicon area in DRAM is increased by 5% for a much greater reduction
+  in power. The article notes, pointedly, that programmability will
+  be a key deciding factor.  The article also notes that Samsung has
+  proposed its architecture as a JEDEC Standard.
+
+**PIM-HBM Research**
+
+[Presentation](https://ieeexplore.ieee.org/document/9073325/) by Seongguk Kim
+and associated [video](https://www.youtube.com/watch?v=e4zU6u0YIRU)
+showing 3D-stacked DRAM connected to GPUs, but notes that even HBM, due to
+large GPU size, is less advantageous than it should be.  Processing-in-Memory
+is therefore logically proposed. the PE (named a Streaming Multiprocessor)
+is much more sophisticated, comprising Register File, L1 Cache, FP32, FP64
+and a Tensor Unit.
+
+<img src="/openpower/sv/2022-05-14_11-55.jpg" width=500 />
+
+**etp4hpc.eu**
+
+[ETP 4 HPC](https://etp4hpc.eu) is a European Joint Initiative for HPC,
+with an eye towards
+[Processing in Memory](https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf)
+
+**Salient Labs**
+
+[Research paper](https://arxiv.org/abs/2002.00281) explaining
+that they can exceed a 14 ghz clock rate Multiply-and-Accumulate
+using Photonics.
+
+**SparseLNR**
+
+[SparseLNR](https://arxiv.org/abs/2205.11622) restructures sparse
+tensor computations using loop-nest restructuring.
+
+**Additional ZOLC Resources**
+
+* <https://www.researchgate.net/publication/3351728_Zero-overhead_loop_controller_that_implements_multimedia_algorithms>