openpower/sv/vector_isa_comparison.mdwn

   1 [[!tag standards]]
   2
   3 # Comparative analysis
   4
   5 These are all, deep breath, basically... required reading, *as well as
   6 and in addition* to a full and comprehensive deep technical understanding
   7 of the Power ISA, in order to understand the depth and background on
   8 SVP64 as a 3D GPU and VPU Extension.
   9
  10 I am keenly aware that each of them is 300 to 1,000 pages (just like
  11 the Power ISA itself).
  12
  13 This is just how it is.
  14
  15 Given the sheer overwhelming size and scope of SVP64 we have gone to
  16 **considerable lengths** to provide justification and rationalisation for
  17 adding the various sub-extensions to the Base Scalar Power ISA.
  18
  19 * Scalar bitmanipulation is justifiable for the exact same reasons the
  20   extensions are justifiable for other ISAs.  The additional justification
  21   for their inclusion where some instructions are already (sort-of) present
  22   in VSX is that VSX is not mandatory, and the complexity of implementation
  23   of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
  24 * Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
  25   instruction, Power ISA does not (and it costs a ridiculous 45 instructions
  26   to implement, including 6 branches!)
  27 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
  28   for High-Performance Compute workloads.
  29
  30 It also has to be pointed out that normally this work would be covered by
  31 multiple separate full-time Workgroups with multiple Members contributing
  32 their time and resources. In RISC-V there are over sixty Technical Working
  33 Groups https://riscv.org/community/directory-of-working-groups/
  34
  35 Overall the contributions that we are developing take the Power ISA out of
  36 the specialist highly-focussed market it is presently best known for, and
  37 expands it into areas with much wider general adoption and broader uses.
  38
  39 ---
  40
  41 OpenCL specifications are linked here, these are relevant when we get
  42 to a 3D GPU / High Performance Compute ISA WG RFC:
  43 [[openpower/transcendentals]]
  44
  45 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
  46 *willfully* designing a product that is 100% destined for commercial
  47 rejection, due to the extremely high competitive performance/watt achieved
  48 by today's mass-volume GPUs.)
  49
  50 I mention these because they will be encountered in every single
  51 commercial GPU ISA, but they're not part of the "Base" (core design)
  52 of a Vector Processor. Transcendentals can be added as a sub-RFC.
  53
  54 # SIMD ISAs commonly mistaken for Vector
  55
  56 There is considerable confusion surrounding Vector ISAs
  57 because of a mis-use of the word "Vector" in the marketing
  58 material of most well-known Packed SIMD ISAs of the past 3
  59 decades. These Packed
  60 SIMD ISAs used features "inspired" from Scalable Vector ISAs.
  61
  62 * PackedSIMD VSX. VSX, which has the word "Vector" in its name,
  63   is "inspired" by Vector Processing
  64   but has no "Scaling" capability, and no Predicate masking.
  65   Both these factors put pressure on developers to use
  66   "inline assembler unrolling" and data repetition, which in turn
  67   is detrimental to both L1 Data and Instruction Caches.
  68   Adding Predicate Masks to the PackedSIMD VSX ISA
  69   would effectively double the number of PackedSIMD
  70   instructions (750 becomes 1,500) even if it were practical
  71   to do so (no available 32 bit encoding space).
  72 * [AVX / AVX2 / AVX128 / AVX256 / AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
  73   again has the word "Vector" in its name but this in no
  74   way makes it a Vector ISA. None of the AVX-\* family
  75   are "Scalable" however there is at least Predicate Masking
  76   in AVX-512.
  77 * ARM NEON - accurately described as a Packed SIMD ISA in
  78   all literature.
  79 * ARM SVE / SVE2 - **not a Scalable Vector ISA**, it is actually
  80   a hybrid PackedSIMD/PredicatedSIMD ISA: with 4-operand instructions
  81   being overwrite to fit into 32-bit there was no room for a predicate
  82   mask.
  83   The "Scaling" is, rather unfortunately, a parameter
  84   that is chosen by the *Hardware Architect*, rather than
  85   the programmer. The actual "Scalar" part as far as the programmer
  86   is concerned is supposed to be the Predicate Masks.  However in
  87   practice, ARM NEON programmers have found it too hard to adapt and
  88   have instead attempted to fit the NEON SIMD paradigm on top of SVE.
  89   This has resulted in programmers writing
  90   **multiple variants** of near-identical hand-coded assembler in order
  91   to target different machines with different hardware widths,
  92   going directly against the advice given on ARM's developer
  93   documentation.
  94
  95 A good analogy explaining why "Silicon-Partner Scalability" is
  96 catastrophic is to note that the situation is near-identical to when IBM
  97 extended Power ISA from 32 to 64-bit.  Existing 32-bit systems
  98 were **unable** to run or trap-and-emulate 64-bit instructions
  99 **because they were the exact same opcodes** and the "Silicon Scalability"
 100 of both RVV and ARM SVE/2 is the exact same mistake, but much
 101 worse.  At least IBM provided an `MSR.SF` bit.
 102
 103 The saving grace of PackedSIMD VSX is that it did not fall to the
 104 seduction outlined in the "SIMD Considered Harmful" article
 105 <https://www.sigarch.org/simd-instructions-considered-harmful/>.
 106 It is clear that it is expected to deploy Multi-Issue to achieve
 107 high performance, which is a much cleaner approach that has not
 108 resulted in ISA poisoning such as that suffered by x86 (AVX).
 109
 110 # Actual 3D GPU Architectures and ISAs (all SIMD)
 111
 112 All of these are not Scalable Vector ISAs, they are SIMD ISAs.
 113
 114 * Broadcom Videocore
 115   <https://github.com/hermanhermitage/videocoreiv>
 116 * Etnaviv
 117   <https://github.com/etnaviv/etna_viv/tree/master/doc>
 118 * Nyuzi
 119   <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
 120 * MALI
 121   <https://github.com/cwabbott0/mali-isa-docs>
 122 * AMD
 123   <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
 124   <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
 125 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to
 126   implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
 127   <https://miaowgpu.org/>
 128
 129 # Actual Scalar Vector Processor Architectures and ISAs
 130
 131 * NEC SX Aurora
 132   <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
 133 * Cray ISA
 134   <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
 135 * RISC-V RVV
 136   <https://github.com/riscv/riscv-v-spec>
 137 * MRISC32 ISA Manual (under active development)
 138   <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
 139 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
 140   Mitch under NDA
 141   on direct contact with him.  It is a different approach from the
 142   others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
 143   66000 is a *Vertical-First* Vector ISA with hardware-level
 144   auto-vectorisation.
 145 * [ETA-10](http://50.204.185.175/collections/catalog/102641713)
 146   an extremely rare Scalable Vector Architecture from 1986,
 147   similar to the CDC Cyber 205.
 148   Only 25 machines were ever delivered. Page 3-220 of its ISA
 149   shows that it had Predicate Masks and Horizontal Reduction.
 150   Appendix H-1 shows it is likely a Memory-to-Memory Vector
 151   Architecture, and overcame the penalties normally associated
 152   with this by adding an explicit "Vector operand forwarding/chaining"
 153   instruction (Page 3-69).  It is however clearly Scalable, up to Vector
 154   elements of 2^16.
 155
 156 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
 157 "Column-First" technique, where:
 158
 159 * Horizontal-First processes all elements in a Vector before moving on
 160   to the next instruction
 161 * Vertical-First processes *ONE* element per instruction, and requires
 162   loop constructs to explicitly step to the next element.
 163
 164 Vector-type Support by Architecture
 165
 166
 167 | Architecture | Horizontal | Vertical |
 168 | ------------ | ---------- | -------- |
 169 | MyISA 66000  |            | X  |
 170 | Cray         | X          |  |
 171 | SX Aurora    | X          |  |
 172 | RVV          | X          |  |
 173 | SVP64        | X          | X  |
 174
 175 ![Horizontal vs Vertical](/openpower/sv/sv_horizontal_vs_vertical.svg)
 176