openpower/sv/vector_isa_comparison.mdwn

   1 [[!tag standards]]
   2
   3 # Comparative analysis
   4
   5 These are all, deep breath, basically... required reading, *as well as
   6 and in addition* to a full and comprehensive deep technical understanding
   7 of the Power ISA, in order to understand the depth and background on
   8 SVP64 as a 3D GPU and VPU Extension.
   9
  10 I am keenly aware that each of them is 300 to 1,000 pages (just like
  11 the Power ISA itself).
  12
  13 This is just how it is.
  14
  15 Given the sheer overwhelming size and scope of SVP64 we have gone to
  16 **considerable lengths** to provide justification and rationalisation for
  17 adding the various sub-extensions to the Base Scalar Power ISA.
  18
  19 * Scalar bitmanipulation is justifiable for the exact same reasons the
  20   extensions are justifiable for other ISAs.  The additional justification
  21   for their inclusion where some instructions are already (sort-of) present
  22   in VSX is that VSX is not mandatory, and the complexity of implementation
  23   of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
  24 * Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
  25   instruction, Power ISA does not (and it costs a ridiculous 45 instructions
  26   to implement, including 6 branches!)
  27 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
  28   for High-Performance Compute workloads.
  29
  30 It also has to be pointed out that normally this work would be covered by
  31 multiple separate full-time Workgroups with multiple Members contributing
  32 their time and resources.
  33
  34 Overall the contributions that we are developing take the Power ISA out of
  35 the specialist highly-focussed market it is presently best known for, and
  36 expands it into areas with much wider general adoption and broader uses.
  37
  38 ---
  39
  40 OpenCL specifications are linked here, these are relevant when we get
  41 to a 3D GPU / High Performance Compute ISA WG RFC:
  42 [[openpower/transcendentals]]
  43
  44 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
  45 *willfully* designing a product that is 100% destined for commercial
  46 rejection, due to the extremely high competitive performance/watt achieved
  47 by today's mass-volume GPUs.)
  48
  49 I mention these because they will be encountered in every single
  50 commercial GPU ISA, but they're not part of the "Base" (core design)
  51 of a Vector Processor. Transcendentals can be added as a sub-RFC.
  52
  53 # SIMD ISAs commonly mistaken for Vector
  54
  55 There is considerable confusion surrounding Vector ISAs
  56 because of a mis-use of the word "Vector" in most
  57 well-known Packed SIMD ISAs.
  58
  59 * PackedSIMD VSX. VSX, which has the word "Vector" in its name,
  60   is "inspired" by Vector Processing
  61   but has no "Scaling" capability, and no Predicate masking.
  62   Adding Predicate Masks to the PackedSIMD VSX ISA
  63   would effectively double the number of PackedSIMD
  64   instructions (750 becomes 1,500)
  65 * [AVX / AVX2 / AVX128 / AVX256 / AVX512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
  66   again has the word "Vector" in its name but this in no
  67   way makes it a Vector ISA. None of the AVX-\* family
  68   are "Scalable" however there is at least Predicate Masking
  69   in AVX-512.
  70 * ARM NEON - accurately described as a Packed SIMD ISA in
  71   all literature.
  72 * ARM SVE / SVE2 - partially accurately described as a Scalable Vector
  73   ISA, but the "Scaling" is, rather unfortunately, a parameter
  74   that is chosen by the *Hardware Architect*, rather than
  75   the programmer. The actual "Scalar" part as far as the programmer
  76   is concerned is supposed to be the Predicate Masks.  However in
  77   practice, ARM NEON programmers have found it too hard to adapt and
  78   have instead attempted to fit the NEON SIMD paradigm on top of SVE.
  79   This has resulted in programmers writing
  80   **multiple variants** of near-identical hand-coded assembler in order
  81   to target different machines with different hardware widths,
  82   going directly against the advice given on ARM's developer
  83   documentation.
  84
  85
  86 # Actual 3D GPU Architectures and ISAs (all SIMD)
  87
  88 All of these are not Vector ISAs, they are SIMD ISAs.
  89
  90 * Broadcom Videocore
  91   <https://github.com/hermanhermitage/videocoreiv>
  92 * Etnaviv
  93   <https://github.com/etnaviv/etna_viv/tree/master/doc>
  94 * Nyuzi
  95   <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
  96 * MALI
  97   <https://github.com/cwabbott0/mali-isa-docs>
  98 * AMD
  99   <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
 100   <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
 101 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to
 102   implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
 103   <https://miaowgpu.org/>
 104
 105
 106 # Actual Scalar Vector Processor Architectures and ISAs
 107
 108 * NEC SX Aurora
 109   <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
 110 * Cray ISA
 111   <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
 112 * RISC-V RVV
 113   <https://github.com/riscv/riscv-v-spec>
 114 * MRISC32 ISA Manual (under active development)
 115   <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
 116 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
 117   Mitch on direct contact with him.  It is a different approach from the
 118   others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
 119   66000 is a *Vertical-First* Vector ISA.
 120
 121 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
 122 "Column-First" technique, where:
 123
 124 * Horizontal-First processes all elements in a Vector before moving on
 125   to the next instruction
 126 * Vertical-First processes *ONE* element per instruction, and requires
 127   loop constructs to explicitly step to the next element.
 128
 129 Vector-type Support by Architecture
 130
 131
 132 | Architecture | Horizontal | Vertical  |
 133 | - | -   |
 134 | MyISA 66000  |            | X  |
 135 | Cray         | X          |  |
 136 | SX Aurora    | X          |  |
 137 | RVV          | X          |  |
 138 | SVP64        | X          | X  |
 139