openpower/sv.mdwn

   1 [[!tag standards]]
   2
   3 # Simple-V Vectorisation for the OpenPOWER ISA
   4
   5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
   6
   7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
   8
   9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
  10 As such it brings features normally only found in Cray Supercomputers
  11 (Cray-1, NEC SX-Aurora)
  12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
  13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
  14 explicit Vector opcode exists in SV, at all**.
  15
  16 Fundamental design principles:
  17
  18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
  19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar operations
  20 * Preserving the underlying scalar execution dependencies as if the for-loop had been expanded as actual scalar instructions
  21   (termed "preserving Program Order")
  22 * Augments ("tags") existing instructions, providing Vectorisation "context" rather than adding new ones.
  23 * Does not modify or deviate from the underlying scalar OpenPOWER ISA unless it provides significant performance or other advantage to do so in the Vector space (dropping XER.SO and OE=1 for example)
  24 * Designed for Supercomputing: avoids creating significant sequential
  25 dependency hazards, allowing high performance superscalar microarchitectures to be deployed.
  26
  27 Advantages of these design principles:
  28
  29 * It is therefore easy to create a first (and sometimes only) implementation as literally a for-loop in hardware, simulators, and compilers.
  30 * Hardware Architects may understand and implement SV as being an
  31   extra pipeline stage, inserted between decode and issue, that is
  32   a simple for-loop issuing element-level sub-instructions.
  33 * More complex HDL can be done by repeating existing scalar ALUs and
  34   pipelines as blocks and leveraging existing Multi-Issue Infrastructure
  35 * As (mostly) a high-level "context" that does not (significantly) deviate from scalar OpenPOWER ISA and, in its purest form being "a for loop around scalar instructions", it is minimally-disruptive and consequently stands a reasonable chance of broad community adoption and acceptance
  36 * Completely wipes not just SIMD opcode proliferation off the
  37   map (SIMD is O(N^6) opcode proliferation)
  38   but off of Vectorisation ISAs as well.  No more separate Vector
  39   instructions.
  40
  41 # Major opcodes summary
  42
  43 Please be advised that even though below is entirely DRAFT status, there
  44 is considerable concern that because there is not yet any two-way
  45 day-to-day communication established with the OPF ISA WG, we have
  46 no idea if any of these are conflicting with future plans by any OPF
  47 Members.  **The External ISA WG RFC Process is yet to be ratified
  48 and Libre-SOC may not join the OPF as an entity because it does
  49 not exist except in name. Even if it existed it would be a conflict
  50 of interest to join the OPF, due to our funding remit from NLnet**.
  51 We therefore proceed on the basis of making public the intention to
  52 submit RFCs once the External ISA WG RFC Process is in place and,
  53 in a wholly unsatisfactory manner have to *hope and trust* that
  54 OPF ISA WG Members are reading this and take it into consideration.
  55
  56 **None of these Draft opcodes are intended for private custom
  57 secret proprietary usage. They are all intended for entirely
  58 public, upstream, high-profile mass-volume day-to-day usage at the
  59 same level as add, popcnt and fld**
  60
  61 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
  62 * bitmanip requires two major opcodes (due to 16+ bit immediates)
  63   those are currently EXT022 and EXT05.
  64 * brownfield encoding in one of those two major opcodes still
  65   requires multiple VA-Form operations (in greater numbers
  66   than EXT04 has spare) currently in EXT022 (which is under
  67   severe design pressure)
  68 * space in EXT019 next to addpcis and crops is recommended
  69 * many X-Form opcodes currently in EXT022 have no preference
  70   for a location at all, and may be moved to EXT059, EXT019,
  71   EXT031 or other much more suitable location.
  72
  73 # Sub-pages
  74
  75 Pages being developed and examples
  76
  77 * [[sv/overview]] explaining the basics.
  78 * [[sv/implementation]] implementation planning and coordination
  79 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
  80   contains explanations and further details
  81 * [[sv/setvl]] the Cray-style "Vector Length" instruction
  82 * [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
  83 * [[sv/cr_int_predication]] instructions needed for effective predication
  84 * [[opcode_regs_deduped]]
  85 * [[sv/vector_swizzle]]
  86 * [[sv/vector_ops]]
  87 * [[sv/mv.swizzle]]
  88 * [[sv/mv.x]]
  89 * SVP64 "Modes":
  90   - For condition register operations see [[sv/cr_ops]] - SVP64 Condition Register ops: Guidelines
  91  on Vectorisation of any v3.0B base operations which return
  92  or modify a Condition Register bit or field.
  93   - For LD/ST Modes, see [[sv/ldst]].
  94   - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch behaviour: All/Some Vector CRs
  95   - For arithmetic and logical, see [[sv/normal]]
  96 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
  97 * [[sv/fclass]] detect class of FP numbers
  98 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
  99 * [[sv/mv.vec]] move to and from vec2/3/4
 100 * [[sv/sprs]] SPRs
 101 * [[sv/bitmanip]]
 102 * [[sv/biginteger]] Operations that help with big arithmetic
 103 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
 104 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
 105 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
 106 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
 107 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
 108 * Twin targetted instructions (two registers out, one implicit)
 109   Explanation of the rules for twin register targets
 110   (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
 111   - [[isa/svfixedarith]]
 112   - [[isa/svfparith]]
 113 * TODO: OpenPOWER [[openpower/transcendentals]]
 114
 115 Examples experiments ideas discussion:
 116
 117 * [[sv/masked_vector_chaining]]
 118 * [[sv/discussion]]
 119 * [[sv/example_dep_matrices]]
 120 * [[sv/major_opcode_allocation]]
 121 * [[sv/byteswap]]
 122 * [[sv/16_bit_compressed]] experimental
 123 * [[sv/toc_data_pointer]] experimental
 124 * [[sv/predication]] discussion on predication concepts
 125 * [[sv/register_type_tags]]
 126
 127 Additional links:
 128
 129 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
 130 * [[simple_v_extension]] old (deprecated) version
 131 * [[openpower/sv/llvm]]
 132 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
 133
 134 ===
 135
 136 Required Background Reading:
 137 ============================
 138
 139 These are all, deep breath, basically... required reading, *as well as and in addition* to a full and comprehensive deep technical understanding of the Power ISA, in order to understand the depth and background on SVP64 as a 3D GPU and VPU Extension.
 140
 141 I am keenly aware that each of them is 300 to 1,000 pages (just like the Power ISA itself).
 142
 143 This is just how it is.
 144
 145 Given the sheer overwhelming size and scope of SVP64 we have gone to CONSIDERABLE LENGTHS to provide justification and rationalisation for adding the various sub-extensions to the Base Scalar Power ISA.
 146
 147 * Scalar bitmanipulation is justifiable for the exact same reasons the extensions are justifiable for other ISAs.  The additional justification for their inclusion where some instructions are already (sort-of) present in VSX is that VSX is not mandatory, and the complexity of implementation of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
 148
 149 * Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion instruction, Power ISA does not (and it costs a ridiculous 45 instructions to implement, including 6 branches!)
 150
 151 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable for High-Performance Compute workloads.
 152
 153 It also has to be pointed out that normally this work would be covered by multiple separate full-time Workgroups with multiple Members contributing their time and resources!
 154
 155 Overall the contributions that we are developing take the Power ISA out of the specialist highly-focussed market it is presently best known for, and expands it into areas with much wider general adoption and broader uses.
 156
 157
 158 ---
 159
 160 OpenCL specifications are linked here, these are relevant when we get to a 3D GPU / High Performance Compute ISA WG RFC:
 161 [[openpower/transcendentals]]
 162
 163 (Failure to add Transcendentals to a 3D GPU is directly equivalent to *willfully* designing a product that is 100% destined for commercial failure.)
 164
 165 I mention these because they will be encountered in every single commercial GPU ISA, but they're not part of the "Base" (core design) of a Vector Processor. Transcendentals can be added as a sub-RFC.
 166
 167 ---
 168
 169 Actual 3D GPU Architectures and ISAs:
 170 -------------------------------------
 171
 172 * Broadcom Videocore
 173   <https://github.com/hermanhermitage/videocoreiv>
 174
 175 * Etnaviv
 176   <https://github.com/etnaviv/etna_viv/tree/master/doc>
 177
 178 * Nyuzi
 179   <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
 180
 181 * MALI
 182   <https://github.com/cwabbott0/mali-isa-docs>
 183
 184 * AMD
 185   <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
 186   <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
 187
 188 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
 189   <https://miaowgpu.org/>
 190
 191
 192 Actual Vector Processor Architectures and ISAs:
 193 -----------------------------------------------
 194
 195 * NEC SX Aurora
 196   <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
 197
 198 * Cray ISA
 199   <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
 200
 201 * RISC-V RVV
 202   <https://github.com/riscv/riscv-v-spec>
 203
 204 * MRISC32 ISA Manual (under active development)
 205   <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
 206
 207 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from Mitch on direct contact with him.  It is a different approach from the others, which may be termed "Cray-Style Horizontal-First" Vectorisation.  66000 is a *Vertical-First* Vector ISA.
 208
 209 The term Horizontal or Vertical alludes to the Matrix "Row-First" or "Column-First" technique, where:
 210
 211 * Horizontal-First processes all elements in a Vector before moving on to the next instruction
 212 * Vertical-First processes *ONE* element per instruction, and requires loop constructs to explicitly step to the next element.
 213
 214 Vector-type Support by Architecture
 215 [[!table  data="""
 216 Architecture | Horizontal | Vertical
 217 MyISA 66000  |            | X
 218 Cray         | X          |
 219 SX Aurora    | X          |
 220 RVV          | X          |
 221 SVP64        | X          | X
 222 """]]
 223
 224 ===
 225
 226 Obligatory Dilbert:
 227
 228 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
 229