openpower/sv.mdwn

   1 [[!tag standards]]
   2
   3 # Simple-V Vectorisation for the OpenPOWER ISA
   4
   5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
   6
   7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
   8
   9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
  10 As such it brings features normally only found in Cray Supercomputers
  11 (Cray-1, NEC SX-Aurora)
  12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
  13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
  14 explicit Vector opcode exists in SV, at all**.
  15
  16 Fundamental design principles:
  17
  18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
  19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
  20   operations
  21 * Preserving the underlying scalar execution dependencies as if the
  22   for-loop had been expanded as actual scalar instructions
  23   (termed "preserving Program Order")
  24 * Augments ("tags") existing instructions, providing Vectorisation
  25   "context" rather than adding new ones.
  26 * Does not modify or deviate from the underlying scalar OpenPOWER ISA
  27   unless it provides significant performance or other advantage to do so
  28   in the Vector space (dropping XER.SO for example)
  29 * Designed for Supercomputing: avoids creating significant sequential
  30   dependency hazards, allowing high performance superscalar
  31   microarchitectures to be deployed.
  32
  33 Advantages of these design principles:
  34
  35 * It is therefore easy to create a first (and sometimes only)
  36   implementation as literally a for-loop in hardware, simulators, and
  37   compilers.
  38 * Hardware Architects may understand and implement SV as being an
  39   extra pipeline stage, inserted between decode and issue, that is
  40   a simple for-loop issuing element-level sub-instructions.
  41 * More complex HDL can be done by repeating existing scalar ALUs and
  42   pipelines as blocks and leveraging existing Multi-Issue Infrastructure
  43 * As (mostly) a high-level "context" that does not (significantly) deviate
  44   from scalar OpenPOWER ISA and, in its purest form being "a for loop around
  45   scalar instructions", it is minimally-disruptive and consequently stands
  46   a reasonable chance of broad community adoption and acceptance
  47 * Completely wipes not just SIMD opcode proliferation off the
  48   map (SIMD is O(N^6) opcode proliferation)
  49   but off of Vectorisation ISAs as well.  No more separate Vector
  50   instructions.
  51
  52 Comparative instruction count:
  53
  54 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
  55 * ARM SVE: around 4,000 instructions, prerequisite: NEON.
  56 * ARM SVE2: around 1,000 instructions, prerequisite: SVE
  57 * Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
  58 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
  59 * SVP64: **four** instructions, 24-bit prefixing of
  60   prerequisite SFS (150) or
  61   SFFS (214) Compliancy Subsets
  62
  63 # Major opcodes summary
  64
  65 Please be advised that even though below is entirely DRAFT status, there
  66 is considerable concern that because there is not yet any two-way
  67 day-to-day communication established with the OPF ISA WG, we have
  68 no idea if any of these are conflicting with future plans by any OPF
  69 Members.  **The External ISA WG RFC Process is yet to be ratified
  70 and Libre-SOC may not join the OPF as an entity because it does
  71 not exist except in name. Even if it existed it would be a conflict
  72 of interest to join the OPF, due to our funding remit from NLnet**.
  73 We therefore proceed on the basis of making public the intention to
  74 submit RFCs once the External ISA WG RFC Process is in place and,
  75 in a wholly unsatisfactory manner have to *hope and trust* that
  76 OPF ISA WG Members are reading this and take it into consideration.
  77
  78 **None of these Draft opcodes are intended for private custom
  79 secret proprietary usage. They are all intended for entirely
  80 public, upstream, high-profile mass-volume day-to-day usage at the
  81 same level as add, popcnt and fld**
  82
  83 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
  84 * bitmanip requires two major opcodes (due to 16+ bit immediates)
  85   those are currently EXT022 and EXT05.
  86 * brownfield encoding in one of those two major opcodes still
  87   requires multiple VA-Form operations (in greater numbers
  88   than EXT04 has spare)
  89 * space in EXT019 next to addpcis and crops is recommended
  90 * many X-Form opcodes currently in EXT022 have no preference
  91   for a location at all, and may be moved to EXT059, EXT019,
  92   EXT031 or other much more suitable location.
  93
  94 Note that there is no Sandbox allocation in the published ISA Spec for
  95 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
  96 Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
  97 would become a whopping 96-bit long instruction. Avoiding this
  98 situation is a high priority which in turn by necessity puts pressure
  99 on the 32-bit Major Opcode space.
 100
 101 Note also that EXT022, the Official Architectural Sandbox area
 102 is under severe design pressure as it is insufficient to hold
 103 the full extent of the instruction additions required to create
 104 a Hybrid 3D CPU-VPU-GPU.
 105
 106 **Whilst SVP64 is only 4 instructions
 107 the heavy focus on VSX for the past 12 years has left the SFFS Level
 108 anaemic and out-of-date compared to ARM and x86. Approximately
 109 100 additional Scalar Instructions are up for proposal**
 110
 111 # Sub-pages
 112
 113 Pages being developed and examples
 114
 115 * [[sv/overview]] explaining the basics.
 116 * [[sv/implementation]] implementation planning and coordination
 117 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
 118   contains explanations and further details
 119 * [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
 120 * [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
 121 * [[sv/sprs]] SPRs
 122 * SVP64 "Modes":
 123   - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
 124     Register ops: Guidelines
 125     on Vectorisation of any v3.0B base operations which return
 126     or modify a Condition Register bit or field.
 127   - For LD/ST Modes, see [[sv/ldst]].
 128   - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
 129     behaviour: All/Some Vector CRs
 130   - For arithmetic and logical, see [[sv/normal]]
 131
 132 Core SVP64 instructions:
 133
 134 * [[sv/setvl]] the Cray-style "Vector Length" instruction
 135 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
 136 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
 137
 138 Vector-related:
 139
 140 * [[sv/vector_swizzle]]
 141 * [[sv/mv.vec]] move to and from vec2/3/4
 142 * [[sv/mv.swizzle]]
 143 * [[sv/vector_ops]] scalar operations needed for supporting vectors
 144
 145 Scalar Instructions:
 146
 147 * [[sv/cr_int_predication]] instructions needed for effective predication
 148 * [[sv/bitmanip]]
 149 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
 150 * [[sv/fclass]] detect class of FP numbers
 151 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
 152 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
 153 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
 154 * Twin targetted instructions (two registers out, one implicit)
 155   Explanation of the rules for twin register targets
 156   (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
 157   - [[isa/svfixedarith]]
 158   - [[isa/svfparith]]
 159   - [[sv/biginteger]] Operations that help with big arithmetic
 160 * TODO: OpenPOWER adaptation [[openpower/transcendentals]]
 161
 162 Examples experiments future ideas discussion:
 163
 164 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
 165 * [[sv/masked_vector_chaining]]
 166 * [[sv/discussion]]
 167 * [[sv/example_dep_matrices]]
 168 * [[sv/major_opcode_allocation]]
 169 * [[sv/byteswap]]
 170 * [[sv/16_bit_compressed]] experimental
 171 * [[sv/toc_data_pointer]] experimental
 172 * [[sv/predication]] discussion on predication concepts
 173 * [[sv/register_type_tags]]
 174 * [[sv/mv.x]] deprecated in favour of Indexed REMAP
 175
 176 Additional links:
 177
 178 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
 179 * [[simple_v_extension]] old (deprecated) version
 180 * [[openpower/sv/llvm]]
 181 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
 182
 183 ===
 184
 185 Required Background Reading:
 186 ============================
 187
 188 These are all, deep breath, basically... required reading, *as well as
 189 and in addition* to a full and comprehensive deep technical understanding
 190 of the Power ISA, in order to understand the depth and background on
 191 SVP64 as a 3D GPU and VPU Extension.
 192
 193 I am keenly aware that each of them is 300 to 1,000 pages (just like
 194 the Power ISA itself).
 195
 196 This is just how it is.
 197
 198 Given the sheer overwhelming size and scope of SVP64 we have gone to
 199 **considerable lengths** to provide justification and rationalisation for
 200 adding the various sub-extensions to the Base Scalar Power ISA.
 201
 202 * Scalar bitmanipulation is justifiable for the exact same reasons the
 203   extensions are justifiable for other ISAs.  The additional justification
 204   for their inclusion where some instructions are already (sort-of) present
 205   in VSX is that VSX is not mandatory, and the complexity of implementation
 206   of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
 207 * Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
 208   instruction, Power ISA does not (and it costs a ridiculous 45 instructions
 209   to implement, including 6 branches!)
 210 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
 211   for High-Performance Compute workloads.
 212
 213 It also has to be pointed out that normally this work would be covered by
 214 multiple separate full-time Workgroups with multiple Members contributing
 215 their time and resources.
 216
 217 Overall the contributions that we are developing take the Power ISA out of
 218 the specialist highly-focussed market it is presently best known for, and
 219 expands it into areas with much wider general adoption and broader uses.
 220
 221
 222 ---
 223
 224 OpenCL specifications are linked here, these are relevant when we get
 225 to a 3D GPU / High Performance Compute ISA WG RFC:
 226 [[openpower/transcendentals]]
 227
 228 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
 229 *willfully* designing a product that is 100% destined for commercial
 230 rejection, due to the extremely high competitive performance/watt achieved
 231 by today's mass-volume GPUs.)
 232
 233 I mention these because they will be encountered in every single
 234 commercial GPU ISA, but they're not part of the "Base" (core design)
 235 of a Vector Processor. Transcendentals can be added as a sub-RFC.
 236
 237 ---
 238
 239 Actual 3D GPU Architectures and ISAs:
 240 -------------------------------------
 241
 242 * Broadcom Videocore
 243   <https://github.com/hermanhermitage/videocoreiv>
 244 * Etnaviv
 245   <https://github.com/etnaviv/etna_viv/tree/master/doc>
 246 * Nyuzi
 247   <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
 248 * MALI
 249   <https://github.com/cwabbott0/mali-isa-docs>
 250 * AMD
 251   <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
 252   <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
 253 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
 254   <https://miaowgpu.org/>
 255
 256
 257 Actual Vector Processor Architectures and ISAs:
 258 -----------------------------------------------
 259
 260 * NEC SX Aurora
 261   <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
 262 * Cray ISA
 263   <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
 264 * RISC-V RVV
 265   <https://github.com/riscv/riscv-v-spec>
 266 * MRISC32 ISA Manual (under active development)
 267   <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
 268 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
 269   Mitch on direct contact with him.  It is a different approach from the
 270   others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
 271   66000 is a *Vertical-First* Vector ISA.
 272
 273 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
 274 "Column-First" technique, where:
 275
 276 * Horizontal-First processes all elements in a Vector before moving on
 277   to the next instruction
 278 * Vertical-First processes *ONE* element per instruction, and requires
 279   loop constructs to explicitly step to the next element.
 280
 281 Vector-type Support by Architecture
 282 [[!table  data="""
 283 Architecture | Horizontal | Vertical
 284 MyISA 66000  |            | X
 285 Cray         | X          |
 286 SX Aurora    | X          |
 287 RVV          | X          |
 288 SVP64        | X          | X
 289 """]]
 290
 291 ===
 292
 293 Obligatory Dilbert:
 294
 295 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
 296