openpower/sv.mdwn

   1 [[!tag standards]]
   2
   3 # Simple-V Vectorisation for the OpenPOWER ISA
   4
   5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
   6
   7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
   8
   9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
  10 As such it brings features normally only found in Cray Supercomputers
  11 (Cray-1, NEC SX-Aurora)
  12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
  13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
  14 explicit Vector opcode exists in SV, at all**.
  15
  16 Fundamental design principles:
  17
  18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
  19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
  20   operations
  21 * Preserving the underlying scalar execution dependencies as if the
  22   for-loop had been expanded as actual scalar instructions
  23   (termed "preserving Program Order")
  24 * Augments ("tags") existing instructions, providing Vectorisation
  25   "context" rather than adding new ones.
  26 * Does not modify or deviate from the underlying scalar OpenPOWER ISA
  27   unless it provides significant performance or other advantage to do so
  28   in the Vector space (dropping XER.SO for example)
  29 * Designed for Supercomputing: avoids creating significant sequential
  30   dependency hazards, allowing high performance superscalar
  31   microarchitectures to be deployed.
  32
  33 Advantages of these design principles:
  34
  35 * It is therefore easy to create a first (and sometimes only)
  36   implementation as literally a for-loop in hardware, simulators, and
  37   compilers.
  38 * Hardware Architects may understand and implement SV as being an
  39   extra pipeline stage, inserted between decode and issue, that is
  40   a simple for-loop issuing element-level sub-instructions.
  41 * More complex HDL can be done by repeating existing scalar ALUs and
  42   pipelines as blocks and leveraging existing Multi-Issue Infrastructure
  43 * As (mostly) a high-level "context" that does not (significantly) deviate
  44   from scalar OpenPOWER ISA and, in its purest form being "a for loop around
  45   scalar instructions", it is minimally-disruptive and consequently stands
  46   a reasonable chance of broad community adoption and acceptance
  47 * Completely wipes not just SIMD opcode proliferation off the
  48   map (SIMD is O(N^6) opcode proliferation)
  49   but off of Vectorisation ISAs as well.  No more separate Vector
  50   instructions.
  51
  52 Comparative instruction count:
  53
  54 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
  55 * ARM SVE: around 4,000 instructions, prerequisite: NEON.
  56 * ARM SVE2: around 1,000 instructions, prerequisite: SVE
  57 * Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
  58 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
  59 * SVP64: **four** instructions, 24-bit prefixing of
  60   prerequisite SFS (150) or
  61   SFFS (214) Compliancy Subsets
  62
  63 SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
  64 efficient High-Performance Compute, Distributed Computing and Advanced
  65 Computational Supercomputing.  The Compliancy Levels are arranged such
  66 that even at the bare minimum Level, full Soft-Emulation of all
  67 optional and future features is possible.
  68
  69 # Major opcodes summary
  70
  71 Please be advised that even though below is entirely DRAFT status, there
  72 is considerable concern that because there is not yet any two-way
  73 day-to-day communication established with the OPF ISA WG, we have
  74 no idea if any of these are conflicting with future plans by any OPF
  75 Members.  **The External ISA WG RFC Process is yet to be ratified
  76 and Libre-SOC may not join the OPF as an entity because it does
  77 not exist except in name. Even if it existed it would be a conflict
  78 of interest to join the OPF, due to our funding remit from NLnet**.
  79 We therefore proceed on the basis of making public the intention to
  80 submit RFCs once the External ISA WG RFC Process is in place and,
  81 in a wholly unsatisfactory manner have to *hope and trust* that
  82 OPF ISA WG Members are reading this and take it into consideration.
  83
  84 **None of these Draft opcodes are intended for private custom
  85 secret proprietary usage. They are all intended for entirely
  86 public, upstream, high-profile mass-volume day-to-day usage at the
  87 same level as add, popcnt and fld**
  88
  89 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
  90 * bitmanip requires two major opcodes (due to 16+ bit immediates)
  91   those are currently EXT022 and EXT05.
  92 * brownfield encoding in one of those two major opcodes still
  93   requires multiple VA-Form operations (in greater numbers
  94   than EXT04 has spare)
  95 * space in EXT019 next to addpcis and crops is recommended
  96 * many X-Form opcodes currently in EXT022 have no preference
  97   for a location at all, and may be moved to EXT059, EXT019,
  98   EXT031 or other much more suitable location.
  99
 100 Note that there is no Sandbox allocation in the published ISA Spec for
 101 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
 102 Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
 103 would become a whopping 96-bit long instruction. Avoiding this
 104 situation is a high priority which in turn by necessity puts pressure
 105 on the 32-bit Major Opcode space.
 106
 107 Note also that EXT022, the Official Architectural Sandbox area
 108 is under severe design pressure as it is insufficient to hold
 109 the full extent of the instruction additions required to create
 110 a Hybrid 3D CPU-VPU-GPU.
 111
 112 **Whilst SVP64 is only 4 instructions
 113 the heavy focus on VSX for the past 12 years has left the SFFS Level
 114 anaemic and out-of-date compared to ARM and x86. Approximately
 115 100 additional Scalar Instructions are up for proposal**
 116
 117 # Sub-pages
 118
 119 Pages being developed and examples
 120
 121 * [[sv/overview]] explaining the basics.
 122 * [[sv/compliancy_levels]] for minimum subsets through to Advanced
 123   Supercomputing.
 124 * [[sv/implementation]] implementation planning and coordination
 125 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
 126   contains explanations and further details
 127 * [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
 128 * [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
 129 * [[sv/sprs]] SPRs
 130 * SVP64 "Modes":
 131   - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
 132     Register ops: Guidelines
 133     on Vectorisation of any v3.0B base operations which return
 134     or modify a Condition Register bit or field.
 135   - For LD/ST Modes, see [[sv/ldst]].
 136   - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
 137     behaviour: All/Some Vector CRs
 138   - For arithmetic and logical, see [[sv/normal]]
 139
 140 Core SVP64 instructions:
 141
 142 * [[sv/setvl]] the Cray-style "Vector Length" instruction
 143 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
 144 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
 145
 146 Vector-related:
 147
 148 * [[sv/vector_swizzle]]
 149 * [[sv/mv.vec]] pack/unpack move to and from vec2/3/4
 150 * [[sv/mv.swizzle]]
 151 * [[sv/vector_ops]] scalar operations needed for supporting vectors
 152
 153 Scalar Instructions:
 154
 155 * [[sv/cr_int_predication]] instructions needed for effective predication
 156 * [[sv/bitmanip]]
 157 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
 158 * [[sv/fclass]] detect class of FP numbers
 159 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
 160 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
 161 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
 162 * Twin targetted instructions (two registers out, one implicit)
 163   Explanation of the rules for twin register targets
 164   (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
 165   - [[isa/svfixedarith]]
 166   - [[isa/svfparith]]
 167   - [[sv/biginteger]] Operations that help with big arithmetic
 168 * TODO: OpenPOWER adaptation [[openpower/transcendentals]]
 169
 170 Examples experiments future ideas discussion:
 171
 172 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
 173 * [[sv/masked_vector_chaining]]
 174 * [[sv/discussion]]
 175 * [[sv/example_dep_matrices]]
 176 * [[sv/major_opcode_allocation]]
 177 * [[sv/byteswap]]
 178 * [[sv/16_bit_compressed]] experimental
 179 * [[sv/toc_data_pointer]] experimental
 180 * [[sv/predication]] discussion on predication concepts
 181 * [[sv/register_type_tags]]
 182 * [[sv/mv.x]] deprecated in favour of Indexed REMAP
 183
 184 Additional links:
 185
 186 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
 187 * [[simple_v_extension]] old (deprecated) version
 188 * [[openpower/sv/llvm]]
 189 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
 190
 191 ===
 192
 193 Required Background Reading:
 194 ============================
 195
 196 These are all, deep breath, basically... required reading, *as well as
 197 and in addition* to a full and comprehensive deep technical understanding
 198 of the Power ISA, in order to understand the depth and background on
 199 SVP64 as a 3D GPU and VPU Extension.
 200
 201 I am keenly aware that each of them is 300 to 1,000 pages (just like
 202 the Power ISA itself).
 203
 204 This is just how it is.
 205
 206 Given the sheer overwhelming size and scope of SVP64 we have gone to
 207 **considerable lengths** to provide justification and rationalisation for
 208 adding the various sub-extensions to the Base Scalar Power ISA.
 209
 210 * Scalar bitmanipulation is justifiable for the exact same reasons the
 211   extensions are justifiable for other ISAs.  The additional justification
 212   for their inclusion where some instructions are already (sort-of) present
 213   in VSX is that VSX is not mandatory, and the complexity of implementation
 214   of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
 215 * Scalar FP-to-INT conversions, likewise.  ARM has a javascript conversion
 216   instruction, Power ISA does not (and it costs a ridiculous 45 instructions
 217   to implement, including 6 branches!)
 218 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
 219   for High-Performance Compute workloads.
 220
 221 It also has to be pointed out that normally this work would be covered by
 222 multiple separate full-time Workgroups with multiple Members contributing
 223 their time and resources.
 224
 225 Overall the contributions that we are developing take the Power ISA out of
 226 the specialist highly-focussed market it is presently best known for, and
 227 expands it into areas with much wider general adoption and broader uses.
 228
 229
 230 ---
 231
 232 OpenCL specifications are linked here, these are relevant when we get
 233 to a 3D GPU / High Performance Compute ISA WG RFC:
 234 [[openpower/transcendentals]]
 235
 236 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
 237 *willfully* designing a product that is 100% destined for commercial
 238 rejection, due to the extremely high competitive performance/watt achieved
 239 by today's mass-volume GPUs.)
 240
 241 I mention these because they will be encountered in every single
 242 commercial GPU ISA, but they're not part of the "Base" (core design)
 243 of a Vector Processor. Transcendentals can be added as a sub-RFC.
 244
 245 ---
 246
 247 Actual 3D GPU Architectures and ISAs:
 248 -------------------------------------
 249
 250 * Broadcom Videocore
 251   <https://github.com/hermanhermitage/videocoreiv>
 252 * Etnaviv
 253   <https://github.com/etnaviv/etna_viv/tree/master/doc>
 254 * Nyuzi
 255   <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
 256 * MALI
 257   <https://github.com/cwabbott0/mali-isa-docs>
 258 * AMD
 259   <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
 260   <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
 261 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
 262   <https://miaowgpu.org/>
 263
 264
 265 Actual Vector Processor Architectures and ISAs:
 266 -----------------------------------------------
 267
 268 * NEC SX Aurora
 269   <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
 270 * Cray ISA
 271   <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
 272 * RISC-V RVV
 273   <https://github.com/riscv/riscv-v-spec>
 274 * MRISC32 ISA Manual (under active development)
 275   <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
 276 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
 277   Mitch on direct contact with him.  It is a different approach from the
 278   others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
 279   66000 is a *Vertical-First* Vector ISA.
 280
 281 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
 282 "Column-First" technique, where:
 283
 284 * Horizontal-First processes all elements in a Vector before moving on
 285   to the next instruction
 286 * Vertical-First processes *ONE* element per instruction, and requires
 287   loop constructs to explicitly step to the next element.
 288
 289 Vector-type Support by Architecture
 290 [[!table  data="""
 291 Architecture | Horizontal | Vertical
 292 MyISA 66000  |            | X
 293 Cray         | X          |
 294 SX Aurora    | X          |
 295 RVV          | X          |
 296 SVP64        | X          | X
 297 """]]
 298
 299 ===
 300
 301 Obligatory Dilbert:
 302
 303 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
 304