openpower/sv.mdwn

   1 [[!tag standards]]
   2
   3 Obligatory Dilbert:
   4
   5 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
   6
   7 ===
   8
   9 # SV (Simple Scalar Vectorisation) for the Power ISA
  10
  11 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
  12
  13 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  14
  15 SV is designed as a Scalable Vector ISA for Hybrid 3D CPU GPU VPU workloads.
  16 As such it brings features normally only found in Cray Supercomputers
  17 (Cray-1, NEC SX-Aurora)
  18 and in GPUs, but keeps strictly to a *Simple* RISC principle of leveraging
  19 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
  20 explicit Vector opcode exists in SV, at all**.
  21
  22 Fundamental design principles:
  23
  24 * Simplicity of introduction and implementation on the existing Power ISA
  25 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
  26   operations
  27 * Preserving the underlying scalar execution dependencies as if the
  28   for-loop had been expanded as actual scalar instructions
  29   (termed "preserving Program Order")
  30 * Augments ("tags") existing instructions, providing Vectorisation
  31   "context" rather than adding new instructions.
  32 * Does not modify or deviate from the underlying scalar Power ISA
  33   unless it provides significant performance or other advantage to do so
  34   in the Vector space (dropping "sticky" of XER.SO for example)
  35 * Designed for Supercomputing: avoids creating significant sequential
  36   dependency hazards, allowing standard
  37   high performance superscalar multi-issue
  38   micro-architectures to be leveraged.
  39
  40 Advantages of these design principles:
  41
  42 * It is therefore easy to create a first (and sometimes only)
  43   implementation as literally a for-loop in hardware, simulators, and
  44   compilers.
  45 * Hardware Architects may understand and implement SV as being an
  46   extra pipeline stage, inserted between decode and issue, that is
  47   a simple for-loop issuing element-level sub-instructions.
  48 * More complex HDL can be done by repeating existing scalar ALUs and
  49   pipelines as blocks and leveraging existing Multi-Issue Infrastructure
  50 * As (mostly) a high-level "context" that does not (significantly) deviate
  51   from scalar Power ISA and, in its purest form being "a for loop around
  52   scalar instructions", it is minimally-disruptive and consequently stands
  53   a reasonable chance of broad community adoption and acceptance
  54 * Completely wipes not just SIMD opcode proliferation off the
  55   map (SIMD is O(N^6) opcode proliferation)
  56   but off of Vectorisation ISAs as well.  No more separate Vector
  57   instructions.
  58
  59 Comparative instruction count:
  60
  61 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
  62 * ARM SVE: around 4,000 instructions, prerequisite: NEON.
  63 * ARM SVE2: around 1,000 instructions, prerequisite: SVE
  64 * Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
  65 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
  66 * SVP64: **four** instructions, 24-bit prefixing of
  67   prerequisite SFS (150) or
  68   SFFS (214) Compliancy Subsets
  69
  70 SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
  71 efficient High-Performance Compute, Distributed Computing and Advanced
  72 Computational Supercomputing.  The Compliancy Levels are arranged such
  73 that even at the bare minimum Level, full Soft-Emulation of all
  74 optional and future features is possible.
  75
  76 # Sub-pages
  77
  78 Pages being developed and examples
  79
  80 * [[sv/overview]] explaining the basics.
  81 * [[sv/compliancy_levels]] for minimum subsets through to Advanced
  82   Supercomputing.
  83 * [[sv/implementation]] implementation planning and coordination
  84 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
  85   contains explanations and further details
  86 * [[sv/svp64_quirks]] things in SVP64  that slightly break the rules
  87 * [[opcode_regs_deduped]] autogenerated table of SVP64 decoder augmentation
  88 * [[sv/vector_comparative_analysis] - a list of Packed SIMD, GPU,
  89   and other Scalable Vector ISAs
  90 * [[sv/sprs]] SPRs
  91 * SVP64 "Modes":
  92   - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
  93     Register ops: Guidelines
  94     on Vectorisation of any v3.0B base operations which return
  95     or modify a Condition Register bit or field.
  96   - For LD/ST Modes, see [[sv/ldst]].
  97   - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
  98     behaviour: All/Some Vector CRs
  99   - For arithmetic and logical, see [[sv/normal]]
 100   - [[sv/mv.vec]] pack/unpack move to and from vec2/3/4,
 101     actually an RM.EXTRA Mode and a [[sv/remap]] mode
 102
 103 Core SVP64 instructions:
 104
 105 * [[sv/setvl]] the Cray-style "Vector Length" instruction
 106 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
 107 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
 108
 109 *Please note: there are only five instructions in the whole of SV.
 110 Beyond this point are additional **Scalar** instructions related to
 111 specific workloads that have nothing to do with the SV Specification**
 112
 113 **Additional Instructions for specific purposes (not SVP64)**
 114
 115 All of these instructions below have nothing to do with SV.
 116 They are all entirely designed as Scalar instructions that, as
 117 Scalar instructions, stand on their own merit. Considerable
 118 lengths have been made to provide justifications for each of these
 119 *Scalar* instructions.
 120
 121 Some of these Scalar instructions are specifically designed to make
 122 Scalable Vector binaries more efficient (less instructions) such
 123 as the crweird group.  Others are to bring the Scalar Power ISA
 124 up-to-date within specific workloads,
 125 such as a Javascript Rounding instruction. None of them are strictly
 126 necessary but performance and power consumption may be compromised
 127 in certain workloads and use-cases without them.
 128
 129 Vector-related:
 130
 131 * [[sv/mv.swizzle]] vec2/3/4 Swizzles (RGBA, XYZW) for 3D and CUDA.
 132   designed as a Scalar instruction.
 133 * [[sv/vector_ops]] scalar operations needed for supporting vectors
 134
 135 Scalar Instructions:
 136
 137 * [[sv/cr_int_predication]] instructions needed for effective predication
 138 * [[sv/bitmanip]]
 139 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
 140 * [[sv/fclass]] detect class of FP numbers
 141 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
 142 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
 143 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
 144 * Twin targetted instructions (two registers out, one implicit, just like
 145   Load-with-Update).
 146   Explanation of the rules for twin register targets
 147   (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
 148   - [[isa/svfixedarith]]
 149   - [[isa/svfparith]]
 150   - [[sv/biginteger]] Operations that help with big arithmetic
 151 * TODO: OpenPOWER adaptation [[openpower/transcendentals]]
 152
 153 Examples experiments future ideas discussion:
 154
 155 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
 156 * [[sv/masked_vector_chaining]]
 157 * [[sv/discussion]]
 158 * [[sv/example_dep_matrices]]
 159 * [[sv/major_opcode_allocation]]
 160 * [[sv/byteswap]]
 161 * [[sv/16_bit_compressed]] experimental
 162 * [[sv/toc_data_pointer]] experimental
 163 * [[sv/predication]] discussion on predication concepts
 164 * [[sv/register_type_tags]]
 165 * [[sv/mv.x]] deprecated in favour of Indexed REMAP
 166
 167 Additional links:
 168
 169 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
 170 * [[simple_v_extension]] old (deprecated) version
 171 * [[openpower/sv/llvm]]
 172 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
 173
 174 # Major opcodes summary
 175
 176 Please be advised that even though below is entirely DRAFT status, there
 177 is considerable concern that because there is not yet any two-way
 178 day-to-day communication established with the OPF ISA WG, we have
 179 no idea if any of these are conflicting with future plans by any OPF
 180 Members.  **The External ISA WG RFC Process is yet to be ratified
 181 and Libre-SOC may not join the OPF as an entity because it does
 182 not exist except in name. Even if it existed it would be a conflict
 183 of interest to join the OPF, due to our funding remit from NLnet**.
 184 We therefore proceed on the basis of making public the intention to
 185 submit RFCs once the External ISA WG RFC Process is in place and,
 186 in a wholly unsatisfactory manner have to *hope and trust* that
 187 OPF ISA WG Members are reading this and take it into consideration.
 188
 189 **None of these Draft opcodes are intended for private custom
 190 secret proprietary usage. They are all intended for entirely
 191 public, upstream, high-profile mass-volume day-to-day usage at the
 192 same level as add, popcnt and fld**
 193
 194 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
 195 * bitmanip requires two major opcodes (due to 16+ bit immediates)
 196   those are currently EXT022 and EXT05.
 197 * brownfield encoding in one of those two major opcodes still
 198   requires multiple VA-Form operations (in greater numbers
 199   than EXT04 has spare)
 200 * space in EXT019 next to addpcis and crops is recommended
 201   (or any 5-6 bit Minor XO areas)
 202 * many X-Form opcodes currently in EXT022 have no preference
 203   for a location at all, and may be moved to EXT059, EXT019,
 204   EXT031 or other much more suitable location.
 205
 206 Note that there is no Sandbox allocation in the published ISA Spec for
 207 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
 208 Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
 209 would become a whopping 96-bit long instruction. Avoiding this
 210 situation is a high priority which in turn by necessity puts pressure
 211 on the 32-bit Major Opcode space.
 212
 213 Note also that EXT022, the Official Architectural Sandbox area
 214 is under severe design pressure as it is insufficient to hold
 215 the full extent of the instruction additions required to create
 216 a Hybrid 3D CPU-VPU-GPU.
 217
 218 **Whilst SVP64 is only 4 instructions
 219 the heavy focus on VSX for the past 12 years has left the SFFS Level
 220 anaemic and out-of-date compared to ARM and x86. Approximately
 221 100 additional Scalar Instructions are up for proposal**
 222