(no commit message)
[libreriscv.git] / openpower / sv.mdwn
1 [[!tag standards]]
2
3 # Simple-V Vectorisation for the OpenPOWER ISA
4
5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
6
7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
8
9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
10 As such it brings features normally only found in Cray Supercomputers
11 (Cray-1, NEC SX-Aurora)
12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
14 explicit Vector opcode exists in SV, at all**.
15
16 Fundamental design principles:
17
18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
20 operations
21 * Preserving the underlying scalar execution dependencies as if the
22 for-loop had been expanded as actual scalar instructions
23 (termed "preserving Program Order")
24 * Augments ("tags") existing instructions, providing Vectorisation
25 "context" rather than adding new ones.
26 * Does not modify or deviate from the underlying scalar OpenPOWER ISA
27 unless it provides significant performance or other advantage to do so
28 in the Vector space (dropping XER.SO for example)
29 * Designed for Supercomputing: avoids creating significant sequential
30 dependency hazards, allowing high performance superscalar
31 microarchitectures to be deployed.
32
33 Advantages of these design principles:
34
35 * It is therefore easy to create a first (and sometimes only)
36 implementation as literally a for-loop in hardware, simulators, and
37 compilers.
38 * Hardware Architects may understand and implement SV as being an
39 extra pipeline stage, inserted between decode and issue, that is
40 a simple for-loop issuing element-level sub-instructions.
41 * More complex HDL can be done by repeating existing scalar ALUs and
42 pipelines as blocks and leveraging existing Multi-Issue Infrastructure
43 * As (mostly) a high-level "context" that does not (significantly) deviate
44 from scalar OpenPOWER ISA and, in its purest form being "a for loop around
45 scalar instructions", it is minimally-disruptive and consequently stands
46 a reasonable chance of broad community adoption and acceptance
47 * Completely wipes not just SIMD opcode proliferation off the
48 map (SIMD is O(N^6) opcode proliferation)
49 but off of Vectorisation ISAs as well. No more separate Vector
50 instructions.
51
52 Comparative instruction count:
53
54 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
55 * ARM SVE: around 4,000 instructions, prerequisite: NEON.
56 * ARM SVE2: around 1,000 instructions, prerequisite: SVE
57 * Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
58 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
59 * SVP64: **four** instructions, 24-bit prefixing of
60 prerequisite SFS (150) or
61 SFFS (214) Compliancy Subsets
62
63 SV comprises several [[sv/compliancy_levels]] suited to Embedded, Energy
64 efficient High-Performance Compute, Distributed Computing and Advanced
65 Computational Supercomputing. The Compliancy Levels are arranged such
66 that even at the bare minimum Level, full Soft-Emulation of all
67 optional and future features is possible.
68
69 # Major opcodes summary
70
71 Please be advised that even though below is entirely DRAFT status, there
72 is considerable concern that because there is not yet any two-way
73 day-to-day communication established with the OPF ISA WG, we have
74 no idea if any of these are conflicting with future plans by any OPF
75 Members. **The External ISA WG RFC Process is yet to be ratified
76 and Libre-SOC may not join the OPF as an entity because it does
77 not exist except in name. Even if it existed it would be a conflict
78 of interest to join the OPF, due to our funding remit from NLnet**.
79 We therefore proceed on the basis of making public the intention to
80 submit RFCs once the External ISA WG RFC Process is in place and,
81 in a wholly unsatisfactory manner have to *hope and trust* that
82 OPF ISA WG Members are reading this and take it into consideration.
83
84 **None of these Draft opcodes are intended for private custom
85 secret proprietary usage. They are all intended for entirely
86 public, upstream, high-profile mass-volume day-to-day usage at the
87 same level as add, popcnt and fld**
88
89 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
90 * bitmanip requires two major opcodes (due to 16+ bit immediates)
91 those are currently EXT022 and EXT05.
92 * brownfield encoding in one of those two major opcodes still
93 requires multiple VA-Form operations (in greater numbers
94 than EXT04 has spare)
95 * space in EXT019 next to addpcis and crops is recommended
96 * many X-Form opcodes currently in EXT022 have no preference
97 for a location at all, and may be moved to EXT059, EXT019,
98 EXT031 or other much more suitable location.
99
100 Note that there is no Sandbox allocation in the published ISA Spec for
101 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
102 Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
103 would become a whopping 96-bit long instruction. Avoiding this
104 situation is a high priority which in turn by necessity puts pressure
105 on the 32-bit Major Opcode space.
106
107 Note also that EXT022, the Official Architectural Sandbox area
108 is under severe design pressure as it is insufficient to hold
109 the full extent of the instruction additions required to create
110 a Hybrid 3D CPU-VPU-GPU.
111
112 **Whilst SVP64 is only 4 instructions
113 the heavy focus on VSX for the past 12 years has left the SFFS Level
114 anaemic and out-of-date compared to ARM and x86. Approximately
115 100 additional Scalar Instructions are up for proposal**
116
117 # Sub-pages
118
119 Pages being developed and examples
120
121 * [[sv/overview]] explaining the basics.
122 * [[sv/compliancy_levels]] for minimum subsets through to Advanced
123 Supercomputing.
124 * [[sv/implementation]] implementation planning and coordination
125 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
126 contains explanations and further details
127 * [[sv/svp64_quirks]] things in SVP64 that slightly break the rules
128 * [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
129 * [[sv/sprs]] SPRs
130 * SVP64 "Modes":
131 - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
132 Register ops: Guidelines
133 on Vectorisation of any v3.0B base operations which return
134 or modify a Condition Register bit or field.
135 - For LD/ST Modes, see [[sv/ldst]].
136 - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
137 behaviour: All/Some Vector CRs
138 - For arithmetic and logical, see [[sv/normal]]
139
140 Core SVP64 instructions:
141
142 * [[sv/setvl]] the Cray-style "Vector Length" instruction
143 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
144 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
145
146 Vector-related:
147
148 * [[sv/vector_swizzle]]
149 * [[sv/mv.vec]] pack/unpack move to and from vec2/3/4
150 * [[sv/mv.swizzle]]
151 * [[sv/vector_ops]] scalar operations needed for supporting vectors
152
153 Scalar Instructions:
154
155 * [[sv/cr_int_predication]] instructions needed for effective predication
156 * [[sv/bitmanip]]
157 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
158 * [[sv/fclass]] detect class of FP numbers
159 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
160 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
161 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
162 * Twin targetted instructions (two registers out, one implicit)
163 Explanation of the rules for twin register targets
164 (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
165 - [[isa/svfixedarith]]
166 - [[isa/svfparith]]
167 - [[sv/biginteger]] Operations that help with big arithmetic
168 * TODO: OpenPOWER adaptation [[openpower/transcendentals]]
169
170 Examples experiments future ideas discussion:
171
172 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
173 * [[sv/masked_vector_chaining]]
174 * [[sv/discussion]]
175 * [[sv/example_dep_matrices]]
176 * [[sv/major_opcode_allocation]]
177 * [[sv/byteswap]]
178 * [[sv/16_bit_compressed]] experimental
179 * [[sv/toc_data_pointer]] experimental
180 * [[sv/predication]] discussion on predication concepts
181 * [[sv/register_type_tags]]
182 * [[sv/mv.x]] deprecated in favour of Indexed REMAP
183
184 Additional links:
185
186 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
187 * [[simple_v_extension]] old (deprecated) version
188 * [[openpower/sv/llvm]]
189 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
190
191 ===
192
193 Required Background Reading:
194 ============================
195
196 These are all, deep breath, basically... required reading, *as well as
197 and in addition* to a full and comprehensive deep technical understanding
198 of the Power ISA, in order to understand the depth and background on
199 SVP64 as a 3D GPU and VPU Extension.
200
201 I am keenly aware that each of them is 300 to 1,000 pages (just like
202 the Power ISA itself).
203
204 This is just how it is.
205
206 Given the sheer overwhelming size and scope of SVP64 we have gone to
207 **considerable lengths** to provide justification and rationalisation for
208 adding the various sub-extensions to the Base Scalar Power ISA.
209
210 * Scalar bitmanipulation is justifiable for the exact same reasons the
211 extensions are justifiable for other ISAs. The additional justification
212 for their inclusion where some instructions are already (sort-of) present
213 in VSX is that VSX is not mandatory, and the complexity of implementation
214 of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
215 * Scalar FP-to-INT conversions, likewise. ARM has a javascript conversion
216 instruction, Power ISA does not (and it costs a ridiculous 45 instructions
217 to implement, including 6 branches!)
218 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
219 for High-Performance Compute workloads.
220
221 It also has to be pointed out that normally this work would be covered by
222 multiple separate full-time Workgroups with multiple Members contributing
223 their time and resources.
224
225 Overall the contributions that we are developing take the Power ISA out of
226 the specialist highly-focussed market it is presently best known for, and
227 expands it into areas with much wider general adoption and broader uses.
228
229
230 ---
231
232 OpenCL specifications are linked here, these are relevant when we get
233 to a 3D GPU / High Performance Compute ISA WG RFC:
234 [[openpower/transcendentals]]
235
236 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
237 *willfully* designing a product that is 100% destined for commercial
238 rejection, due to the extremely high competitive performance/watt achieved
239 by today's mass-volume GPUs.)
240
241 I mention these because they will be encountered in every single
242 commercial GPU ISA, but they're not part of the "Base" (core design)
243 of a Vector Processor. Transcendentals can be added as a sub-RFC.
244
245 ---
246
247 Actual 3D GPU Architectures and ISAs:
248 -------------------------------------
249
250 * Broadcom Videocore
251 <https://github.com/hermanhermitage/videocoreiv>
252 * Etnaviv
253 <https://github.com/etnaviv/etna_viv/tree/master/doc>
254 * Nyuzi
255 <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
256 * MALI
257 <https://github.com/cwabbott0/mali-isa-docs>
258 * AMD
259 <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
260 <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
261 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
262 <https://miaowgpu.org/>
263
264
265 Actual Vector Processor Architectures and ISAs:
266 -----------------------------------------------
267
268 * NEC SX Aurora
269 <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
270 * Cray ISA
271 <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
272 * RISC-V RVV
273 <https://github.com/riscv/riscv-v-spec>
274 * MRISC32 ISA Manual (under active development)
275 <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
276 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
277 Mitch on direct contact with him. It is a different approach from the
278 others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
279 66000 is a *Vertical-First* Vector ISA.
280
281 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
282 "Column-First" technique, where:
283
284 * Horizontal-First processes all elements in a Vector before moving on
285 to the next instruction
286 * Vertical-First processes *ONE* element per instruction, and requires
287 loop constructs to explicitly step to the next element.
288
289 Vector-type Support by Architecture
290 [[!table data="""
291 Architecture | Horizontal | Vertical
292 MyISA 66000 | | X
293 Cray | X |
294 SX Aurora | X |
295 RVV | X |
296 SVP64 | X | X
297 """]]
298
299 ===
300
301 Obligatory Dilbert:
302
303 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
304