(no commit message)
[libreriscv.git] / openpower / sv.mdwn
1 [[!tag standards]]
2
3 # Simple-V Vectorisation for the OpenPOWER ISA
4
5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
6
7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
8
9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
10 As such it brings features normally only found in Cray Supercomputers
11 (Cray-1, NEC SX-Aurora)
12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
14 explicit Vector opcode exists in SV, at all**.
15
16 Fundamental design principles:
17
18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar
20 operations
21 * Preserving the underlying scalar execution dependencies as if the
22 for-loop had been expanded as actual scalar instructions
23 (termed "preserving Program Order")
24 * Augments ("tags") existing instructions, providing Vectorisation
25 "context" rather than adding new ones.
26 * Does not modify or deviate from the underlying scalar OpenPOWER ISA
27 unless it provides significant performance or other advantage to do so
28 in the Vector space (dropping XER.SO for example)
29 * Designed for Supercomputing: avoids creating significant sequential
30 dependency hazards, allowing high performance superscalar
31 microarchitectures to be deployed.
32
33 Advantages of these design principles:
34
35 * It is therefore easy to create a first (and sometimes only)
36 implementation as literally a for-loop in hardware, simulators, and
37 compilers.
38 * Hardware Architects may understand and implement SV as being an
39 extra pipeline stage, inserted between decode and issue, that is
40 a simple for-loop issuing element-level sub-instructions.
41 * More complex HDL can be done by repeating existing scalar ALUs and
42 pipelines as blocks and leveraging existing Multi-Issue Infrastructure
43 * As (mostly) a high-level "context" that does not (significantly) deviate
44 from scalar OpenPOWER ISA and, in its purest form being "a for loop around
45 scalar instructions", it is minimally-disruptive and consequently stands
46 a reasonable chance of broad community adoption and acceptance
47 * Completely wipes not just SIMD opcode proliferation off the
48 map (SIMD is O(N^6) opcode proliferation)
49 but off of Vectorisation ISAs as well. No more separate Vector
50 instructions.
51
52 Comparative instruction count:
53
54 * ARM NEON SIMD: around 2,000 instructions, prerequisite: ARM Scalar.
55 * ARM SVE: around 4,000 instructions, prerequisite: NEON.
56 * ARM SVE2: around 1,000 instructions, prerequisite: SVE
57 * Intel AVX-512: around 4,000 instructions, prerequisite AVX2 etc.
58 * RISV-V RVV: 192 instructions, prerequisite 96 Scalar RV64GC instructions
59 * SVP64: **four** instructions, 24-bit prefixing of
60 prerequisite SFS (150) or
61 SFFS (214) Compliancy Subsets
62
63 # Major opcodes summary
64
65 Please be advised that even though below is entirely DRAFT status, there
66 is considerable concern that because there is not yet any two-way
67 day-to-day communication established with the OPF ISA WG, we have
68 no idea if any of these are conflicting with future plans by any OPF
69 Members. **The External ISA WG RFC Process is yet to be ratified
70 and Libre-SOC may not join the OPF as an entity because it does
71 not exist except in name. Even if it existed it would be a conflict
72 of interest to join the OPF, due to our funding remit from NLnet**.
73 We therefore proceed on the basis of making public the intention to
74 submit RFCs once the External ISA WG RFC Process is in place and,
75 in a wholly unsatisfactory manner have to *hope and trust* that
76 OPF ISA WG Members are reading this and take it into consideration.
77
78 **None of these Draft opcodes are intended for private custom
79 secret proprietary usage. They are all intended for entirely
80 public, upstream, high-profile mass-volume day-to-day usage at the
81 same level as add, popcnt and fld**
82
83 * SVP64 requires 25% of EXT01 (bits 6 and 9 set to 1)
84 * bitmanip requires two major opcodes (due to 16+ bit immediates)
85 those are currently EXT022 and EXT05.
86 * brownfield encoding in one of those two major opcodes still
87 requires multiple VA-Form operations (in greater numbers
88 than EXT04 has spare)
89 * space in EXT019 next to addpcis and crops is recommended
90 * many X-Form opcodes currently in EXT022 have no preference
91 for a location at all, and may be moved to EXT059, EXT019,
92 EXT031 or other much more suitable location.
93
94 Note that there is no Sandbox allocation in the published ISA Spec for
95 v3.1 EXT01 usage, and because SVP64 is already 64-bit Prefixed,
96 Prefixed-Prefixed-instructions (SVP64 Prefixed v3.1 Prefixed)
97 would become a whopping 96-bit long instruction. Avoiding this
98 situation is a high priority which in turn by necessity puts pressure
99 on the 32-bit Major Opcode space.
100
101 Note also that EXT022, the Official Architectural Sandbox area
102 is under severe design pressure as it is insufficient to hold
103 the full extent of the instruction additions required to create
104 a Hybrid 3D CPU-VPU-GPU.
105
106 **Whilst SVP64 is only 4 instructions
107 the heavy focus on VSX for the past 12 years has left the SFFS Level
108 anaemic and out-of-date compared to ARM and x86. Approximately
109 100 additional Scalar Instructions are up for proposal**
110
111 # Sub-pages
112
113 Pages being developed and examples
114
115 * [[sv/overview]] explaining the basics.
116 * [[sv/implementation]] implementation planning and coordination
117 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
118 contains explanations and further details
119 * [[sv/svp64_quirks]] things in SVP64 that slightly break the rules
120 * [[opcode_regs_deduped]] autogenerated table of SVP64 instructions
121 * [[sv/sprs]] SPRs
122 * SVP64 "Modes":
123 - For condition register operations see [[sv/cr_ops]] - SVP64 Condition
124 Register ops: Guidelines
125 on Vectorisation of any v3.0B base operations which return
126 or modify a Condition Register bit or field.
127 - For LD/ST Modes, see [[sv/ldst]].
128 - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch
129 behaviour: All/Some Vector CRs
130 - For arithmetic and logical, see [[sv/normal]]
131
132 Core SVP64 instructions:
133
134 * [[sv/setvl]] the Cray-style "Vector Length" instruction
135 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
136 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
137
138 Vector-related:
139
140 * [[sv/vector_swizzle]]
141 * [[sv/mv.vec]] move to and from vec2/3/4
142 * [[sv/mv.swizzle]]
143 * [[sv/vector_ops]] scalar operations needed for supporting vectors
144
145 Scalar Instructions:
146
147 * [[sv/cr_int_predication]] instructions needed for effective predication
148 * [[sv/bitmanip]]
149 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
150 * [[sv/fclass]] detect class of FP numbers
151 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
152 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
153 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
154 * Twin targetted instructions (two registers out, one implicit)
155 Explanation of the rules for twin register targets
156 (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
157 - [[isa/svfixedarith]]
158 - [[isa/svfparith]]
159 - [[sv/biginteger]] Operations that help with big arithmetic
160 * TODO: OpenPOWER adaptation [[openpower/transcendentals]]
161
162 Examples experiments future ideas discussion:
163
164 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
165 * [[sv/masked_vector_chaining]]
166 * [[sv/discussion]]
167 * [[sv/example_dep_matrices]]
168 * [[sv/major_opcode_allocation]]
169 * [[sv/byteswap]]
170 * [[sv/16_bit_compressed]] experimental
171 * [[sv/toc_data_pointer]] experimental
172 * [[sv/predication]] discussion on predication concepts
173 * [[sv/register_type_tags]]
174 * [[sv/mv.x]] deprecated in favour of Indexed REMAP
175
176 Additional links:
177
178 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
179 * [[simple_v_extension]] old (deprecated) version
180 * [[openpower/sv/llvm]]
181 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
182
183 ===
184
185 Required Background Reading:
186 ============================
187
188 These are all, deep breath, basically... required reading, *as well as
189 and in addition* to a full and comprehensive deep technical understanding
190 of the Power ISA, in order to understand the depth and background on
191 SVP64 as a 3D GPU and VPU Extension.
192
193 I am keenly aware that each of them is 300 to 1,000 pages (just like
194 the Power ISA itself).
195
196 This is just how it is.
197
198 Given the sheer overwhelming size and scope of SVP64 we have gone to
199 **considerable lengths** to provide justification and rationalisation for
200 adding the various sub-extensions to the Base Scalar Power ISA.
201
202 * Scalar bitmanipulation is justifiable for the exact same reasons the
203 extensions are justifiable for other ISAs. The additional justification
204 for their inclusion where some instructions are already (sort-of) present
205 in VSX is that VSX is not mandatory, and the complexity of implementation
206 of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
207 * Scalar FP-to-INT conversions, likewise. ARM has a javascript conversion
208 instruction, Power ISA does not (and it costs a ridiculous 45 instructions
209 to implement, including 6 branches!)
210 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable
211 for High-Performance Compute workloads.
212
213 It also has to be pointed out that normally this work would be covered by
214 multiple separate full-time Workgroups with multiple Members contributing
215 their time and resources.
216
217 Overall the contributions that we are developing take the Power ISA out of
218 the specialist highly-focussed market it is presently best known for, and
219 expands it into areas with much wider general adoption and broader uses.
220
221
222 ---
223
224 OpenCL specifications are linked here, these are relevant when we get
225 to a 3D GPU / High Performance Compute ISA WG RFC:
226 [[openpower/transcendentals]]
227
228 (Failure to add Transcendentals to a 3D GPU is directly equivalent to
229 *willfully* designing a product that is 100% destined for commercial
230 rejection, due to the extremely high competitive performance/watt achieved
231 by today's mass-volume GPUs.)
232
233 I mention these because they will be encountered in every single
234 commercial GPU ISA, but they're not part of the "Base" (core design)
235 of a Vector Processor. Transcendentals can be added as a sub-RFC.
236
237 ---
238
239 Actual 3D GPU Architectures and ISAs:
240 -------------------------------------
241
242 * Broadcom Videocore
243 <https://github.com/hermanhermitage/videocoreiv>
244 * Etnaviv
245 <https://github.com/etnaviv/etna_viv/tree/master/doc>
246 * Nyuzi
247 <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
248 * MALI
249 <https://github.com/cwabbott0/mali-isa-docs>
250 * AMD
251 <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
252 <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
253 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
254 <https://miaowgpu.org/>
255
256
257 Actual Vector Processor Architectures and ISAs:
258 -----------------------------------------------
259
260 * NEC SX Aurora
261 <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
262 * Cray ISA
263 <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
264 * RISC-V RVV
265 <https://github.com/riscv/riscv-v-spec>
266 * MRISC32 ISA Manual (under active development)
267 <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
268 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from
269 Mitch on direct contact with him. It is a different approach from the
270 others, which may be termed "Cray-Style Horizontal-First" Vectorisation.
271 66000 is a *Vertical-First* Vector ISA.
272
273 The term Horizontal or Vertical alludes to the Matrix "Row-First" or
274 "Column-First" technique, where:
275
276 * Horizontal-First processes all elements in a Vector before moving on
277 to the next instruction
278 * Vertical-First processes *ONE* element per instruction, and requires
279 loop constructs to explicitly step to the next element.
280
281 Vector-type Support by Architecture
282 [[!table data="""
283 Architecture | Horizontal | Vertical
284 MyISA 66000 | | X
285 Cray | X |
286 SX Aurora | X |
287 RVV | X |
288 SVP64 | X | X
289 """]]
290
291 ===
292
293 Obligatory Dilbert:
294
295 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
296