(no commit message)
[libreriscv.git] / openpower / sv.mdwn
1 [[!tag standards]]
2
3 # Simple-V Vectorisation for the OpenPOWER ISA
4
5 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
6
7 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
8
9 SV is designed as a Vector ISA for Hybrid 3D CPU GPU VPU workloads.
10 As such it brings features normally only found in Cray Supercomputers
11 (Cray-1, NEC SX-Aurora)
12 and in GPUs, but keeps strictly to a *Simple* principle of leveraging
13 a *Scalar* ISA, exclusively using "Prefixing". **Not one single actual
14 explicit Vector opcode exists in SV, at all**.
15
16 Fundamental design principles:
17
18 * Simplicity of introduction and implementation on the existing OpenPOWER ISA
19 * Effectively a hardware for-loop, pausing PC, issuing multiple scalar operations
20 * Preserving the underlying scalar execution dependencies as if the for-loop had been expanded as actual scalar instructions
21 (termed "preserving Program Order")
22 * Augments ("tags") existing instructions, providing Vectorisation "context" rather than adding new ones.
23 * Does not modify or deviate from the underlying scalar OpenPOWER ISA unless it provides significant performance or other advantage to do so in the Vector space (dropping XER.SO and OE=1 for example)
24 * Designed for Supercomputing: avoids creating significant sequential
25 dependency hazards, allowing high performance superscalar microarchitectures to be deployed.
26
27 Advantages of these design principles:
28
29 * It is therefore easy to create a first (and sometimes only) implementation as literally a for-loop in hardware, simulators, and compilers.
30 * Hardware Architects may understand and implement SV as being an
31 extra pipeline stage, inserted between decode and issue, that is
32 a simple for-loop issuing element-level sub-instructions.
33 * More complex HDL can be done by repeating existing scalar ALUs and
34 pipelines as blocks and leveraging existing Multi-Issue Infrastructure
35 * As (mostly) a high-level "context" that does not (significantly) deviate from scalar OpenPOWER ISA and, in its purest form being "a for loop around scalar instructions", it is minimally-disruptive and consequently stands a reasonable chance of broad community adoption and acceptance
36 * Completely wipes not just SIMD opcode proliferation off the
37 map (SIMD is O(N^6) opcode proliferation)
38 but off of Vectorisation ISAs as well. No more separate Vector
39 instructions.
40
41 Pages being developed and examples
42
43 * [[sv/overview]] explaining the basics.
44 * [[sv/implementation]] implementation planning and coordination
45 * [[sv/svp64]] contains the packet-format *only*, the [[sv/svp64/appendix]]
46 contains explanations and further details
47 * [[sv/setvl]] the Cray-style "Vector Length" instruction
48 * [[sv/svp64_quirks]] things in SVP64 that slightly break the rules
49 * [[sv/cr_int_predication]] instructions needed for effective predication
50 * [[opcode_regs_deduped]]
51 * [[sv/vector_swizzle]]
52 * [[sv/vector_ops]]
53 * [[sv/mv.swizzle]]
54 * [[sv/mv.x]]
55 * SVP64 "Modes":
56 - For condition register operations see [[sv/cr_ops]] - SVP64 Condition Register ops: Guidelines
57 on Vectorisation of any v3.0B base operations which return
58 or modify a Condition Register bit or field.
59 - For LD/ST Modes, see [[sv/ldst]].
60 - For Branch modes, see [[sv/branches]] - SVP64 Conditional Branch behaviour: All/Some Vector CRs
61 - For arithmetic and logical, see [[sv/normal]]
62 * [[sv/fcvt]] FP Conversion (due to OpenPOWER Scalar FP32)
63 * [[sv/fclass]] detect class of FP numbers
64 * [[sv/int_fp_mv]] Move and convert GPR <-> FPR, needed for !VSX
65 * [[sv/mv.vec]] move to and from vec2/3/4
66 * [[sv/sprs]] SPRs
67 * [[sv/bitmanip]]
68 * [[sv/biginteger]] Operations that help with big arithmetic
69 * [[sv/remap]] "Remapping" for Matrix Multiply and RGB "Structure Packing"
70 * [[sv/svstep]] Key stepping instruction for Vertical-First Mode
71 * [[sv/propagation]] Context propagation including svp64, swizzle and remap
72 * [[sv/vector_ops]] Vector ops needed to make a "complete" Vector ISA
73 * [[sv/av_opcodes]] scalar opcodes for Audio/Video
74 * Twin targetted instructions (two registers out, one implicit)
75 Explanation of the rules for twin register targets
76 (implicit RS, FRS) explained in SVP64 [[sv/svp64/appendix]]
77 - [[isa/svfixedarith]]
78 - [[isa/svfparith]]
79 * TODO: OpenPOWER [[openpower/transcendentals]]
80
81 Examples experiments ideas discussion:
82
83 * [[sv/masked_vector_chaining]]
84 * [[sv/discussion]]
85 * [[sv/example_dep_matrices]]
86 * [[sv/major_opcode_allocation]]
87 * [[sv/byteswap]]
88 * [[sv/16_bit_compressed]] experimental
89 * [[sv/toc_data_pointer]] experimental
90 * [[sv/predication]] discussion on predication concepts
91 * [[sv/register_type_tags]]
92
93 Additional links:
94
95 * <https://www.sigarch.org/simd-instructions-considered-harmful/>
96 * [[simple_v_extension]] old (deprecated) version
97 * [[openpower/sv/llvm]]
98 * [[openpower/sv/effect-of-more-decode-stages-on-reg-renaming]]
99
100 ===
101
102 Required Background Reading:
103 ============================
104
105 These are all, deep breath, basically... required reading, *as well as and in addition* to a full and comprehensive deep technical understanding of the Power ISA, in order to understand the depth and background on SVP64 as a 3D GPU and VPU Extension.
106
107 I am keenly aware that each of them is 300 to 1,000 pages (just like the Power ISA itself).
108
109 This is just how it is.
110
111 Given the sheer overwhelming size and scope of SVP64 we have gone to CONSIDERABLE LENGTHS to provide justification and rationalisation for adding the various sub-extensions to the Base Scalar Power ISA.
112
113 * Scalar bitmanipulation is justifiable for the exact same reasons the extensions are justifiable for other ISAs. The additional justification for their inclusion where some instructions are already (sort-of) present in VSX is that VSX is not mandatory, and the complexity of implementation of VSX is too high a price to pay at the Embedded SFFS Compliancy Level.
114
115 * Scalar FP-to-INT conversions, likewise. ARM has a javascript conversion instruction, Power ISA does not (and it costs a ridiculous 45 instructions to implement, including 6 branches!)
116
117 * Scalar Transcendentals (SIN, COS, ATAN2, LOG) are easily justifiable for High-Performance Compute workloads.
118
119 It also has to be pointed out that normally this work would be covered by multiple separate full-time Workgroups with multiple Members contributing their time and resources!
120
121 Overall the contributions that we are developing take the Power ISA out of the specialist highly-focussed market it is presently best known for, and expands it into areas with much wider general adoption and broader uses.
122
123
124 ---
125
126 OpenCL specifications are linked here, these are relevant when we get to a 3D GPU / High Performance Compute ISA WG RFC:
127 [[openpower/transcendentals]]
128
129 (Failure to add Transcendentals to a 3D GPU is directly equivalent to *willfully* designing a product that is 100% destined for commercial failure.)
130
131 I mention these because they will be encountered in every single commercial GPU ISA, but they're not part of the "Base" (core design) of a Vector Processor. Transcendentals can be added as a sub-RFC.
132
133 ---
134
135 Actual 3D GPU Architectures and ISAs:
136 -------------------------------------
137
138 * Broadcom Videocore
139 <https://github.com/hermanhermitage/videocoreiv>
140
141 * Etnaviv
142 <https://github.com/etnaviv/etna_viv/tree/master/doc>
143
144 * Nyuzi
145 <http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf>
146
147 * MALI
148 <https://github.com/cwabbott0/mali-isa-docs>
149
150 * AMD
151 <https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf>
152 <https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf>
153
154 * MIAOW which is *NOT* a 3D GPU, it is a processor which happens to implement a subset of the AMDGPU ISA (Southern Islands), aka a "GPGPU"
155 <https://miaowgpu.org/>
156
157
158 Actual Vector Processor Architectures and ISAs:
159 -----------------------------------------------
160
161 * NEC SX Aurora
162 <https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf>
163
164 * Cray ISA
165 <http://www.bitsavers.org/pdf/cray/CRAY_Y-MP/HR-04001-0C_Cray_Y-MP_Computer_Systems_Functional_Description_Jun90.pdf>
166
167 * RISC-V RVV
168 <https://github.com/riscv/riscv-v-spec>
169
170 * MRISC32 ISA Manual (under active development)
171 <https://github.com/mrisc32/mrisc32/tree/master/isa-manual>
172
173 * Mitch Alsup's MyISA 66000 Vector Processor ISA Manual is available from Mitch on direct contact with him. It is a different approach from the others, which may be termed "Cray-Style Horizontal-First" Vectorisation. 66000 is a *Vertical-First* Vector ISA.
174
175 The term Horizontal or Vertical alludes to the Matrix "Row-First" or "Column-First" technique, where:
176
177 * Horizontal-First processes all elements in a Vector before moving on to the next instruction
178 * Vertical-First processes *ONE* element per instruction, and requires loop constructs to explicitly step to the next element.
179
180 Vector-type Support by Architecture
181 [[!table data="""
182 Architecture | Horizontal | Vertical
183 MyISA 66000 | | X
184 Cray | X |
185 SX Aurora | X |
186 RVV | X |
187 SVP64 | X | X
188 """]]
189
190 ===
191
192 Obligatory Dilbert:
193
194 <img src="https://assets.amuniversal.com/7fada35026ca01393d3d005056a9545d" width="600" />
195