add power_pipeline.jpg to svp64 primer
[libreriscv.git] / svp64-primer / summary.tex
1 \section{Summary}
2 Simple-V is a Scalable Vector Specification for a hardware for-loop that
3 ONLY uses scalar instructions.
4
5 \begin{itemize}
6 \item The Power ISA v3.1 Specification is not altered in any way.
7 v3.1 Code-compatibility is guaranteed.
8 \item Does not require sacrificing 32-bit Major Opcodes.
9 \item Does not require adding duplicates of instructions
10 (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
11 \item Specifically designed to be easily implemented
12 on top of an existing Micro-architecture (especially
13 Superscalar Out-of-Order Multi-issue) without
14 disruptive full architectural redesigns.
15 \item Divided into Compliancy Levels to suit differing needs.
16 \item At the highest Compliancy Level only requires five instructions
17 (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around
18 300).
19 \item Predication, an often-requested feature, is added cleanly
20 (without modifying the v3.1 Power ISA)
21 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
22 instructions (without adding any v3.1 Power ISA instructions)
23 \item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
24 reduced instruction count, and power consumption expected to greatly
25 reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
26 Hexagon)
27 \item Fail-First Load/Store allows strncpy to be implemented in around 14
28 instructions (hand-optimised VSX assembler is 240).
29 \item Inner loop of MP3 implemented in under 100 instructions
30 (gcc produces 450 for the same function)
31 \end{itemize}
32
33 All areas investigated so far consistently showed reductions in executable
34 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
35 power consumption due to less I-Cache/TLB pressure and also Issue remaining
36 idle for long periods.
37
38 Simple-V has been specifically and carefully crafted to respect
39 the Power ISA's Supercomputing pedigree.
40
41 \begin{figure}[hb]
42 \centering
43 \includegraphics[width=0.6\linewidth]{power_pipelines}
44 \caption{Showing how SV fits in between Decode and Issue}
45 \label{fig:power_pipelines}
46 \end{figure}
47
48 \pagebreak
49
50 \subsection{What is SIMD?}
51
52 \ac{SIMD} is a way of partitioning existing \ac{CPU}
53 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
54 \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
55 on simultaneously, and the initial values and results being stored as
56 entire 64-bit registers. The SIMD instruction opcode includes the data
57 width and the operation to perform.
58 \par
59
60 \begin{figure}[hb]
61 \centering
62 \includegraphics[width=0.6\linewidth]{simd_axb}
63 \caption{SIMD multiplication}
64 \label{fig:simd_axb}
65 \end{figure}
66
67 This method can have a huge advantage for rapid processing of
68 vector-type data (image/video, physics simulations, cryptography,
69 etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
70 scalar-only instructions.
71 \textit{As long as the data width fits the workload, everything is fine}.
72 \par
73
74 SIMD registers are of a fixed length and thus to achieve greater
75 performance, CPU architects typically increase the width of registers
76 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
77 binary compatibility is an important feature, and thus each doubling
78 of SIMD registers also expands the instruction set. The number of
79 instructions quickly balloons and this can be seen in popular \ac{ISA},
80 for example IA-32 expanding from 80 to about 1400 instructions since
81 1978\cite{SIMD_HARM}.\par
82
83 \subsection{Vector Architectures}
84 An older alternative exists to utilise data parallelism - vector
85 architectures. Vector CPUs collect operands from the main memory, and
86 store them in large, sequential vector registers.\par
87
88 \begin{figure}[hb]
89 \centering
90 \includegraphics[width=0.6\linewidth]{cray_vector_regs}
91 \caption{Cray Vector registers: 8 registers, 64 elements each}
92 \label{fig:cray_vector_regs}
93 \end{figure}
94
95 A simple vector processor might operate on one element at a time,
96 however as the element operations are usually independent,
97 a processor could be made to compute all of the vector's
98 elements simultaneously, taking advantage of multiple pipelines.\par
99
100 Typically, today's vector processors can execute two, four, or eight
101 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
102 deal with (in hardware) fringe cases where the vector length is not a
103 multiple of the number of elements. The element data width is variable
104 (just like in SIMD) but it is the \textit{number} of elements being
105 variable under control of a "setvl" instruction that makes Vector ISAs
106 "Scalable"
107 \par
108
109 RISC-V Vector extension (RVV) supports a VL of up to $2^{16}$ or $65536$ bits,
110 which can fit 1024 64-bit words \cite{riscv-v-spec}. The Cray-1 had
111 8 Vector Registers with up to 64 elements. An early Draft of RVV supported
112 overlaying the Vector Registers onto the Floating Point registers, similar
113 to x86 "MMX".
114
115 Simple-V's "Vector" Registers are specifically designed to fit on top of
116 the Scalar (GPR, FPR) register files, which are extended from the default
117 of 32, to 128 entries in the Libre-SOC implementation. This is a primary
118 reason why Simple-V can be added on top of an existing Scalar ISA, and
119 \textit{in particular} why there is no need to add Vector Registers or
120 Vector instructions.
121
122 \begin{figure}[hb]
123 \centering
124 \includegraphics[width=0.6\linewidth]{svp64_regs}
125 \caption{three instructions, same vector length, different element widths}
126 \label{fig:svp64_regs}
127 \end{figure}
128
129 \subsection{Comparison Between SIMD and Vector}
130 \textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
131
132 \subsubsection{Code Example}
133 \begin{verbatim}
134 test test
135 \end{verbatim}
136
137 \subsection{Shortfalls of SIMD}
138 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
139 The following are just some of the reasons why SIMD is unsustainable as
140 the number of instructions increase:
141 \begin{itemize}
142 \item Hardware design, ASIC routing etc.
143 \item Compiler design
144 \item Documentation of the ISA
145 \item Manual coding and optimisation
146 \item Time to support the platform
147 \item Compilance Suite development and testing
148 \item Protracted Variable-Length encoding (x86) severely compromises
149 Multi-issue decoding
150 \end{itemize}
151
152 \subsection{Simple Vectorisation}
153 \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
154 VPU, 3D?). Includes features normally found only on Cray-style Supercomputers
155 (Cray-1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
156 leveraging a scalar ISA by using "Prefixing".
157 \textbf{No dedicated vector opcodes exist in SV, at all}.
158
159 \vspace{10pt}
160 Main design principles
161 \begin{itemize}
162 \item Introduce by implementing on top of existing Power ISA
163 \item Effectively a \textbf{hardware for-loop}, pauses main PC,
164 issues multiple scalar operations
165 \item Preserves underlying scalar execution dependencies as if
166 the for-loop had been expanded into actual scalar instructions
167 ("preserving Program Order")
168 \item Augments existing instructions by adding "tags" - provides
169 Vectorisation "context" rather than adding new opcodes.
170 \item Does not modify or deviate from the underlying scalar
171 Power ISA unless there's a significant performance boost or other
172 advantage in the vector space
173 \item Aimed at Supercomputing: avoids creating significant
174 \textit{sequential dependency hazards}, allowing \textbf{high
175 performance multi-issue superscalar microarchitectures} to be
176 leveraged.
177 \end{itemize}
178
179 Advantages include:
180 \begin{itemize}
181 \item Easy to create first (and sometimes only) implementation
182 as a literal for-loop in hardware, simulators, and compilers.
183 \item Hardware Architects may understand and implement SV as
184 being an extra pipeline stage, inserted between decode and
185 issue. Essentially a simple for-loop issuing element-level
186 sub-instructions.
187 \item More complex HDL can be done by repeating existing scalar
188 ALUs and pipelines as blocks, leveraging existing Multi-Issue
189 Infrastructure.
190 \item Mostly high-level "context" which does not significantly
191 deviate from scalar Power ISA and, in its purest form
192 being a "for-loop around scalar instructions". Thus SV is
193 minimally-disruptive and consequently has a reasonable chance
194 of broad community adoption and acceptance.
195 \item Obliterates SIMD opcode proliferation
196 ($O(N^6)$) as well as dedicated Vectorisation
197 ISAs. No more separate vector instructions.
198 \end{itemize}
199
200 \subsubsection{Prefix 64 - SVP64}
201
202 SVP64, is a specification designed to solve the problems caused by
203 SIMD implementations by:
204 \begin{itemize}
205 \item Simplifying the hardware design
206 \item Reducing maintenance overhead
207 \item Reducing code size and power consumption
208 \item Easier for compilers, coders, documentation
209 \item Time to support platform is a fraction of conventional SIMD
210 (Less money on R\&D, faster to deliver)
211 \end{itemize}
212
213 - Intel SIMD has been incrementally added to for decades, requires backwards
214 interoperability, and thus has a greater complexity (?)
215
216
217 - What are we going to gain?
218
219 -for loop, increment registers RT, RA, RB
220 -few instructions, easier to implement and maintain
221 -example assembly code
222 -ARM has already started to add to libC SVE2 support
223
224 1970 x86 comparison