clarify primer
[libreriscv.git] / svp64-primer / summary.tex
1 \section{Summary}
2 Simple-V is a Scalable Vector Specification for a hardware for-loop that
3 ONLY uses scalar instructions.
4
5 \begin{itemize}
6 \item The Power ISA v3.1 Specification is not altered in any way.
7 v3.1 Code-compatibility is guaranteed.
8 \item Does not require sacrificing 32-bit Major Opcodes.
9 \item Specifically designed to be easily implemented
10 on top of an existing Micro-architecture (especially
11 Superscalar Out-of-Order Multi-issue) without
12 disruptive full architectural redesigns.
13 \item Divided into Compliancy Levels to suit differing needs.
14 \item At the highest Compliancy Level only requires five instructions
15 (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around
16 300).
17 \item Predication, an often-requested feature, is added cleanly
18 (without modifying the v3.1 Power ISA)
19 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
20 instructions (without adding any v3.1 Power ISA instructions)
21 \item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
22 reduced instruction count, and power consumption expected to greatly
23 reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
24 Hexagon)
25 \item Fail-First Load/Store allows strncpy to be implemented in around 14
26 instructions (hand-optimised VSX assembler is 240).
27 \item Inner loop of MP3 implemented in under 100 instructions
28 (gcc produces 450 for the same function)
29 \end{itemize}
30
31 All areas investigated so far consistently showed reductions in executable
32 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
33 power consumption due to less I-Cache/TLB pressure and also Issue remaining
34 idle for long periods.
35
36 \pagebreak
37
38 \subsection{What is SIMD?}
39 \textit{(for clarity only 64-bit registers will be discussed here,
40 however 128-, 256-, and 512-bit implementations also exist)}
41
42 \ac{SIMD} is a way of partitioning existing \ac{CPU}
43 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
44 \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
45 on simultaneously, and the initial values and results being stored as
46 entire 64-bit registers. The SIMD instruction opcode includes the data
47 width and the operation to perform.\par
48
49 \begin{figure}[h]
50 \includegraphics[width=\linewidth]{simd_axb}
51 \caption{SIMD multiplication}
52 \label{fig:simd_axb}
53 \end{figure}
54
55 This method can have a huge advantage for rapid processing of
56 vector-type data (image/video, physics simulations, cryptography,
57 etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
58 scalar-only instructions.\par
59
60 SIMD registers are of a fixed length and thus to achieve greater
61 performance, CPU architects typically increase the width of registers
62 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
63 binary compatibility is an important feature, and thus each doubling
64 of SIMD registers also expands the instruction set. The number of
65 instructions quickly balloons and this can be seen in popular \ac{ISA},
66 for example IA-32 expanding from 80 to about 1400 instructions since
67 1978\cite{SIMD_HARM}.\par
68
69 \subsection{Vector Architectures}
70 An older alternative exists to utilise data parallelism - vector
71 architectures. Vector CPUs collect operands from the main memory, and
72 store them in large, sequential vector registers.\par
73
74 A simple vector processor might operate on one element at a time,
75 however as the element operations are usually independent,
76 a processor could be made to compute all of the vector's
77 elements simultaneously, taking advantage of multiple pipelines.\par
78
79 Typically, today's vector processors can execute two, four, or eight
80 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
81 deal with (in hardware) fringe cases where the vector length is not a
82 multiple of the number of elements. The element data width is variable
83 (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship
84 between number of elements, data width and register vector length.
85
86 \begin{figure}[h]
87 \includegraphics[width=\linewidth]{vl_reg_n}
88 \caption{Vector length, data width, number of elements}
89 \label{fig:vl_reg_n}
90 \end{figure}
91
92 RISC-V Vector extension supports a VL of up to $2^{16}$ or $65536$ bits,
93 which can fit 1024 64-bit words \cite{riscv-v-spec}.
94
95 \subsection{Comparison Between SIMD and Vector}
96 \textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
97
98 \subsubsection{Code Example}
99 \begin{verbatim}
100 test test
101 \end{verbatim}
102
103 \subsection{Shortfalls of SIMD}
104 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
105 The following are just some of the reasons why SIMD is unsustainable as
106 the number of instructions increase:
107 \begin{itemize}
108 \item Hardware design, ASIC routing etc.
109 \item Compiler design
110 \item Documentation of the ISA
111 \item Manual coding and optimisation
112 \item Time to support the platform
113 \item Compilance Suite development and testing
114 \item Protracted Variable-Length encoding (x86) severely compromises
115 Multi-issue decoding
116 \end{itemize}
117
118 \subsection{Simple Vectorisation}
119 \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
120 VPU, 3D?). Includes features normally found only on Cray-style Supercomputers
121 (Cray-1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
122 leveraging a scalar ISA by using "Prefixing".
123 \textbf{No dedicated vector opcodes exist in SV, at all}.
124
125 \vspace{10pt}
126 Main design principles
127 \begin{itemize}
128 \item Introduce by implementing on top of existing Power ISA
129 \item Effectively a \textbf{hardware for-loop}, pauses main PC,
130 issues multiple scalar operations
131 \item Preserves underlying scalar execution dependencies as if
132 the for-loop had been expanded into actual scalar instructions
133 ("preserving Program Order")
134 \item Augments existing instructions by adding "tags" - provides
135 Vectorisation "context" rather than adding new opcodes.
136 \item Does not modify or deviate from the underlying scalar
137 Power ISA unless there's a significant performance boost or other
138 advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
139 \item Aimed at Supercomputing: avoids creating significant
140 \textit{sequential dependency hazards}, allowing \textbf{high
141 performance multi-issue superscalar microarchitectures} to be
142 leveraged.
143 \end{itemize}
144
145 Advantages include:
146 \begin{itemize}
147 \item Easy to create first (and sometimes only) implementation
148 as a literal for-loop in hardware, simulators, and compilers.
149 \item Hardware Architects may understand and implement SV as
150 being an extra pipeline stage, inserted between decode and
151 issue. Essentially a simple for-loop issuing element-level
152 sub-instructions.
153 \item More complex HDL can be done by repeating existing scalar
154 ALUs and pipelines as blocks, leveraging existing Multi-Issue
155 Infrastructure.
156 \item Mostly high-level "context" which does not significantly
157 deviate from scalar Power ISA and, in its purest form
158 being a "for-loop around scalar instructions". Thus SV is
159 minimally-disruptive and consequently has a reasonable chance
160 of broad community adoption and acceptance.
161 \item Obliterates SIMD opcode proliferation
162 ($O(N^6)$) as well as dedicated Vectorisation
163 ISAs. No more separate vector instructions.
164 \end{itemize}
165
166 \subsubsection{Prefix 64 - SVP64}
167
168 SVP64, is a specification designed to solve the problems caused by
169 SIMD implementations by:
170 \begin{itemize}
171 \item Simplifying the hardware design
172 \item Reducing maintenance overhead
173 \item Reducing code size and power consumption
174 \item Easier for compilers, coders, documentation
175 \item Time to support platform is a fraction of conventional SIMD
176 (Less money on R\&D, faster to deliver)
177 \end{itemize}
178
179 - Intel SIMD has been incrementally added to for decades, requires backwards
180 interoperability, and thus has a greater complexity (?)
181
182
183 - What are we going to gain?
184
185 -for loop, increment registers RT, RA, RB
186 -few instructions, easier to implement and maintain
187 -example assembly code
188 -ARM has already started to add to libC SVE2 support
189
190 1970 x86 comparison