add reworded version of section just removed, without duplicates
[libreriscv.git] / svp64-primer / summary.tex
1 \section{Summary}
2 The proposed \acs{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
3 ONLY uses scalar instructions}.
4
5 \begin{itemize}
6 \item The Power \acs{ISA} v3.1 Specification is not altered in any way.
7 v3.1 Code-compatibility is guaranteed.
8 \item Does not require sacrificing 32-bit Major Opcodes.
9 \item Does not require adding duplicates of instructions
10 (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
11 \item Specifically designed to be easily implemented
12 on top of an existing Micro-architecture (especially
13 Superscalar Out-of-Order Multi-issue) without
14 disruptive full architectural redesigns.
15 \item Divided into Compliancy Levels to suit differing needs.
16 \item At the highest Compliancy Level only requires five instructions
17 (SVE2 requires appx 9,000. \acs{AVX-512} around 10,000. \acs{RVV} around
18 300).
19 \item Predication, an often-requested feature, is added cleanly
20 (without modifying the v3.1 Power ISA)
21 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
22 instructions (without adding any v3.1 Power ISA instructions)
23 \item Full \acs{DCT} and \acs{FFT} RADIX2 Triple-loops are achieved with
24 dramatically reduced instruction count, and power consumption expected
25 to greatly reduce. Normally found only in high-end \acs{VLIW} \acs{DSP}
26 (TI MSP, Qualcomm Hexagon)
27 \item Fail-First Load/Store allows strncpy to be implemented in around 14
28 instructions (hand-optimised \acs{VSX} assembler is 240).
29 \item Inner loop of MP3 implemented in under 100 instructions
30 (gcc produces 450 for the same function on POWER9).
31 \end{itemize}
32
33 All areas investigated so far consistently showed reductions in executable
34 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
35 power consumption due to less I-Cache/TLB pressure and also Issue remaining
36 idle for long periods.
37
38 Simple-V has been specifically and carefully crafted to respect
39 the Power ISA's Supercomputing pedigree.
40
41 \begin{figure}[hb]
42 \centering
43 \includegraphics[width=0.6\linewidth]{power_pipelines}
44 \caption{Showing how SV fits in between Decode and Issue}
45 \label{fig:power_pipelines}
46 \end{figure}
47
48 \pagebreak
49
50 \subsection{What is SIMD?}
51
52 \acs{SIMD} is a way of partitioning existing \acs{CPU}
53 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
54 \cite{SIMD_HARM}\cite{SIMD_HPC}
55 These partitions can then be operated on simultaneously, and the initial values
56 and results being stored as entire 64-bit registers. The SIMD instruction opcode
57 includes the data width and the operation to perform.
58 \par
59
60 \begin{figure}[hb]
61 \centering
62 \includegraphics[width=0.6\linewidth]{simd_axb}
63 \caption{SIMD multiplication}
64 \label{fig:simd_axb}
65 \end{figure}
66
67 This method can have a huge advantage for rapid processing of
68 vector-type data (image/video, physics simulations, cryptography,
69 etc.),
70 \cite{SIMD_WASM},
71 and thus on paper is very attractive compared to
72 scalar-only instructions.
73 \textit{As long as the data width fits the workload, everything is fine}.
74 \par
75
76 \subsection{Shortfalls of SIMD}
77 SIMD registers are of a fixed length and thus to achieve greater
78 performance, CPU architects typically increase the width of registers
79 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
80 binary compatibility is an important feature, and thus each doubling
81 of SIMD registers also expands the instruction set. The number of
82 instructions quickly balloons and this can be seen in for example
83 IA-32 expanding from 80 to about 1400 instructions since
84 the 1970s\cite{SIMD_HARM}.\par
85
86 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
87 The following are just some of the reasons why SIMD is unsustainable as
88 the number of instructions increase:
89 \begin{itemize}
90 \item Hardware design, ASIC routing etc.
91 \item Compiler design
92 \item Documentation of the ISA
93 \item Manual coding and optimisation
94 \item Time to support the platform
95 \item Compilance Suite development and testing
96 \item Protracted Variable-Length encoding (x86) severely compromises
97 Multi-issue decoding
98 \end{itemize}
99
100 \subsection{Vector Architectures}
101 An older alternative exists to utilise data parallelism - vector
102 architectures. Vector CPUs collect operands from the main memory, and
103 store them in large, sequential vector registers.\par
104
105 A simple vector processor might operate on one element at a time,
106 however as the element operations are usually independent,
107 a processor could be made to compute all of the vector's
108 elements simultaneously, taking advantage of multiple pipelines.\par
109
110 Typically, today's vector processors can execute two, four, or eight
111 64-bit elements per clock cycle.
112 \cite{SIMD_HARM}.
113 Such processors can also deal with (in hardware) fringe cases where the vector
114 length is not a multiple of the number of elements. The element data width
115 is variable (just like in SIMD) but it is the \textit{number} of elements being
116 variable under control of a "setvl" instruction that makes Vector ISAs
117 "Scalable"
118 \par
119
120 \acs{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
121 which can fit 1024 64-bit words.
122 \cite{riscv-v-spec}.
123 The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
124 An early Draft of RVV supported overlaying the Vector Registers onto the
125 Floating Point registers, similar to \acs{MMX}.
126
127 \begin{figure}[hb]
128 \centering
129 \includegraphics[width=0.6\linewidth]{cray_vector_regs}
130 \caption{Cray Vector registers: 8 registers, 64 elements each}
131 \label{fig:cray_vector_regs}
132 \end{figure}
133
134 Simple-V's "Vector" Registers are specifically designed to fit on top of
135 the Scalar (GPR, FPR) register files, which are extended from the default
136 of 32, to 128 entries in the Libre-SOC implementation. This is a primary
137 reason why Simple-V can be added on top of an existing Scalar ISA, and
138 \textit{in particular} why there is no need to add Vector Registers or
139 Vector instructions.
140
141 \begin{figure}[hb]
142 \centering
143 \includegraphics[width=0.6\linewidth]{svp64_regs}
144 \caption{three instructions, same vector length, different element widths}
145 \label{fig:svp64_regs}
146 \end{figure}
147
148 \subsection{Simple Vectorisation}
149 \acs{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
150 VPU, 3D?). Includes features normally found only on Cray-style Supercomputers
151 (Cray-1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
152 leveraging a scalar ISA by using "Prefixing".
153 \textbf{No dedicated vector opcodes exist in SV, at all}.
154 SVP64 uses 25\% of the Power ISA v3.1 64-bit Prefix space (EXT001) to create
155 the SV Vectorisation Context for the 32-bit Scalar Suffix.
156
157 \vspace{10pt}
158 Main design principles
159 \begin{itemize}
160 \item Introduce by implementing on top of existing Power ISA
161 \item Effectively a \textbf{hardware for-loop}, pauses main PC,
162 issues multiple scalar operations
163 \item Preserves underlying scalar execution dependencies as if
164 the for-loop had been expanded into actual scalar instructions
165 ("preserving Program Order")
166 \item Augments existing instructions by adding "tags" - provides
167 Vectorisation "context" rather than adding new opcodes.
168 \item Does not modify or deviate from the underlying scalar
169 Power ISA unless there's a significant performance boost or other
170 advantage in the vector space
171 \item Aimed at Supercomputing: avoids creating significant
172 \textit{sequential dependency hazards}, allowing \textbf{high
173 performance multi-issue superscalar microarchitectures} to be
174 leveraged.
175 \end{itemize}
176
177 Advantages include:
178 \begin{itemize}
179 \item Easy to create first (and sometimes only) implementation
180 as a literal for-loop in hardware, simulators, and compilers.
181 \item Hardware Architects may understand and implement SV as
182 being an extra pipeline stage, inserted between decode and
183 issue. Essentially a simple for-loop issuing element-level
184 sub-instructions.
185 \item More complex HDL can be done by repeating existing scalar
186 ALUs and pipelines as blocks, leveraging existing Multi-Issue
187 Infrastructure.
188 \item Mostly high-level "context" which does not significantly
189 deviate from scalar Power ISA and, in its purest form
190 being a "for-loop around scalar instructions". Thus SV is
191 minimally-disruptive and consequently has a reasonable chance
192 of broad community adoption and acceptance.
193 \item Obliterates SIMD opcode proliferation
194 ($O(N^6)$) as well as dedicated Vectorisation
195 ISAs. No more separate vector instructions.
196 \item Reducing maintenance overhead (no separate Vector instructions).
197 Adding a Scalar instruction automatically gains a Vectorised
198 version.
199 \item Easier for compilers, coders, documentation
200 \end{itemize}
201