2 Simple-V is a Scalable Vector Specification for a hardware for-loop that
3 ONLY uses scalar instructions.
6 \item The Power ISA v3.1 Specification is not altered in any way.
7 v3.1 Code-compatibility is guaranteed.
8 \item Does not require sacrificing
32-bit Major Opcodes.
9 \item Specifically designed to be easily implemented
10 on top of an existing Micro-architecture (especially
11 Superscalar Out-of-Order Multi-issue) without
12 disruptive full architectural redesigns.
13 \item Divided into Compliancy Levels to suit differing needs.
14 \item At the highest Compliancy Level only requires five instructions
15 (SVE2 requires appx
9,
000. AVX-
512 around
10,
000. RVV around
17 \item Predication, an often-requested feature, is added cleanly
18 (without modifying the v3.1 Power ISA)
19 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
20 instructions (without adding any v3.1 Power ISA instructions)
21 \item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
22 reduced instruction count, and power consumption expected to greatly
23 reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
25 \item Fail-First Load/Store allows strncpy to be implemented in around
14
26 instructions (hand-optimised VSX assembler is
240).
27 \item Inner loop of MP3 implemented in under
100 instructions
28 (gcc produces
450 for the same function)
31 All areas investigated so far consistently showed reductions in executable
32 size, which as outlined in
\cite{SIMD_HARM
} has an indirect reduction in
33 power consumption due to less I-Cache/TLB pressure and also Issue remaining
34 idle for long periods.
38 \subsection{What is SIMD?
}
40 \ac{SIMD
} is a way of partitioning existing
\ac{CPU
}
41 registers of
64-bit length into smaller
8-,
16-,
32-bit pieces
42 \cite{SIMD_HARM
}\cite{SIMD_HPC
}. These partitions can then be operated
43 on simultaneously, and the initial values and results being stored as
44 entire
64-bit registers. The SIMD instruction opcode includes the data
45 width and the operation to perform.
50 \includegraphics[width=
0.6\linewidth]{simd_axb
}
51 \caption{SIMD multiplication
}
55 This method can have a huge advantage for rapid processing of
56 vector-type data (image/video, physics simulations, cryptography,
57 etc.)
\cite{SIMD_WASM
}, and thus on paper is very attractive compared to
58 scalar-only instructions.
59 \textit{As long as the data width fits the workload, everything is fine
}.
62 SIMD registers are of a fixed length and thus to achieve greater
63 performance, CPU architects typically increase the width of registers
64 (to
128-,
256-,
512-bit etc) for more partitions.
\par Additionally,
65 binary compatibility is an important feature, and thus each doubling
66 of SIMD registers also expands the instruction set. The number of
67 instructions quickly balloons and this can be seen in popular
\ac{ISA
},
68 for example IA-
32 expanding from
80 to about
1400 instructions since
69 1978\cite{SIMD_HARM
}.
\par
71 \subsection{Vector Architectures
}
72 An older alternative exists to utilise data parallelism - vector
73 architectures. Vector CPUs collect operands from the main memory, and
74 store them in large, sequential vector registers.
\par
78 \includegraphics[width=
0.6\linewidth]{cray_vector_regs
}
79 \caption{Cray Vector registers:
8 registers,
64 elements each
}
80 \label{fig:cray_vector_regs
}
83 A simple vector processor might operate on one element at a time,
84 however as the element operations are usually independent,
85 a processor could be made to compute all of the vector's
86 elements simultaneously, taking advantage of multiple pipelines.
\par
88 Typically, today's vector processors can execute two, four, or eight
89 64-bit elements per clock cycle
\cite{SIMD_HARM
}. Such processors can also
90 deal with (in hardware) fringe cases where the vector length is not a
91 multiple of the number of elements. The element data width is variable
92 (just like in SIMD) but it is the
\textit{number
} of elements being
93 variable under control of a "setvl" instruction that makes Vector ISAs
97 RISC-V Vector extension (RVV) supports a VL of up to $
2^
{16}$ or $
65536$ bits,
98 which can fit
1024 64-bit words
\cite{riscv-v-spec
}. The Cray-
1 had
99 8 Vector Registers with up to
64 elements. An early Draft of RVV supported
100 overlaying the Vector Registers onto the Floating Point registers, similar
103 Simple-V's "Vector" Registers are specifically designed to fit
104 on top of the Scalar (GPR, FPR) register files, which are extended from
105 32 to
128 entries. This is a primary reason why Simple-V can be added
106 on top of an existing Scalar ISA, and
\textit{in particular
} why there
107 is no need to add Vector Registers or Vector instructions.
111 \includegraphics[width=
0.6\linewidth]{svp64_regs
}
112 \caption{three instructions, same vector length, different element widths
}
113 \label{fig:svp64_regs
}
116 \subsection{Comparison Between SIMD and Vector
}
117 \textit{(Need to add more here, use example from
\cite{SIMD_HARM
}?)
}
119 \subsubsection{Code Example
}
124 \subsection{Shortfalls of SIMD
}
125 Five digit Opcode proliferation (
10,
000 instructions) is overwhelming.
126 The following are just some of the reasons why SIMD is unsustainable as
127 the number of instructions increase:
129 \item Hardware design, ASIC routing etc.
130 \item Compiler design
131 \item Documentation of the ISA
132 \item Manual coding and optimisation
133 \item Time to support the platform
134 \item Compilance Suite development and testing
135 \item Protracted Variable-Length encoding (x86) severely compromises
139 \subsection{Simple Vectorisation
}
140 \ac{SV
} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
141 VPU,
3D?). Includes features normally found only on Cray-style Supercomputers
142 (Cray-
1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
143 leveraging a scalar ISA by using "Prefixing".
144 \textbf{No dedicated vector opcodes exist in SV, at all
}.
147 Main design principles
149 \item Introduce by implementing on top of existing Power ISA
150 \item Effectively a
\textbf{hardware for-loop
}, pauses main PC,
151 issues multiple scalar operations
152 \item Preserves underlying scalar execution dependencies as if
153 the for-loop had been expanded into actual scalar instructions
154 ("preserving Program Order")
155 \item Augments existing instructions by adding "tags" - provides
156 Vectorisation "context" rather than adding new opcodes.
157 \item Does not modify or deviate from the underlying scalar
158 Power ISA unless there's a significant performance boost or other
159 advantage in the vector space (see
\ref{subsubsec:add_to_pow_isa
})
160 \item Aimed at Supercomputing: avoids creating significant
161 \textit{sequential dependency hazards
}, allowing
\textbf{high
162 performance multi-issue superscalar microarchitectures
} to be
168 \item Easy to create first (and sometimes only) implementation
169 as a literal for-loop in hardware, simulators, and compilers.
170 \item Hardware Architects may understand and implement SV as
171 being an extra pipeline stage, inserted between decode and
172 issue. Essentially a simple for-loop issuing element-level
174 \item More complex HDL can be done by repeating existing scalar
175 ALUs and pipelines as blocks, leveraging existing Multi-Issue
177 \item Mostly high-level "context" which does not significantly
178 deviate from scalar Power ISA and, in its purest form
179 being a "for-loop around scalar instructions". Thus SV is
180 minimally-disruptive and consequently has a reasonable chance
181 of broad community adoption and acceptance.
182 \item Obliterates SIMD opcode proliferation
183 ($O(N^
6)$) as well as dedicated Vectorisation
184 ISAs. No more separate vector instructions.
187 \subsubsection{Prefix
64 - SVP64
}
189 SVP64, is a specification designed to solve the problems caused by
190 SIMD implementations by:
192 \item Simplifying the hardware design
193 \item Reducing maintenance overhead
194 \item Reducing code size and power consumption
195 \item Easier for compilers, coders, documentation
196 \item Time to support platform is a fraction of conventional SIMD
197 (Less money on R\&D, faster to deliver)
200 - Intel SIMD has been incrementally added to for decades, requires backwards
201 interoperability, and thus has a greater complexity (?)
204 - What are we going to gain?
206 -for loop, increment registers RT, RA, RB
207 -few instructions, easier to implement and maintain
208 -example assembly code
209 -ARM has already started to add to libC SVE2 support