\section{Summary}
-Simple-V is a Scalable Vector Specification for a hardware for-loop that
-ONLY uses scalar instructions.
+The proposed \ac{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
+ONLY uses scalar instructions}.
\begin{itemize}
\item The Power \ac{ISA} v3.1 Specification is not altered in any way.
\item Does not require adding duplicates of instructions
(popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
\item Specifically designed to be easily implemented
- on top of an existing Micro-architecture (especially
- Superscalar Out-of-Order Multi-issue) without
- disruptive full architectural redesigns.
+ on top of an existing Micro-architecture (especially
+ Superscalar Out-of-Order Multi-issue) without
+ disruptive full architectural redesigns.
\item Divided into Compliancy Levels to suit differing needs.
\item At the highest Compliancy Level only requires five instructions
- (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around
- 300).
+ (SVE2 requires appx 9,000. \ac{AVX-512} around 10,000. \ac{RVV} around
+ 300).
\item Predication, an often-requested feature, is added cleanly
- (without modifying the v3.1 Power ISA)
+ (without modifying the v3.1 Power ISA)
\item In-registers arbitrary-sized Matrix Multiply is achieved in three
- instructions (without adding any v3.1 Power ISA instructions)
-\item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
- reduced instruction count, and power consumption expected to greatly
- reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
- Hexagon)
+ instructions (without adding any v3.1 Power ISA instructions)
+\item Full \ac{DCT} and \ac{FFT} RADIX2 Triple-loops are achieved with
+ dramatically reduced instruction count, and power consumption expected
+ to greatly reduce. Normally found only in high-end \ac{VLIW} \ac{DSP}
+ (TI MSP, Qualcomm Hexagon)
\item Fail-First Load/Store allows strncpy to be implemented in around 14
- instructions (hand-optimised VSX assembler is 240).
+ instructions (hand-optimised \ac{VSX} assembler is 240).
\item Inner loop of MP3 implemented in under 100 instructions
- (gcc produces 450 for the same function)
+ (gcc produces 450 for the same function on POWER9).
\end{itemize}
All areas investigated so far consistently showed reductions in executable
\subsection{What is SIMD?}
\ac{SIMD} is a way of partitioning existing \ac{CPU}
-registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
-\cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
-on simultaneously, and the initial values and results being stored as
-entire 64-bit registers. The SIMD instruction opcode includes the data
-width and the operation to perform.
+registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
+%\cite{SIMD_HARM}\cite{SIMD_HPC}
+These partitions can then be operated on simultaneously, and the initial values
+and results being stored as entire 64-bit registers. The SIMD instruction opcode
+ includes the data width and the operation to perform.
\par
\begin{figure}[hb]
This method can have a huge advantage for rapid processing of
vector-type data (image/video, physics simulations, cryptography,
-etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
+etc.),
+%\cite{SIMD_WASM},
+ and thus on paper is very attractive compared to
scalar-only instructions.
\textit{As long as the data width fits the workload, everything is fine}.
\par
+\subsection{Shortfalls of SIMD}
SIMD registers are of a fixed length and thus to achieve greater
performance, CPU architects typically increase the width of registers
(to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
of SIMD registers also expands the instruction set. The number of
instructions quickly balloons and this can be seen in for example
IA-32 expanding from 80 to about 1400 instructions since
-1978\cite{SIMD_HARM}.\par
+the 1970s\cite{SIMD_HARM}.\par
+
+Five digit Opcode proliferation (10,000 instructions) is overwhelming.
+The following are just some of the reasons why SIMD is unsustainable as
+the number of instructions increase:
+\begin{itemize}
+ \item Hardware design, ASIC routing etc.
+ \item Compiler design
+ \item Documentation of the ISA
+ \item Manual coding and optimisation
+ \item Time to support the platform
+ \item Compilance Suite development and testing
+ \item Protracted Variable-Length encoding (x86) severely compromises
+ Multi-issue decoding
+\end{itemize}
\subsection{Vector Architectures}
An older alternative exists to utilise data parallelism - vector
elements simultaneously, taking advantage of multiple pipelines.\par
Typically, today's vector processors can execute two, four, or eight
-64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
-deal with (in hardware) fringe cases where the vector length is not a
-multiple of the number of elements. The element data width is variable
-(just like in SIMD) but it is the \textit{number} of elements being
+64-bit elements per clock cycle.
+%\cite{SIMD_HARM}.
+Such processors can also deal with (in hardware) fringe cases where the vector
+length is not a multiple of the number of elements. The element data width
+is variable (just like in SIMD) but it is the \textit{number} of elements being
variable under control of a "setvl" instruction that makes Vector ISAs
"Scalable"
\par
-RISC-V Vector extension (RVV) supports a VL of up to $2^{16}$ or $65536$ bits,
-which can fit 1024 64-bit words \cite{riscv-v-spec}. The Cray-1 had
-8 Vector Registers with up to 64 elements. An early Draft of RVV supported
-overlaying the Vector Registers onto the Floating Point registers, similar
-to x86 "MMX".
+\ac{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
+which can fit 1024 64-bit words.
+%\cite{riscv-v-spec}.
+The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
+An early Draft of RVV supported overlaying the Vector Registers onto the
+Floating Point registers, similar to \ac{MMX}.
Simple-V's "Vector" Registers are specifically designed to fit on top of
the Scalar (GPR, FPR) register files, which are extended from the default
\label{fig:svp64_regs}
\end{figure}
-\subsection{Comparison Between SIMD and Vector}
-\textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
-
-\subsubsection{Code Example}
-\begin{verbatim}
-test test
-\end{verbatim}
-
-\subsection{Shortfalls of SIMD}
-Five digit Opcode proliferation (10,000 instructions) is overwhelming.
-The following are just some of the reasons why SIMD is unsustainable as
-the number of instructions increase:
-\begin{itemize}
- \item Hardware design, ASIC routing etc.
- \item Compiler design
- \item Documentation of the ISA
- \item Manual coding and optimisation
- \item Time to support the platform
- \item Compilance Suite development and testing
- \item Protracted Variable-Length encoding (x86) severely compromises
- Multi-issue decoding
-\end{itemize}
-
\subsection{Simple Vectorisation}
\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
VPU, 3D?). Includes features normally found only on Cray-style Supercomputers
\item Time to support platform is a fraction of conventional SIMD
(Less money on R\&D, faster to deliver)
\end{itemize}
-
-- Intel SIMD has been incrementally added to for decades, requires backwards
- interoperability, and thus has a greater complexity (?)
-
-
-- What are we going to gain?
-
--for loop, increment registers RT, RA, RB
--few instructions, easier to implement and maintain
--example assembly code
--ARM has already started to add to libC SVE2 support
-
-1970 x86 comparison