\section{Summary}
-The proposed \ac{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
+The proposed \acs{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
ONLY uses scalar instructions}.
\begin{itemize}
-\item The Power \ac{ISA} v3.1 Specification is not altered in any way.
+\item The Power \acs{ISA} v3.1 Specification is not altered in any way.
v3.1 Code-compatibility is guaranteed.
\item Does not require sacrificing 32-bit Major Opcodes.
\item Does not require adding duplicates of instructions
disruptive full architectural redesigns.
\item Divided into Compliancy Levels to suit differing needs.
\item At the highest Compliancy Level only requires five instructions
- (SVE2 requires appx 9,000. \ac{AVX-512} around 10,000. \ac{RVV} around
+ (SVE2 requires appx 9,000. \acs{AVX-512} around 10,000. \acs{RVV} around
300).
\item Predication, an often-requested feature, is added cleanly
(without modifying the v3.1 Power ISA)
\item In-registers arbitrary-sized Matrix Multiply is achieved in three
instructions (without adding any v3.1 Power ISA instructions)
-\item Full \ac{DCT} and \ac{FFT} RADIX2 Triple-loops are achieved with
+\item Full \acs{DCT} and \acs{FFT} RADIX2 Triple-loops are achieved with
dramatically reduced instruction count, and power consumption expected
- to greatly reduce. Normally found only in high-end \ac{VLIW} \ac{DSP}
+ to greatly reduce. Normally found only in high-end \acs{VLIW} \acs{DSP}
(TI MSP, Qualcomm Hexagon)
\item Fail-First Load/Store allows strncpy to be implemented in around 14
- instructions (hand-optimised \ac{VSX} assembler is 240).
+ instructions (hand-optimised \acs{VSX} assembler is 240).
\item Inner loop of MP3 implemented in under 100 instructions
(gcc produces 450 for the same function on POWER9).
\end{itemize}
\subsection{What is SIMD?}
-\ac{SIMD} is a way of partitioning existing \ac{CPU}
+\acs{SIMD} is a way of partitioning existing \acs{CPU}
registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
-%\cite{SIMD_HARM}\cite{SIMD_HPC}
+\cite{SIMD_HARM}\cite{SIMD_HPC}
These partitions can then be operated on simultaneously, and the initial values
and results being stored as entire 64-bit registers. The SIMD instruction opcode
includes the data width and the operation to perform.
This method can have a huge advantage for rapid processing of
vector-type data (image/video, physics simulations, cryptography,
etc.),
-%\cite{SIMD_WASM},
+\cite{SIMD_WASM},
and thus on paper is very attractive compared to
scalar-only instructions.
\textit{As long as the data width fits the workload, everything is fine}.
architectures. Vector CPUs collect operands from the main memory, and
store them in large, sequential vector registers.\par
-\begin{figure}[hb]
- \centering
- \includegraphics[width=0.6\linewidth]{cray_vector_regs}
- \caption{Cray Vector registers: 8 registers, 64 elements each}
- \label{fig:cray_vector_regs}
-\end{figure}
-
A simple vector processor might operate on one element at a time,
however as the element operations are usually independent,
a processor could be made to compute all of the vector's
Typically, today's vector processors can execute two, four, or eight
64-bit elements per clock cycle.
-%\cite{SIMD_HARM}.
+\cite{SIMD_HARM}.
Such processors can also deal with (in hardware) fringe cases where the vector
length is not a multiple of the number of elements. The element data width
is variable (just like in SIMD) but it is the \textit{number} of elements being
"Scalable"
\par
-\ac{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
+\acs{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
which can fit 1024 64-bit words.
-%\cite{riscv-v-spec}.
+\cite{riscv-v-spec}.
The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
An early Draft of RVV supported overlaying the Vector Registers onto the
-Floating Point registers, similar to \ac{MMX}.
+Floating Point registers, similar to \acs{MMX}.
+
+\begin{figure}[hb]
+ \centering
+ \includegraphics[width=0.6\linewidth]{cray_vector_regs}
+ \caption{Cray Vector registers: 8 registers, 64 elements each}
+ \label{fig:cray_vector_regs}
+\end{figure}
Simple-V's "Vector" Registers are specifically designed to fit on top of
the Scalar (GPR, FPR) register files, which are extended from the default
\end{figure}
\subsection{Simple Vectorisation}
-\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
+\acs{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
VPU, 3D?). Includes features normally found only on Cray-style Supercomputers
(Cray-1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
leveraging a scalar ISA by using "Prefixing".