Specification for hardware for-loop that ONLY uses scalar instructions
\subsection{What is SIMD?}
-\textit{(for clarity only 64-bit registers will be discussed here, however 128-, 256-, and 512-bit implementations also exist)}
+\textit{(for clarity only 64-bit registers will be discussed here,
+ however 128-, 256-, and 512-bit implementations also exist)}
-\ac{SIMD} is a way of partitioning existing \ac{CPU} registers of 64-bit length into smaller 8-, 16-, 32-bit pieces \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated on simultaneously, and the initial values and results being stored as entire 64-bit registers. The SIMD instruction opcode includes the data width and the operation to perform.\par
+\ac{SIMD} is a way of partitioning existing \ac{CPU}
+registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
+\cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
+on simultaneously, and the initial values and results being stored as
+entire 64-bit registers. The SIMD instruction opcode includes the data
+width and the operation to perform.\par
\begin{figure}[h]
\includegraphics[width=\linewidth]{simd_axb}
\label{fig:simd_axb}
\end{figure}
-This method can have a huge advantage for rapid processing of vector-type data (image/video, physics simulations, cryptography, etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to scalar-only instructions.\par
+This method can have a huge advantage for rapid processing of
+vector-type data (image/video, physics simulations, cryptography,
+etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
+scalar-only instructions.\par
-SIMD registers are of a fixed length and thus to achieve greater performance, CPU architects typically increase the width of registers (to 128-, 256-, 512-bit etc) for more partitions.\par
-Additionally, binary compatibility is an important feature, and thus each doubling of SIMD registers also expands the instruction set. The number of instructions quickly balloons and this can be seen in popular \ac{ISA}, for example IA-32 expanding from 80 to about 1400 instructions since 1978\cite{SIMD_HARM}.\par
+SIMD registers are of a fixed length and thus to achieve greater
+performance, CPU architects typically increase the width of registers
+(to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
+binary compatibility is an important feature, and thus each doubling
+of SIMD registers also expands the instruction set. The number of
+instructions quickly balloons and this can be seen in popular \ac{ISA},
+for example IA-32 expanding from 80 to about 1400 instructions since
+1978\cite{SIMD_HARM}.\par
\subsection{Vector Architectures}
-An older alternative exists to utilise data parallelism - vector architectures. Vector CPUs collect operands from the main memory, and store them in large, sequential vector registers.\par
-
-Pipelined execution units then perform parallel computations on these vector registers. The result vector is then broken up into individual results which are sent back into the main memory.\par
-
-A simple vector processor might operate on one element at a time, however as the operations are independent by definition \textbf{(where is this from?)}, a processor could be made to compute all of the vector's elements simultaneously.\par
-
-Typically, today's vector processors can execute two, four, or eight 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship between number of elements, data width and register vector length.
+An older alternative exists to utilise data parallelism - vector
+architectures. Vector CPUs collect operands from the main memory, and
+store them in large, sequential vector registers.\par
+
+Pipelined execution units then perform parallel computations on these
+vector registers. The result vector is then broken up into individual
+results which are sent back into the main memory.\par
+
+A simple vector processor might operate on one element at a time,
+however as the operations are independent by definition \textbf{(where
+is this from?)}, a processor could be made to compute all of the vector's
+elements simultaneously.\par
+
+Typically, today's vector processors can execute two, four, or eight
+64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
+deal with (in hardware) fringe cases where the vector length is not a
+multiple of the number of elements. The element data width is variable
+(just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship
+between number of elements, data width and register vector length.
\begin{figure}[h]
\includegraphics[width=\linewidth]{vl_reg_n}
\label{fig:vl_reg_n}
\end{figure}
-RISCV Vector extension supports a VL of up to $2^{16}$ or $65536$ bits, which can fit 1024 64-bit words \cite{riscv-v-spec}.
+RISCV Vector extension supports a VL of up to $2^{16}$ or $65536$ bits,
+which can fit 1024 64-bit words \cite{riscv-v-spec}.
\subsection{Comparison Between SIMD and Vector}
\textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
\end{verbatim}
\subsection{Shortfalls of SIMD}
-The following are just some of the reasons why SIMD is unsustainable as the number of instructions increase:
+The following are just some of the reasons why SIMD is unsustainable as
+the number of instructions increase:
\begin{itemize}
\item Hardware design, ASIC routing etc.
\item Compiler design
\end{itemize}
\subsection{Simple Vectorisation}
-\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU, VPU, 3D?).
-Includes features normally found only on Cray Supercomputers (Cray-1, NEC SX-Aurora) and GPUs.
-Keeps a strictly simple RISC leveraging a scalar ISA by using "Prefixing"
-No dedicated vector opcodes exist in SV!
+\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
+VPU, 3D?). Includes features normally found only on Cray Supercomputers
+(Cray-1, NEC SX-Aurora) and GPUs. Keeps a strictly simple RISC leveraging
+a scalar ISA by using "Prefixing" No dedicated vector opcodes exist in SV!
Main design principles
\begin{itemize}
\item Introduce by implementing on top of existing Power ISA
- \item Effectively a \textbf{hardware for-loop}, pauses main PC, issues multiple scalar op's
- \item Preserves underlying scalar execution dependencies as the for-loop had been expanded as actual scalar instructions ("preserving Program Order")
- \item Augments existing instructions by adding "tags" - provides Vectorisation "context" rather than adding new opcodes.
- \item Does not modify or deviate from the underlying scalar Power ISA unless there's a significant performance boost or other advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
- \item Aimed at Supercomputing: avoids creating significant \textit{sequential dependency hazards}, allowing \textbf{high performance superscalar microarchitectures} to be deployed.
+ \item Effectively a \textbf{hardware for-loop}, pauses main PC,
+ issues multiple scalar op's
+ \item Preserves underlying scalar execution dependencies as
+ the for-loop had been expanded as actual scalar instructions
+ ("preserving Program Order")
+ \item Augments existing instructions by adding "tags" - provides
+ Vectorisation "context" rather than adding new opcodes.
+ \item Does not modify or deviate from the underlying scalar
+ Power ISA unless there's a significant performance boost or other
+ advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
+ \item Aimed at Supercomputing: avoids creating significant
+ \textit{sequential dependency hazards}, allowing \textbf{high
+ performance superscalar microarchitectures} to be deployed.
\end{itemize}
Advantages include:
\begin{itemize}
- \item Easy to create first (and sometimes only) implementation as a literal for-loop in hardware, simulators, and compilers.
- \item Hardware Architects may understand and implement SV as being an extra pipeline stage, inserted between decode and issue. Essentially a simple for-loop issuing element-level sub-instructions.
- \item More complex HDL can be done by repeating existing scalar ALUs and pipelines as blocks, leveraging existing Multi-Issue Infrastructure.
- \item Mostly high-level "context" which does not significantly deviate from scalar Power ISA and, in its purest form being a "for-loop around scalar instructions". Thus SV is minimally-disruptive and consequently has a reasonable chance of broad community adoption and acceptance.
- \item Obliterates SIMD opcode proliferation ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation ISAs. No more separate vector instructions.
+ \item Easy to create first (and sometimes only) implementation
+ as a literal for-loop in hardware, simulators, and compilers.
+ \item Hardware Architects may understand and implement SV as
+ being an extra pipeline stage, inserted between decode and
+ issue. Essentially a simple for-loop issuing element-level
+ sub-instructions.
+ \item More complex HDL can be done by repeating existing scalar
+ ALUs and pipelines as blocks, leveraging existing Multi-Issue
+ Infrastructure.
+ \item Mostly high-level "context" which does not significantly
+ deviate from scalar Power ISA and, in its purest form
+ being a "for-loop around scalar instructions". Thus SV is
+ minimally-disruptive and consequently has a reasonable chance
+ of broad community adoption and acceptance.
+ \item Obliterates SIMD opcode proliferation
+ ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation
+ ISAs. No more separate vector instructions.
\end{itemize}
\subsubsection{Deviations from Power ISA}
\item Simplifying the hardware design
\item Reducing maintenance overhead
\item Easier for compilers, coders, documentation
- \item Time to support platform is a fraction of conventional SIMD (Less money on R\&D, faster to deliver)
+ \item Time to support platform is a fraction of conventional SIMD
+ (Less money on R\&D, faster to deliver)
\end{itemize}
- Intel SIMD is designed to be more capable and has more features, and thus has a greater complexity (?)
-example assembly code
-ARM has already started to add to libC SVE2 support
-1970 x86 comparison
\ No newline at end of file
+1970 x86 comparison