svp64-primer/summary.tex

   1 \section{Summary}
   2 Specification for hardware for-loop that ONLY uses scalar instructions
   3
   4 \subsection{What is SIMD?}
   5 \textit{(for clarity only 64-bit registers will be discussed here,
   6        however 128-, 256-, and 512-bit implementations also exist)}
   7
   8 \ac{SIMD} is a way of partitioning existing \ac{CPU}
   9 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
  10 \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
  11 on simultaneously, and the initial values and results being stored as
  12 entire 64-bit registers. The SIMD instruction opcode includes the data
  13 width and the operation to perform.\par
  14
  15 \begin{figure}[h]
  16         \includegraphics[width=\linewidth]{simd_axb}
  17         \caption{SIMD multiplication}
  18         \label{fig:simd_axb}
  19 \end{figure}
  20
  21 This method can have a huge advantage for rapid processing of
  22 vector-type data (image/video, physics simulations, cryptography,
  23 etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
  24 scalar-only instructions.\par
  25
  26 SIMD registers are of a fixed length and thus to achieve greater
  27 performance, CPU architects typically increase the width of registers
  28 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  29 binary compatibility is an important feature, and thus each doubling
  30 of SIMD registers also expands the instruction set. The number of
  31 instructions quickly balloons and this can be seen in popular \ac{ISA},
  32 for example IA-32 expanding from 80 to about 1400 instructions since
  33 1978\cite{SIMD_HARM}.\par
  34
  35 \subsection{Vector Architectures}
  36 An older alternative exists to utilise data parallelism - vector
  37 architectures. Vector CPUs collect operands from the main memory, and
  38 store them in large, sequential vector registers.\par
  39
  40 A simple vector processor might operate on one element at a time,
  41 however as the element operations are usually independent,
  42 a processor could be made to compute all of the vector's
  43 elements simultaneously, taking advantage of multiple pipelines.\par
  44
  45 Typically, today's vector processors can execute two, four, or eight
  46 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
  47 deal with (in hardware) fringe cases where the vector length is not a
  48 multiple of the number of elements. The element data width is variable
  49 (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship
  50 between number of elements, data width and register vector length.
  51
  52 \begin{figure}[h]
  53         \includegraphics[width=\linewidth]{vl_reg_n}
  54         \caption{Vector length, data width, number of elements}
  55         \label{fig:vl_reg_n}
  56 \end{figure}
  57
  58 RISC-V Vector extension supports a VL of up to $2^{16}$ or $65536$ bits,
  59 which can fit 1024 64-bit words \cite{riscv-v-spec}.
  60
  61 \subsection{Comparison Between SIMD and Vector}
  62 \textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
  63
  64 \subsubsection{Code Example}
  65 \begin{verbatim}
  66 test test
  67 \end{verbatim}
  68
  69 \subsection{Shortfalls of SIMD}
  70 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
  71 The following are just some of the reasons why SIMD is unsustainable as
  72 the number of instructions increase:
  73 \begin{itemize}
  74         \item Hardware design, ASIC routing etc.
  75         \item Compiler design
  76         \item Documentation of the ISA
  77         \item Manual coding and optimisation
  78         \item Time to support the platform
  79 \end{itemize}
  80
  81 \subsection{Simple Vectorisation}
  82 \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
  83 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
  84 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
  85 leveraging a scalar ISA by using "Prefixing".
  86 \textbf{No dedicated vector opcodes exist in SV, at all}.
  87
  88 \vspace{10pt}
  89 Main design principles
  90 \begin{itemize}
  91         \item Introduce by implementing on top of existing Power ISA
  92         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
  93               issues multiple scalar op's
  94         \item Preserves underlying scalar execution dependencies as
  95               the for-loop had been expanded as actual scalar instructions
  96         ("preserving Program Order")
  97         \item Augments existing instructions by adding "tags" - provides
  98           Vectorisation "context" rather than adding new opcodes.
  99         \item Does not modify or deviate from the underlying scalar
 100           Power ISA unless there's a significant performance boost or other
 101           advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
 102         \item Aimed at Supercomputing: avoids creating significant
 103               \textit{sequential dependency hazards}, allowing \textbf{high
 104               performance superscalar microarchitectures} to be deployed.
 105 \end{itemize}
 106
 107 Advantages include:
 108 \begin{itemize}
 109         \item Easy to create first (and sometimes only) implementation
 110               as a literal for-loop in hardware, simulators, and compilers.
 111         \item Hardware Architects may understand and implement SV as
 112           being an extra pipeline stage, inserted between decode and
 113           issue. Essentially a simple for-loop issuing element-level
 114           sub-instructions.
 115         \item More complex HDL can be done by repeating existing scalar
 116               ALUs and pipelines as blocks, leveraging existing Multi-Issue
 117           Infrastructure.
 118         \item Mostly high-level "context" which does not significantly
 119           deviate from scalar Power ISA and, in its purest form
 120           being a "for-loop around scalar instructions". Thus SV is
 121           minimally-disruptive and consequently has a reasonable chance
 122           of broad community adoption and acceptance.
 123         \item Obliterates SIMD opcode proliferation
 124           ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation
 125           ISAs. No more separate vector instructions.
 126 \end{itemize}
 127
 128 \subsubsection{Prefix 64 - SVP64}
 129
 130 SVP64, is a specification designed to solve the problems caused by
 131 SIMD implementations by:
 132 \begin{itemize}
 133         \item Simplifying the hardware design
 134         \item Reducing maintenance overhead
 135         \item Reducing code size and power consumption
 136         \item Easier for compilers, coders, documentation
 137         \item Time to support platform is a fraction of conventional SIMD
 138               (Less money on R\&D, faster to deliver)
 139 \end{itemize}
 140
 141 - Intel SIMD has been incrementally added to for decades, requires backwards
 142   interoperability, and thus has a greater complexity (?)
 143
 144
 145 - What are we going to gain?
 146
 147 -for loop, increment registers RT, RA, RB
 148 -few instructions, easier to implement and maintain
 149 -example assembly code
 150 -ARM has already started to add to libC SVE2 support
 151
 152 1970 x86 comparison