svp64-primer/summary.tex

   1 \section{Summary}
   2 Simple-V is a Scalable Vector Specification for a hardware for-loop that
   3 ONLY uses scalar instructions. Advantages:
   4
   5 \begin{itemize}
   6 \item The v3.1 Specification is not altered in any way.
   7 \item Specifically designed to be easily implemented
   8   on top of an existing Micro-architecture (especially
   9   Superscalar Out-of-Order Multi-issue) without
  10   disruptive full architectural redesigns.
  11 \item Divided into Compliancy Levels to suit differing needs.
  12 \item At the highest Compliancy Level only requires four instructions
  13   (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around
  14   300).
  15 \item Predication, an often-requested feature, is added cleanly to the
  16   Power ISA (without modifying the v3.1 Power ISA)
  17 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
  18   instructions (without adding any v3.1 Power ISA instructions)
  19 \item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
  20   reduced instruction count, and power consumption expected to greatly
  21   reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
  22   Hexagon)
  23 \item Fail-First Load/Store allows strncpy to be implemented in around 14
  24   instructions (Optimised VSX assembler is 240).
  25 \item Inner loop of MP3 implemented in under 100 instructions
  26   (gcc produces 450 for the same function)
  27 \end{itemize}
  28
  29 All areas investigated so far consistently showed reductions in executable
  30 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
  31 power consumption due both to less I-Cache/TLB pressure and Issue remaining
  32 idle.
  33
  34
  35 \subsection{What is SIMD?}
  36 \textit{(for clarity only 64-bit registers will be discussed here,
  37        however 128-, 256-, and 512-bit implementations also exist)}
  38
  39 \ac{SIMD} is a way of partitioning existing \ac{CPU}
  40 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
  41 \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
  42 on simultaneously, and the initial values and results being stored as
  43 entire 64-bit registers. The SIMD instruction opcode includes the data
  44 width and the operation to perform.\par
  45
  46 \begin{figure}[h]
  47         \includegraphics[width=\linewidth]{simd_axb}
  48         \caption{SIMD multiplication}
  49         \label{fig:simd_axb}
  50 \end{figure}
  51
  52 This method can have a huge advantage for rapid processing of
  53 vector-type data (image/video, physics simulations, cryptography,
  54 etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
  55 scalar-only instructions.\par
  56
  57 SIMD registers are of a fixed length and thus to achieve greater
  58 performance, CPU architects typically increase the width of registers
  59 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  60 binary compatibility is an important feature, and thus each doubling
  61 of SIMD registers also expands the instruction set. The number of
  62 instructions quickly balloons and this can be seen in popular \ac{ISA},
  63 for example IA-32 expanding from 80 to about 1400 instructions since
  64 1978\cite{SIMD_HARM}.\par
  65
  66 \subsection{Vector Architectures}
  67 An older alternative exists to utilise data parallelism - vector
  68 architectures. Vector CPUs collect operands from the main memory, and
  69 store them in large, sequential vector registers.\par
  70
  71 A simple vector processor might operate on one element at a time,
  72 however as the element operations are usually independent,
  73 a processor could be made to compute all of the vector's
  74 elements simultaneously, taking advantage of multiple pipelines.\par
  75
  76 Typically, today's vector processors can execute two, four, or eight
  77 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
  78 deal with (in hardware) fringe cases where the vector length is not a
  79 multiple of the number of elements. The element data width is variable
  80 (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship
  81 between number of elements, data width and register vector length.
  82
  83 \begin{figure}[h]
  84         \includegraphics[width=\linewidth]{vl_reg_n}
  85         \caption{Vector length, data width, number of elements}
  86         \label{fig:vl_reg_n}
  87 \end{figure}
  88
  89 RISC-V Vector extension supports a VL of up to $2^{16}$ or $65536$ bits,
  90 which can fit 1024 64-bit words \cite{riscv-v-spec}.
  91
  92 \subsection{Comparison Between SIMD and Vector}
  93 \textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
  94
  95 \subsubsection{Code Example}
  96 \begin{verbatim}
  97 test test
  98 \end{verbatim}
  99
 100 \subsection{Shortfalls of SIMD}
 101 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
 102 The following are just some of the reasons why SIMD is unsustainable as
 103 the number of instructions increase:
 104 \begin{itemize}
 105         \item Hardware design, ASIC routing etc.
 106         \item Compiler design
 107         \item Documentation of the ISA
 108         \item Manual coding and optimisation
 109         \item Time to support the platform
 110 \end{itemize}
 111
 112 \subsection{Simple Vectorisation}
 113 \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
 114 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
 115 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
 116 leveraging a scalar ISA by using "Prefixing".
 117 \textbf{No dedicated vector opcodes exist in SV, at all}.
 118
 119 \vspace{10pt}
 120 Main design principles
 121 \begin{itemize}
 122         \item Introduce by implementing on top of existing Power ISA
 123         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
 124               issues multiple scalar op's
 125         \item Preserves underlying scalar execution dependencies as
 126               the for-loop had been expanded as actual scalar instructions
 127         ("preserving Program Order")
 128         \item Augments existing instructions by adding "tags" - provides
 129           Vectorisation "context" rather than adding new opcodes.
 130         \item Does not modify or deviate from the underlying scalar
 131           Power ISA unless there's a significant performance boost or other
 132           advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
 133         \item Aimed at Supercomputing: avoids creating significant
 134               \textit{sequential dependency hazards}, allowing \textbf{high
 135               performance superscalar microarchitectures} to be deployed.
 136 \end{itemize}
 137
 138 Advantages include:
 139 \begin{itemize}
 140         \item Easy to create first (and sometimes only) implementation
 141               as a literal for-loop in hardware, simulators, and compilers.
 142         \item Hardware Architects may understand and implement SV as
 143           being an extra pipeline stage, inserted between decode and
 144           issue. Essentially a simple for-loop issuing element-level
 145           sub-instructions.
 146         \item More complex HDL can be done by repeating existing scalar
 147               ALUs and pipelines as blocks, leveraging existing Multi-Issue
 148           Infrastructure.
 149         \item Mostly high-level "context" which does not significantly
 150           deviate from scalar Power ISA and, in its purest form
 151           being a "for-loop around scalar instructions". Thus SV is
 152           minimally-disruptive and consequently has a reasonable chance
 153           of broad community adoption and acceptance.
 154         \item Obliterates SIMD opcode proliferation
 155           ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation
 156           ISAs. No more separate vector instructions.
 157 \end{itemize}
 158
 159 \subsubsection{Prefix 64 - SVP64}
 160
 161 SVP64, is a specification designed to solve the problems caused by
 162 SIMD implementations by:
 163 \begin{itemize}
 164         \item Simplifying the hardware design
 165         \item Reducing maintenance overhead
 166         \item Reducing code size and power consumption
 167         \item Easier for compilers, coders, documentation
 168         \item Time to support platform is a fraction of conventional SIMD
 169               (Less money on R\&D, faster to deliver)
 170 \end{itemize}
 171
 172 - Intel SIMD has been incrementally added to for decades, requires backwards
 173   interoperability, and thus has a greater complexity (?)
 174
 175
 176 - What are we going to gain?
 177
 178 -for loop, increment registers RT, RA, RB
 179 -few instructions, easier to implement and maintain
 180 -example assembly code
 181 -ARM has already started to add to libC SVE2 support
 182
 183 1970 x86 comparison