svp64-primer/summary.tex

   1 \section{Summary}
   2 Simple-V is a Scalable Vector Specification for a hardware for-loop that
   3 ONLY uses scalar instructions.
   4
   5 \begin{itemize}
   6 \item The Power ISA v3.1 Specification is not altered in any way.
   7   v3.1 Code-compatibility is guaranteed.
   8 \item Does not require sacrificing 32-bit Major Opcodes.
   9 \item Does not require adding duplicates of instructions
  10       (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
  11 \item Specifically designed to be easily implemented
  12   on top of an existing Micro-architecture (especially
  13   Superscalar Out-of-Order Multi-issue) without
  14   disruptive full architectural redesigns.
  15 \item Divided into Compliancy Levels to suit differing needs.
  16 \item At the highest Compliancy Level only requires five instructions
  17   (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around
  18   300).
  19 \item Predication, an often-requested feature, is added cleanly
  20   (without modifying the v3.1 Power ISA)
  21 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
  22   instructions (without adding any v3.1 Power ISA instructions)
  23 \item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically
  24   reduced instruction count, and power consumption expected to greatly
  25   reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm
  26   Hexagon)
  27 \item Fail-First Load/Store allows strncpy to be implemented in around 14
  28   instructions (hand-optimised VSX assembler is 240).
  29 \item Inner loop of MP3 implemented in under 100 instructions
  30   (gcc produces 450 for the same function)
  31 \end{itemize}
  32
  33 All areas investigated so far consistently showed reductions in executable
  34 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
  35 power consumption due to less I-Cache/TLB pressure and also Issue remaining
  36 idle for long periods.
  37
  38 Simple-V has been specifically and carefully crafted to respect
  39 the Power ISA's Supercomputing pedigree.
  40
  41 \pagebreak
  42
  43 \subsection{What is SIMD?}
  44
  45 \ac{SIMD} is a way of partitioning existing \ac{CPU}
  46 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces
  47 \cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated
  48 on simultaneously, and the initial values and results being stored as
  49 entire 64-bit registers. The SIMD instruction opcode includes the data
  50 width and the operation to perform.
  51 \par
  52
  53 \begin{figure}[hb]
  54     \centering
  55         \includegraphics[width=0.6\linewidth]{simd_axb}
  56         \caption{SIMD multiplication}
  57         \label{fig:simd_axb}
  58 \end{figure}
  59
  60 This method can have a huge advantage for rapid processing of
  61 vector-type data (image/video, physics simulations, cryptography,
  62 etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to
  63 scalar-only instructions.
  64 \textit{As long as the data width fits the workload, everything is fine}.
  65 \par
  66
  67 SIMD registers are of a fixed length and thus to achieve greater
  68 performance, CPU architects typically increase the width of registers
  69 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  70 binary compatibility is an important feature, and thus each doubling
  71 of SIMD registers also expands the instruction set. The number of
  72 instructions quickly balloons and this can be seen in popular \ac{ISA},
  73 for example IA-32 expanding from 80 to about 1400 instructions since
  74 1978\cite{SIMD_HARM}.\par
  75
  76 \subsection{Vector Architectures}
  77 An older alternative exists to utilise data parallelism - vector
  78 architectures. Vector CPUs collect operands from the main memory, and
  79 store them in large, sequential vector registers.\par
  80
  81 \begin{figure}[hb]
  82     \centering
  83         \includegraphics[width=0.6\linewidth]{cray_vector_regs}
  84         \caption{Cray Vector registers: 8 registers, 64 elements each}
  85         \label{fig:cray_vector_regs}
  86 \end{figure}
  87
  88 A simple vector processor might operate on one element at a time,
  89 however as the element operations are usually independent,
  90 a processor could be made to compute all of the vector's
  91 elements simultaneously, taking advantage of multiple pipelines.\par
  92
  93 Typically, today's vector processors can execute two, four, or eight
  94 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also
  95 deal with (in hardware) fringe cases where the vector length is not a
  96 multiple of the number of elements. The element data width is variable
  97 (just like in SIMD) but it is the \textit{number} of elements being
  98 variable under control of a "setvl" instruction that makes Vector ISAs
  99 "Scalable"
 100 \par
 101
 102 RISC-V Vector extension (RVV) supports a VL of up to $2^{16}$ or $65536$ bits,
 103 which can fit 1024 64-bit words \cite{riscv-v-spec}.  The Cray-1 had
 104 8 Vector Registers with up to 64 elements.  An early Draft of RVV supported
 105 overlaying the Vector Registers onto the Floating Point registers, similar
 106 to x86 "MMX".
 107
 108 Simple-V's "Vector" Registers are specifically designed to fit
 109 on top of the Scalar (GPR, FPR) register files with \textbf{(byte-addressable access required?)}, which are extended from
 110  the default of 32 (see PowerISA 3.2.1 General Purpose Registers and 4.2.1 Floating-Point Registers \textbf{[WHICH SPEC VERSION?]}), to 128 entries in the Libre-SOC implementation \textbf{[CAN WE REFER TO LIBRE-SOC?]}.  This is a primary reason why Simple-V can be added
 111 on top of an existing Scalar ISA, and \textit{in particular} why there
 112 is no need to add Vector Registers or Vector instructions.
 113
 114 \begin{figure}[hb]
 115     \centering
 116         \includegraphics[width=0.6\linewidth]{svp64_regs}
 117         \caption{three instructions, same vector length, different element widths}
 118         \label{fig:svp64_regs}
 119 \end{figure}
 120
 121 \subsection{Comparison Between SIMD and Vector}
 122 \textit{(Need to add more here, use example from \cite{SIMD_HARM}?)}
 123
 124 \subsubsection{Code Example}
 125 \begin{verbatim}
 126 test test
 127 \end{verbatim}
 128
 129 \subsection{Shortfalls of SIMD}
 130 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
 131 The following are just some of the reasons why SIMD is unsustainable as
 132 the number of instructions increase:
 133 \begin{itemize}
 134         \item Hardware design, ASIC routing etc.
 135         \item Compiler design
 136         \item Documentation of the ISA
 137         \item Manual coding and optimisation
 138         \item Time to support the platform
 139         \item Compilance Suite development and testing
 140     \item Protracted Variable-Length encoding (x86) severely compromises
 141           Multi-issue decoding
 142 \end{itemize}
 143
 144 \subsection{Simple Vectorisation}
 145 \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
 146 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
 147 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
 148 leveraging a scalar ISA by using "Prefixing".
 149 \textbf{No dedicated vector opcodes exist in SV, at all}.
 150
 151 \vspace{10pt}
 152 Main design principles
 153 \begin{itemize}
 154         \item Introduce by implementing on top of existing Power ISA
 155         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
 156               issues multiple scalar operations
 157         \item Preserves underlying scalar execution dependencies as if
 158               the for-loop had been expanded into actual scalar instructions
 159         ("preserving Program Order")
 160         \item Augments existing instructions by adding "tags" - provides
 161           Vectorisation "context" rather than adding new opcodes.
 162         \item Does not modify or deviate from the underlying scalar
 163           Power ISA unless there's a significant performance boost or other
 164           advantage in the vector space
 165         \item Aimed at Supercomputing: avoids creating significant
 166               \textit{sequential dependency hazards}, allowing \textbf{high
 167               performance multi-issue superscalar microarchitectures} to be
 168           leveraged.
 169 \end{itemize}
 170
 171 Advantages include:
 172 \begin{itemize}
 173         \item Easy to create first (and sometimes only) implementation
 174               as a literal for-loop in hardware, simulators, and compilers.
 175         \item Hardware Architects may understand and implement SV as
 176           being an extra pipeline stage, inserted between decode and
 177           issue. Essentially a simple for-loop issuing element-level
 178           sub-instructions.
 179         \item More complex HDL can be done by repeating existing scalar
 180               ALUs and pipelines as blocks, leveraging existing Multi-Issue
 181           Infrastructure.
 182         \item Mostly high-level "context" which does not significantly
 183           deviate from scalar Power ISA and, in its purest form
 184           being a "for-loop around scalar instructions". Thus SV is
 185           minimally-disruptive and consequently has a reasonable chance
 186           of broad community adoption and acceptance.
 187         \item Obliterates SIMD opcode proliferation
 188           ($O(N^6)$) as well as dedicated Vectorisation
 189           ISAs. No more separate vector instructions.
 190 \end{itemize}
 191
 192 \subsubsection{Prefix 64 - SVP64}
 193
 194 SVP64, is a specification designed to solve the problems caused by
 195 SIMD implementations by:
 196 \begin{itemize}
 197         \item Simplifying the hardware design
 198         \item Reducing maintenance overhead
 199         \item Reducing code size and power consumption
 200         \item Easier for compilers, coders, documentation
 201         \item Time to support platform is a fraction of conventional SIMD
 202               (Less money on R\&D, faster to deliver)
 203 \end{itemize}
 204
 205 - Intel SIMD has been incrementally added to for decades, requires backwards
 206   interoperability, and thus has a greater complexity (?)
 207
 208
 209 - What are we going to gain?
 210
 211 -for loop, increment registers RT, RA, RB
 212 -few instructions, easier to implement and maintain
 213 -example assembly code
 214 -ARM has already started to add to libC SVE2 support
 215
 216 1970 x86 comparison