svp64-primer/summary.tex

   1 \section{Summary}
   2 The proposed \acs{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that
   3 ONLY uses scalar instructions}.
   4
   5 \begin{itemize}
   6 \item The Power \acs{ISA} v3.1 Specification is not altered in any way.
   7   v3.1 Code-compatibility is guaranteed.
   8 \item Does not require sacrificing 32-bit Major Opcodes.
   9 \item Does not require adding duplicates of instructions
  10       (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
  11 \item Specifically designed to be easily implemented
  12       on top of an existing Micro-architecture (especially
  13       Superscalar Out-of-Order Multi-issue) without
  14       disruptive full architectural redesigns.
  15 \item Divided into Compliancy Levels to suit differing needs.
  16 \item At the highest Compliancy Level only requires five instructions
  17       (SVE2 requires appx 9,000. \acs{AVX-512} around 10,000. \acs{RVV} around
  18       300).
  19 \item Predication, an often-requested feature, is added cleanly
  20       (without modifying the v3.1 Power ISA)
  21 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
  22       instructions (without adding any v3.1 Power ISA instructions)
  23 \item Full \acs{DCT} and \acs{FFT} RADIX2 Triple-loops are achieved with
  24       dramatically reduced instruction count, and power consumption expected
  25       to greatly reduce. Normally found only in high-end \acs{VLIW} \acs{DSP}
  26       (TI MSP, Qualcomm Hexagon)
  27 \item Fail-First Load/Store allows strncpy to be implemented in around 14
  28       instructions (hand-optimised \acs{VSX} assembler is 240).
  29 \item Inner loop of MP3 implemented in under 100 instructions
  30       (gcc produces 450 for the same function on POWER9).
  31 \end{itemize}
  32
  33 All areas investigated so far consistently showed reductions in executable
  34 size, which as outlined in \cite{SIMD_HARM} has an indirect reduction in
  35 power consumption due to less I-Cache/TLB pressure and also Issue remaining
  36 idle for long periods.
  37
  38 Simple-V has been specifically and carefully crafted to respect
  39 the Power ISA's Supercomputing pedigree.
  40
  41 \begin{figure}[hb]
  42     \centering
  43         \includegraphics[width=0.6\linewidth]{power_pipelines}
  44         \caption{Showing how SV fits in between Decode and Issue}
  45         \label{fig:power_pipelines}
  46 \end{figure}
  47
  48 \pagebreak
  49
  50 \subsection{What is SIMD?}
  51
  52 \acs{SIMD} is a way of partitioning existing \acs{CPU}
  53 registers of 64-bit length into smaller 8-, 16-, 32-bit pieces.
  54 \cite{SIMD_HARM}\cite{SIMD_HPC}
  55 These partitions can then be operated on simultaneously, and the initial values
  56 and results being stored as entire 64-bit registers. The SIMD instruction opcode
  57  includes the data width and the operation to perform.
  58 \par
  59
  60 \begin{figure}[hb]
  61     \centering
  62         \includegraphics[width=0.6\linewidth]{simd_axb}
  63         \caption{SIMD multiplication}
  64         \label{fig:simd_axb}
  65 \end{figure}
  66
  67 This method can have a huge advantage for rapid processing of
  68 vector-type data (image/video, physics simulations, cryptography,
  69 etc.),
  70 \cite{SIMD_WASM},
  71  and thus on paper is very attractive compared to
  72 scalar-only instructions.
  73 \textit{As long as the data width fits the workload, everything is fine}.
  74 \par
  75
  76 \subsection{Shortfalls of SIMD}
  77 SIMD registers are of a fixed length and thus to achieve greater
  78 performance, CPU architects typically increase the width of registers
  79 (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally,
  80 binary compatibility is an important feature, and thus each doubling
  81 of SIMD registers also expands the instruction set. The number of
  82 instructions quickly balloons and this can be seen in for example
  83 IA-32 expanding from 80 to about 1400 instructions since
  84 the 1970s\cite{SIMD_HARM}.\par
  85
  86 Five digit Opcode proliferation (10,000 instructions) is overwhelming.
  87 The following are just some of the reasons why SIMD is unsustainable as
  88 the number of instructions increase:
  89 \begin{itemize}
  90         \item Hardware design, ASIC routing etc.
  91         \item Compiler design
  92         \item Documentation of the ISA
  93         \item Manual coding and optimisation
  94         \item Time to support the platform
  95         \item Compilance Suite development and testing
  96         \item Protracted Variable-Length encoding (x86) severely compromises
  97         Multi-issue decoding
  98 \end{itemize}
  99
 100 \subsection{Vector Architectures}
 101 An older alternative exists to utilise data parallelism - vector
 102 architectures. Vector CPUs collect operands from the main memory, and
 103 store them in large, sequential vector registers.\par
 104
 105 A simple vector processor might operate on one element at a time,
 106 however as the element operations are usually independent,
 107 a processor could be made to compute all of the vector's
 108 elements simultaneously, taking advantage of multiple pipelines.\par
 109
 110 Typically, today's vector processors can execute two, four, or eight
 111 64-bit elements per clock cycle.
 112 \cite{SIMD_HARM}.
 113 Such processors can also deal with (in hardware) fringe cases where the vector
 114 length is not a multiple of the number of elements. The element data width
 115 is variable (just like in SIMD) but it is the \textit{number} of elements being
 116 variable under control of a "setvl" instruction that makes Vector ISAs
 117 "Scalable"
 118 \par
 119
 120 \acs{RVV} supports a VL of up to $2^{16}$ or $65536$ bits,
 121 which can fit 1024 64-bit words.
 122 \cite{riscv-v-spec}.
 123 The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each).
 124 An early Draft of RVV supported overlaying the Vector Registers onto the
 125 Floating Point registers, similar to \acs{MMX}.
 126
 127 \begin{figure}[hb]
 128     \centering
 129         \includegraphics[width=0.6\linewidth]{cray_vector_regs}
 130         \caption{Cray Vector registers: 8 registers, 64 elements each}
 131         \label{fig:cray_vector_regs}
 132 \end{figure}
 133
 134 Simple-V's "Vector" Registers are specifically designed to fit on top of
 135 the Scalar (GPR, FPR) register files, which are extended from the default
 136 of 32, to 128 entries in the Libre-SOC implementation.  This is a primary
 137 reason why Simple-V can be added on top of an existing Scalar ISA, and
 138 \textit{in particular} why there is no need to add Vector Registers or
 139 Vector instructions.
 140
 141 \begin{figure}[hb]
 142     \centering
 143         \includegraphics[width=0.6\linewidth]{svp64_regs}
 144         \caption{three instructions, same vector length, different element widths}
 145         \label{fig:svp64_regs}
 146 \end{figure}
 147
 148 \subsection{Simple Vectorisation}
 149 \acs{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
 150 VPU, 3D?).  Includes features normally found only on Cray-style Supercomputers
 151 (Cray-1, NEC SX-Aurora) and GPUs.  Keeps to a strict uniform RISC paradigm,
 152 leveraging a scalar ISA by using "Prefixing".
 153 \textbf{No dedicated vector opcodes exist in SV, at all}.
 154 SVP64 uses 25\% of the Power ISA v3.1 64-bit Prefix space (EXT001) to create
 155 the SV Vectorisation Context for the 32-bit Scalar Suffix.
 156
 157 \vspace{10pt}
 158 Main design principles
 159 \begin{itemize}
 160         \item Introduce by implementing on top of existing Power ISA
 161         \item Effectively a \textbf{hardware for-loop}, pauses main PC,
 162               issues multiple scalar operations
 163         \item Preserves underlying scalar execution dependencies as if
 164               the for-loop had been expanded into actual scalar instructions
 165         ("preserving Program Order")
 166         \item Augments existing instructions by adding "tags" - provides
 167           Vectorisation "context" rather than adding new opcodes.
 168         \item Does not modify or deviate from the underlying scalar
 169           Power ISA unless there's a significant performance boost or other
 170           advantage in the vector space
 171         \item Aimed at Supercomputing: avoids creating significant
 172               \textit{sequential dependency hazards}, allowing \textbf{high
 173               performance multi-issue superscalar microarchitectures} to be
 174           leveraged.
 175 \end{itemize}
 176
 177 Advantages include:
 178 \begin{itemize}
 179         \item Easy to create first (and sometimes only) implementation
 180               as a literal for-loop in hardware, simulators, and compilers.
 181         \item Hardware Architects may understand and implement SV as
 182           being an extra pipeline stage, inserted between decode and
 183           issue. Essentially a simple for-loop issuing element-level
 184           sub-instructions.
 185         \item More complex HDL can be done by repeating existing scalar
 186               ALUs and pipelines as blocks, leveraging existing Multi-Issue
 187           Infrastructure.
 188         \item Mostly high-level "context" which does not significantly
 189           deviate from scalar Power ISA and, in its purest form
 190           being a "for-loop around scalar instructions". Thus SV is
 191           minimally-disruptive and consequently has a reasonable chance
 192           of broad community adoption and acceptance.
 193         \item Obliterates SIMD opcode proliferation
 194           ($O(N^6)$) as well as dedicated Vectorisation
 195           ISAs. No more separate vector instructions.
 196         \item Reducing maintenance overhead (no separate Vector instructions).
 197           Adding a Scalar instruction automatically gains a Vectorised
 198           version.
 199         \item Easier for compilers, coders, documentation
 200 \end{itemize}
 201