From 7f7198c2448a4a85caa1843af12e00f4ef060ce3 Mon Sep 17 00:00:00 2001 From: Andrey Miroshnikov Date: Mon, 20 Jun 2022 18:14:30 +0100 Subject: [PATCH] Added acronyms, moved some text around --- svp64-primer/acronyms.tex | 16 +++-- svp64-primer/summary.tex | 117 +++++++++++++++------------------- svp64-primer/svp64-primer.tex | 9 ++- 3 files changed, 69 insertions(+), 73 deletions(-) diff --git a/svp64-primer/acronyms.tex b/svp64-primer/acronyms.tex index cca3c8908..66b8ba494 100644 --- a/svp64-primer/acronyms.tex +++ b/svp64-primer/acronyms.tex @@ -1,14 +1,20 @@ \section{List of Acronyms} \begin{acronym} + \acro{ASIC}{Application Specific Integrated Circuit} + \acro{AVX-512}{Intel Advanced Vector Extensions 512-bit} \acro{CPU}{Central Processing Unit} - \acro{ISA}{Instruction Set Architecture} - \acro{DAXPY}{double-precision aX plus Y} \acro{DCT}{Discrete Cosine Transform} + \acro{DSP}{Digital Signal Processors} + \acro{DAXPY}{Double-Precision aX Plus Y ($aX+Y$)} \acro{FFT}{Fast Fourier Transform} + \acro{IA-32}{Intel Architecture 32-bit or i386} + \acro{ISA}{Instruction Set Architecture} + \acro{MMX}{Intel's first SIMD implementation} + \acro{RVV}{RISC-V Vector extension} \acro{SIMD}{Single Instruction Multiple Data} - \acro{SV}{(Scalable) Simple Vectorisation} + \acro{SV}{(Scalable) Simple Vectorisation or Simple-V} + \acro{SVE2}{ARM Scalable Vector Extension version two} \acro{SVP64}{Simple-V with Prefixing of Power ISA, 64-bits in length} - \acro{AVX}{Intel Advanced Vector Extension} - \acro{SVE}{ARM Scalable Vector Extension} + \acro{VLIW}{Very Long Instruction Word} \acro{VSX}{128-bit Packed SIMD Extension to the Power ISA} \end{acronym} diff --git a/svp64-primer/summary.tex b/svp64-primer/summary.tex index 68503ffab..4d33ad6a6 100644 --- a/svp64-primer/summary.tex +++ b/svp64-primer/summary.tex @@ -1,6 +1,6 @@ \section{Summary} -Simple-V is a Scalable Vector Specification for a hardware for-loop that -ONLY uses scalar instructions. +The proposed \ac{SV} is a Scalable Vector Specification for a hardware for-loop \textbf{that +ONLY uses scalar instructions}. \begin{itemize} \item The Power \ac{ISA} v3.1 Specification is not altered in any way. @@ -9,25 +9,25 @@ ONLY uses scalar instructions. \item Does not require adding duplicates of instructions (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd) \item Specifically designed to be easily implemented - on top of an existing Micro-architecture (especially - Superscalar Out-of-Order Multi-issue) without - disruptive full architectural redesigns. + on top of an existing Micro-architecture (especially + Superscalar Out-of-Order Multi-issue) without + disruptive full architectural redesigns. \item Divided into Compliancy Levels to suit differing needs. \item At the highest Compliancy Level only requires five instructions - (SVE2 requires appx 9,000. AVX-512 around 10,000. RVV around - 300). + (SVE2 requires appx 9,000. \ac{AVX-512} around 10,000. \ac{RVV} around + 300). \item Predication, an often-requested feature, is added cleanly - (without modifying the v3.1 Power ISA) + (without modifying the v3.1 Power ISA) \item In-registers arbitrary-sized Matrix Multiply is achieved in three - instructions (without adding any v3.1 Power ISA instructions) -\item Full DCT and FFT RADIX2 Triple-loops are achieved with dramatically - reduced instruction count, and power consumption expected to greatly - reduce. Normally found only in high-end VLIW DSPs (TI MSP, Qualcomm - Hexagon) + instructions (without adding any v3.1 Power ISA instructions) +\item Full \ac{DCT} and \ac{FFT} RADIX2 Triple-loops are achieved with + dramatically reduced instruction count, and power consumption expected + to greatly reduce. Normally found only in high-end \ac{VLIW} \ac{DSP} + (TI MSP, Qualcomm Hexagon) \item Fail-First Load/Store allows strncpy to be implemented in around 14 - instructions (hand-optimised VSX assembler is 240). + instructions (hand-optimised \ac{VSX} assembler is 240). \item Inner loop of MP3 implemented in under 100 instructions - (gcc produces 450 for the same function) + (gcc produces 450 for the same function on POWER9). \end{itemize} All areas investigated so far consistently showed reductions in executable @@ -50,11 +50,11 @@ the Power ISA's Supercomputing pedigree. \subsection{What is SIMD?} \ac{SIMD} is a way of partitioning existing \ac{CPU} -registers of 64-bit length into smaller 8-, 16-, 32-bit pieces -\cite{SIMD_HARM}\cite{SIMD_HPC}. These partitions can then be operated -on simultaneously, and the initial values and results being stored as -entire 64-bit registers. The SIMD instruction opcode includes the data -width and the operation to perform. +registers of 64-bit length into smaller 8-, 16-, 32-bit pieces. +%\cite{SIMD_HARM}\cite{SIMD_HPC} +These partitions can then be operated on simultaneously, and the initial values +and results being stored as entire 64-bit registers. The SIMD instruction opcode + includes the data width and the operation to perform. \par \begin{figure}[hb] @@ -66,11 +66,14 @@ width and the operation to perform. This method can have a huge advantage for rapid processing of vector-type data (image/video, physics simulations, cryptography, -etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to +etc.), +%\cite{SIMD_WASM}, + and thus on paper is very attractive compared to scalar-only instructions. \textit{As long as the data width fits the workload, everything is fine}. \par +\subsection{Shortfalls of SIMD} SIMD registers are of a fixed length and thus to achieve greater performance, CPU architects typically increase the width of registers (to 128-, 256-, 512-bit etc) for more partitions.\par Additionally, @@ -78,7 +81,21 @@ binary compatibility is an important feature, and thus each doubling of SIMD registers also expands the instruction set. The number of instructions quickly balloons and this can be seen in for example IA-32 expanding from 80 to about 1400 instructions since -1978\cite{SIMD_HARM}.\par +the 1970s\cite{SIMD_HARM}.\par + +Five digit Opcode proliferation (10,000 instructions) is overwhelming. +The following are just some of the reasons why SIMD is unsustainable as +the number of instructions increase: +\begin{itemize} + \item Hardware design, ASIC routing etc. + \item Compiler design + \item Documentation of the ISA + \item Manual coding and optimisation + \item Time to support the platform + \item Compilance Suite development and testing + \item Protracted Variable-Length encoding (x86) severely compromises + Multi-issue decoding +\end{itemize} \subsection{Vector Architectures} An older alternative exists to utilise data parallelism - vector @@ -98,19 +115,21 @@ a processor could be made to compute all of the vector's elements simultaneously, taking advantage of multiple pipelines.\par Typically, today's vector processors can execute two, four, or eight -64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also -deal with (in hardware) fringe cases where the vector length is not a -multiple of the number of elements. The element data width is variable -(just like in SIMD) but it is the \textit{number} of elements being +64-bit elements per clock cycle. +%\cite{SIMD_HARM}. +Such processors can also deal with (in hardware) fringe cases where the vector +length is not a multiple of the number of elements. The element data width +is variable (just like in SIMD) but it is the \textit{number} of elements being variable under control of a "setvl" instruction that makes Vector ISAs "Scalable" \par -RISC-V Vector extension (RVV) supports a VL of up to $2^{16}$ or $65536$ bits, -which can fit 1024 64-bit words \cite{riscv-v-spec}. The Cray-1 had -8 Vector Registers with up to 64 elements. An early Draft of RVV supported -overlaying the Vector Registers onto the Floating Point registers, similar -to x86 "MMX". +\ac{RVV} supports a VL of up to $2^{16}$ or $65536$ bits, +which can fit 1024 64-bit words. +%\cite{riscv-v-spec}. +The Cray-1 had 8 Vector Registers with up to 64 elements (64-bit each). +An early Draft of RVV supported overlaying the Vector Registers onto the +Floating Point registers, similar to \ac{MMX}. Simple-V's "Vector" Registers are specifically designed to fit on top of the Scalar (GPR, FPR) register files, which are extended from the default @@ -126,29 +145,6 @@ Vector instructions. \label{fig:svp64_regs} \end{figure} -\subsection{Comparison Between SIMD and Vector} -\textit{(Need to add more here, use example from \cite{SIMD_HARM}?)} - -\subsubsection{Code Example} -\begin{verbatim} -test test -\end{verbatim} - -\subsection{Shortfalls of SIMD} -Five digit Opcode proliferation (10,000 instructions) is overwhelming. -The following are just some of the reasons why SIMD is unsustainable as -the number of instructions increase: -\begin{itemize} - \item Hardware design, ASIC routing etc. - \item Compiler design - \item Documentation of the ISA - \item Manual coding and optimisation - \item Time to support the platform - \item Compilance Suite development and testing - \item Protracted Variable-Length encoding (x86) severely compromises - Multi-issue decoding -\end{itemize} - \subsection{Simple Vectorisation} \ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU, VPU, 3D?). Includes features normally found only on Cray-style Supercomputers @@ -209,16 +205,3 @@ SIMD implementations by: \item Time to support platform is a fraction of conventional SIMD (Less money on R\&D, faster to deliver) \end{itemize} - -- Intel SIMD has been incrementally added to for decades, requires backwards - interoperability, and thus has a greater complexity (?) - - -- What are we going to gain? - --for loop, increment registers RT, RA, RB --few instructions, easier to implement and maintain --example assembly code --ARM has already started to add to libC SVE2 support - -1970 x86 comparison diff --git a/svp64-primer/svp64-primer.tex b/svp64-primer/svp64-primer.tex index 14e99fb7c..78c3eb641 100644 --- a/svp64-primer/svp64-primer.tex +++ b/svp64-primer/svp64-primer.tex @@ -1,5 +1,6 @@ \documentclass[a4paper, 10pt]{article} \usepackage[utf8]{inputenc} +\usepackage[firstpage]{draftwatermark} \usepackage[printonlyused,withpage]{acronym} \usepackage{graphicx} \usepackage{float} @@ -7,10 +8,14 @@ \usepackage[margin=1.1in]{geometry} \graphicspath{ {./img/} } -\title{(DRAFT) SVP64 Primer - \textit{Not so short yet}} +\title{(DRAFT) SVP64 Primer} \author{Andrey Miroshnikov, ...} +\SetWatermarkLightness{0.5} +\SetWatermarkScale{4} +%\SetWatermarkText{DRAFT!} + \begin{document} \maketitle @@ -20,6 +25,8 @@ \input{summary} %\input{...} +%\section{References} +%\textit{(All references and sources are available on request)} \bibliography{references} \bibliographystyle{ieeetr} \end{document} -- 2.30.2