From c622226176550788c3cf447db36f4fde07bff16f Mon Sep 17 00:00:00 2001 From: Andrey Miroshnikov Date: Fri, 17 Jun 2022 13:56:04 +0100 Subject: [PATCH] Added information from SV page --- svp64-primer/summary.tex | 35 +++++++++++++++++++++++++++++++---- svp64-primer/svp64-primer.tex | 2 +- 2 files changed, 32 insertions(+), 5 deletions(-) diff --git a/svp64-primer/summary.tex b/svp64-primer/summary.tex index 8b8150156..b1b47f875 100644 --- a/svp64-primer/summary.tex +++ b/svp64-primer/summary.tex @@ -9,7 +9,7 @@ Specification for hardware for-loop that ONLY uses scalar instructions \begin{figure}[h] \includegraphics[width=\linewidth]{simd_axb} \caption{SIMD multiplication} - \label{simd_axb} + \label{fig:simd_axb} \end{figure} This method can have a huge advantage for rapid processing of vector-type data (image/video, physics simulations, cryptography, etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to scalar-only instructions.\par @@ -24,12 +24,12 @@ Pipelined execution units then perform parallel computations on these vector reg A simple vector processor might operate on one element at a time, however as the operations are independent by definition \textbf{(where is this from?)}, a processor could be made to compute all of the vector's elements simultaneously.\par -Typically, today's vector processors can execute two, four, or eight 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig \ref{vl_reg_n} shows the relationship between number of elements, data width and register vector length. +Typically, today's vector processors can execute two, four, or eight 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship between number of elements, data width and register vector length. \begin{figure}[h] \includegraphics[width=\linewidth]{vl_reg_n} \caption{Vector length, data width, number of elements} - \label{vl_reg_n} + \label{fig:vl_reg_n} \end{figure} RISCV Vector extension supports a VL of up to $2^{16}$ or $65536$ bits, which can fit 1024 64-bit words \cite{riscv-v-spec}. @@ -53,7 +53,34 @@ The following are just some of the reasons why SIMD is unsustainable as the numb \end{itemize} \subsection{Simple Vectorisation} -\ac{SV} is a an extension to a scalar ISA, designed to be as simple as possible, with no dedicated vector instructions. Effectively a hardware for-loop. +\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU, VPU, 3D?). +Includes features normally found only on Cray Supercomputers (Cray-1, NEC SX-Aurora) and GPUs. +Keeps a strictly simple RISC leveraging a scalar ISA by using "Prefixing" +No dedicated vector opcodes exist in SV! + +Main design principles +\begin{itemize} + \item Introduce by implementing on top of existing Power ISA + \item Effectively a \textbf{hardware for-loop}, pauses main PC, issues multiple scalar op's + \item Preserves underlying scalar execution dependencies as the for-loop had been expanded as actual scalar instructions ("preserving Program Order") + \item Augments existing instructions by adding "tags" - provides Vectorisation "context" rather than adding new opcodes. + \item Does not modify or deviate from the underlying scalar Power ISA unless there's a significant performance boost or other advantage in the vector space (see \ref{subsubsec:add_to_pow_isa}) + \item Aimed at Supercomputing: avoids creating significant \textit{sequential dependency hazards}, allowing \textbf{high performance superscalar microarchitectures} to be deployed. +\end{itemize} + +Advantages include: +\begin{itemize} + \item Easy to create first (and sometimes only) implementation as a literal for-loop in hardware, simulators, and compilers. + \item Hardware Architects may understand and implement SV as being an extra pipeline stage, inserted between decode and issue. Essentially a simple for-loop issuing element-level sub-instructions. + \item More complex HDL can be done by repeating existing scalar ALUs and pipelines as blocks, leveraging existing Multi-Issue Infrastructure. + \item Mostly high-level "context" which does not significantly deviate from scalar Power ISA and, in its purest form being a "for-loop around scalar instructions". Thus SV is minimally-disruptive and consequently has a reasonable chance of broad community adoption and acceptance. + \item Obliterates SIMD opcode proliferation ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation ISAs. No more separate vector instructions. +\end{itemize} + +\subsubsection{Deviations from Power ISA} +\label{subsubsec:add_to_pow_isa} +\textit{(TODO: EXPAND)} +dropping XER.SO for example \subsubsection{Prefix 64 - SVP64} diff --git a/svp64-primer/svp64-primer.tex b/svp64-primer/svp64-primer.tex index 260e6ae33..1f001e392 100644 --- a/svp64-primer/svp64-primer.tex +++ b/svp64-primer/svp64-primer.tex @@ -4,7 +4,7 @@ \usepackage{graphicx} \graphicspath{ {./img/} } -\title{(DRAFT) SVP64 Primer} +\title{(DRAFT) SVP64 Primer - \textit{Not so short yet}} \author{Andrey Miroshnikov, ...} -- 2.30.2