From: Andrey Miroshnikov <andrey@technepisteme.xyz>
Date: Fri, 17 Jun 2022 12:56:04 +0000 (+0100)
Subject: Added information from SV page
X-Git-Tag: opf_rfc_ls005_v1~1735
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=c622226176550788c3cf447db36f4fde07bff16f;p=libreriscv.git

Added information from SV page
---

diff --git a/svp64-primer/summary.tex b/svp64-primer/summary.tex
index 8b8150156..b1b47f875 100644
--- a/svp64-primer/summary.tex
+++ b/svp64-primer/summary.tex
@@ -9,7 +9,7 @@ Specification for hardware for-loop that ONLY uses scalar instructions
 \begin{figure}[h]
 	\includegraphics[width=\linewidth]{simd_axb}
 	\caption{SIMD multiplication}
-	\label{simd_axb}
+	\label{fig:simd_axb}
 \end{figure}
 
 This method can have a huge advantage for rapid processing of vector-type data (image/video, physics simulations, cryptography, etc.)\cite{SIMD_WASM}, and thus on paper is very attractive compared to scalar-only instructions.\par
@@ -24,12 +24,12 @@ Pipelined execution units then perform parallel computations on these vector reg
 
 A simple vector processor might operate on one element at a time, however as the operations are independent by definition \textbf{(where is this from?)}, a processor could be made to compute all of the vector's elements simultaneously.\par
 
-Typically, today's vector processors can execute two, four, or eight 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig \ref{vl_reg_n} shows the relationship between number of elements, data width and register vector length.
+Typically, today's vector processors can execute two, four, or eight 64-bit elements per clock cycle\cite{SIMD_HARM}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig \ref{fig:vl_reg_n} shows the relationship between number of elements, data width and register vector length.
 
 \begin{figure}[h]
 	\includegraphics[width=\linewidth]{vl_reg_n}
 	\caption{Vector length, data width, number of elements}
-	\label{vl_reg_n}
+	\label{fig:vl_reg_n}
 \end{figure}
 
 RISCV Vector extension supports a VL of up to $2^{16}$ or $65536$ bits, which can fit 1024 64-bit words \cite{riscv-v-spec}.
@@ -53,7 +53,34 @@ The following are just some of the reasons why SIMD is unsustainable as the numb
 \end{itemize}
 
 \subsection{Simple Vectorisation}
-\ac{SV} is a an extension to a scalar ISA, designed to be as simple as possible, with no dedicated vector instructions. Effectively a hardware for-loop.
+\ac{SV} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU, VPU, 3D?).
+Includes features normally found only on Cray Supercomputers (Cray-1, NEC SX-Aurora) and GPUs.
+Keeps a strictly simple RISC leveraging a scalar ISA by using "Prefixing"
+No dedicated vector opcodes exist in SV!
+
+Main design principles
+\begin{itemize}
+	\item Introduce by implementing on top of existing Power ISA
+	\item Effectively a \textbf{hardware for-loop}, pauses main PC, issues multiple scalar op's
+	\item Preserves underlying scalar execution dependencies as the for-loop had been expanded as actual scalar instructions ("preserving Program Order")
+	\item Augments existing instructions by adding "tags" - provides Vectorisation "context" rather than adding new opcodes.
+	\item Does not modify or deviate from the underlying scalar Power ISA unless there's a significant performance boost or other advantage in the vector space (see \ref{subsubsec:add_to_pow_isa})
+	\item Aimed at Supercomputing: avoids creating significant \textit{sequential dependency hazards}, allowing \textbf{high performance superscalar microarchitectures} to be deployed. 
+\end{itemize}
+
+Advantages include:
+\begin{itemize}
+	\item Easy to create first (and sometimes only) implementation as a literal for-loop in hardware, simulators, and compilers.
+	\item Hardware Architects may understand and implement SV as being an extra pipeline stage, inserted between decode and issue. Essentially a simple for-loop issuing element-level sub-instructions.
+	\item More complex HDL can be done by repeating existing scalar ALUs and pipelines as blocks, leveraging existing Multi-Issue Infrastructure.
+	\item Mostly high-level "context" which does not significantly deviate from scalar Power ISA and, in its purest form being a "for-loop around scalar instructions". Thus SV is minimally-disruptive and consequently has a reasonable chance of broad community adoption and acceptance.
+	\item Obliterates SIMD opcode proliferation ($O(N^6)$\textbf{[source?]}) as well as dedicated Vectorisation ISAs. No more separate vector instructions.
+\end{itemize}
+
+\subsubsection{Deviations from Power ISA}
+\label{subsubsec:add_to_pow_isa}
+\textit{(TODO: EXPAND)}
+dropping XER.SO for example
 
 \subsubsection{Prefix 64 - SVP64}
 
diff --git a/svp64-primer/svp64-primer.tex b/svp64-primer/svp64-primer.tex
index 260e6ae33..1f001e392 100644
--- a/svp64-primer/svp64-primer.tex
+++ b/svp64-primer/svp64-primer.tex
@@ -4,7 +4,7 @@
 \usepackage{graphicx}
 \graphicspath{ {./img/} }
 
-\title{(DRAFT) SVP64 Primer}
+\title{(DRAFT) SVP64 Primer - \textit{Not so short yet}}
 
 \author{Andrey Miroshnikov, ...}