2 Specification for hardware for-loop that ONLY uses scalar instructions
4 \subsection{What is SIMD?
}
5 \textit{(for clarity only
64-bit registers will be discussed here, however
128-,
256-, and
512-bit implementations also exist)
}
7 \ac{SIMD
} is a way of partitioning existing
\ac{CPU
} registers of
64-bit length into smaller
8-,
16-,
32-bit pieces
\cite{SIMD_HARM
}\cite{SIMD_HPC
}. These partitions can then be operated on simultaneously, and the initial values and results being stored as entire
64-bit registers. The SIMD instruction opcode includes the data width and the operation to perform.
\par
10 \includegraphics[width=
\linewidth]{simd_axb
}
11 \caption{SIMD multiplication
}
15 This method can have a huge advantage for rapid processing of vector-type data (image/video, physics simulations, cryptography, etc.)
\cite{SIMD_WASM
}, and thus on paper is very attractive compared to scalar-only instructions.
\par
17 SIMD registers are of a fixed length and thus to achieve greater performance, CPU architects typically increase the width of registers (to
128-,
256-,
512-bit etc) for more partitions.
\par
18 Additionally, binary compatibility is an important feature, and thus each doubling of SIMD registers also expands the instruction set. The number of instructions quickly balloons and this can be seen in popular
\ac{ISA
}, for example IA-
32 expanding from
80 to about
1400 instructions since
1978\cite{SIMD_HARM
}.
\par
20 \subsection{Vector Architectures
}
21 An older alternative exists to utilise data parallelism - vector architectures. Vector CPUs collect operands from the main memory, and store them in large, sequential vector registers.
\par
23 Pipelined execution units then perform parallel computations on these vector registers. The result vector is then broken up into individual results which are sent back into the main memory.
\par
25 A simple vector processor might operate on one element at a time, however as the operations are independent by definition
\textbf{(where is this from?)
}, a processor could be made to compute all of the vector's elements simultaneously.
\par
27 Typically, today's vector processors can execute two, four, or eight
64-bit elements per clock cycle
\cite{SIMD_HARM
}. Such processors can also deal with (in hardware) fringe cases where the vector length is not a multiple of the number of elements. The element data width is variable (just like in SIMD). Fig
\ref{vl_reg_n
} shows the relationship between number of elements, data width and register vector length.
30 \includegraphics[width=
\linewidth]{vl_reg_n
}
31 \caption{Vector length, data width, number of elements
}
35 RISCV Vector extension supports a VL of up to $
2^
{16}$ or $
65536$ bits, which can fit
1024 64-bit words
\cite{riscv-v-spec
}.
37 \subsection{Comparison Between SIMD and Vector
}
38 \textit{(Need to add more here, use example from
\cite{SIMD_HARM
}?)
}
40 \subsubsection{Code Example
}
45 \subsection{Shortfalls of SIMD
}
46 The following are just some of the reasons why SIMD is unsustainable as the number of instructions increase:
48 \item Hardware design, ASIC routing etc.
50 \item Documentation of the ISA
51 \item Manual coding and optimisation
52 \item Time to support the platform
55 \subsection{Simple Vectorisation
}
56 \ac{SV
} is a an extension to a scalar ISA, designed to be as simple as possible, with no dedicated vector instructions. Effectively a hardware for-loop.
58 \subsubsection{Prefix
64 - SVP64
}
60 SVP64, is a specification designed to rival existing SIMD implementations by:
62 \item Simplifying the hardware design
63 \item Reducing maintenance overhead
64 \item Easier for compilers, coders, documentation
65 \item Time to support platform is a fraction of conventional SIMD (Less money on R\&D, faster to deliver)
68 - Intel SIMD is designed to be more capable and has more features, and thus has a greater complexity (?)
71 - What are we going to gain?
73 -for loop, increment registers RT, RA, RB
74 -few instructions, easier to implement and maintain
75 -example assembly code
76 -ARM has already started to add to libC SVE2 support