2 The proposed
\acs{SV
} is a Scalable Vector Specification for a hardware for-loop
\textbf{that
3 ONLY uses scalar instructions
}.
7 \item The Power
\acs{ISA
} v3.1 Spec is not altered.
8 v3.1 Code-compatibility is guaranteed.
9 \item Does not require sacrificing
32-bit Major Opcodes.
10 \item Does not require adding duplicates of instructions
11 (popcnt, popcntw, popcntd, vpopcntb, vpopcnth, vpopcntw, vpopcntd)
12 \item Fully abstracted: does not create Micro-architectural dependencies
13 (no fixed "Lane" size), one binary works across all existing
14 \textit{and future
} implementations.
15 \item Specifically designed to be easily implemented
16 on top of an existing Micro-architecture (especially
17 Superscalar Out-of-Order Multi-issue) without
18 disruptive full architectural redesigns.
19 \item Divided into Compliancy Levels to suit differing needs.
20 \item At the highest Compliancy Level only requires five instructions
21 (SVE2 requires appx
9,
000.
\acs{AVX-
512} around
10,
000.
\acs{RVV
} around
23 \item Predication, often-requested, is added cleanly
24 (without modifying the v3.1 Power ISA)
25 \item In-registers arbitrary-sized Matrix Multiply is achieved in three
26 instructions (without adding any v3.1 Power ISA instructions)
27 \item Full
\acs{DCT
} and
\acs{FFT
} RADIX2 Triple-loops are achieved with
28 dramatically reduced instruction count, and power consumption expected
29 to greatly reduce. Normally found only in high-end
\acs{VLIW
} \acs{DSP
}
30 (TI MSP, Qualcomm Hexagon)
31 \item Fail-First Load/Store allows Vectorised high performance
32 strncpy to be implemented in around
14
33 instructions (hand-optimised
\acs{VSX
} assembler is
240).
34 \item Inner loop of MP3 implemented in under
100 instructions
35 (gcc produces
450 for the same function on POWER9).
38 All areas investigated so far consistently showed reductions in executable
39 size, which as outlined in
\cite{SIMD_HARM
} has an indirect reduction in
40 power consumption due to less I-Cache/TLB pressure and also Issue remaining
41 idle for long periods.
42 Simple-V has been specifically and carefully crafted to respect
43 the Power ISA's Supercomputing pedigree.
47 \includegraphics[width=
0.6\linewidth]{power_pipelines.png
}
48 \caption{Showing how SV fits in between Decode and Issue
}
49 \label{fig:power_pipelines
}
54 \subsection{What is SIMD?
}
56 \acs{SIMD
} is a way of partitioning existing
\acs{CPU
}
57 registers of
64-bit length into smaller
8-,
16-,
32-bit pieces.
58 \cite{SIMD_HARM
}\cite{SIMD_HPC
}
59 These partitions can then be operated on simultaneously, and the initial values
60 and results being stored as entire
64-bit registers (
\acs{SWAR
}).
61 The SIMD instruction opcode
62 includes the data width and the operation to perform.
67 \includegraphics[width=
0.6\linewidth]{simd_axb.png
}
68 \caption{SIMD multiplication
}
72 This method can have a huge advantage for rapid processing of
73 vector-type data (image/video, physics simulations, cryptography,
76 and thus on paper is very attractive compared to
77 scalar-only instructions.
78 \textit{As long as the data width fits the workload, everything is fine
}.
81 \subsection{Shortfalls of SIMD
}
82 SIMD registers are of a fixed length and thus to achieve greater
83 performance, CPU architects typically increase the width of registers
84 (to
128-,
256-,
512-bit etc) for more partitions.
\par Additionally,
85 binary compatibility is an important feature, and thus each doubling
86 of SIMD registers also expands the instruction set. The number of
87 instructions quickly balloons and this can be seen in for example
88 IA-
32 expanding from
80 to about
1400 instructions since
89 the
1970s
\cite{SIMD_HARM
}.
\par
91 Five digit Opcode proliferation (
10,
000 instructions) is overwhelming.
92 The following are just some of the reasons why SIMD is unsustainable as
93 the number of instructions increase:
96 \item Hardware design, ASIC routing etc.
98 \item Documentation of the ISA
99 \item Manual coding and optimisation
100 \item Time to support the platform
101 \item Compilance Suite development and testing
102 \item Protracted Variable-Length encoding (x86) severely compromises
106 \subsection{Scalable Vector Architectures
}
107 An older alternative exists to utilise data parallelism - vector
108 architectures. Vector CPUs collect operands from the main memory, and
109 store them in large, sequential vector registers.
\par
111 A simple vector processor might operate on one element at a time,
112 however as the element operations are usually independent,
113 a processor could be made to compute all of the vector's
114 elements simultaneously, taking advantage of multiple pipelines.
\par
116 Typically, today's vector processors can execute two, four, or eight
117 64-bit elements per clock cycle.
119 Vector ISAs are specifically designed to deal with (in hardware) fringe
120 cases where an algorithm's element count is not a multiple of the
121 underlying hardware "Lane" width. The element data width
122 is variable (
8 to
64-bit just like in SIMD)
123 but it is the
\textit{number
} of elements being
124 variable under control of a "setvl" instruction that specifically
125 makes Vector ISAs "Scalable"
128 \acs{RVV
} supports a VL of up to $
2^
{16}$ or $
65536$ bits,
129 which can fit
1024 64-bit words.
131 The Cray-
1 had
8 Vector Registers with up to
64 elements (
64-bit each).
132 An early Draft of RVV supported overlaying the Vector Registers onto the
133 Floating Point registers, similar to
\acs{MMX
}.
137 \includegraphics[width=
0.6\linewidth]{cray_vector_regs.png
}
138 \caption{Cray Vector registers:
8 registers,
64 elements each
}
139 \label{fig:cray_vector_regs
}
142 Simple-V's "Vector" Registers (a misnomer) are specifically designed to fit
144 the Scalar (GPR, FPR) register files, which are extended from the default
145 of
32, to
128 entries in the high-end Compliancy Levels. This is a primary
146 reason why Simple-V can be added on top of an existing Scalar ISA, and
147 \textit{in particular
} why there is no need to add explicit Vector
149 Vector instructions. The diagram below shows
\textit{conceptually
}
150 how a Vector's elements are sequentially and linearly mapped onto the
151 \textit{Scalar
} register file:
155 \includegraphics[width=
0.6\linewidth]{svp64_regs.png
}
156 \caption{three instructions, same vector length, different element widths
}
157 \label{fig:svp64_regs
}
162 \subsection{Simple Vectorisation
}
163 \acs{SV
} is a Scalable Vector ISA designed for hybrid workloads (CPU, GPU,
164 VPU,
3D?). Includes features normally found only on Cray-style Supercomputers
165 (Cray-
1, NEC SX-Aurora) and GPUs. Keeps to a strict uniform RISC paradigm,
166 leveraging a scalar ISA by using "Prefixing".
167 \textbf{No dedicated vector opcodes exist in SV, at all
}.
168 SVP64 uses
25\% of the Power ISA v3.1
64-bit Prefix space (EXT001) to create
169 the SV Vectorisation Context for the
32-bit Scalar Suffix.
172 Main design principles
175 \item Introduce by implementing on top of existing Power ISA
176 \item Effectively a
\textbf{hardware for-loop
}, pauses main PC,
177 issues multiple scalar operations
178 \item Strictly preserves (leverages) underlying scalar execution
180 the for-loop had been expanded into actual scalar instructions
181 ("preserving Program Order")
182 \item Augments existing instructions by adding "tags" - provides
183 Vectorisation "context" rather than adding new opcodes.
184 \item Does not modify or deviate from the underlying scalar
185 Power ISA unless there's a significant performance boost or other
186 advantage in the vector space
187 \item Aimed at Supercomputing: avoids creating significant
188 \textit{sequential dependency hazards
}, allowing
\textbf{high
189 performance multi-issue superscalar microarchitectures
} to be
196 \item Easy to create first (and sometimes only) implementation
197 as a literal for-loop in hardware, simulators, and compilers.
198 \item Obliterates SIMD opcode proliferation
199 ($O(N^
6)$) as well as dedicated Vectorisation
200 ISAs. No more separate vector instructions.
201 \item Reducing maintenance overhead (no separate Vector instructions).
202 Adding any new Scalar instruction
203 \textit{automatically adds a Vectorised version of the same
}.
204 \item Easier for compilers, coders, documentation