1 \documentclass[slidestop
]{beamer
}
2 \usepackage{beamerthemesplit
}
6 \title{Simple-V RISC-V Extension for Vectorisation and SIMD
}
7 \author{Luke Kenneth Casson Leighton
}
14 \huge{Simple-V RISC-V Extension for Vectors and SIMD
}\\
16 \Large{Flexible Vectorisation
}\\
17 \Large{(aka not so Simple-V?)
}\\
19 \Large{[proposed for
] Chennai
9th RISC-V Workshop
}\\
26 \frame{\frametitle{Credits and Acknowledgements
}
29 \item The Designers of RISC-V
\vspace{15pt
}
30 \item The RVV Working Group and contributors
\vspace{15pt
}
31 \item Jacob Bachmeyer, Xan Phung, Chuanhua Chang,\\
32 Guy Lemurieux and others
\vspace{15pt
}
33 \item ISA-Dev Group Members
\vspace{10pt
}
38 \frame{\frametitle{The Simon Sinek lowdown (Why, How, What)
}
41 \item Vectorisation needs to fit (be useful within) an implementor's\\
42 scope: RV32E, Embedded/Mobile, DSP, Servers and more.
\vspace{15pt
}
43 \item By implicitly marking INT/FP regs as "Vectorised",\\
44 everything else follows from there.
\vspace{15pt
}
45 \item A Standard Vector "API" with flexibility for implementors:\\
46 choice to optimise for area or performance as desired
\vspace{10pt
}
51 \frame{\frametitle{Why another Vector Extension?
}
54 \item RVV very heavy-duty (excellent for supercomputing)
\vspace{10pt
}
55 \item Simple-V abstracts parallelism (based on best of RVV)
\vspace{10pt
}
56 \item Graded levels: hardware, hybrid or traps (fit impl. need)
\vspace{10pt
}
57 \item Even Compressed instructions become vectorised
\vspace{10pt
}
59 What Simple-V is not:
\vspace{10pt
}
61 \item A full supercomputer-level Vector Proposal
\vspace{10pt
}
62 \item A replacement for RVV (designed to be augmented)
\vspace{10pt
}
67 \frame{\frametitle{Quick refresher on SIMD
}
70 \item SIMD very easy to implement (and very seductive)
\vspace{10pt
}
71 \item Parallelism is in the ALU
\vspace{10pt
}
72 \item Zero-to-Negligeable impact for rest of core
\vspace{10pt
}
74 Where SIMD Goes Wrong:
\vspace{10pt
}
76 \item See "SIMD instructions considered harmful"
77 https://www.sigarch.org/simd-instructions-considered-harmful
78 \item Corner-cases alone are extremely complex.\\
79 Hardware is easy, but software is hell.
80 \item O($N^
{6}$) ISA opcode proliferation!\\
81 opcode, elwidth, veclen, src1-src2-dest hi/lo
85 \frame{\frametitle{Quick refresher on RVV
}
88 \item Extremely powerful (extensible to
256 registers)
\vspace{10pt
}
89 \item Supports polymorphism, several datatypes (inc. FP16)
\vspace{10pt
}
90 \item Requires a separate Register File
\vspace{10pt
}
91 \item Can be implemented as a separate pipeline
\vspace{10pt
}
93 However...
\vspace{10pt
}
95 \item 98 percent opcode duplication with rest of RV (CLIP)
96 \item Extending RVV requires customisation not just of h/w:\\
97 gcc and s/w also need customisation (and maintenance)
102 \frame{\frametitle{How is Parallelism abstracted?
}
105 \item Register "typing" turns any op into an implicit Vector op
\vspace{10pt
}
106 \item Primarily at the Instruction issue phase (except SIMD)
\vspace{10pt
}
107 \item Standard (and future, and custom) opcodes now parallel
\vspace{10pt
}
111 \item LOAD/STORE (inc. C.LD and C.ST, LD.X: everything)
112 \item All ALU ops (soft / hybrid / full HW, on per-op basis)
113 \item All branches become predication targets (C.FNE added)
114 \item C.MV of particular interest (s/v, v/v, v/s)
119 \frame{\frametitle{Implementation Options
}
122 \item Absolute minimum: Exceptions (if CSRs indicate "V", trap)
\vspace{10pt
}
123 \item Hardware loop, single-instruction issue
\vspace{10pt
}
124 \item Hardware loop, parallel (multi-instruction) issue
\vspace{10pt
}
125 \item Hardware loop, full parallel ALU (not recommended)
\vspace{10pt
}
129 \item 4 (or more?) options above may be deployed on per-op basis
130 \item Minimum MVL MUST be sufficient to cover regfile LD/ST
131 \item Instr. FIFO may repeatedly split off N scalar ops at a time
134 % Instr. FIFO may need its own slide. Basically, the vectorised op
135 % gets pushed into the FIFO, where it is then "processed". Processing
136 % will remove the first set of ops from its vector numbering (taking
137 % predication into account) and shoving them **BACK** into the FIFO,
138 % but MODIFYING the remaining "vectorised" op, subtracting the now
139 % scalar ops from it.
141 \frame{\frametitle{How are SIMD Instructions Vectorised?
}
144 \item SIMD ALU(s) primarily unchanged
\vspace{10pt
}
145 \item Predication is added to each SIMD element (NO ZEROING!)
\vspace{10pt
}
146 \item End of Vector enables predication (NO ZEROING!)
\vspace{10pt
}
148 Considerations:
\vspace{10pt
}
150 \item Many SIMD ALUs possible (parallel execution)
\vspace{10pt
}
151 \item Very long SIMD ALUs could waste die area (short vectors)
\vspace{10pt
}
152 \item Implementor free to choose (API remains the same)
\vspace{10pt
}
155 % With multiple SIMD ALUs at for example 32-bit wide they can be used
156 % to either issue 64-bit or 128-bit or 256-bit wide SIMD operations
157 % or they can be used to cover several operations on totally different
158 % vectors / registers.
160 \frame{\frametitle{What's the deal / juice / score?
}
163 \item Standard Register File(s) overloaded with "vector span"
\vspace{10pt
}
164 \item Element width and type concepts remain same as RVV
\vspace{10pt
}
165 \item CSRs are key-value tables (overlaps allowed)
\vspace{10pt
}
167 Key differences from RVV:
\vspace{10pt
}
169 \item Predication in INT regs as a BIT field (max VL=XLEN)
170 \item Minimum VL must be Num Regs -
1 (all regs single LD/ST)
171 \item SV may condense sparse Vecs: RVV lets ALU do predication
172 \item NO ZEROING: non-predicated elements are skipped
177 \frame{\frametitle{Why are overlaps allowed in Regfiles?
}
180 \item Same register(s) can have multiple "interpretations"
\vspace{10pt
}
181 \item xBitManip plus SIMD plus xBitManip = Hi/Lo bitops
\vspace{10pt
}
182 \item (
32-bit GREV plus
4x8-bit SIMD plus
32-bit GREV)
\vspace{10pt
}
183 \item Same register(s) can be offset (no need for VSLIDE)
\vspace{10pt
}
187 \item xBitManip reduces O($N^
{6}$) SIMD down to O($N^
{3}$)
\vspace{10pt
}
188 \item Hi-Performance: Macro-op fusion (more pipeline stages?)
\vspace{10pt
}
193 \frame{\frametitle{Why no Zeroing (place zeros in non-predicated elements)?
}
196 \item Zeroing is an implementation optimisation favouring OoO
\vspace{8pt
}
197 \item Simple implementations may skip non-predicated operations
\vspace{8pt
}
198 \item Simple implementations explicitly have to destroy data
\vspace{8pt
}
199 \item Complex implementations may use reg-renames to save power\\
200 Zeroing on predication chains makes optimisation harder
202 Considerations:
\vspace{10pt
}
204 \item Complex not really impacted, Simple impacted a LOT
205 \item Overlapping "Vectors" may issue overlapping ops
206 \item Please don't use Vectors for "security" (use Sec-Ext)
209 % with overlapping "vectors" - bearing in mind that "vectors" are
210 % just a remap onto the standard register file, if the top bits of
211 % predication are zero, and there happens to be a second vector
212 % that uses some of the same register file that happens to be
213 % predicated out, the second vector op may be issued *at the same time*
214 % if there are available parallel ALUs to do so.
217 \frame{\frametitle{Predication key-value CSR store
}
220 \item key is int regfile number or FP regfile number (
1 bit)
\vspace{10pt
}
221 \item register to be predicated if referred to (
5 bits, key)
\vspace{10pt
}
222 \item register to store actual predication in (
5 bits, value)
\vspace{10pt
}
223 \item predication is inverted (
1 bit)
\vspace{10pt
}
227 \item Table should be expanded out for high-speed implementations
228 \item Multiple "keys" (and values) theoretically permitted
229 \item RVV rules about deleting higher-indexed CSRs followed
234 \frame{\frametitle{Register key-value CSR store
}
237 \item key is int regfile number or FP regfile number (
1 bit)
\vspace{10pt
}
238 \item register to be predicated if referred to (
5 bits, key)
\vspace{10pt
}
239 \item register to store actual predication in (
5 bits, value)
\vspace{10pt
}
240 \item TODO
\vspace{10pt
}
244 \item Table should be expanded out for high-speed implementations
245 \item Multiple "keys" (and values) theoretically permitted
246 \item RVV rules about deleting higher-indexed CSRs followed
251 \begin{frame
}[fragile
]
252 \frametitle{ADD pseudocode (or trap, or actual hardware loop)
}
255 function op_add(rd, rs1, rs2, predr) # add not VADD!
256 int i, id=
0, irs1=
0, irs2=
0;
257 for (i=
0; i < MIN(VL, vectorlen
[rd
]); i++)
258 if (ireg
[predr
] &
1<<i) # predication uses intregs
259 ireg
[rd+id
] <= ireg
[rs1+irs1
] + ireg
[rs2+irs2
];
260 if (reg_is_vectorised
[rd
]) \
{ id +=
1; \
}
261 if (reg_is_vectorised
[rs1
]) \
{ irs1 +=
1; \
}
262 if (reg_is_vectorised
[rs2
]) \
{ irs2 +=
1; \
}
266 \item SIMD slightly more complex (case above is elwidth = default)
267 \item Scalar-scalar and scalar-vector and vector-vector now all in one
268 \item OoO may choose to push ADDs into instr. queue (v. busy!)
272 \begin{frame
}[fragile
]
273 \frametitle{Predication-Branch (or trap, or actual hardware loop)
}
276 s1 = vectorlen
[src1
] >
1;
277 s2 = vectorlen
[src2
] >
1;
278 for (int i =
0; i < VL; ++i)
279 preg
[rs3
] |=
1 << cmp(s1 ? reg
[src1+i
] : reg
[src1
],
280 s2 ? reg
[src2+i
] : reg
[src2
]);
284 \item SIMD slightly more complex (case above is elwidth = default)
285 \item If s1 and s2 both scalars, Standard branch occurs
286 \item Predication stored in integer regfile as a bitfield
287 \item Scalar-vector and vector-vector supported
291 \begin{frame
}[fragile
]
292 \frametitle{LD/LD.S/LD.X (or trap, or actual hardware loop)
}
295 if (unit-strided) stride = elsize;
296 else stride = areg
[as2
]; // constant-strided
297 for (int i =
0; i < VL; ++i)
298 if (preg_enabled
[rd
] && (
[!
]preg
[rd
] &
1<<i))
299 for (int j =
0; j < seglen+
1; j++)
300 if (vectorised
[rs2
]) offs = vreg
[rs2
][i
]
301 else offs = i*(seglen+
1)*stride;
302 vreg
[rd+j
][i
] = mem
[sreg
[base
] + offs + j*stride
]
306 \item Again: SIMD slightly more complex
307 \item rs2 vectorised taken to implicitly indicate LD.X
312 \frame{\frametitle{C.MV extremely flexible!
}
315 \item scalar-to-vector (w/no pred): VSPLAT
316 \item scalar-to-vector (w/dest-pred): Sparse VSPLAT
317 \item scalar-to-vector (w/single dest-pred): VINSERT
318 \item vector-to-scalar (w/src-pred): VEXTRACT
319 \item vector-to-vector (w/no pred): Vector Copy
320 \item vector-to-vector (w/src xor dest pred): Sparse Vector Copy
321 \item vector-to-vector (w/src and dest pred): Vector Shuffle
326 \item Really powerful!
327 \item Any other options?
332 \frame{\frametitle{Opcodes, compared to RVV
}
335 \item All integer and FP opcodes all removed (no CLIP!)
\vspace{8pt
}
336 \item VMPOP, VFIRST etc. all removed (use xBitManip)
\vspace{8pt
}
337 \item VSLIDE removed (use regfile overlaps)
\vspace{8pt
}
338 \item C.MV covers VEXTRACT VINSERT and VSPLAT (and more)
\vspace{8pt
}
339 \item VSETVL, VGETVL, VSELECT stay
\vspace{8pt
}
340 \item Issue: VCLIP is not in RV* (add with custom ext?)
\vspace{8pt
}
341 \item Vector (or scalar-vector) use C.MV (MV is a pseudo-op)
\vspace{8pt
}
342 \item VMERGE: twin predicated C.MVs (one inverted. macro-op'd)
\vspace{8pt
}
347 \frame{\frametitle{Under consideration
}
350 \item Can VSELECT be removed? (it's really complex)
\vspace{10pt
}
351 \item Can CLIP be done as a CSR (mode, like elwidth)
\vspace{10pt
}
352 \item SIMD saturation (etc.) also set as a mode?
\vspace{10pt
}
353 \item C.MV src predication no different from dest predication\\
354 What to do? Make one have different meaning?
\vspace{10pt
}
355 \item 8/
16-bit ops is it worthwhile adding a "start offset"? \\
356 (a bit like misaligned addressing... for registers)\\
357 or just use predication to skip start?
\vspace{10pt
}
362 \frame{\frametitle{Summary
}
365 \item Designed for simplicity (graded levels of complexity)
\vspace{10pt
}
366 \item Fits RISC-V ethos: do more with less
\vspace{10pt
}
367 \item Reduces SIMD ISA proliferation by
3-
4 orders of magnitude \\
368 (without SIMD downsides or sacrificing speed trade-off)
\vspace{10pt
}
369 \item Covers
98\% of RVV, allows RVV to fit "on top"
\vspace{10pt
}
370 \item Huge range of implementor freedom and flexibility
\vspace{10pt
}
371 \item Not designed for supercomputing (that's RVV), designed for
372 in between: DSPs, RV32E, Embedded
3D GPUs etc.
\vspace{10pt
}
377 \frame{\frametitle{slide
}
382 Considerations:
\vspace{10pt
}
389 \frame{\frametitle{Including a plot
}
391 % \includegraphics[height=2in]{dental.ps}\\
392 {\bf \red Dental trajectories for
27 children:
}
396 \frame{\frametitle{Creating .pdf slides in WinEdt
}
399 \item LaTeX
[Shift-Control-L
]\vspace{10pt
}
400 \item dvi2pdf
[click the button
]\vspace{24pt
}
402 To print
4 slides per page in acrobat click
\vspace{10pt
}
404 \item File/print/properties
\vspace{10pt
}
405 \item Change ``pages per sheet'' to
4\vspace{10pt
}