1 \documentclass[slidestop
]{beamer
}
2 \usepackage{beamerthemesplit
}
6 \title{Simple-V RISC-V Extension for Vectorisation and SIMD
}
7 \author{Luke Kenneth Casson Leighton
}
14 \huge{Simple-V RISC-V Extension for Vectors and SIMD
}\\
16 \Large{Flexible Vectorisation
}\\
17 \Large{(aka not so Simple-V?)
}\\
19 \Large{Chennai
9th RISC-V Workshop
}\\
25 \frame{\frametitle{Why another Vector Extension?
}
28 \item RVV very heavy-duty (excellent for supercomputing)
\vspace{10pt
}
29 \item Simple-V abstracts parallelism (based on best of RVV)
\vspace{10pt
}
30 \item Graded levels: hardware or software-emulation
\vspace{10pt
}
31 \item Even Compressed instructions become vectorised
\vspace{10pt
}
33 What Simple-V is not:
\vspace{10pt
}
35 \item A full supercomputer-level Vector Proposal
\vspace{10pt
}
36 \item A replacement for RVV (designed to be augmented)
\vspace{10pt
}
40 \frame{\frametitle{Quick refresher on SIMD
}
43 \item SIMD very easy to implement (and very seductive)
\vspace{10pt
}
44 \item Parallelism is in the ALU
\vspace{10pt
}
45 \item Zero-to-Negligeable impact for rest of core
\vspace{10pt
}
47 Where SIMD Goes Wrong:
\vspace{10pt
}
49 \item See "Why SIMD considered harmful"
\vspace{10pt
}
50 \item (Corner-cases alone are extremely complex)
\vspace{10pt
}
51 \item O($N^
{6}$) ISA opcode proliferation!
\vspace{10pt
}
55 \frame{\frametitle{Quick refresher on RVV
}
58 \item Extremely powerful (extensible to
256 registers)
\vspace{10pt
}
59 \item Supports polymorphism, several datatypes (inc. FP16)
\vspace{10pt
}
60 \item Requires a separate Register File
\vspace{10pt
}
61 \item Can be implemented as a separate pipeline
\vspace{10pt
}
63 However...
\vspace{10pt
}
65 \item 98 percent opcode duplication with rest of RV (CLIP)
\vspace{10pt
}
66 \item Extending RVV requires customisation
\vspace{10pt
}
71 \frame{\frametitle{How is Parallelism abstracted?
}
74 \item Almost all opcodes removed in favour of implicit "typing"
\vspace{10pt
}
75 \item Primarily at the Instruction issue phase (except SIMD)
\vspace{10pt
}
76 \item Standard (and future, and custom) opcodes now parallel
\vspace{10pt
}
80 \item LOAD/STORE (inc. C.LD and C.ST, LDX: everything)
\vspace{10pt
}
81 \item All ALU ops (soft / hybrid / full HW, on per-op basis)
\vspace{10pt
}
82 \item All branches become predication targets (C.FNE added)
\vspace{10pt
}
86 \begin{frame
}[fragile
]
87 \frametitle{ADD pseudocode (or trap, or actual hardware loop)
}
90 function op_add(rd, rs1, rs2, predr) # add not VADD!
91 int i, id=
0, irs1=
0, irs2=
0;
92 for (i=
0; i < MIN(VL, vectorlen
[rd
]); i++)
93 if (ireg
[predr
] &
1<<i) # predication uses intregs
94 ireg
[rd+id
] <= ireg
[rs1+irs1
] + ireg
[rs2+irs2
];
95 if (reg_is_vectorised
[rd
]) \
{ id +=
1; \
}
96 if (reg_is_vectorised
[rs1
]) \
{ irs1 +=
1; \
}
97 if (reg_is_vectorised
[rs2
]) \
{ irs2 +=
1; \
}
100 \item SIMD slightly more complex (case above is elwidth = default)
101 \item Scalar-scalar and scalar-vector and vector-vector now all in one
102 \item OoO may choose to push ADDs into instr. queue (v. busy!)
106 \frame{\frametitle{How are SIMD Instructions Vectorised?
}
109 \item SIMD ALU(s) primarily unchanged
\vspace{10pt
}
110 \item Predication is added to each SIMD element (NO ZEROING!)
\vspace{10pt
}
111 \item End of Vector enables predication (NO ZEROING!)
\vspace{10pt
}
113 Considerations:
\vspace{10pt
}
115 \item Many SIMD ALUs possible (parallel execution)
\vspace{10pt
}
116 \item Very long SIMD ALUs could waste die area (short vectors)
\vspace{10pt
}
117 \item Implementor free to choose (API remains the same)
\vspace{10pt
}
121 \frame{\frametitle{What's the deal / juice / score?
}
124 \item Standard Register File(s) overloaded with "vector span"
\vspace{10pt
}
125 \item Element width and type concepts remain same as RVV
\vspace{10pt
}
126 \item CSRs are key-value tables (overlaps allowed)
\vspace{10pt
}
128 Key differences from RVV:
\vspace{10pt
}
130 \item Predication in INT regs as a BIT field (max VL=XLEN)
\vspace{10pt
}
131 \item Minimum VL must be Num Regs -
1 (all regs single LD/ST)
\vspace{10pt
}
132 \item NO ZEROING: non-predicated elements are skipped
\vspace{10pt
}
136 \frame{\frametitle{Why are overlaps allowed in Regfiles?
}
139 \item Same register(s) can have multiple "interpretations"
\vspace{10pt
}
140 \item xBitManip plus SIMD plus xBitManip = Hi/Lo bitops
\vspace{10pt
}
141 \item (
32-bit GREV plus
4-wide
32-bit SIMD plus
32-bit GREVI)
\vspace{10pt
}
142 \item 32-bit op followed by
16-bit op w/
2x VL,
1/
2 predicated
\vspace{10pt
}
146 \item xBitManip reduces O($N^
{6}$) SIMD down to O($N^
{3}$)
\vspace{10pt
}
147 \item Hi-Performance: Macro-op fusion (more pipeline stages?)
\vspace{10pt
}
152 \frame{\frametitle{Why no Zeroing (place zeros in non-predicated elements)?
}
155 \item Zeroing is an implementation optimisation favouring OoO
\vspace{10pt
}
156 \item Simple implementations may skip non-predicated operations
\vspace{10pt
}
157 \item Simple implementations explicitly have to destroy data
\vspace{10pt
}
158 \item Complex implementations may use reg-renames to save power
\vspace{10pt
}
160 Considerations:
\vspace{10pt
}
162 \item Complex not really impacted, Simple impacted a LOT
\vspace{10pt
}
163 \item Please don't use Vectors for "security" (use Sec-Ext)
\vspace{10pt
}
168 \frame{\frametitle{slide
}
173 Considerations:
\vspace{10pt
}
180 \frame{\frametitle{slide
}
185 Considerations:
\vspace{10pt
}
192 \frame{\frametitle{slide
}
197 Considerations:
\vspace{10pt
}
204 \frame{\frametitle{Including a plot
}
206 % \includegraphics[height=2in]{dental.ps}\\
207 {\bf \red Dental trajectories for
27 children:
}
211 \frame{\frametitle{Creating .pdf slides in WinEdt
}
214 \item LaTeX
[Shift-Control-L
]\vspace{10pt
}
215 \item dvi2pdf
[click the button
]\vspace{24pt
}
217 To print
4 slides per page in acrobat click
\vspace{10pt
}
219 \item File/print/properties
\vspace{10pt
}
220 \item Change ``pages per sheet'' to
4\vspace{10pt
}