X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension%2Fsimple_v_chennai_2018.tex;h=3e6e47d7cf86681ca0c66b4f955fab37ec6bf574;hb=d5665bf2d20d08cbcb96177297e8f4a2f52d9c56;hp=873939d007ba7618103a42a9f622e438c8f1207a;hpb=a785610856f2858aadc3cf527612bdafcb7bccc8;p=libreriscv.git diff --git a/simple_v_extension/simple_v_chennai_2018.tex b/simple_v_extension/simple_v_chennai_2018.tex index 873939d00..3e6e47d7c 100644 --- a/simple_v_extension/simple_v_chennai_2018.tex +++ b/simple_v_extension/simple_v_chennai_2018.tex @@ -11,13 +11,14 @@ \frame{ \begin{center} - \huge{Simple-V RISC-V Extension for Vectors and SIMD}\\ + \huge{Simple-V RISC-V Parallelism Abstraction Extension}\\ \vspace{32pt} \Large{Flexible Vectorisation}\\ \Large{(aka not so Simple-V?)}\\ + \Large{(aka A Parallelism API for the RISC-V ISA)}\\ \vspace{24pt} \Large{[proposed for] Chennai 9th RISC-V Workshop}\\ - \vspace{24pt} + \vspace{16pt} \large{\today} \end{center} } @@ -39,17 +40,17 @@ \frame{\frametitle{Quick refresher on SIMD} \begin{itemize} - \item SIMD very easy to implement (and very seductive)\vspace{10pt} - \item Parallelism is in the ALU\vspace{10pt} - \item Zero-to-Negligeable impact for rest of core\vspace{10pt} + \item SIMD very easy to implement (and very seductive)\vspace{8pt} + \item Parallelism is in the ALU\vspace{8pt} + \item Zero-to-Negligeable impact for rest of core\vspace{8pt} \end{itemize} Where SIMD Goes Wrong:\vspace{10pt} \begin{itemize} \item See "SIMD instructions considered harmful" - https://www.sigarch.org/simd-instructions-considered-harmful - \item Corner-cases alone are extremely complex.\\ + https://sigarch.org/simd-instructions-considered-harmful + \item Setup and corner-cases alone are extremely complex.\\ Hardware is easy, but software is hell. - \item O($N^{6}$) ISA opcode proliferation!\\ + \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\ opcode, elwidth, veclen, src1-src2-dest hi/lo \end{itemize} } @@ -57,16 +58,19 @@ \frame{\frametitle{Quick refresher on RVV} \begin{itemize} - \item Extremely powerful (extensible to 256 registers)\vspace{10pt} - \item Supports polymorphism, several datatypes (inc. FP16)\vspace{10pt} - \item Requires a separate Register File (32 w/ext to 256)\vspace{10pt} - \item Implemented as a separate pipeline (no impact on scalar)\vspace{10pt} - \end{itemize} - However...\vspace{10pt} + \item Effectively a variant of SIMD / SIMT (arbitrary length)\vspace{4pt} + \item Fascinatingly, despite being a SIMD-variant, RVV only has + O(N) opcode proliferation! (extremely well designed) + \item Extremely powerful (extensible to 256 registers)\vspace{4pt} + \item Supports polymorphism, several datatypes (inc. FP16)\vspace{4pt} + \item Requires a separate Register File (32 w/ext to 256)\vspace{4pt} + \item Implemented as a separate pipeline (no impact on scalar) + \end{itemize} + However... \begin{itemize} - \item 98 percent opcode duplication with rest of RV (CLIP) + \item 98 percent opcode duplication with rest of RV \item Extending RVV requires customisation not just of h/w:\\ - gcc and s/w also need customisation (and maintenance) + gcc, binutils also need customisation (and maintenance) \end{itemize} } @@ -77,17 +81,21 @@ \item Why? Implementors need flexibility in vectorisation to optimise for area or performance depending on the scope: - embedded DSP, Mobile GPU's, Server CPU's and more.\vspace{4pt}\\ + embedded DSP, Mobile GPU's, Server CPU's and more.\\ Compilers also need flexibility in vectorisation to optimise for cost of pipeline setup, amount of state to context switch - and software portability\vspace{4pt} + and software portability \item How? - By implicitly marking INT/FP regs as "Vectorised",\\ + By marking INT/FP regs as "Vectorised" and + adding a level of indirection, SV expresses how existing instructions should act - on [contiguous] blocks of registers, in parallel.\vspace{4pt} + on [contiguous] blocks of registers, in parallel, WITHOUT + needing any new extra arithmetic opcodes. \item What? Simple-V is an "API" that implicitly extends - existing (scalar) instructions with explicit parallelisation. + existing (scalar) instructions with explicit parallelisation\\ + i.e. SV is actually about parallelism NOT vectors per se.\\ + Has a lot in common with VLIW (without the actual VLIW). \end{itemize} } @@ -95,12 +103,16 @@ \frame{\frametitle{What's the value of SV? Why adopt it even in non-V?} \begin{itemize} - \item memcpy becomes much smaller (higher bang-per-buck)\vspace{10pt} - \item context-switch (LOAD/STORE multiple): 1-2 instructions\vspace{10pt} - \item Compressed instrs further reduces I-cache (etc.)\vspace{10pt} - \item greatly-reduced I-cache load (and less reads)\vspace{10pt} - \end{itemize} - Note:\vspace{10pt} + \item memcpy has a much higher bang-per-buck ratio + \item context-switch (LOAD/STORE multiple): 1-2 instructions + \item Compressed instrs further reduces I-cache (etc.) + \item Reduced I-cache load (and less I-reads) + \item Amazingly, SIMD becomes tolerable (no corner-cases) + \item Modularity/Abstraction in both the h/w and the toolchain. + \item "Reach" of registers accessible by Compressed is enhanced + \item Future: double the standard INT/FP register file sizes. + \end{itemize} + Note: \begin{itemize} \item It's not just about Vectors: it's about instruction effectiveness \item Anything implementor is not interested in HW-optimising,\\ @@ -109,19 +121,21 @@ } -\frame{\frametitle{How does Simple-V relate to RVV?} +\frame{\frametitle{How does Simple-V relate to RVV? What's different?} \begin{itemize} - \item RVV very heavy-duty (excellent for supercomputing)\vspace{10pt} - \item Simple-V abstracts parallelism (based on best of RVV)\vspace{10pt} - \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{10pt} - \item Even Compressed instructions become vectorised\vspace{10pt} + \item RVV very heavy-duty (excellent for supercomputing)\vspace{4pt} + \item Simple-V abstracts parallelism (based on best of RVV)\vspace{4pt} + \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{4pt} + \item Even Compressed become vectorised (RVV can't)\vspace{4pt} + \item No polymorphism in SV (too complex)\vspace{4pt} \end{itemize} - What Simple-V is not:\vspace{10pt} + What Simple-V is not:\vspace{4pt} \begin{itemize} - \item A full supercomputer-level Vector Proposal + \item A full supercomputer-level Vector Proposal\\ + (it's not actually a Vector Proposal at all!) \item A replacement for RVV (SV is designed to be over-ridden\\ - by - or augmented to become, or just be replaced by - RVV) + by - or augmented to become - RVV) \end{itemize} } @@ -129,80 +143,42 @@ \frame{\frametitle{How is Parallelism abstracted in Simple-V?} \begin{itemize} - \item Register "typing" turns any op into an implicit Vector op\vspace{10pt} + \item Register "typing" turns any op into an implicit Vector op:\\ + registers are reinterpreted through a level of indirection \item Primarily at the Instruction issue phase (except SIMD)\\ Note: it's ok to pass predication through to ALU (like SIMD) - \item Standard (and future, and custom) opcodes now parallel\vspace{10pt} + \item Standard and future and custom opcodes now parallel\\ + (crucially: with NO extra instructions needing to be added) \end{itemize} - Notes:\vspace{6pt} + Note: EVERY scalar op now paralleliseable \begin{itemize} \item All LOAD/STORE (inc. Compressed, Int/FP versions) - \item All ALU ops (soft / hybrid / full HW, on per-op basis) - \item All branches become predication targets (C.FNE added) + \item All ALU ops (Int, FP, SIMD, DSP, everything) + \item All branches become predication targets (note: no FNE) \item C.MV of particular interest (s/v, v/v, v/s) + \item FCVT, FMV, FSGNJ etc. very similar to C.MV \end{itemize} } -\frame{\frametitle{Implementation Options} - - \begin{itemize} - \item Absolute minimum: Exceptions (if CSRs indicate "V", trap) - \item Hardware loop, single-instruction issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, parallel (multi-instruction) issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, full parallel ALU (not recommended) - \end{itemize} - Notes:\vspace{6pt} - \begin{itemize} - \item 4 (or more?) options above may be deployed on per-op basis - \item SIMD always sends predication bits through to ALU - \item Minimum MVL MUST be sufficient to cover regfile LD/ST - \item Instr. FIFO may repeatedly split off N scalar ops at a time - \end{itemize} -} -% Instr. FIFO may need its own slide. Basically, the vectorised op -% gets pushed into the FIFO, where it is then "processed". Processing -% will remove the first set of ops from its vector numbering (taking -% predication into account) and shoving them **BACK** into the FIFO, -% but MODIFYING the remaining "vectorised" op, subtracting the now -% scalar ops from it. - -\frame{\frametitle{How are SIMD Instructions Vectorised?} - - \begin{itemize} - \item SIMD ALU(s) primarily unchanged\vspace{10pt} - \item Predication is added to each SIMD element (NO ZEROING!)\vspace{10pt} - \item End of Vector enables predication (NO ZEROING!)\vspace{10pt} - \end{itemize} - Considerations:\vspace{10pt} - \begin{itemize} - \item Many SIMD ALUs possible (parallel execution)\vspace{10pt} - \item Very long SIMD ALUs could waste die area (short vectors)\vspace{10pt} - \item Implementor free to choose (API remains the same)\vspace{10pt} - \end{itemize} -} -% With multiple SIMD ALUs at for example 32-bit wide they can be used -% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations -% or they can be used to cover several operations on totally different -% vectors / registers. - \frame{\frametitle{What's the deal / juice / score?} \begin{itemize} - \item Standard Register File(s) overloaded with CSR "vector span"\\ + \item Standard Register File(s) overloaded with CSR "reg is vector"\\ (see pseudocode slides for examples) - \item Element width and type concepts remain same as RVV\\ - (CSRs are used to "interpret" elements in registers) - \item CSRs are key-value tables (overlaps allowed)\vspace{10pt} + \item "2nd FP\&INT register bank" possibility, reserved for future\\ + (would allow standard regfiles to remain unmodified) + \item Element width concept remain same as RVV\\ + (CSRs give new size: overrides opcode-defined meaning) + \item CSRs are key-value tables (overlaps allowed: v. important) \end{itemize} - Key differences from RVV:\vspace{10pt} + Key differences from RVV: \begin{itemize} - \item Predication in INT regs as a BIT field (max VL=XLEN) + \item Predication in INT reg as a BIT field (max VL=XLEN) \item Minimum VL must be Num Regs - 1 (all regs single LD/ST) - \item SV may condense sparse Vecs: RVV lets ALU do predication - \item NO ZEROING: non-predicated elements are skipped + \item SV may condense sparse Vecs: RVV cannot (SIMD-like):\\ + SV gives choice to Zero or skip non-predicated elements\\ + (no such choice in RVV: zeroing-only) \end{itemize} } @@ -211,17 +187,18 @@ \frametitle{ADD pseudocode (or trap, or actual hardware loop)} \begin{semiverbatim} -function op_add(rd, rs1, rs2, predr) # add not VADD! +function op\_add(rd, rs1, rs2, predr) # add not VADD!  int i, id=0, irs1=0, irs2=0;  for (i = 0; i < VL; i++)   if (ireg[predr] & 1<