X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension%2Fsimple_v_chennai_2018.tex;h=902f7cc22a4ca9583b362fa488bdd1fff3d17c54;hb=324e009dd199361d5c84e6a15dca1b1be24a6b94;hp=7392dfe9e4a1802ce1045af220004b2f9b98099d;hpb=b9a2952b137401f333400442756d14ffb8b4884b;p=libreriscv.git diff --git a/simple_v_extension/simple_v_chennai_2018.tex b/simple_v_extension/simple_v_chennai_2018.tex index 7392dfe9e..902f7cc22 100644 --- a/simple_v_extension/simple_v_chennai_2018.tex +++ b/simple_v_extension/simple_v_chennai_2018.tex @@ -11,11 +11,11 @@ \frame{ \begin{center} - \huge{Simple-V RISC-V Extension for Vectors and SIMD}\\ + \huge{Simple-V RISC-V Parallelism Abstraction Extension}\\ \vspace{32pt} \Large{Flexible Vectorisation}\\ \Large{(aka not so Simple-V?)}\\ - \Large{(aka How to Parallelise the RISC-V ISA)}\\ + \Large{(aka A Parallelism API for the RISC-V ISA)}\\ \vspace{24pt} \Large{[proposed for] Chennai 9th RISC-V Workshop}\\ \vspace{16pt} @@ -50,7 +50,7 @@ https://sigarch.org/simd-instructions-considered-harmful \item Setup and corner-cases alone are extremely complex.\\ Hardware is easy, but software is hell. - \item O($N^{6}$) ISA opcode proliferation!\\ + \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\ opcode, elwidth, veclen, src1-src2-dest hi/lo \end{itemize} } @@ -58,14 +58,17 @@ \frame{\frametitle{Quick refresher on RVV} \begin{itemize} - \item Extremely powerful (extensible to 256 registers)\vspace{10pt} - \item Supports polymorphism, several datatypes (inc. FP16)\vspace{10pt} - \item Requires a separate Register File (16 w/ext to 256)\vspace{10pt} - \item Implemented as a separate pipeline (no impact on scalar)\vspace{10pt} - \end{itemize} - However...\vspace{10pt} + \item Effectively a variant of SIMD / SIMT (arbitrary length)\vspace{4pt} + \item Fascinatingly, despite being a SIMD-variant, RVV only has + O(N) opcode proliferation! (extremely well designed) + \item Extremely powerful (extensible to 256 registers)\vspace{4pt} + \item Supports polymorphism, several datatypes (inc. FP16)\vspace{4pt} + \item Requires a separate Register File (32 w/ext to 256)\vspace{4pt} + \item Implemented as a separate pipeline (no impact on scalar) + \end{itemize} + However... \begin{itemize} - \item 98 percent opcode duplication with rest of RV (CLIP) + \item 98 percent opcode duplication with rest of RV \item Extending RVV requires customisation not just of h/w:\\ gcc, binutils also need customisation (and maintenance) \end{itemize} @@ -78,19 +81,21 @@ \item Why? Implementors need flexibility in vectorisation to optimise for area or performance depending on the scope: - embedded DSP, Mobile GPU's, Server CPU's and more.\vspace{4pt}\\ + embedded DSP, Mobile GPU's, Server CPU's and more.\\ Compilers also need flexibility in vectorisation to optimise for cost of pipeline setup, amount of state to context switch - and software portability\vspace{4pt} + and software portability \item How? By marking INT/FP regs as "Vectorised" and adding a level of indirection, SV expresses how existing instructions should act - on [contiguous] blocks of registers, in parallel.\vspace{4pt} + on [contiguous] blocks of registers, in parallel, WITHOUT + needing any new extra arithmetic opcodes. \item What? Simple-V is an "API" that implicitly extends existing (scalar) instructions with explicit parallelisation\\ - (i.e. SV is actually about parallelism NOT vectors per se) + i.e. SV is actually about parallelism NOT vectors per se.\\ + Has a lot in common with VLIW (without the actual VLIW). \end{itemize} } @@ -101,15 +106,15 @@ \item memcpy becomes much smaller (higher bang-per-buck) \item context-switch (LOAD/STORE multiple): 1-2 instructions \item Compressed instrs further reduces I-cache (etc.) - \item Greatly-reduced I-cache load (and less reads) - \item Amazingly, SIMD becomes (more) tolerable\\ - (corner-cases for setup and teardown are gone) + \item Reduced I-cache load (and less I-reads) + \item Amazingly, SIMD becomes tolerable (no corner-cases) \item Modularity/Abstraction in both the h/w and the toolchain. + \item "Reach" of registers accessible by Compressed is enhanced + \item Future: double the standard INT/FP register file sizes. \end{itemize} Note: \begin{itemize} \item It's not just about Vectors: it's about instruction effectiveness - \item Anything that makes SIMD tolerable has to be a good thing \item Anything implementor is not interested in HW-optimising,\\ let it fall through to exceptions (implement as a trap). \end{itemize} @@ -119,14 +124,16 @@ \frame{\frametitle{How does Simple-V relate to RVV? What's different?} \begin{itemize} - \item RVV very heavy-duty (excellent for supercomputing)\vspace{10pt} - \item Simple-V abstracts parallelism (based on best of RVV)\vspace{10pt} - \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{10pt} - \item Even Compressed become vectorised (RVV can't)\vspace{10pt} + \item RVV very heavy-duty (excellent for supercomputing)\vspace{4pt} + \item Simple-V abstracts parallelism (based on best of RVV)\vspace{4pt} + \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{4pt} + \item Even Compressed become vectorised (RVV can't)\vspace{4pt} + \item No polymorphism in SV (too complex)\vspace{4pt} \end{itemize} - What Simple-V is not:\vspace{10pt} + What Simple-V is not:\vspace{4pt} \begin{itemize} - \item A full supercomputer-level Vector Proposal + \item A full supercomputer-level Vector Proposal\\ + (it's not actually a Vector Proposal at all!) \item A replacement for RVV (SV is designed to be over-ridden\\ by - or augmented to become - RVV) \end{itemize} @@ -140,112 +147,38 @@ registers are reinterpreted through a level of indirection \item Primarily at the Instruction issue phase (except SIMD)\\ Note: it's ok to pass predication through to ALU (like SIMD) - \item Standard (and future, and custom) opcodes now parallel\vspace{10pt} + \item Standard and future and custom opcodes now parallel\\ + (crucially: with NO extra instructions needing to be added) \end{itemize} - Note: EVERYTHING is parallelised: + Note: EVERY scalar op now paralleliseable \begin{itemize} \item All LOAD/STORE (inc. Compressed, Int/FP versions) - \item All ALU ops (soft / hybrid / full HW, on per-op basis) - \item All branches become predication targets (C.FNE added?) + \item All ALU ops (Int, FP, SIMD, DSP, everything) + \item All branches become predication targets (note: no FNE) \item C.MV of particular interest (s/v, v/v, v/s) \item FCVT, FMV, FSGNJ etc. very similar to C.MV \end{itemize} } -\frame{\frametitle{Implementation Options} - - \begin{itemize} - \item Absolute minimum: Exceptions (if CSRs indicate "V", trap) - \item Hardware loop, single-instruction issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, parallel (multi-instruction) issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, full parallel ALU (not recommended) - \end{itemize} - Notes:\vspace{6pt} - \begin{itemize} - \item 4 (or more?) options above may be deployed on per-op basis - \item SIMD always sends predication bits through to ALU - \item Minimum MVL MUST be sufficient to cover regfile LD/ST - \item Instr. FIFO may repeatedly split off N scalar ops at a time - \end{itemize} -} -% Instr. FIFO may need its own slide. Basically, the vectorised op -% gets pushed into the FIFO, where it is then "processed". Processing -% will remove the first set of ops from its vector numbering (taking -% predication into account) and shoving them **BACK** into the FIFO, -% but MODIFYING the remaining "vectorised" op, subtracting the now -% scalar ops from it. - -\frame{\frametitle{Predicated 8-parallel ADD: 1-wide ALU} - \begin{center} - \includegraphics[height=2.5in]{padd9_alu1.png}\\ - {\bf \red Predicated adds are shuffled down: 6 cycles in total} - \end{center} -} - - -\frame{\frametitle{Predicated 8-parallel ADD: 4-wide ALU} - \begin{center} - \includegraphics[height=2.5in]{padd9_alu4.png}\\ - {\bf \red Predicated adds are shuffled down: 4 in 1st cycle, 2 in 2nd} - \end{center} -} - - -\frame{\frametitle{Predicated 8-parallel ADD: 3 phase FIFO expansion} - \begin{center} - \includegraphics[height=2.5in]{padd9_fifo.png}\\ - {\bf \red First cycle takes first four 1s; second takes the rest} - \end{center} -} - - -\frame{\frametitle{How are SIMD Instructions Vectorised?} - - \begin{itemize} - \item SIMD ALU(s) primarily unchanged\vspace{6pt} - \item Predication is added to each SIMD element\vspace{6pt} - \item Predication bits sent in groups to the ALU\vspace{6pt} - \item End of Vector enables (additional) predication\vspace{10pt} - \end{itemize} - Considerations:\vspace{4pt} - \begin{itemize} - \item Many SIMD ALUs possible (parallel execution) - \item Implementor free to choose (API remains the same) - \item Unused ALU units wasted, but s/w DRASTICALLY simpler - \item Very long SIMD ALUs could waste significant die area - \end{itemize} -} -% With multiple SIMD ALUs at for example 32-bit wide they can be used -% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations -% or they can be used to cover several operations on totally different -% vectors / registers. - -\frame{\frametitle{Predicated 9-parallel SIMD ADD} - \begin{center} - \includegraphics[height=2.5in]{padd9_simd.png}\\ - {\bf \red 4-wide 8-bit SIMD, 4 bits of predicate passed to ALU} - \end{center} -} - - \frame{\frametitle{What's the deal / juice / score?} \begin{itemize} \item Standard Register File(s) overloaded with CSR "reg is vector"\\ (see pseudocode slides for examples) - \item Element width (and type?) concepts remain same as RVV\\ - (CSRs give new size (and meaning?) to elements in registers) - \item CSRs are key-value tables (overlaps allowed)\vspace{10pt} + \item "2nd FP\&INT register bank" possibility, reserved for future\\ + (would allow standard regfiles to remain unmodified) + \item Element width concept remain same as RVV\\ + (CSRs give new size: overrides opcode-defined meaning) + \item CSRs are key-value tables (overlaps allowed: v. important) \end{itemize} - Key differences from RVV:\vspace{10pt} + Key differences from RVV: \begin{itemize} - \item Predication in INT regs as a BIT field (max VL=XLEN) + \item Predication in INT reg as a BIT field (max VL=XLEN) \item Minimum VL must be Num Regs - 1 (all regs single LD/ST) - \item SV may condense sparse Vecs: RVV lets ALU do predication - \item Choice to Zero or skip non-predicated elements + \item SV may condense sparse Vecs: RVV cannot (SIMD-like):\\ + SV gives choice to Zero or skip non-predicated elements\\ + (no such choice in RVV: zeroing-only) \end{itemize} } @@ -259,9 +192,9 @@ function op\_add(rd, rs1, rs2, predr) # add not VADD!  for (i = 0; i < VL; i++)   if (ireg[predr] & 1<