X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension%2Fsimple_v_chennai_2018.tex;h=09019e99231641e898cb1a642f0f2277fbf2701a;hb=c66087216aba6428fea769f14ad7f994a846de62;hp=b72a5f61b996ea2c164f10019a76edc707a0d29f;hpb=b071dfd3614a8f5d91eee5a25c7bfcc1310b0ad9;p=libreriscv.git diff --git a/simple_v_extension/simple_v_chennai_2018.tex b/simple_v_extension/simple_v_chennai_2018.tex index b72a5f61b..09019e992 100644 --- a/simple_v_extension/simple_v_chennai_2018.tex +++ b/simple_v_extension/simple_v_chennai_2018.tex @@ -60,7 +60,7 @@ \begin{itemize} \item Extremely powerful (extensible to 256 registers)\vspace{10pt} \item Supports polymorphism, several datatypes (inc. FP16)\vspace{10pt} - \item Requires a separate Register File (16 w/ext to 256)\vspace{10pt} + \item Requires a separate Register File (32 w/ext to 256)\vspace{10pt} \item Implemented as a separate pipeline (no impact on scalar)\vspace{10pt} \end{itemize} However...\vspace{10pt} @@ -78,19 +78,21 @@ \item Why? Implementors need flexibility in vectorisation to optimise for area or performance depending on the scope: - embedded DSP, Mobile GPU's, Server CPU's and more.\vspace{4pt}\\ + embedded DSP, Mobile GPU's, Server CPU's and more.\\ Compilers also need flexibility in vectorisation to optimise for cost of pipeline setup, amount of state to context switch - and software portability\vspace{4pt} + and software portability \item How? By marking INT/FP regs as "Vectorised" and adding a level of indirection, SV expresses how existing instructions should act - on [contiguous] blocks of registers, in parallel.\vspace{4pt} + on [contiguous] blocks of registers, in parallel, WITHOUT + needing new any actual extra arithmetic opcodes. \item What? Simple-V is an "API" that implicitly extends existing (scalar) instructions with explicit parallelisation\\ - (i.e. SV is actually about parallelism NOT vectors per se) + i.e. SV is actually about parallelism NOT vectors per se.\\ + Has a lot in common with VLIW (without the actual VLIW). \end{itemize} } @@ -102,14 +104,14 @@ \item context-switch (LOAD/STORE multiple): 1-2 instructions \item Compressed instrs further reduces I-cache (etc.) \item Greatly-reduced I-cache load (and less reads) - \item Amazingly, SIMD becomes (more) tolerable\\ - (corner-cases for setup and teardown are gone) + \item Amazingly, SIMD becomes (more) tolerable (no corner-cases) \item Modularity/Abstraction in both the h/w and the toolchain. + \item "Reach" of registers accessible by Compressed is enhanced + \item Future: double the standard INT/FP register file sizes. \end{itemize} Note: \begin{itemize} \item It's not just about Vectors: it's about instruction effectiveness - \item Anything that makes SIMD tolerable has to be a good thing \item Anything implementor is not interested in HW-optimising,\\ let it fall through to exceptions (implement as a trap). \end{itemize} @@ -119,12 +121,13 @@ \frame{\frametitle{How does Simple-V relate to RVV? What's different?} \begin{itemize} - \item RVV very heavy-duty (excellent for supercomputing)\vspace{10pt} - \item Simple-V abstracts parallelism (based on best of RVV)\vspace{10pt} - \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{10pt} - \item Even Compressed become vectorised (RVV can't)\vspace{10pt} + \item RVV very heavy-duty (excellent for supercomputing)\vspace{8pt} + \item Simple-V abstracts parallelism (based on best of RVV)\vspace{8pt} + \item Graded levels: hardware, hybrid or traps (fit impl. need)\vspace{8pt} + \item Even Compressed become vectorised (RVV can't)\vspace{8pt} + \item No polymorphism in SV (too complex)\vspace{8pt} \end{itemize} - What Simple-V is not:\vspace{10pt} + What Simple-V is not:\vspace{4pt} \begin{itemize} \item A full supercomputer-level Vector Proposal \item A replacement for RVV (SV is designed to be over-ridden\\ @@ -145,7 +148,7 @@ Note: EVERYTHING is parallelised: \begin{itemize} \item All LOAD/STORE (inc. Compressed, Int/FP versions) - \item All ALU ops (soft / hybrid / full HW, on per-op basis) + \item All ALU ops (Int, FP, SIMD, DSP, everything) \item All branches become predication targets (C.FNE added?) \item C.MV of particular interest (s/v, v/v, v/s) \item FCVT, FMV, FSGNJ etc. very similar to C.MV @@ -153,98 +156,20 @@ } -\frame{\frametitle{Implementation Options} - - \begin{itemize} - \item Absolute minimum: Exceptions: if CSRs indicate "V", trap.\\ - (Requires as absolute minimum that CSRs be in H/W) - \item Hardware loop, single-instruction issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, parallel (multi-instruction) issue\\ - (Do / Don't send through predication to ALU) - \item Hardware loop, full parallel ALU (not recommended) - \end{itemize} - Notes:\vspace{4pt} - \begin{itemize} - \item 4 (or more?) options above may be deployed on per-op basis - \item SIMD always sends predication bits through to ALU - \item Minimum MVL MUST be sufficient to cover regfile LD/ST - \item Instr. FIFO may repeatedly split off N scalar ops at a time - \end{itemize} -} -% Instr. FIFO may need its own slide. Basically, the vectorised op -% gets pushed into the FIFO, where it is then "processed". Processing -% will remove the first set of ops from its vector numbering (taking -% predication into account) and shoving them **BACK** into the FIFO, -% but MODIFYING the remaining "vectorised" op, subtracting the now -% scalar ops from it. - -\frame{\frametitle{Predicated 8-parallel ADD: 1-wide ALU} - \begin{center} - \includegraphics[height=2.5in]{padd9_alu1.png}\\ - {\bf \red Predicated adds are shuffled down: 6 cycles in total} - \end{center} -} - - -\frame{\frametitle{Predicated 8-parallel ADD: 4-wide ALU} - \begin{center} - \includegraphics[height=2.5in]{padd9_alu4.png}\\ - {\bf \red Predicated adds are shuffled down: 4 in 1st cycle, 2 in 2nd} - \end{center} -} - - -\frame{\frametitle{Predicated 8-parallel ADD: 3 phase FIFO expansion} - \begin{center} - \includegraphics[height=2.5in]{padd9_fifo.png}\\ - {\bf \red First cycle takes first four 1s; second takes the rest} - \end{center} -} - - -\frame{\frametitle{How are SIMD Instructions Vectorised?} - - \begin{itemize} - \item SIMD ALU(s) primarily unchanged\vspace{6pt} - \item Predication is added to each SIMD element\vspace{6pt} - \item Predication bits sent in groups to the ALU\vspace{6pt} - \item End of Vector enables (additional) predication\\ - (completely nullifies need for end-case code) - \end{itemize} - Considerations:\vspace{4pt} - \begin{itemize} - \item Many SIMD ALUs possible (parallel execution) - \item Implementor free to choose (API remains the same) - \item Unused ALU units wasted, but s/w DRASTICALLY simpler - \item Very long SIMD ALUs could waste significant die area - \end{itemize} -} -% With multiple SIMD ALUs at for example 32-bit wide they can be used -% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations -% or they can be used to cover several operations on totally different -% vectors / registers. - -\frame{\frametitle{Predicated 9-parallel SIMD ADD} - \begin{center} - \includegraphics[height=2.5in]{padd9_simd.png}\\ - {\bf \red 4-wide 8-bit SIMD, 4 bits of predicate passed to ALU} - \end{center} -} - - \frame{\frametitle{What's the deal / juice / score?} \begin{itemize} \item Standard Register File(s) overloaded with CSR "reg is vector"\\ (see pseudocode slides for examples) - \item Element width (and type?) concepts remain same as RVV\\ - (CSRs give new size (and meaning?) to elements in registers) - \item CSRs are key-value tables (overlaps allowed)\vspace{10pt} + \item "2nd FP\&INT register bank" possibility, reserved for future\\ + (would allow standard regfiles to remain unmodified) + \item Element width concept remain same as RVV\\ + (CSRs give new size to elements in registers) + \item CSRs are key-value tables (overlaps allowed: v. important) \end{itemize} - Key differences from RVV:\vspace{10pt} + Key differences from RVV: \begin{itemize} - \item Predication in INT regs as a BIT field (max VL=XLEN) + \item Predication in INT reg as a BIT field (max VL=XLEN) \item Minimum VL must be Num Regs - 1 (all regs single LD/ST) \item SV may condense sparse Vecs: RVV lets ALU do predication \item Choice to Zero or skip non-predicated elements @@ -297,6 +222,7 @@ for (int i = 0; i < VL; ++i) \item If s1 and s2 both scalars, Standard branch occurs \item Predication stored in integer regfile as a bitfield \item Scalar-vector and vector-vector supported + \item Overload Branch immediate to be predication target rs3 \end{itemize} \end{frame} @@ -321,51 +247,72 @@ for (int i = 0; i < VL; ++i) \end{frame} -\frame{\frametitle{Why are overlaps allowed in Regfiles?} +\frame{\frametitle{Register key-value CSR store} \begin{itemize} - \item Same register(s) can have multiple "interpretations" - \item Set "real" register (scalar) without needing to set/unset CSRs. - \item xBitManip plus SIMD plus xBitManip = Hi/Lo bitops - \item (32-bit GREV plus 4x8-bit SIMD plus 32-bit GREV:\\ - GREV @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8) - \item RGB 565 (video): BEXTW plus 4x8-bit SIMD plus BDEPW\\ - (BEXT/BDEP @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8) - \item Same register(s) can be offset (no need for VSLIDE)\vspace{6pt} + \item key is int regfile number or FP regfile number (1 bit) + \item treated as vector if referred to in op (5 bits, key) + \item starting register to actually be used (5 bits, value) + \item element bitwidth: default, dflt/2, 8, 16 (2 bits) + \item is vector: Y/N (1 bit) + \item is packed SIMD: Y/N (1 bit) + \item register bank: 0/reserved for future ext. (1 bit) \end{itemize} - Note: + Notes: \begin{itemize} - \item xBitManip reduces O($N^{6}$) SIMD down to O($N^{3}$) - \item Hi-Performance: Macro-op fusion (more pipeline stages?) + \item References different (internal) mapping table for INT or FP + \item Level of indirection has implications for pipeline latency + \item (future) bank bit, no need to extend opcodes: set bank=1, + just use normal 5-bit regs, indirection takes care of the rest. \end{itemize} } -\frame{\frametitle{To Zero or not to place zeros in non-predicated elements?} +\frame{\frametitle{Register element width and packed SIMD} + Packed SIMD = N: \begin{itemize} - \item Zeroing is an implementation optimisation favouring OoO - \item Simple implementations may skip non-predicated operations - \item Simple implementations explicitly have to destroy data - \item Complex implementations may use reg-renames to save power\\ - Zeroing on predication chains makes optimisation harder - \item Compromise: REQUIRE both (specified in predication CSRs). + \item default: RV32/64/128 opcodes define elwidth = 32/64/128 + \item default/2: RV32/64/128 opcodes, elwidth = 16/32/64 with + top half of register ignored (src), zero'd/s-ext (dest) + \item 8 or 16: elwidth = 8 (or 16), similar to default/2 \end{itemize} - Considerations: - \begin{itemize} - \item Complex not really impacted, simple impacted a LOT\\ - with Zeroing... however it's useful (memzero) - \item Non-zero'd overlapping "Vectors" may issue overlapping ops\\ - (2nd op's predicated elements slot in 1st's non-predicated ops) - \item Please don't use Vectors for "security" (use Sec-Ext) + Packed SIMD = Y (default is moot, packing is 1:1) + \begin{itemize} + \item default/2: 2 elements per register @ opcode-defined bitwidth + \item 8 or 16: standard 8 (or 16) packed SIMD + \end{itemize} + Notes: + \begin{itemize} + \item Different src/dest widths (and packs) PERMITTED + \item RV* already allows (and defines) how RV32 ops work in RV64\\ + so just logically follow that lead/example. \end{itemize} } -% with overlapping "vectors" - bearing in mind that "vectors" are -% just a remap onto the standard register file, if the top bits of -% predication are zero, and there happens to be a second vector -% that uses some of the same register file that happens to be -% predicated out, the second vector op may be issued *at the same time* -% if there are available parallel ALUs to do so. + + +\begin{frame}[fragile] +\frametitle{Register key-value CSR table decoding pseudocode} + +\begin{semiverbatim} +struct vectorised fp\_vec[32], int\_vec[32]; // 64 in future + +for (i = 0; i < 16; i++) // 16 CSRs? + tb = int\_vec if CSRvec[i].type == 0 else fp\_vec + idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode + tb[idx].elwidth = CSRvec[i].elwidth + tb[idx].regidx = CSRvec[i].regidx // indirection + tb[idx].isvector = CSRvec[i].isvector + tb[idx].packed = CSRvec[i].packed // SIMD or not + tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd) +\end{semiverbatim} + + \begin{itemize} + \item All 32 int (and 32 FP) entries zero'd before setup + \item Might be a bit complex to set up in hardware (TBD) + \end{itemize} + +\end{frame} \frame{\frametitle{Predication key-value CSR store} @@ -373,7 +320,7 @@ for (int i = 0; i < VL; ++i) \begin{itemize} \item key is int regfile number or FP regfile number (1 bit)\vspace{6pt} \item register to be predicated if referred to (5 bits, key)\vspace{6pt} - \item register to store actual predication in (5 bits, value)\vspace{6pt} + \item INT reg with actual predication mask (5 bits, value)\vspace{6pt} \item predication is inverted Y/N (1 bit)\vspace{6pt} \item non-predicated elements are to be zero'd Y/N (1 bit)\vspace{6pt} \end{itemize} @@ -390,12 +337,11 @@ for (int i = 0; i < VL; ++i) \frametitle{Predication key-value CSR table decoding pseudocode} \begin{semiverbatim} -struct pred fp\_pred[32]; -struct pred int\_pred[32]; +struct pred fp\_pred[32], int\_pred[32]; for (i = 0; i < 16; i++) // 16 CSRs? tb = int\_pred if CSRpred[i].type == 0 else fp\_pred - idx = CSRpred[i].regidx + idx = CSRpred[i].regkey tb[idx].zero = CSRpred[i].zero tb[idx].inv = CSRpred[i].inv tb[idx].predidx = CSRpred[i].predidx @@ -403,8 +349,8 @@ for (i = 0; i < 16; i++) // 16 CSRs? \end{semiverbatim} \begin{itemize} - \item All 64 (int and FP) Entries zero'd before setting - \item Might be a bit complex to set up (TBD) + \item All 32 int and 32 FP entries zero'd before setting + \item Might be a bit complex to set up in hardware (TBD) \end{itemize} \end{frame} @@ -421,60 +367,98 @@ def get\_pred\_val(bool is\_fp\_op, int reg): predidx = tb[reg].predidx // redirection occurs HERE predicate = intreg[predidx] // actual predicate HERE if (tb[reg].inv): - predicate = ~predicate + predicate = ~predicate // invert ALL bits return predicate \end{semiverbatim} \begin{itemize} \item References different (internal) mapping table for INT or FP \item Actual predicate bitmask ALWAYS from the INT regfile + \item Hard-limit on MVL of XLEN (predication only 1 intreg) \end{itemize} \end{frame} -\frame{\frametitle{Register key-value CSR store} +\frame{\frametitle{To Zero or not to place zeros in non-predicated elements?} \begin{itemize} - \item key is int regfile number or FP regfile number (1 bit)\vspace{6pt} - \item treated as vector if referred to in op (5 bits, key)\vspace{6pt} - \item starting register to actually be used (5 bits, value)\vspace{6pt} - \item element bitwidth: default/8/16/32/64/rsvd (3 bits)\vspace{6pt} - \item element type: still under consideration\vspace{6pt} + \item Zeroing is an implementation optimisation favouring OoO + \item Simple implementations may skip non-predicated operations + \item Simple implementations explicitly have to destroy data + \item Complex implementations may use reg-renames to save power\\ + Zeroing on predication chains makes optimisation harder + \item Compromise: REQUIRE both (specified in predication CSRs). \end{itemize} - Notes:\vspace{10pt} - \begin{itemize} - \item Same notes apply (previous slide) as for predication CSR table - \item Level of indirection has implications for pipeline latency + Considerations: + \begin{itemize} + \item Complex not really impacted, simple impacted a LOT\\ + with Zeroing... however it's useful (memzero) + \item Non-zero'd overlapping "Vectors" may issue overlapping ops\\ + (2nd op's predicated elements slot in 1st's non-predicated ops) + \item Please don't use Vectors for "security" (use Sec-Ext) \end{itemize} } +% with overlapping "vectors" - bearing in mind that "vectors" are +% just a remap onto the standard register file, if the top bits of +% predication are zero, and there happens to be a second vector +% that uses some of the same register file that happens to be +% predicated out, the second vector op may be issued *at the same time* +% if there are available parallel ALUs to do so. -\begin{frame}[fragile] -\frametitle{Register key-value CSR table decoding pseudocode} - -\begin{semiverbatim} -struct vectorised fp\_vec[32]; -struct vectorised int\_vec[32]; - -for (i = 0; i < 16; i++) // 16 CSRs? - tb = int\_vec if CSRvectortb[i].type == 0 else fp\_vec - idx = CSRvectortb[i].regidx - tb[idx].elwidth = CSRpred[i].elwidth - tb[idx].regidx = CSRpred[i].regidx - tb[idx].isvector = true -\end{semiverbatim} +\frame{\frametitle{Implementation Options} \begin{itemize} - \item All 64 (int and FP) Entries zero'd before setting - \item Might be a bit complex to set up (TBD) + \item Absolute minimum: Exceptions: if CSRs indicate "V", trap.\\ + (Requires as absolute minimum that CSRs be in H/W) + \item Hardware loop, single-instruction issue\\ + (Do / Don't send through predication to ALU) + \item Hardware loop, parallel (multi-instruction) issue\\ + (Do / Don't send through predication to ALU) + \item Hardware loop, full parallel ALU (not recommended) \end{itemize} + Notes:\vspace{4pt} + \begin{itemize} + \item 4 (or more?) options above may be deployed on per-op basis + \item SIMD always sends predication bits through to ALU + \item Minimum MVL MUST be sufficient to cover regfile LD/ST + \item Instr. FIFO may repeatedly split off N scalar ops at a time + \end{itemize} +} +% Instr. FIFO may need its own slide. Basically, the vectorised op +% gets pushed into the FIFO, where it is then "processed". Processing +% will remove the first set of ops from its vector numbering (taking +% predication into account) and shoving them **BACK** into the FIFO, +% but MODIFYING the remaining "vectorised" op, subtracting the now +% scalar ops from it. -\end{frame} +\frame{\frametitle{Predicated 8-parallel ADD: 1-wide ALU} + \begin{center} + \includegraphics[height=2.5in]{padd9_alu1.png}\\ + {\bf \red Predicated adds are shuffled down: 6 cycles in total} + \end{center} +} + + +\frame{\frametitle{Predicated 8-parallel ADD: 4-wide ALU} + \begin{center} + \includegraphics[height=2.5in]{padd9_alu4.png}\\ + {\bf \red Predicated adds are shuffled down: 4 in 1st cycle, 2 in 2nd} + \end{center} +} + + +\frame{\frametitle{Predicated 8-parallel ADD: 3 phase FIFO expansion} + \begin{center} + \includegraphics[height=2.5in]{padd9_fifo.png}\\ + {\bf \red First cycle takes first four 1s; second takes the rest} + \end{center} +} \begin{frame}[fragile] -\frametitle{ADD pseudocode with redirection, this time} +\frametitle{ADD pseudocode with redirection (and proper predication)} \begin{semiverbatim} function op\_add(rd, rs1, rs2) # add not VADD! @@ -497,6 +481,59 @@ function op\_add(rd, rs1, rs2) # add not VADD! \end{frame} +\frame{\frametitle{How are SIMD Instructions Vectorised?} + + \begin{itemize} + \item SIMD ALU(s) primarily unchanged + \item Predication is added down each SIMD element (if requested, + otherwise entire block will be predicated as a group) + \item Predication bits sent in groups to the ALU (if requested, + otherwise just one bit for the entire packed block) + \item End of Vector enables (additional) predication: + completely nullifies end-case code (ONLY in multi-bit + predication mode) + \end{itemize} + Considerations:\vspace{4pt} + \begin{itemize} + \item Many SIMD ALUs possible (parallel execution) + \item Implementor free to choose (API remains the same) + \item Unused ALU units wasted, but s/w DRASTICALLY simpler + \item Very long SIMD ALUs could waste significant die area + \end{itemize} +} +% With multiple SIMD ALUs at for example 32-bit wide they can be used +% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations +% or they can be used to cover several operations on totally different +% vectors / registers. + +\frame{\frametitle{Predicated 9-parallel SIMD ADD} + \begin{center} + \includegraphics[height=2.5in]{padd9_simd.png}\\ + {\bf \red 4-wide 8-bit SIMD, 4 bits of predicate passed to ALU} + \end{center} +} + + +\frame{\frametitle{Why are overlaps allowed in Regfiles?} + + \begin{itemize} + \item Same register(s) can have multiple "interpretations" + \item Set "real" register (scalar) without needing to set/unset CSRs. + \item xBitManip plus SIMD plus xBitManip = Hi/Lo bitops + \item (32-bit GREV plus 4x8-bit SIMD plus 32-bit GREV:\\ + GREV @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8) + \item RGB 565 (video): BEXTW plus 4x8-bit SIMD plus BDEPW\\ + (BEXT/BDEP @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8) + \item Same register(s) can be offset (no need for VSLIDE)\vspace{6pt} + \end{itemize} + Note: + \begin{itemize} + \item xBitManip reduces O($N^{6}$) SIMD down to O($N^{3}$) + \item Hi-Performance: Macro-op fusion (more pipeline stages?) + \end{itemize} +} + + \frame{\frametitle{C.MV extremely flexible!} \begin{itemize} @@ -512,7 +549,7 @@ function op\_add(rd, rs1, rs2) # add not VADD! \vspace{4pt} Notes: \begin{itemize} - \item Surprisingly powerful! + \item Surprisingly powerful! Zero-predication even more so \item Same arrangement for FVCT, FMV, FSGNJ etc. \end{itemize} } @@ -554,7 +591,7 @@ def op_mv_x(rd, rs): # (hypothetical) RV MX.X Vectorised version aka "VSELECT": \begin{semiverbatim} -def op_mv_x(rd, rs): # SV version of MX.X +def op_mv_x(rd, rs): # SV version of MX.X for i in range(VL): rs1 = regfile[rs+i] # indirection regfile[rd+i] = regfile[rs] # straight regcopy @@ -584,7 +621,7 @@ def op_mv_x(rd, rs): # SV version of MX.X \begin{itemize} \item VSELECT stays? no MV.X, so no (add with custom ext?) \item VSNE exists, but no FNE (use predication inversion?) - \item VCLIP is not in RV* (add with custom ext?) + \item VCLIP is not in RV* (add with custom ext? or CSR?) \end{itemize} } @@ -658,11 +695,11 @@ loop: \frame{\frametitle{Under consideration} \begin{itemize} - \item Is C.FNE actually needed? Should it be added if it is? - \item Element type implies polymorphism. Should it be in SV? + \item Should future extra bank be included now? + \item How many Register and Predication CSRs should there be?\\ + (and how many in RV32E) + \item How many in M-Mode (for doing context-switch)? \item Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? - \item Is detection of all-scalar ops ok (without slowing pipeline)? - \item Can VSELECT be removed? (it's really complex) \item Can CLIP be done as a CSR (mode, like elwidth) \item SIMD saturation (etc.) also set as a mode? \item Include src1/src2 predication on Comparison Ops?\\ @@ -680,10 +717,10 @@ loop: (scalar ops are just vectors of length 1)\vspace{4pt} \item Tightly coupled with the core (instruction issue)\\ could be disabled through MISA switch\vspace{4pt} - \item An extra pipeline phase is pretty much essential\\ + \item An extra pipeline phase almost certainly essential\\ for fast low-latency implementations\vspace{4pt} \item With zeroing off, skipping non-predicated elements is hard:\\ - it is however an optimisation (and could be skipped).\vspace{4pt} + it is however an optimisation (and need not be done).\vspace{4pt} \item Setting up the Register/Predication tables (interpreting the\\ CSR key-value stores) might be a bit complex to optimise (any change to a CSR key-value entry needs to redo the table) @@ -691,14 +728,6 @@ loop: } -\frame{\frametitle{Is this OK (low latency)? Detect scalar-ops (only)} - \begin{center} - \includegraphics[height=2.5in]{scalardetect.png}\\ - {\bf \red Detect when all registers are scalar for a given op} - \end{center} -} - - \frame{\frametitle{Summary} \begin{itemize}