From: Luke Kenneth Casson Leighton Date: Mon, 16 Apr 2018 02:04:15 +0000 (+0100) Subject: add SIMD section X-Git-Tag: convert-csv-opcode-to-binary~5662 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=a06477ee677c5eb2addc96ee020a9d2be85e0246;p=libreriscv.git add SIMD section --- diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 3036e31f2..d5adea1c6 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1046,12 +1046,15 @@ translates effectively to: * Throw an exception. Whether that actually results in spawning threads as part of the trap-handling remains to be seen. -# Comparison of SIMD (TODO) Alt-RVP, Simple-V and RVV Proposals +# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals -This section compares the various parallelism proposals as they stand. -SIMD is yet to be explicitly incorporated into this section. +This section compares the various parallelism proposals as they stand, +compared to traditional SIMD. -[[alt_rvp]] +## [[alt_rvp]] + +Primary benefit of Alt-RVP is the simplicity with which parallelism +may be introduced (effective multiplication of regfiles and associated ALUs). * plus: the simplicity of the lanes (combined with the regularity of allocating identical opcodes multiple independent registers) meaning @@ -1071,7 +1074,12 @@ SIMD is yet to be explicitly incorporated into this section. * minus: Access to registers across multiple lanes is challenging. "Solution" is to drop data into memory and immediately back in again (like MMX). -Simple-V +## Simple-V + +Primary benefit of Simple-V is the OO abstraction of parallel principles +from actual hardware. It's an API in effect that's designed to be +slotted in to an existing implementation (just after instruction decode) +with minimum disruption and effort. * minus: the complexity of having to use register renames, OoO, VLIW, register file cacheing, all of which has been done before but is a @@ -1105,10 +1113,13 @@ Simple-V be "no worse" than existing register renaming, OoO, VLIW and register file cacheing schemes. -RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) +## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) -* plus: regular predictable workload means effects on L1/L2 Cache can - be streamlined. +RVV is extremely well-designed and has some amazing features, including +2D reorganisation of memory through LOAD/STORE "strides". + +* plus: regular predictable workload means that implmentations may + streamline effects on L1/L2 Cache. * plus: regular and clear parallel workload also means that lanes (similar to Alt-RVP) may be used as an implementation detail, using either SRAM or 2R1W registers. @@ -1131,6 +1142,35 @@ RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) to be in high-performance specialist supercomputing (where it will be absolutely superb). +## Traditional SIMD + +The only really good things about SIMD are how easy it is to implement and +get good performance. Unfortunately that makes it quite seductive... + +* plus: really straightforward, ALU basically does several packed operations + at once. Parallelism is inherent at the ALU, making the rest of the + processor really straightforward (zero impact). +* plus (continuation): SIMD in simple in-order single-issue designs + can therefore result in great throughput even with a very simple execution + engine. +* minus: ridiculously complex setup and corner-cases. +* minus: getting data usefully out of registers (if separate regfiles + are used) means outputting to memory and back. +* minus: quite a lot of supplementary instructions for bit-level manipulation + are needed in order to efficiently extract (or prepare) SIMD operands. +* minus: MASSIVE proliferation of ISA both in terms of opcodes in one + dimension and parallelism (width): an at least O(N^2) and quite probably + O(N^3) ISA proliferation that often results in several thousand + separate instructions. all with separate corner-case algorithms! +* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of + 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction. + For example: add (high|low) 16-bits of r1 to (low|high) of r2. +* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch + between operand and result bit-widths. In combination with high/low + proliferation the situation is made even worse. +* minor-saving-grace: some implementations *may* have predication masks + that allow control over individual elements within the SIMD block. + # Impementing V on top of Simple-V * Number of Offset CSRs extends from 2