X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fcomparison_table.mdwn;h=15a73c836f2ec99d5bd72e5d1c4333315b62fa9d;hb=2695b08bc7ec4d279f02f5dcd61b9b3ee460294c;hp=dcc9f52d2ac42c4abda78d8f14ab6dc4f16f1148;hpb=d404d67dd80a719d2a9d13bf79ea22ec62ded949;p=libreriscv.git diff --git a/openpower/sv/comparison_table.mdwn b/openpower/sv/comparison_table.mdwn index dcc9f52d2..15a73c836 100644 --- a/openpower/sv/comparison_table.mdwn +++ b/openpower/sv/comparison_table.mdwn @@ -1,44 +1,40 @@ **ISA Comparison Table to DRAFT SVP64** - discussion and research at -|ISA
name |No
opcodes|No
intrinsics|Taxonomy /
Class|setvl
scalable|Predicate
Masks|Twin
Pred|Vector
regs |128-bit
ops |Bigint |LDST
F/First|Data-dep
Fail-first|Pred-
Result|HW
Matrix|DCT/FFT
HW| -|---------------|--------------|-----------------|--------------------|-------------------|--------------------|-------------|----------------|-----------------|--------|----------------|-----------------------|----------------|-------------|--------------| -|SVP64 |5 [^1] |see [^2] |Scalable [^3] |yes |yes |yes [^4] |no [^5] |see [^6] |yes[^7] |yes [^8] |yes [^9] |yes [^10] |yes [^11] | yes[^12] | -|VSX |700+ |700?[^v1] |PackedSIMD |no |no |no |yes [^v2] |yes |no |no |no |no |yes [^v3] | no | -|NEON |~250 [^n1] |7088 [^n2] |PackedSIMD |no |no |no |yes |see [^b1] |no |no |no |no |no | no | -|SVE2 |~1000 [^e1] |6040 [^e2] |Predicated SIMD[^e3]|no [^e3] |yes |no |yes |see [^b1] |no |yes [^8] |no |no |yes [^e4] | no | -|AVX512 [^x1] |~1000s [^x2] |7256 [^x3] |Predicated SIMD |no |yes |no |yes |see [^b1] |no |no |no |no |yes [^x4] | no | -|RVV [^r1] |~190 [^r2] |~25000[^r3] |Scalable[^r4] |yes |yes |no |yes |yes [^r5] |no |yes |no |no |no | no | -|Aurora SX[^s1] |~200 [^s2] |unknown [^s3] |Scalable [^s4] |yes |yes |no |yes |no |no |no |no |no |? | no | -|66000[^m1] |~200 |unknown |AutoVec[^m1] |see [^m1] |see[^m1] |no |see [^m1] |no |yes[^m2]|see [^m1] |no |no |no | no | +|ISA
name |No
opcodes|No
intrinsics|Taxonomy /
Class|Binary
Compat|setvl
scalable|Pred.
Masks|Twin
Pred|Vector
regs |128-bit
ops |Big
int|LDST
F/First|Data-dep
F-first|Pred
Result|HW
Matrix|DCT
FFT | +|---------------|--------------|-----------------|--------------------|------------------|-------------------|----------------|-------------|----------------|-----------------|------------|----------------|--------------------|----------------|-------------|--------------| +|SVP64 |6 [^1] |see [^2] |Scalable [^3] |yes |yes |yes |yes [^4] |no [^5] |see [^6] |yes[^7] |yes [^8] |yes [^9] |yes [^10] |yes [^11] | yes[^12] | +|VSX |700+ |700?[^v1] |PackedSIMD |yes |no |no |no |yes [^v2] |yes |no |no |no |no |yes [^v3] | no | +|NEON |~250[^n1] |7088 [^n2] |PackedSIMD |yes |no |no |no |yes |see [^b1] |no |no |no |no |no | no | +|SVE2 |~1000[^e1] |6040 [^e2] |PredSIMD[^e3] |NO [^nc] |no [^e3] |yes |no |yes |see [^b1] |no |yes [^8] |no |no |yes [^e4] | no | +|AVX512[^x1] |~1000s[^x2] |7256[^x3] |PredSIMD |yes |no |yes |no |yes |see[^b1] |no |no |no |no |yes[^x4] | no | +|RVV [^r1] |~190[^r2] |~25000[^r3] |Scalable[^r4] |NO [^nc] |yes |yes |no |yes |yes [^r5] |no |yes |no |no |no | no | +|AuroraSX[^s1] |~200[^s2] |unknown[^s3] |Scalable[^s4] |yes |yes |yes |no |yes |no |no |no |no |no |? | no | +|66000[^m1] |~200 |unknown |AutoVec[^m1] |yes |see [^m1] |see[^m1] |no |see [^m1] |no |yes[^m2] |see [^m1] |no |no |no | no | [^1]: plus EXT001 24-bit prefixing using 25% of EXT001 space. See [[sv/svp64]] -[^2]: If treated as a 1-Dimensional ISA, and designed badly, the 24-bit Prefix expands 200+ scalar instructions to well over a million intrinsics (N~=10^4 **times** M~=10^2). - If treated as a 2-Dimensional ISA and designed well, there are far less. N prefix intrinsics **plus** M scalar instruction intrinsics, where N is likely to be of the order of 10^2 and M of the order of 10^2. +[^2]: If treated as a 1-Dimensional ISA, and designed badly, the 24-bit Prefix expands 200+ scalar instructions to well over a million intrinsics (N~=10^4 **times** M~=10^2). If treated as a 2-Dimensional ISA and designed well, there are far less. N prefix intrinsics **plus** M scalar instruction intrinsics, where N is likely to be of the order of 10^2 and M of the order of 10^2. [^3]: A 2-Dimensional Scalable Vector ISA **specifically designed for the Power ISA** with both Horizontal-First and Vertical-First Modes. See [[sv/vector_isa_comparison]] [^4]: on specific operations. See [[opcode_regs_deduped]] for full list. Key: 2P - Twin Predication, 1P - Single-Predicate [^5]: SVP64 provides a Vector concept on top of the **Scalar** GPR, FPR and CR Fields, extended to 128 entries. -[^6]: SVP64 Vectorises Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops. -[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorised by SVP64). See [[sv/biginteger/analysis]] +[^6]: SVP64 Vectorizes Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops. +[^7]: big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorized by SVP64). See [[sv/biginteger/analysis]] [^8]: LD/ST Fault-First: see [[sv/svp64/appendix]] and [ARM SVE Fault-First](https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf) -[^9]: Based on LD/ST Fail-first, extended to data. See [[sv/svp64/appendix]] +[^9]: Data-dependent Fail-First: Based on LD/ST Fail-first, extended to data. Truncates VL based on failing Rc=1 test. Similar to Z80 CPIR. See [[sv/svp64/appendix]] [^10]: Predicate-result effectively turns any standard op into a type of "cmp". See [[sv/svp64/appendix]] -[^11]: Any non-power-of-two Matrices up to 127 FMACs (or other FMA-style op), full triple-loop Schedule. See [[sv/remap]] +[^11]: Any non-power-of-two Matrices up to 127 FMACs or other FMA-style op including Ternary Logical, full triple-loop Schedule. See [[sv/remap]] [^12]: DCT (Lee) and FFT Full Triple-loops supported, RADIX2-only. Normally only found in VLIW DSPs (TI MSP320, Qualcom Hexagon). See [[sv/remap]] [^v2]: VSX's Vector Registers are mis-named: they are 100% PackedSIMD. AVX-512 is not a Vector ISA either. See [Flynn's Taxonomy](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy) [^v3]: Power ISA v3.1 contains "Matrix Multiply Assist" (MMA) which due to PackedSIMD is restricted to RADIX2 and requires inline assembler loop-unrolling for non-power-of-two Matrix dimensions [^n1]: difficult to ascertain, see [NEON/VFP](https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions). Critically depends on ARM Scalar instructions [^e1]: difficult to exactly ascertain, see ARM Architecture Reference Manual Supplement, DDI 0584. Critically depends on ARM Scalar instructions. -[^e3]: ARM states that the Scalability is a [Silicon-partner choice](https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea). - Scalability in the ISA is **not available to the programmer**: there is no `setvl` instruction in SVE2, which is already causing assembler programmer difficulties. +[^e3]: ARM states that the Scalability is a [Silicon-partner choice](https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea). Scalability in the ISA is **not available to the programmer**: there is no `setvl` instruction in SVE2, which is already causing assembler programmer difficulties. [quote](https://gist.github.com/zingaburga/805669eb891c820bd220418ee3f0d6bd#file-sve2-md) **"you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width"** [^x1]: [AVX512 Wikipedia](https://en.wikipedia.org/wiki/AVX-512), [Lifecycle of an instruction set](https://media.handmade-seattle.com/tom-forsyth/) including full slides [^x2]: difficult to exactly ascertain, contains subsets. Critically depends on ISA support from earlier x86 ISA subsets (several more thousand instructions). See [SIMD ISA listing](https://www.officedaytime.com/simd512e/) [^r1]: [RVV Spec](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc) [^r2]: RISC-V Vectors are not stand-alone, i.e. like SVE2 and AVX-512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set, needed for Linux). -[^r4]: Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction). However, like SVE2, the Maximum Vector length is a Silicon-partner choice, which creates similar limitations that SVP64 does not have. - The RISC-V Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Silicon-independent assembler. **This requires all algorithms to contain a loop construct**. - MAXVL in SVP64 is a Spec-hard-fixed quantity therefore loop constructs are not necessary 100% of the time. +[^r4]: Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction). However, like SVE2, the Maximum Vector length is a [Silicon-partner choice](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-extensions), which creates similar limitations that SVP64 does not have. The RISC-V Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Silicon-independent assembler. **This requires all algorithms to contain a loop construct**. MAXVL in SVP64 is a Spec-hard-fixed quantity therefore loop constructs are not necessary 100% of the time. [^r5]: like SVP64 it is up to the hardware implementor (Silicon partner) to choose whether to support 128-bit elements. [^s1]: [NEC SX Aurora](https://ftp.libre-soc.org/NEC_SX_Aurora_TSUBASA_VectorEngine-as-manual-v1.2.pdf) is based on the original Cray Vectors [^s2]: [Aurora ISA guide](https://sxauroratsubasa.sakura.ne.jp/documents/guide/pdfs/Aurora_ISA_guide.pdf) Appendix-3 11.1 p508 @@ -55,5 +51,6 @@ which are power-2 based on Silicon-partner SIMD width. Non-power-2 not supported but [zero-input masking](https://www.realworldtech.com/forum/?threadid=202688&curpostid=207774) is. [^x4]: [Advanced matrix Extensions](https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions) supports BF16 and INT8 only. Separate regfile, power-of-two "tiles". Not general-purpose at all. [^b1]: Although registers may be 128-bit in NEON, SVE2, and AVX, unlike VSX there are very few (or no) actual arithmetic 128-bit operations. Only RVV and SVP64 have the possibility of 128-bit ops -[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorisation** LOOP built-in as an extension named VVM. Classified as "Vertical-First". +[^m1]: Mitch Alsup's MyISA 66000 is available on request. A powerful RISC ISA with a **Hardware-level auto-vectorization** LOOP built-in as an extension named VVM. Classified as "Vertical-First". [^m2]: MyISA 66000 has a CARRY register up to 64-bit. Repeated application of FMA (esp. within Auto-Vectored LOOPS) automatically and inherently creates big-int operations with zero effort. +[^nc]: "Silicon-Partner" Scaling achieved through allowing same instruction to act on different regfile size and bitwidth. This catastrophically results in binary non-interoperability.