From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Thu, 28 Jul 2022 22:21:54 +0000 (+0100)
Subject: clarify, add SVE2 Scalable Matrix Extension
X-Git-Tag: opf_rfc_ls005_v1~971
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=9cbf159e95ff2ea7e202f5a76cfdd732ea6855a7;p=libreriscv.git

clarify, add SVE2 Scalable Matrix Extension
---

diff --git a/openpower/sv/comparison_table.mdwn b/openpower/sv/comparison_table.mdwn
index e5748e475..28b89f9e9 100644
--- a/openpower/sv/comparison_table.mdwn
+++ b/openpower/sv/comparison_table.mdwn
@@ -5,33 +5,33 @@
 |Draft SVP64   |5 (1)          |see (25)          |Scalable (2)         |yes                |yes                 |yes (3)              |no (4)                   |see (5)         |yes (6) |yes (7)              |yes (8)                       |yes (9)              |yes (10)             |
 |VSX           |700+           |700+? (26)        |Packed SIMD          |no                 |no                  |no                   |yes (11)                 |yes             |no      |no                   |no                            |no                   |yes (12)             |
 |NEON          |~250 (13)      |7088 (27)         |Packed SIMD          |no                 |no                  |no                   |yes                      |yes             |no      |no                   |no                            |no                   |no                   |
-|SVE2          |~1000 (14)     |6040 (28)         |Predicated SIMD(15)  |no (15)            |yes                 |no                   |yes                      |yes             |no      |yes (7)              |no                            |no                   |no                   |
+|SVE2          |~1000 (14)     |6040 (28)         |Predicated SIMD(15)  |no (15)            |yes                 |no                   |yes                      |yes             |no      |yes (7)              |no                            |no                   |yes (32)             |
 |AVX512 (16)   |~1000s (17)    |7256 (29)         |Predicated SIMD      |no                 |yes                 |no                   |yes                      |yes             |no      |no                   |no                            |no                   |no                   |
 |RVV (18)      |~190 (19)      |~25000 (30)       |Scalable (20)        |yes                |yes                 |no                   |yes                      |yes (21)        |no      |yes                  |no                            |no                   |no                   |
-|Aurora SX(22) |~200 (23)      |unknown (31)      |Scalable (24)        |yes                |yes                 |no                   |yes                      |no              |no      |no                   |no                            |no                   |no                   |
+|Aurora SX(22) |~200 (23)      |unknown (31)      |Scalable (24)        |yes                |yes                 |no                   |yes                      |no              |no      |no                   |no                            |no                   |?                    |
 
 * (1): plus EXT001 24-bit prefixing using 25% of EXT001 space. See [[sv/svp64]]
 * (2): A 2-Dimensional Scalable Vector ISA **specifically designed for the Power ISA** with both Horizontal-First and Vertical-First Modes. See [[sv/vector_isa_comparison]]
 * (3): on specific operations.  See [[opcode_regs_deduped]] for full list. Key: 2P - Twin Predication, 1P - Single-Predicate
 * (4): SVP64 provides a Vector concept on top of the **Scalar** GPR, FPR and CR Fields, extended to 128 entries.
 * (5): SVP64 Vectorises Scalar ops. It is up to the **implementor** to choose (**optionally**) whether to apply SVP64 to e.g. VSX Quad-Precision (128-bit) instructions, to create 128-bit Vector ops.
-* (6): big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (which are then naturally Vectorised by SVP64). See [[sv/biginteger/analysis]]
+* (6): big-integer add is just `sv.adde`. For optimal performance Bigint Mul and divide first require addition of two scalar operations (in turn, naturally Vectorised by SVP64). See [[sv/biginteger/analysis]]
 * (7): See [[sv/svp64/appendix]] and [ARM SVE Fault-First](https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf)
 * (8): Based on LD/ST Fail-first, extended to data. See [[sv/svp64/appendix]]
 * (9): Turns standard ops into a type of "cmp". See [[sv/svp64/appendix]]
-* (10): Any non-power-of-two Matrix up to 127 FMACs.  Also DCT (Lee) and FFT Full (RADIX2) Triple-loops supported. See [[sv/remap]]
+* (10): Any non-power-of-two Matrices up to 127 FMACs (or other FMA-style op), full triple-loop Schedule.  Also DCT (Lee) and FFT Full (RADIX2) Triple-loops supported. See [[sv/remap]]
 * (11): VSX's Vector Registers are mis-named: they are 100% PackedSIMD. AVX-512 is not a Vector ISA either.  See [Flynn's Taxonomy](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy)
 * (12): Power ISA v3.1 contains "Matrix Multiply Assist" (MMA) which due to PackedSIMD is restricted to RADIX2 and requires inline assembler loop-unrolling for non-power-of-two Matrix dimensions
 * (13): difficult to ascertain, see [NEON/VFP](https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions).
   Critically depends on ARM Scalar instructions
 * (14): difficult to exactly ascertain, see ARM Architecture Reference Manual Supplement, DDI 0584.  Critically depends on ARM Scalar instructions.
 * (15): ARM states that the Scalability is a [Silicon-partner choice](https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_00_en_introduction-to-sve2.pdf?revision=aae96dd2-5334-4ad3-9a47-393086a20fea).
-  Scalability in the ISA is **not available to the programmer**: there is no `setvl` instruction in SVE2, which is already causing assembler programmer difficulties. Effectively this makes SVE2 Predicated SIMD.
+  Scalability in the ISA is **not available to the programmer**: there is no `setvl` instruction in SVE2, which is already causing assembler programmer difficulties.
   [quote](https://gist.github.com/zingaburga/805669eb891c820bd220418ee3f0d6bd#file-sve2-md) **"you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width"**
 * (16): [AVX512 Wikipedia](https://en.wikipedia.org/wiki/AVX-512), [Lifecycle of an instruction set](https://media.handmade-seattle.com/tom-forsyth/) including full slides
 * (17): difficult to exactly ascertain, contains subsets. Critically depends on ISA support from earlier x86 ISA subsets (several more thousand instructions). See [SIMD ISA listing](https://www.officedaytime.com/simd512e/)
 * (18): [RVV Spec](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc)
-* (19): RISC-V Vectors are not stand-alone, i.e. like SVE2 and AVX-512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set (RV64GC is equivalent to the Linux Compliancy Level)
+* (19): RISC-V Vectors are not stand-alone, i.e. like SVE2 and AVX-512 are critically dependent on the Scalar ISA (an additional ~96 instructions for the Scalar RV64GC set, needed for Linux).
 * (20): Like the original Cray RVV is a truly scalable Vector ISA (Cray setvl instruction).  However, like SVE2, the Maximum Vector length is a Silicon-partner choice, which creates similar limitations that SVP64 does not have.
   The RISC-V Founders strongly discourage efforts by programmers to find out the Silicon's Maximum Vector Length, as an effort to steer programmers towards Silicon-independent assembler. This requires **all** algorithms to contain a loop construct.
   MAXVL in SVP64 is a Spec-hard-fixed quantity therefore loop constructs are not necessary 100% of the time.
@@ -47,3 +47,5 @@
 * (29): Count includes SSE, SSE2, AVX, AVX2 and all AVX512 variants
 * (30): [RVV intrinsics listing](https://raw.githubusercontent.com/riscv-non-isa/rvv-intrinsic-doc/master/intrinsic_funcs.md) page is 25,000 lines long.
 * (31): Unknown. estimated to be of the order of length of RVV due to also being a Cray-style Scalable ISA.
+* (32): [Scalable Matrix Optional Extension](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture)
+  the key is an outer-product instruction [SMOPA](https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/SMOPA--Signed-integer-sum-of-outer-products-and-accumulate-?lang=en) which is very hard to tell at a glance if it is power-2 or non-power-2