From 32f9d7ec32e24ccab89fc16dc5f00bfa992fe216 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Fri, 9 Sep 2022 01:06:01 +0100 Subject: [PATCH] add use-cases --- openpower/sv/rfc/Makefile | 4 +- openpower/sv/rfc/ls001.mdwn | 86 ++++++++++++++++++++++++++++++++++++- 2 files changed, 87 insertions(+), 3 deletions(-) diff --git a/openpower/sv/rfc/Makefile b/openpower/sv/rfc/Makefile index 24b4e0960..b505dd68c 100644 --- a/openpower/sv/rfc/Makefile +++ b/openpower/sv/rfc/Makefile @@ -4,4 +4,6 @@ ls001.pdf: ls001.mdwn pandoc -V geometry:margin=0.25in \ -V fontsize=9pt \ -V papersize=a4 \ - -f markdown ls001.mdwn -s -o ls001.pdf + -f markdown ls001.mdwn \ + -s --normalize --smart --self-contained \ + -o ls001.pdf diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index be389516a..460098ab4 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -107,6 +107,7 @@ such large numbers of registers, even for Multi-Issue microarchitectures. # Simple-V Architectural Resources * No new Interrupt types are required. + **No modifications to existing Power ISA are required either**. * GPR FPR and CR Field Register numbers are extended to 128. A future version may extend to 256 or beyond [^extend] * (A future version or other Stakeholder *may* wish to drop Simple-V @@ -257,13 +258,94 @@ all of which are 64/64 (or 64/32). **EXT059 and EXT063** Additionally for High-Performance Compute and Competitive 3D GPU, IEEE754 FP -Transcendentals are required: +Transcendentals are required, as are some DCT/FFT "Twin-Butterfly" operations: * QTY 33of X-Form "1-argument" (fsin, fsins, fcos, fcoss) * QTY 15of X-Form "2-argument" (pow, atan2, fhypot) +* QTY 5of A-Form "3-in 2-out" FP Butterfly operations for DCT/FFT +* QTY 8of X-Form "2-in 2-out" FP Butterfly operations (again for DCT/FFT) + +\newpage{} + +# Use case: DCT + +DCT has dozens of uses in Audio-Visual processing and CODECs. +A full 8-wide in-place triple-loop Inverse DCT may be achieved +in 7 instructions. Expanding this to 16-wide is a matter of setting +`svshape 16`. Lee Composition may be deployed to construct non-power-two +DCTs. The cosine table may be computed (once) with 18 Vector instructions. + + + + +``` +1014 def test_sv_remap_fpmadds_ldbrev_idct_8_mode_4(self): +1015 """>>> lst = [# LOAD bit-reversed with half-swap +1016 "svshape 8, 1, 1, 14, 0", +1017 "svremap 1, 0, 0, 0, 0, 0, 0", +1018 "sv.lfs/els *0, 4(1)", +1019 # Outer butterfly, iterative sum +1020 "svremap 31, 0, 1, 2, 1, 0, 1", +1021 "svshape 8, 1, 1, 11, 0", +1022 "sv.fadds *0, *0, *0", +1023 # Inner butterfly, twin +/- MUL-ADD-SUB +1024 "svshape 8, 1, 1, 10, 0", +1025 "sv.ffmadds *0, *0, *0, *8" +``` + +# Use case: Matrix Multiply + +Matrix Multiply of any size (non-power-2) up to a total of 127 operations +is achievable with only three instructions. Normally in any other SIMD +ISA at least one source requires Transposition and often massive rolling +repetition of data is required. These 3 instructions may be used as the +"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply. + + +``` + 28 def test_sv_remap1(self): + 29 """>>> lst = ["svshape 2, 2, 3, 0, 0", + 30 "svremap 31, 1, 2, 3, 0, 0, 0", + 31 "sv.fmadds *0, *8, *16, *0" + 32 ] +``` + +# Use case: Parallel Reduction + +Parallel (Horizontal) Reduction is often deeply problematic in SIMD and +Vector ISAs. Parallel Reduction is Fully Deterministic in Simple-V and +thus may even usefully be deployed on non-associative and non-commutative +operations. + + + +``` + 75 def test_sv_remap2(self): + 76 """>>> lst = ["svshape 7, 0, 0, 7, 0", + 77 "svremap 31, 1, 0, 0, 0, 0, 0", # different order + 78 "sv.subf *0, *8, *16" + 79 ] + 80 REMAP sv.subf RT,RA,RB - inverted application of RA/RB + 81 left/right due to subf +``` + +# Use case: LD/ST-Multi + +Context-switching saving and restoring of registers on the stack often +requires explicit loop-unrolling to achieve effectively. In SVP64 it +is possible to use a Predicate Mask to "compact" or "expand" a swathe +of desired registers, dynamically. Known as "VCOMPRESS" and "VEXPAND", +runtime-configurable LD/ST-Multi is achievable with 2 instructions. + +``` + # load 64 registers off the stack, in-order, skipping unneeded ones + # by using CR0-CR63's "EQ" bits to select only those needed. + setvli 64 + sv.ld/sm=EQ *rt,0(ra) +``` [[!tag opf_rfc]] [^extend]: Prefix opcode space **must** be reserved in advance to do so, in order to avoid the catastrophic binary-incompatibility mistake made by RISC-V RVV and ARM SVE/2 -[^likeext001]: SVP64-Single is remarkably similar to the "bit 1" of EXT001 being set to indicate that the 64-bits is to be allocated in full to a new encoding, but in fact it still embeds v3.0 Scalar operations. +[^likeext001]: SVP64-Single is remarkably similar to the "bit 1" of EXT001 being set to indicate that the 64-bits is to be allocated in full to a new encoding, but in fact SVP64-single still embeds v3.0 Scalar operations. [^pseudorewrite]: elwidth overrides does however mean that all SFS / SFFS pseudocode will need rewriting to be in terms of XLEN. This has the indirect side-effect of automatically making a 32-bit Scalar Power ISA Specification possible, as well as a future 128-bit one (Cross-reference: RISC-V RV32 and RV128) -- 2.30.2