From: Luke Kenneth Casson Leighton Date: Fri, 9 Sep 2022 01:00:21 +0000 (+0100) Subject: shuffle pages X-Git-Tag: opf_rfc_ls005_v1~573 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=dcb76e17280cec4121968d2b5c69b169e47f3f3b;p=libreriscv.git shuffle pages --- diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index 44f05e271..99aa77680 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -272,71 +272,6 @@ For each of EXT059 and EXT063: as of 08Sep2022 \newpage{} - -# Use case: DCT - -DCT has dozens of uses in Audio-Visual processing and CODECs. -A full 8-wide in-place triple-loop Inverse DCT may be achieved -in 8 instructions. Expanding this to 16-wide is a matter of setting -`svshape 16` **and the same instructions used**. -Lee Composition may be deployed to construct non-power-two DCTs. -The cosine table may be computed (once) with 18 Vector instructions -(one of them `fcos`) - -``` -1014 def test_sv_remap_fpmadds_ldbrev_idct_8_mode_4(self): -1015 """>>> lst = [# LOAD bit-reversed with half-swap -1016 "svshape 8, 1, 1, 14, 0", -1017 "svremap 1, 0, 0, 0, 0, 0, 0", -1018 "sv.lfs/els *0, 4(1)", -1019 # Outer butterfly, iterative sum -1020 "svremap 31, 0, 1, 2, 1, 0, 1", -1021 "svshape 8, 1, 1, 11, 0", -1022 "sv.fadds *0, *0, *0", -1023 # Inner butterfly, twin +/- MUL-ADD-SUB -1024 "svshape 8, 1, 1, 10, 0", -1025 "sv.ffmadds *0, *0, *0, *8" -``` - - - -# Use case: Matrix Multiply - -Matrix Multiply of any size (non-power-2) up to a total of 127 operations -is achievable with only three instructions. Normally in any other SIMD -ISA at least one source requires Transposition and often massive rolling -repetition of data is required. These 3 instructions may be used as the -"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply. - -``` - 28 def test_sv_remap1(self): - 29 """>>> lst = ["svshape 2, 2, 3, 0, 0", - 30 "svremap 31, 1, 2, 3, 0, 0, 0", - 31 "sv.fmadds *0, *8, *16, *0" - 32 ] -``` - - - -# Use case: Parallel Reduction - -Parallel (Horizontal) Reduction is often deeply problematic in SIMD and -Vector ISAs. Parallel Reduction is Fully Deterministic in Simple-V and -thus may even usefully be deployed on non-associative and non-commutative -operations. - -``` - 75 def test_sv_remap2(self): - 76 """>>> lst = ["svshape 7, 0, 0, 7, 0", - 77 "svremap 31, 1, 0, 0, 0, 0, 0", # different order - 78 "sv.subf *0, *8, *16" - 79 ] - 80 REMAP sv.subf RT,RA,RB - inverted application of RA/RB - 81 left/right due to subf -``` - - - # Use case: LD/ST-Multi Context-switching saving and restoring of registers on the stack often @@ -351,7 +286,6 @@ runtime-configurable LD/ST-Multi is achievable with 2 instructions. setvli 64 sv.ld/sm=EQ *rt,0(ra) ``` -\newpage{} # Use case: Twin-Predication, re-entrant @@ -361,7 +295,9 @@ that sufficient state is stored within the Vector Context SPR, SVSTATE, for full re-entrancy on a Context Switch or function call *even if in the middle of executing a loop*. Also demonstrates that it is permissible for a programmer to write **directly** to the SVSTATE -SPR, and still expect Deterministic Behaviour. +SPR, and still expect Deterministic Behaviour. It's not exactly recommended +(performance may be impacted by direct SVSTATE access), but it is not +prohibited either. ``` 292 # checks that we are able to resume in the middle of a VL loop, @@ -414,6 +350,71 @@ could then be performed. Full Rationale at +\newpage{} +# Use case: DCT + +DCT has dozens of uses in Audio-Visual processing and CODECs. +A full 8-wide in-place triple-loop Inverse DCT may be achieved +in 8 instructions. Expanding this to 16-wide is a matter of setting +`svshape 16` **and the same instructions used**. +Lee Composition may be deployed to construct non-power-two DCTs. +The cosine table may be computed (once) with 18 Vector instructions +(one of them `fcos`) + +``` +1014 def test_sv_remap_fpmadds_ldbrev_idct_8_mode_4(self): +1015 """>>> lst = [# LOAD bit-reversed with half-swap +1016 "svshape 8, 1, 1, 14, 0", +1017 "svremap 1, 0, 0, 0, 0, 0, 0", +1018 "sv.lfs/els *0, 4(1)", +1019 # Outer butterfly, iterative sum +1020 "svremap 31, 0, 1, 2, 1, 0, 1", +1021 "svshape 8, 1, 1, 11, 0", +1022 "sv.fadds *0, *0, *0", +1023 # Inner butterfly, twin +/- MUL-ADD-SUB +1024 "svshape 8, 1, 1, 10, 0", +1025 "sv.ffmadds *0, *0, *0, *8" +``` + + + +# Use case: Matrix Multiply + +Matrix Multiply of any size (non-power-2) up to a total of 127 operations +is achievable with only three instructions. Normally in any other SIMD +ISA at least one source requires Transposition and often massive rolling +repetition of data is required. These 3 instructions may be used as the +"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply. + +``` + 28 def test_sv_remap1(self): + 29 """>>> lst = ["svshape 2, 2, 3, 0, 0", + 30 "svremap 31, 1, 2, 3, 0, 0, 0", + 31 "sv.fmadds *0, *8, *16, *0" + 32 ] +``` + + + +# Use case: Parallel Reduction + +Parallel (Horizontal) Reduction is often deeply problematic in SIMD and +Vector ISAs. Parallel Reduction is Fully Deterministic in Simple-V and +thus may even usefully be deployed on non-associative and non-commutative +operations. + +``` + 75 def test_sv_remap2(self): + 76 """>>> lst = ["svshape 7, 0, 0, 7, 0", + 77 "svremap 31, 1, 0, 0, 0, 0, 0", # different order + 78 "sv.subf *0, *8, *16" + 79 ] + 80 REMAP sv.subf RT,RA,RB - inverted application of RA/RB + 81 left/right due to subf +``` + + + [[!tag opf_rfc]] [^extend]: Prefix opcode space **must** be reserved in advance to do so, in order to avoid the catastrophic binary-incompatibility mistake made by RISC-V RVV and ARM SVE/2