From: Luke Kenneth Casson Leighton Date: Thu, 29 Sep 2022 20:05:57 +0000 (+0100) Subject: shuffle eamples around to fit more words on pages X-Git-Tag: opf_rfc_ls005_v1~263 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=363e276629c23b8e4e1d558d64c2c12f7073a67c;p=libreriscv.git shuffle eamples around to fit more words on pages --- diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index 15c0fd29c..30d9085da 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -1158,32 +1158,38 @@ prohibited either. -## 3D GPU style "Branch Conditional" +## Matrix Multiply -(*Note: Specification is ready, Simulator still under development of -full specification capabilities*) -This example demonstrates a 2-long Vector Branch-Conditional only -succeeding if *all* elements in the Vector are successful. This -avoids the need for additional instructions that would need to -perform a Parallel Reduction of a Vector of Condition Register -tests down to a single value, on which a Scalar Branch-Conditional -could then be performed. Full Rationale at - +Matrix Multiply of any size (non-power-2) up to a total of 127 operations +is achievable with only three instructions. Normally in any other SIMD +ISA at least one source requires Transposition and often massive rolling +repetition of data is required. These 3 instructions may be used as the +"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply. ``` - 80 # test_sv_branch_cond_all - 81 for i in [7, 8, 9]: - 83 addi 1, 0, i+1 # set r1 to i - 84 addi 2, 0, i # set r2 to i - 85 cmpi cr0, 1, 1, 8 # compare r1 with 8 and store to cr0 - 86 cmpi cr1, 1, 2, 8 # compare r2 with 8 and store to cr1 - 87 sv.bc/all 12, *1, 0xc # bgt 0xc - branch if BOTH - 88 # r1 AND r2 greater 8 to the nop below - 89 addi 3, 0, 0x1234, # if tests fail this shouldn't execute - 90 or 0, 0, 0 # branch target + 28 # test_sv_remap1 5x4 by 4x3 matrix multiply + 29 svshape 5, 4, 3, 0, 0 + 30 svremap 31, 1, 2, 3, 0, 0, 0 + 31 sv.fmadds *0, *8, *16, *0 ``` - + + +## Parallel Reduction + +Parallel (Horizontal) Reduction is often deeply problematic in SIMD and +Vector ISAs. Parallel Reduction is Fully Deterministic in Simple-V and +thus may even usefully be deployed on non-associative and non-commutative +operations. + +``` + 75 # test_sv_remap2 + 76 svshape 7, 0, 0, 7, 0 + 77 svremap 31, 1, 0, 0, 0, 0, 0 # different order + 78 sv.subf *0, *8, *16 +``` + + \newpage{} ## DCT @@ -1213,38 +1219,32 @@ The cosine table may be computed (once) with 18 Vector instructions -## Matrix Multiply - -Matrix Multiply of any size (non-power-2) up to a total of 127 operations -is achievable with only three instructions. Normally in any other SIMD -ISA at least one source requires Transposition and often massive rolling -repetition of data is required. These 3 instructions may be used as the -"inner triple-loop kernel" of the usual 6-loop Massive Matrix Multiply. - -``` - 28 # test_sv_remap1 5x4 by 4x3 matrix multiply - 29 svshape 5, 4, 3, 0, 0 - 30 svremap 31, 1, 2, 3, 0, 0, 0 - 31 sv.fmadds *0, *8, *16, *0 -``` - - - -## Parallel Reduction +## 3D GPU style "Branch Conditional" -Parallel (Horizontal) Reduction is often deeply problematic in SIMD and -Vector ISAs. Parallel Reduction is Fully Deterministic in Simple-V and -thus may even usefully be deployed on non-associative and non-commutative -operations. +(*Note: Specification is ready, Simulator still under development of +full specification capabilities*) +This example demonstrates a 2-long Vector Branch-Conditional only +succeeding if *all* elements in the Vector are successful. This +avoids the need for additional instructions that would need to +perform a Parallel Reduction of a Vector of Condition Register +tests down to a single value, on which a Scalar Branch-Conditional +could then be performed. Full Rationale at + ``` - 75 # test_sv_remap2 - 76 svshape 7, 0, 0, 7, 0 - 77 svremap 31, 1, 0, 0, 0, 0, 0 # different order - 78 sv.subf *0, *8, *16 + 80 # test_sv_branch_cond_all + 81 for i in [7, 8, 9]: + 83 addi 1, 0, i+1 # set r1 to i + 84 addi 2, 0, i # set r2 to i + 85 cmpi cr0, 1, 1, 8 # compare r1 with 8 and store to cr0 + 86 cmpi cr1, 1, 2, 8 # compare r2 with 8 and store to cr1 + 87 sv.bc/all 12, *1, 0xc # bgt 0xc - branch if BOTH + 88 # r1 AND r2 greater 8 to the nop below + 89 addi 3, 0, 0x1234, # if tests fail this shouldn't execute + 90 or 0, 0, 0 # branch target ``` - + ## Big-Integer Math @@ -1278,7 +1278,13 @@ two 64-bit consecutive registers in succession. ``` Additional 128/64 Mul and Div/Mod instructions may similarly be exploited -to perform roll-over in arbitrary-length arithmetic. +to perform roll-over in arbitrary-length arithmetic: effectively they use +one of the two 64-bit output registers as a form of "64-bit Carry In-Out". + +All of these big-integer instructions are Scalar instructions standing on +their own merit and may be utilised even in a Scalar environment to improve +performance. When used with Simple-V they may also be used to improve +performance and also greatly simplify unlimited-length biginteger algorithms.