From 929007c627196449a1978124377bf440d15b9634 Mon Sep 17 00:00:00 2001 From: lkcl Date: Thu, 28 Oct 2021 19:43:50 +0100 Subject: [PATCH] --- conferences/openpower2021.mdwn | 107 +++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) diff --git a/conferences/openpower2021.mdwn b/conferences/openpower2021.mdwn index cb803b5fe..f388cf52d 100644 --- a/conferences/openpower2021.mdwn +++ b/conferences/openpower2021.mdwn @@ -17,6 +17,113 @@ Links * * +# Notes from the talk + +``` +those are all "in-place" (i.e. you use the register file to complete the entire operation, no LD/STs needed in the middle) +it's a ridiculously-long list! https://en.wikipedia.org/wiki/Discrete_cosine_transform#Applications +https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general) +the NTT wiki page is here https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)#Number-theoretic_transform +like, not having 4-wide SIMD and only using 3 of the SIMD lanes +https://arxiv.org/abs/2002.10143# +fascinating paper +that's down to not having to do branches +because the zero-overhead loop doesn't even need a branch instruction +no predication in VSX, either. +it's a rather unfortunate dichotomy, here +which according to the "strict" definition of "Custom Extension" would be in OPCODE 22 + +https://libre-soc.org/openpower/sv/overview/ +fascinatingly this was exactly what Peter Hsu (architect of the MIPS R8000) came up with back around 1994-5! +unfortunately, the only reason they didn't go ahead with it was because they hadn't worked out Multi-Issue Out-of-Order Execution at the time +so couldn't fully exploit the idea +each REMAP can actually be applied to more than one register if required +which is used (shown later) in the 5-operand (draft) instructions +you _could_ do this but you have to have a massive number of Reservation Stations +(an In-Order system would be hosed) +so with this trick you get multiple pipelined FMACs outstanding +the hope is that by the time the inner for-loop has completed, you can do another (partial) FMAC on the same register +i meant, you rotate (not transpose) :) +the matrix data is in order 0 1 2 3 +but REMAP can access it in 0 2 1 3 +or invert the X-dimension +1 2 0 3 + +and that is basically the values of the matrix "rotated" :) +https://libre-soc.org/openpower/sv/remap/ +Aspex was bought out by Ericsson, so the only information available on it now is papers by Argy Krikelis +and the other co-designers +https://www.researchgate.net/profile/Argy-Krikelis +here's the source for that matrix unit test +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;hb=HEAD +to experiment with Matrix "Schedules" this is a simple stand-alone program +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remapyield.py;hb=HEAD + +so you can have data appear to be re-ordered *in-place* (register numbers): +r0 r3 r6 +r1 r4 r7 +r2 r5 r8 +we cut down the MP3 FFMPEG main loop from 450 instructions down to *only 100*. +it was stunning, totally unexpected +ohh dear. FFT. this was hellishly complicated :) took about 2 months to do both DCT and FFT +that 5-operand draft instruction is crucial to do DCT and FFT in-place +if you don't want to do in-place, you can get away with the "normal" approach of using a temp scalar variable (and 3-4 instructions) +but, that kiinda defeats the object of the exercise :) +https://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf +TMS320 FFT +standard library for the nexagon +https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/audio-capiappi/33010 +definite "wow" on the number of VLIW uOps for Hexagon + +https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf +https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf +the original paper by Byeong Gi Lee. 1984! +here's the stand-alone program which can generate the triple-loop schedules +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3 +whoops i meant "add it to X[0]" :) +https://www.nayuki.io/page/free-small-fft-in-multiple-languages +https://developer.qualcomm.com/download/hexagon/hexagon-dsp-architecture.pdf +https://www.nayuki.io/res/fast-discrete-cosine-transform-algorithms/lee-new-algo-discrete-cosine-transform.pdf +the original paper by Byeong Gi Lee. 1984! +here's the stand-alone program which can generate the triple-loop schedules +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_fft_yield.py;h=422c2187867bba75c5a33d395e74d2d1081199d1;hb=0b7eb1cc2b6f1b820a54e668724f1e00967e85f3 +whoops i meant "add it to X[0]" :) +https://www.nayuki.io/page/free-small-fft-in-multiple-languages +really cool set of implementations of FFT +this was mind-bending :) +of course, if you are not doing in-place, it doesn't matter +but when you don't do in-place, you end up using *double the number of registers* which is how a lot of implementations of FFT work. sigh +that puts pressure on the regfile, which is a critical resource in 3D and Video applications +power consumption ends up going through the roof if you have to "spill" +the full unit test(s) for SVP64 FFT remap are here https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_fft.py;hb=HEAD +this is where the bit about mapping any one of the 3 REMAPs to *five* possible target registers is needed +which is why svremap takes so many operands :) +https://opencores.org/projects/hwlu +it's in VHDL. +the paper on ZOLC is fascinating https://www.researchgate.net/publication/224647569_A_portable_specification_of_zero-overhead_looping_control_hardware_applied_to_embedded_processors +and, like the Snitch core, has absolutely stunning reductions in instruction count (and power consumption) +reverse-order +0123 +7654 +for DCT +where FFT is +0123 +4567 +i was amazed by this elegant algorithm +from looking at the numbers +here's the source for a stand-alone program to create DCT schedules +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD +i use it to auto-generate the SVG DCT diagrams used in this talk :) +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/dct_butterfly_svg.py;hb=HEAD +https://www.youtube.com/watch?v=fn2KJvWyBKg +trying to explain it without a slide, sigh :) +it's in the video +here's the unit test for draft svp64 dct +https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD +i meant, 25% :) +they added a transpose-matrix instruction to turn 3x4 into 4x3 +it might have been NEON that ARM added that to, rather than MALI +``` # Abstract -- 2.30.2