From c693d6e423f55db2184e0c8400f6b8e500263547 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 27 May 2024 16:08:26 +0100 Subject: [PATCH] add poly1305.mdwn - documentation https://bugs.libre-soc.org/show_bug.cgi?id=1158 ` --- openpower/sv/cookbook/poly1305.mdwn | 115 ++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 openpower/sv/cookbook/poly1305.mdwn diff --git a/openpower/sv/cookbook/poly1305.mdwn b/openpower/sv/cookbook/poly1305.mdwn new file mode 100644 index 000000000..97eaa6f2b --- /dev/null +++ b/openpower/sv/cookbook/poly1305.mdwn @@ -0,0 +1,115 @@ +# Implementation strategy for poly1305 on Simple-V + +Links: + +* [Poly1305's design](https://loup-vaillant.fr/tutorials/poly1305-design) +* [Python reference](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/poly1305-donna.py;hb=HEAD) +* Bugreport + +Poly1305 is a fast and surprisingly simple provably-secure one-time +authenticator. A huge amount of detailed mathematical analysis has +been done on it, demonstrating and implementing tricks of modulo +arithmetic that can be exploited to good effect and need not be +repeated here: the primary purpose of this document is to show that +it is possible to use Simple-V REMAP when patterns are identified, +instead of standard implementations on Scalar / SIMD ISAs that are +loop-unrolled in order to ensure no branching (and definitely no +conditional-branching) is utilised. + +For example: this entire function, from Loup Vaillant's tutorial +may be carried out with some REMAPs, a mul and a madd instruction. + +``` +void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5]) +{ + uint64_t a0 = a[0]; uint64_t b0 = b[0]; + uint64_t a1 = a[1]; uint64_t b1 = b[1]; + uint64_t a2 = a[2]; uint64_t b2 = b[2]; + uint64_t a3 = a[3]; uint64_t b3 = b[3]; + uint64_t a4 = a[4]; uint64_t b4 = b[4]; + p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5; + p[1] = a0*b1 + a1*b0 + a2*b4*5 + a3*b3*5 + a4*b2*5; + p[2] = a0*b2 + a1*b1 + a2*b0 + a3*b4*5 + a4*b3*5; + p[3] = a0*b3 + a1*b2 + a2*b1 + a3*b0 + a4*b4*5; + p[4] = a0*b4 + a1*b3 + a2*b2 + a3*b1 + a4*b0 ; +} +``` + +Above, the patterns are quite clear: it can be expressed +as a pair of loops (which is normally avoided like the plague), + +``` +for ai = 0 to 4 + for bi = 0 to 4 + if ai < bi then const = 5 else 1 + p[ai] += a[ai] * b[(bi+ai)%5] * const +``` + +This can be covered in Simple-V by having + +* one REMAP for the constant (5 or 1), which + is simplest done by a Triangle REMAP that + selects from a Vector. (otherwise, Indexed + REMAP is necessary) +* a second REMAP for the loop `ai` above +* a third REMAP for the loop value `bi+ai` modulo 5 + +Then - using Vertical-First - the loop is four +instructions: mul, madd, svstep and bc. + +Fast-forwarding to the python implementation, let us look at this: + +``` + 186 h0 = (h0 & c) | g0; + 187 h1 = (h1 & c) | g1; + 188 h2 = (h2 & c) | g2; +``` + +This again is a simple parallel operation, so can be done +as two Vertical-First instructions: + +``` +sv.and *h, *h, c +sv.or *h, *h, *g +``` + +This section however gets more complex, and also requires +a new instruction (Double-Shift-and-add, where previous +instructions designed are Double-Shift-and-or): + +``` + 110 h0 += (( t0 ) & 0xfffffffffff); + 111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); + 112 h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit; +``` + +Here, a "cheat" is required - slightly - to bring in some extra +variables that are set to zero: t2 and t-minus-one (which we name tm): + +``` + 110 h0 += (((t0 >> 0 ) | (tm << 64)) & 0xfffffffffff); + 111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff); + 112 h2 += (((t1 >> 24) | (t2 << 40 ) & 0x3ffffffffff); + 112 h2 |= hibit; +``` + +Both tm and t2 are set to zero: strictly speaking tm shifted by 64 +will always be zero. Now it becomes possible to identify the +pattern, and also split out the AND part: + +``` + h[N] += (((t[N] >> s[N] ) | (t[N-1] << (64-s[N])) + h[N] &= const[N] +``` + +Where it is clear that the shift amount `s[N]` and `const[N]` +can be set up as Vectors. Now it is possible to utilise a +proposed new instruction, "double-shift-and-add" for the first +part, and a straight `sv.add *h, *h, *const` for the second. + +In this way, again we have identified very simple regular patterns, +and applied Vectorisation to them to reduce instruction count. +Interestingly in this case, unlike many algorithms converted +there is not anticipated to be a huge reduction in instruction +count, but the key is that the core of the algorithm is preserved +and thus is easy to validate. -- 2.30.2