add poly1305.mdwn - documentation

author Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Mon, 27 May 2024 15:08:26 +0000 (16:08 +0100)

committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>

Mon, 27 May 2024 15:08:30 +0000 (16:08 +0100)
author Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Mon, 27 May 2024 15:08:26 +0000 (16:08 +0100)
committer Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Mon, 27 May 2024 15:08:30 +0000 (16:08 +0100)
diff --git a/openpower/sv/cookbook/poly1305.mdwn b/openpower/sv/cookbook/poly1305.mdwn

new file mode 100644 (file)

index 0000000..97eaa6f
--- /dev/null
+++ b/openpower/sv/cookbook/poly1305.mdwn
@@ -0,0 +1,115 @@
+# Implementation strategy for poly1305 on Simple-V
+
+Links:
+
+* [Poly1305's design](https://loup-vaillant.fr/tutorials/poly1305-design)
+* [Python reference](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/poly1305-donna.py;hb=HEAD)
+* Bugreport <https://bugs.libre-soc.org/show_bug.cgi?id=1158>
+
+Poly1305 is a fast and surprisingly simple provably-secure one-time
+authenticator. A huge amount of detailed mathematical analysis has
+been done on it, demonstrating and implementing tricks of modulo
+arithmetic that can be exploited to good effect and need not be
+repeated here: the primary purpose of this document is to show that
+it is possible to use Simple-V REMAP when patterns are identified,
+instead of standard implementations on Scalar / SIMD ISAs that are
+loop-unrolled in order to ensure no branching (and definitely no
+conditional-branching) is utilised.
+
+For example: this entire function, from Loup Vaillant's tutorial
+may be carried out with some REMAPs, a mul and a madd instruction.
+
+```
+void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
+{
+    uint64_t a0 = a[0];  uint64_t b0 = b[0];
+    uint64_t a1 = a[1];  uint64_t b1 = b[1];
+    uint64_t a2 = a[2];  uint64_t b2 = b[2];
+    uint64_t a3 = a[3];  uint64_t b3 = b[3];
+    uint64_t a4 = a[4];  uint64_t b4 = b[4];
+    p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
+    p[1] = a0*b1 + a1*b0   + a2*b4*5 + a3*b3*5 + a4*b2*5;
+    p[2] = a0*b2 + a1*b1   + a2*b0   + a3*b4*5 + a4*b3*5;
+    p[3] = a0*b3 + a1*b2   + a2*b1   + a3*b0   + a4*b4*5;
+    p[4] = a0*b4 + a1*b3   + a2*b2   + a3*b1   + a4*b0  ;
+}
+```
+
+Above, the patterns are quite clear: it can be expressed
+as a pair of loops (which is normally avoided like the plague),
+
+```
+for ai = 0 to 4
+    for bi = 0 to 4
+        if ai < bi then const = 5 else 1
+        p[ai] += a[ai] * b[(bi+ai)%5] * const
+```
+
+This can be covered in Simple-V by having
+
+* one REMAP for the constant (5 or 1), which
+  is simplest done by a Triangle REMAP that
+  selects from a Vector. (otherwise, Indexed
+  REMAP is necessary)
+* a second REMAP for the loop `ai` above
+* a third REMAP for the loop value `bi+ai` modulo 5
+
+Then - using Vertical-First - the loop is four
+instructions: mul, madd, svstep and bc.
+
+Fast-forwarding to the python implementation, let us look at this:
+
+```
+ 186    h0 = (h0 & c) | g0;
+ 187    h1 = (h1 & c) | g1;
+ 188    h2 = (h2 & c) | g2;
+```
+
+This again is a simple parallel operation, so can be done
+as two Vertical-First instructions:
+
+```
+sv.and *h, *h, c
+sv.or  *h, *h, *g
+```
+
+This section however gets more complex, and also requires
+a new instruction (Double-Shift-and-add, where previous
+instructions designed are Double-Shift-and-or):
+
+```
+ 110    h0 += (( t0                    ) & 0xfffffffffff);
+ 111    h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
+ 112    h2 += (((t1 >> 24)             ) & 0x3ffffffffff) | hibit;
+```
+
+Here, a "cheat" is required - slightly - to bring in some extra
+variables that are set to zero: t2 and t-minus-one (which we name tm):
+
+```
+ 110    h0 += (((t0 >> 0 ) | (tm << 64)) & 0xfffffffffff);
+ 111    h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
+ 112    h2 += (((t1 >> 24) | (t2 << 40 ) & 0x3ffffffffff);
+ 112    h2 |= hibit;
+```
+
+Both tm and t2 are set to zero: strictly speaking tm shifted by 64
+will always be zero.  Now it becomes possible to identify the
+pattern, and also split out the AND part:
+
+```
+        h[N] += (((t[N] >> s[N] ) | (t[N-1] << (64-s[N]))
+        h[N] &= const[N]
+```
+
+Where it is clear that the shift amount `s[N]` and `const[N]`
+can be set up as Vectors. Now it is possible to utilise a
+proposed new instruction, "double-shift-and-add" for the first
+part, and a straight `sv.add *h, *h, *const` for the second.
+
+In this way, again we have identified very simple regular patterns,
+and applied Vectorisation to them to reduce instruction count.
+Interestingly in this case, unlike many algorithms converted
+there is not anticipated to be a huge reduction in instruction
+count, but the key is that the core of the algorithm is preserved
+and thus is easy to validate.
author	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Mon, 27 May 2024 15:08:26 +0000 (16:08 +0100)
committer	Luke Kenneth Casson Leighton <lkcl@lkcl.net>
	Mon, 27 May 2024 15:08:30 +0000 (16:08 +0100)