--- /dev/null
+# Implementation strategy for poly1305 on Simple-V
+
+Links:
+
+* [Poly1305's design](https://loup-vaillant.fr/tutorials/poly1305-design)
+* [Python reference](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/poly1305-donna.py;hb=HEAD)
+* Bugreport <https://bugs.libre-soc.org/show_bug.cgi?id=1158>
+
+Poly1305 is a fast and surprisingly simple provably-secure one-time
+authenticator. A huge amount of detailed mathematical analysis has
+been done on it, demonstrating and implementing tricks of modulo
+arithmetic that can be exploited to good effect and need not be
+repeated here: the primary purpose of this document is to show that
+it is possible to use Simple-V REMAP when patterns are identified,
+instead of standard implementations on Scalar / SIMD ISAs that are
+loop-unrolled in order to ensure no branching (and definitely no
+conditional-branching) is utilised.
+
+For example: this entire function, from Loup Vaillant's tutorial
+may be carried out with some REMAPs, a mul and a madd instruction.
+
+```
+void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
+{
+ uint64_t a0 = a[0]; uint64_t b0 = b[0];
+ uint64_t a1 = a[1]; uint64_t b1 = b[1];
+ uint64_t a2 = a[2]; uint64_t b2 = b[2];
+ uint64_t a3 = a[3]; uint64_t b3 = b[3];
+ uint64_t a4 = a[4]; uint64_t b4 = b[4];
+ p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
+ p[1] = a0*b1 + a1*b0 + a2*b4*5 + a3*b3*5 + a4*b2*5;
+ p[2] = a0*b2 + a1*b1 + a2*b0 + a3*b4*5 + a4*b3*5;
+ p[3] = a0*b3 + a1*b2 + a2*b1 + a3*b0 + a4*b4*5;
+ p[4] = a0*b4 + a1*b3 + a2*b2 + a3*b1 + a4*b0 ;
+}
+```
+
+Above, the patterns are quite clear: it can be expressed
+as a pair of loops (which is normally avoided like the plague),
+
+```
+for ai = 0 to 4
+ for bi = 0 to 4
+ if ai < bi then const = 5 else 1
+ p[ai] += a[ai] * b[(bi+ai)%5] * const
+```
+
+This can be covered in Simple-V by having
+
+* one REMAP for the constant (5 or 1), which
+ is simplest done by a Triangle REMAP that
+ selects from a Vector. (otherwise, Indexed
+ REMAP is necessary)
+* a second REMAP for the loop `ai` above
+* a third REMAP for the loop value `bi+ai` modulo 5
+
+Then - using Vertical-First - the loop is four
+instructions: mul, madd, svstep and bc.
+
+Fast-forwarding to the python implementation, let us look at this:
+
+```
+ 186 h0 = (h0 & c) | g0;
+ 187 h1 = (h1 & c) | g1;
+ 188 h2 = (h2 & c) | g2;
+```
+
+This again is a simple parallel operation, so can be done
+as two Vertical-First instructions:
+
+```
+sv.and *h, *h, c
+sv.or *h, *h, *g
+```
+
+This section however gets more complex, and also requires
+a new instruction (Double-Shift-and-add, where previous
+instructions designed are Double-Shift-and-or):
+
+```
+ 110 h0 += (( t0 ) & 0xfffffffffff);
+ 111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
+ 112 h2 += (((t1 >> 24) ) & 0x3ffffffffff) | hibit;
+```
+
+Here, a "cheat" is required - slightly - to bring in some extra
+variables that are set to zero: t2 and t-minus-one (which we name tm):
+
+```
+ 110 h0 += (((t0 >> 0 ) | (tm << 64)) & 0xfffffffffff);
+ 111 h1 += (((t0 >> 44) | (t1 << 20)) & 0xfffffffffff);
+ 112 h2 += (((t1 >> 24) | (t2 << 40 ) & 0x3ffffffffff);
+ 112 h2 |= hibit;
+```
+
+Both tm and t2 are set to zero: strictly speaking tm shifted by 64
+will always be zero. Now it becomes possible to identify the
+pattern, and also split out the AND part:
+
+```
+ h[N] += (((t[N] >> s[N] ) | (t[N-1] << (64-s[N]))
+ h[N] &= const[N]
+```
+
+Where it is clear that the shift amount `s[N]` and `const[N]`
+can be set up as Vectors. Now it is possible to utilise a
+proposed new instruction, "double-shift-and-add" for the first
+part, and a straight `sv.add *h, *h, *const` for the second.
+
+In this way, again we have identified very simple regular patterns,
+and applied Vectorisation to them to reduce instruction count.
+Interestingly in this case, unlike many algorithms converted
+there is not anticipated to be a huge reduction in instruction
+count, but the key is that the core of the algorithm is preserved
+and thus is easy to validate.