From 4abbe44a21e8fdf63fe2e76a5be9829dee4aac96 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Thu, 27 Apr 2023 11:09:10 +0100
Subject: [PATCH] add intro section for xchacha20 cookbook

---
 openpower/sv/cookbook/chacha20.mdwn | 38 +++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/openpower/sv/cookbook/chacha20.mdwn b/openpower/sv/cookbook/chacha20.mdwn
index fb473eac0..34dcae479 100644
--- a/openpower/sv/cookbook/chacha20.mdwn
+++ b/openpower/sv/cookbook/chacha20.mdwn
@@ -2,13 +2,30 @@
 
 # XChaCha20 SVP64 Implementation Analysis
 
+This document shows how xchacha20's core loop - all 20 rounds - was
+implemented in just 11 Vector instructions.  There are an additional
+9 instructions involved in establishing a REMAP Schedule (explained
+below), which if there are multiple blocks these 9 instructions do not
+need to be called again.
+
+Firstly, we analyse the xchacha20 algorithm, showing what operations
+are performed and in what order.  Secondly, two innovative features
+of SVP64 are described which are crucial to understanding of Simple-V
+Vectorisation: Vertical-First Mode and Indexed REMAP.  Then we show
+how Index REMAP eliminates the need entirely for inline-loop-unrolling,
+but note that in this particular algorithm REMAP is only useful for
+us in Vertical-First Mode.
+
 ## Description of XChacha20 Algorithm
 
 We will first try to analyze the XChacha20 algorithm as found in:
 
 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
 
-The function under inspection is `xchacha_hchacha20`. If we notice we will that the main part of the computation, the main algorithm is just a for loop -which is also the same in the `xchacha_encrypt_bytes` function as well.
+The function under inspection is `xchacha_hchacha20`. If we notice
+we will that the main part of the computation, the main algorithm is
+just a for loop -which is also the same in the `xchacha_encrypt_bytes`
+function as well.
 
 Main loop for `xchacha_hchacha20`:
 
@@ -164,7 +181,13 @@ This is what happens when REMAP is enabled with Indexing:
 
 In this way we can literally jump about, pretty much anywhere in
 the register file, according to a Schedule that is determined by
-the programmer.
+the programmer.  Therefore, if we can put all of the chacha20
+round intermediary data into an array of registers, and can
+analyse the *order* in which add-operations, xor-operations
+and rotate-operations occur, it might just be possible to
+eliminate **all** loop-unrolled inline assembler, replacing it
+with three instructions and appropriate Indexed REMAP Schedules!
+Turns out that this is indeed possible.
 
 ## Introduction to Vertical-First Mode
 
@@ -187,6 +210,17 @@ will be moved to the next element/register in the vector. Branching to
 the beginning of the loop does not occur automatically though, a branch
 instruction will have to be added manually.
 
+The reason why Vertical-First is needed is because it should be clear
+from xchacha20 that there are ordering dependencies between the three
+operations `add, xor, rotate`.  It is not okay to perform the entire
+suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
+they have to be interleaved so as to respect the element-level ordering.
+This is exactly what Vertical-First allows us to do:
+element 0 add, element 0 xor, element 0 rotate, then element **1**
+add, element **1** xor, element **1** rotate and so on.  Vertical-First
+*combined* with Index REMAP we can literally jump those operations around,
+anywhere within the Vector.
+
 ## Application of VF mode in the Xchacha20 loop
 
 Let's assume the values `x` in the registers 24-36
-- 
2.30.2