# XChaCha20 SVP64 Implementation Analysis
+This document shows how xchacha20's core loop - all 20 rounds - was
+implemented in just 11 Vector instructions. There are an additional
+9 instructions involved in establishing a REMAP Schedule (explained
+below), which if there are multiple blocks these 9 instructions do not
+need to be called again.
+
+Firstly, we analyse the xchacha20 algorithm, showing what operations
+are performed and in what order. Secondly, two innovative features
+of SVP64 are described which are crucial to understanding of Simple-V
+Vectorisation: Vertical-First Mode and Indexed REMAP. Then we show
+how Index REMAP eliminates the need entirely for inline-loop-unrolling,
+but note that in this particular algorithm REMAP is only useful for
+us in Vertical-First Mode.
+
## Description of XChacha20 Algorithm
We will first try to analyze the XChacha20 algorithm as found in:
https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
-The function under inspection is `xchacha_hchacha20`. If we notice we will that the main part of the computation, the main algorithm is just a for loop -which is also the same in the `xchacha_encrypt_bytes` function as well.
+The function under inspection is `xchacha_hchacha20`. If we notice
+we will that the main part of the computation, the main algorithm is
+just a for loop -which is also the same in the `xchacha_encrypt_bytes`
+function as well.
Main loop for `xchacha_hchacha20`:
In this way we can literally jump about, pretty much anywhere in
the register file, according to a Schedule that is determined by
-the programmer.
+the programmer. Therefore, if we can put all of the chacha20
+round intermediary data into an array of registers, and can
+analyse the *order* in which add-operations, xor-operations
+and rotate-operations occur, it might just be possible to
+eliminate **all** loop-unrolled inline assembler, replacing it
+with three instructions and appropriate Indexed REMAP Schedules!
+Turns out that this is indeed possible.
## Introduction to Vertical-First Mode
the beginning of the loop does not occur automatically though, a branch
instruction will have to be added manually.
+The reason why Vertical-First is needed is because it should be clear
+from xchacha20 that there are ordering dependencies between the three
+operations `add, xor, rotate`. It is not okay to perform the entire
+suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
+they have to be interleaved so as to respect the element-level ordering.
+This is exactly what Vertical-First allows us to do:
+element 0 add, element 0 xor, element 0 rotate, then element **1**
+add, element **1** xor, element **1** rotate and so on. Vertical-First
+*combined* with Index REMAP we can literally jump those operations around,
+anywhere within the Vector.
+
## Application of VF mode in the Xchacha20 loop
Let's assume the values `x` in the registers 24-36