From 4abbe44a21e8fdf63fe2e76a5be9829dee4aac96 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Thu, 27 Apr 2023 11:09:10 +0100 Subject: [PATCH] add intro section for xchacha20 cookbook --- openpower/sv/cookbook/chacha20.mdwn | 38 +++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/openpower/sv/cookbook/chacha20.mdwn b/openpower/sv/cookbook/chacha20.mdwn index fb473eac0..34dcae479 100644 --- a/openpower/sv/cookbook/chacha20.mdwn +++ b/openpower/sv/cookbook/chacha20.mdwn @@ -2,13 +2,30 @@ # XChaCha20 SVP64 Implementation Analysis +This document shows how xchacha20's core loop - all 20 rounds - was +implemented in just 11 Vector instructions. There are an additional +9 instructions involved in establishing a REMAP Schedule (explained +below), which if there are multiple blocks these 9 instructions do not +need to be called again. + +Firstly, we analyse the xchacha20 algorithm, showing what operations +are performed and in what order. Secondly, two innovative features +of SVP64 are described which are crucial to understanding of Simple-V +Vectorisation: Vertical-First Mode and Indexed REMAP. Then we show +how Index REMAP eliminates the need entirely for inline-loop-unrolling, +but note that in this particular algorithm REMAP is only useful for +us in Vertical-First Mode. + ## Description of XChacha20 Algorithm We will first try to analyze the XChacha20 algorithm as found in: https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c -The function under inspection is `xchacha_hchacha20`. If we notice we will that the main part of the computation, the main algorithm is just a for loop -which is also the same in the `xchacha_encrypt_bytes` function as well. +The function under inspection is `xchacha_hchacha20`. If we notice +we will that the main part of the computation, the main algorithm is +just a for loop -which is also the same in the `xchacha_encrypt_bytes` +function as well. Main loop for `xchacha_hchacha20`: @@ -164,7 +181,13 @@ This is what happens when REMAP is enabled with Indexing: In this way we can literally jump about, pretty much anywhere in the register file, according to a Schedule that is determined by -the programmer. +the programmer. Therefore, if we can put all of the chacha20 +round intermediary data into an array of registers, and can +analyse the *order* in which add-operations, xor-operations +and rotate-operations occur, it might just be possible to +eliminate **all** loop-unrolled inline assembler, replacing it +with three instructions and appropriate Indexed REMAP Schedules! +Turns out that this is indeed possible. ## Introduction to Vertical-First Mode @@ -187,6 +210,17 @@ will be moved to the next element/register in the vector. Branching to the beginning of the loop does not occur automatically though, a branch instruction will have to be added manually. +The reason why Vertical-First is needed is because it should be clear +from xchacha20 that there are ordering dependencies between the three +operations `add, xor, rotate`. It is not okay to perform the entire +suite of Vector-adds then move on to the Vector-xors then the Vector-rotates: +they have to be interleaved so as to respect the element-level ordering. +This is exactly what Vertical-First allows us to do: +element 0 add, element 0 xor, element 0 rotate, then element **1** +add, element **1** xor, element **1** rotate and so on. Vertical-First +*combined* with Index REMAP we can literally jump those operations around, +anywhere within the Vector. + ## Application of VF mode in the Xchacha20 loop Let's assume the values `x` in the registers 24-36 -- 2.30.2