1 [[!tag svp64_cookbook]]
3 # XChaCha20 SVP64 Implementation Analysis
5 This document shows how xchacha20's core loop - all 20 rounds - was
6 implemented in just 11 Vector instructions. There are an additional
7 9 instructions involved in establishing a REMAP Schedule (explained
8 below), which if there are multiple blocks these 9 instructions do not
9 need to be called again.
11 Firstly, we analyse the xchacha20 algorithm, showing what operations
12 are performed and in what order. Secondly, two innovative features
13 of SVP64 are described which are crucial to understanding of Simple-V
14 Vectorisation: Vertical-First Mode and Indexed REMAP. Then we show
15 how Index REMAP eliminates the need entirely for inline-loop-unrolling,
16 but note that in this particular algorithm REMAP is only useful for
17 us in Vertical-First Mode.
19 ## Description of XChacha20 Algorithm
21 We will first try to analyze the XChacha20 algorithm as found in:
23 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
25 The function under inspection is `xchacha_hchacha20`. If we notice
26 we will that the main part of the computation, the main algorithm is
27 just a for loop -which is also the same in the `xchacha_encrypt_bytes`
30 Main loop for `xchacha_hchacha20`:
33 for (i = 0; i < 10; i++){
34 QUARTERROUND(x0, x4, x8, x12);
35 QUARTERROUND(x1, x5, x9, x13);
36 QUARTERROUND(x2, x6, x10, x14);
37 QUARTERROUND(x3, x7, x11, x15);
38 QUARTERROUND(x0, x5, x10, x15);
39 QUARTERROUND(x1, x6, x11, x12);
40 QUARTERROUND(x2, x7, x8, x13);
41 QUARTERROUND(x3, x4, x9, x14);
44 #define QUARTERROUND(a,b,c,d) \
45 a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
46 c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
47 a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
48 c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
51 We see that the loop is split in two groups of `QUARTERROUND` calls,
55 QUARTERROUND(x0, x4, x8, x12);
56 QUARTERROUND(x1, x5, x9, x13);
57 QUARTERROUND(x2, x6, x10, x14);
58 QUARTERROUND(x3, x7, x11, x15);
61 and another with `step=5`:
64 QUARTERROUND(x0, x5, x10, x15);
65 QUARTERROUND(x1, x6, x11, x12);
66 QUARTERROUND(x2, x7, x8, x13);
67 QUARTERROUND(x3, x4, x9, x14);
70 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
71 essentially it results in the following instructions:
74 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 16);
75 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 12);
76 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 8);
77 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 7);
78 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 16);
79 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 12);
80 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 8);
81 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 7);
82 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 16);
83 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 12);
84 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 8);
85 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 7);
86 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 16);
87 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 12);
88 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 8);
89 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 7);
92 Second group of `QUARTERROUND`s, unrolled:
95 x0 = x0 + x5; x15 = ROTATE(x15 ^ x0, 16);
96 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 12);
97 x0 = x0 + x5; x12 = ROTATE(x15 ^ x0, 8);
98 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 7);
99 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 16);
100 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 12);
101 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 8);
102 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 7);
103 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 16);
104 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 12);
105 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 8);
106 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 7);
107 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 16);
108 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 12);
109 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 8);
110 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 7);
113 Let's list the additions only:
150 ## Introduction to REMAP Indexing
152 REMAP Indexing performs any arbitrary re-positioning of elements.
153 Where normally any other Vector Processor would only be able to do a
154 sequential element-level series of operations, and if re-ordering
155 of the elements is required use a special re-ordering instruction,
156 SVP64 can *in-place* reorder elements on *any* instruction, using
159 Most of the REMAP systems are simple fixed-hardware Deterministic
160 Schedules, but there is one that is general-purpose: Indexing. It
161 requires specifying a group of GPRs (or indices packed into GPRs)
162 that are to be used as the offsets.
164 This is a normal Simple-V operation:
168 GPR[RT+i] = OPERATION(GPR[RA+i])
171 This is what happens when REMAP is enabled with Indexing:
174 def REMAP(SVSHAPE, i):
175 return GPR(SVSHAPE.GPR + i)
177 idx_rt = REMAP(SVSHAPE0, i)
178 idx_ra = REMAP(SVSHAPE1, i)
179 GPR[RT+idx_rt] = OPERATION(GPR[RA+idx_ra])
182 In this way we can literally jump about, pretty much anywhere in
183 the register file, according to a Schedule that is determined by
184 the programmer. Therefore, if we can put all of the chacha20
185 round intermediary data into an array of registers, and can
186 analyse the *order* in which add-operations, xor-operations
187 and rotate-operations occur, it might just be possible to
188 eliminate **all** loop-unrolled inline assembler, replacing it
189 with three instructions and appropriate Indexed REMAP Schedules!
190 Turns out that this is indeed possible.
192 ## Introduction to Vertical-First Mode
194 We're going to use Vertical-First mode (VF) to implement this, so we
195 will first do a short introduction to VF. In normal Horizontal Vector
196 mode, or even in traditional SIMD mode, the instruction is executed
197 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
201 sv.add *RT, *RA, *RB # RT = RA + RB
204 will be executed on all elements of the vector, **before** moving to
205 the next assembly instruction. This behaviour changes in Vertical-First
206 mode. In VF mode, the instruction is executed on a **single** element of
207 the vector and then moves to the next assembly instruction. It continues
208 to do so until execution reaches the `svstep` instruction where processing
209 will be moved to the next element/register in the vector. Branching to
210 the beginning of the loop does not occur automatically though, a branch
211 instruction will have to be added manually.
213 The reason why Vertical-First is needed is because it should be clear
214 from xchacha20 that there are ordering dependencies between the three
215 operations `add, xor, rotate`. It is not okay to perform the entire
216 suite of Vector-adds then move on to the Vector-xors then the Vector-rotates:
217 they have to be interleaved so as to respect the element-level ordering.
218 This is exactly what Vertical-First allows us to do:
219 element 0 add, element 0 xor, element 0 rotate, then element **1**
220 add, element **1** xor, element **1** rotate and so on. Vertical-First
221 *combined* with Index REMAP we can literally jump those operations around,
222 anywhere within the Vector.
224 ## Application of VF mode in the Xchacha20 loop
226 Let's assume the values `x` in the registers 24-36
229 |--------|-----|-----|-----|-----|
230 | 24 | x0 | x1 | x2 | x3 |
231 | 28 | x4 | x5 | x6 | x7 |
232 | 32 | x8 | x9 | x10 | x11 |
233 | 36 | x12 | x13 | x14 | x15 |
235 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
236 same) indices are (in terms of x):
239 |----|----|----|----|----|----|----|----|
240 | 0 | 8 | 0 | 8 | 1 | 9 | 1 | 9 |
241 | 2 | 10 | 2 | 10 | 3 | 11 | 3 | 11 |
242 | 0 | 10 | 0 | 10 | 1 | 11 | 1 | 11 |
243 | 2 | 8 | 2 | 8 | 3 | 9 | 3 | 9 |
245 However, since the indices are small values, using a single 64-bit
246 register for a single index value is a waste so we will compress them,
247 8 indices in a 64-bit register:
248 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
251 |-----------|-------------------|-------------------|-------------------|-------------------|
252 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
254 Similarly we find the RB indices:
257 |----|----|----|----|----|----|----|----|
258 | 4 | 12 | 4 | 12 | 5 | 13 | 5 | 13 |
259 | 6 | 14 | 6 | 14 | 7 | 15 | 7 | 15 |
260 | 5 | 15 | 5 | 15 | 6 | 12 | 6 | 12 |
261 | 7 | 13 | 7 | 13 | 4 | 14 | 7 | 14 |
263 Using a similar method, we find the final 4 registers with the `RB` indices:
266 |-----------|-------------------|-------------------|-------------------|-------------------|
267 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
269 Now, we can construct the Vertical First loop:
272 svindex 4, 0, 1, 3, 0, 1, 0 # SVSHAPE0, add RA/RT indices
273 svindex 6, 1, 1, 3, 0, 1, 0 # SVSHAPE1, add RB indices
274 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
275 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
276 sv.add/w=32 *x, *x, *x # RT, RB: SHAPE0. RA: SHAPE1
277 svstep. 16, 1, 0 # step to next in-regs element
280 What this code snippet does is the following:
282 The first instruction
285 svindex 4, 0, 1, 3, 0, 1, 0
288 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
289 that 4 is listed, but that's because it only works on even registers,
290 so in order to save a bit, we have to double that number to get the
291 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
292 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
295 The next step instruction
298 svindex 6, 1, 1, 3, 0, 1, 0
301 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
302 the actual registers will be loaded in GPR #12, again a use of 8-bit
304 Next, the `setvl` instructions:
307 setvl 0, 0, 32, 1, 1, 1
310 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
311 Vertical-First mode. Afterwards, we have to instruct the way we intend
312 to use the indices, and we do this using `svremap`.
315 svremap 31, 1, 0, 0, 0, 0, 0
318 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
319 `SVSHAPE1` for `RA`. The next instruction performs the **actual** addition:
322 sv.add/w=32 *x, *x, *x
325 Note the `/w=32` suffix. This instructs the adder to perform the operation
326 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
327 that we need to have 2 32-bit elements loaded in each register. Also,
328 note that in all parameters we use the `*x` as argument. This instructs
329 the scheduler to act on the registers as a vector, or a sequence of
330 elements. But even though they are all the same, their indices will be
331 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
332 note that the indices are relative to the actual register used. So,
333 if `*x` starts in GPR 24 for example, in essence this instruction will
334 issue the following sequence of instructions:
337 add/w=32 24 + 0, 24 + 4, 24 + 0
338 add/w=32 24 + 8, 24 + 12, 24 + 8
339 add/w=32 24 + 0, 24 + 4, 24 + 0
340 add/w=32 24 + 8, 24 + 12, 24 + 8
341 add/w=32 24 + 1, 24 + 5, 24 + 1
342 add/w=32 24 + 9, 24 + 13, 24 + 9
343 add/w=32 24 + 1, 24 + 5, 24 + 1
344 add/w=32 24 + 9, 24 + 13, 24 + 9
348 Finally, the `svstep.` instruction steps to the next set of indices
350 We have shown how to do the additions in a Vertical-first mode. Now
351 let's add the rest of the instructions in the `QUARTERROUND`s. For the
352 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
390 We will need to create another set of indices for the `XOR` instructions. We
391 will only need one set as the other set of indices is the same as `RT`
392 for `sv.add` (`SHAPE0`). So, remembering that our
395 |----|----|----|----|----|----|----|----|
396 | 12 | 4 | 12 | 4 | 13 | 5 | 13 | 5 |
397 | 14 | 6 | 14 | 6 | 15 | 7 | 15 | 7 |
398 | 15 | 5 | 15 | 5 | 12 | 6 | 12 | 6 |
399 | 13 | 7 | 13 | 7 | 14 | 4 | 14 | 4 |
404 |-----------|-------------------|-------------------|-------------------|-------------------|
405 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
407 The next operation is the `ROTATE` which takes as operand the result of the
408 `XOR` and a shift argument. You can easily see that the indices used in this
409 case are the same as the `XOR`. However, the shift values cycle every 4:
410 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
413 svindex 8, 2, 1, 3, 0, 1, 0
416 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
417 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
418 the `svshape2` instruction will be used:
421 svshape2 0, 0, 3, 4, 0, 1
424 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
425 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
426 together with the respective svremap instructions:
429 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
430 sv.xor/w=32 *x, *x, *x
431 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
432 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
435 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
436 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
437 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
438 (the shift values, which cycle every 4 elements). Note that the actual
439 indices for `SVSHAPE3` will have to be in 32-bit elements:
442 |---------|--------------------|--------------------|
443 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
445 The complete algorithm for a loop with 10 iterations is as follows:
448 li 7, 10 # Load value 10 into GPR #7
449 mtctr 7 # Set up counter on GPR #7
451 # set up VL=32 vertical-first, and SVSHAPEs 0-2
452 setvl 0, 0, 32, 1, 1, 1
453 # SHAPE0, used by sv.add starts at GPR #8
454 svindex 8/2, 0, 1, 3, 0, 1, 0 # SVSHAPE0, a
455 # SHAPE1, used by sv.xor starts at GPR #12
456 svindex 12/2, 1, 1, 3, 0, 1, 0 # SVSHAPE1, b
457 # SHAPE2, used by sv.rldcl starts at GPR #16
458 svindex 16/2, 2, 1, 3, 0, 1, 0 # SVSHAPE2, c
459 # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
460 # The inner loop will do 32 iterations, but there are only
461 # 4 shift values, so we mod by 4, and can cycle through them
462 svshape2 0, 0, 3, 4, 0, 1 # SVSHAPE3, shift amount, mod4
464 # outer loop begins here (standard CTR loop)
465 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
466 # inner loop begins here. add-xor-rotl32 with remap, step, branch
468 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
469 sv.add/w=32 *x, *x, *x
470 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
471 sv.xor/w=32 *x, *x, *x
472 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
473 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
474 # 16 is the destination containing the result of svstep.
475 # it overlaps with SHAPE2 which is also 16. the first 8 indices
476 # will get corrupted.
477 svstep. 7, 1, 0 # step to next in-regs element
478 bc 6, 3, .inner # svstep. Rc=1 loop-end-condition?
479 # inner-loop done: outer loop standard CTR-decrement to setvl again
480 bdnz .outer # Loop until CTR is zero