add in ```s and deliberately leave in indentation.
[libreriscv.git] / openpower / sv / cookbook / chacha20.mdwn
1 [[!tag svp64_cookbook]]
2
3 # XChaCha20 SVP64 Implementation Analysis
4
5 ## Description of XChacha20 Algorithm
6
7 We will first try to analyze the XChacha20 algorithm as found in:
8
9 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
10
11 The function under inspection is `xchacha_hchacha20`. If we notice we will that the main part of the computation, the main algorithm is just a for loop -which is also the same in the `xchacha_encrypt_bytes` function as well.
12
13 Main loop for `xchacha_hchacha20`:
14
15 ```
16 for (i = 0; i < 10; i++){
17 QUARTERROUND(x0, x4, x8, x12);
18 QUARTERROUND(x1, x5, x9, x13);
19 QUARTERROUND(x2, x6, x10, x14);
20 QUARTERROUND(x3, x7, x11, x15);
21 QUARTERROUND(x0, x5, x10, x15);
22 QUARTERROUND(x1, x6, x11, x12);
23 QUARTERROUND(x2, x7, x8, x13);
24 QUARTERROUND(x3, x4, x9, x14);
25 }
26
27 #define QUARTERROUND(a,b,c,d) \
28 a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
29 c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
30 a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
31 c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
32 ```
33
34 We see that the loop is split in two groups of `QUARTERROUND` calls,
35 one with `step=4`:
36
37 ```
38 QUARTERROUND(x0, x4, x8, x12);
39 QUARTERROUND(x1, x5, x9, x13);
40 QUARTERROUND(x2, x6, x10, x14);
41 QUARTERROUND(x3, x7, x11, x15);
42 ```
43
44 and another with `step=5`:
45
46 ```
47 QUARTERROUND(x0, x5, x10, x15);
48 QUARTERROUND(x1, x6, x11, x12);
49 QUARTERROUND(x2, x7, x8, x13);
50 QUARTERROUND(x3, x4, x9, x14);
51 ```
52
53 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
54 essentially it results in the following instructions:
55
56 ```
57 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 16);
58 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 12);
59 x0 = x0 + x4; x12 = ROTATE(x12 ^ x0, 8);
60 x8 = x8 + x12; x4 = ROTATE(x4 ^ x8, 7);
61 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 16);
62 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 12);
63 x1 = x1 + x5; x13 = ROTATE(x13 ^ x1, 8);
64 x9 = x9 + x13; x5 = ROTATE(x5 ^ x9, 7);
65 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 16);
66 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 12);
67 x2 = x2 + x6; x14 = ROTATE(x14 ^ x2, 8);
68 x10 = x10 + x14; x6 = ROTATE(x6 ^ x10, 7);
69 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 16);
70 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 12);
71 x3 = x3 + x7; x15 = ROTATE(x15 ^ x3, 8);
72 x11 = x11 + x15; x7 = ROTATE(x7 ^ x11, 7);
73 ```
74
75 Second group of `QUARTERROUND`s, unrolled:
76
77 ```
78 x0 = x0 + x5; x15 = ROTATE(x15 ^ x0, 16);
79 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 12);
80 x0 = x0 + x5; x12 = ROTATE(x15 ^ x0, 8);
81 x10 = x10 + x15; x5 = ROTATE(x5 ^ x10, 7);
82 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 16);
83 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 12);
84 x1 = x1 + x6; x12 = ROTATE(x12 ^ x1, 8);
85 x11 = x11 + x12; x6 = ROTATE(x6 ^ x11, 7);
86 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 16);
87 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 12);
88 x2 = x2 + x7; x13 = ROTATE(x13 ^ x2, 8);
89 x8 = x8 + x13; x7 = ROTATE(x7 ^ x8, 7);
90 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 16);
91 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 12);
92 x3 = x3 + x4; x14 = ROTATE(x14 ^ x3, 8);
93 x9 = x9 + x14; x4 = ROTATE(x4 ^ x9, 7);
94 ```
95
96 Let's list the additions only:
97
98 ```
99 x0 = x0 + x4
100 x8 = x8 + x12
101 x0 = x0 + x4
102 x8 = x8 + x12
103 x1 = x1 + x5
104 x9 = x9 + x13
105 x1 = x1 + x5
106 x9 = x9 + x13
107 x2 = x2 + x6
108 x10 = x10 + x14
109 x2 = x2 + x6
110 x10 = x10 + x14
111 x3 = x3 + x7
112 x11 = x11 + x15
113 x3 = x3 + x7
114 x11 = x11 + x15
115 x0 = x0 + x5
116 x10 = x10 + x15
117 x0 = x0 + x5
118 x10 = x10 + x15
119 x1 = x1 + x6
120 x11 = x11 + x12
121 x1 = x1 + x6
122 x11 = x11 + x12
123 x2 = x2 + x7
124 x8 = x8 + x13
125 x2 = x2 + x7
126 x8 = x8 + x13
127 x3 = x3 + x4
128 x9 = x9 + x14
129 x3 = x3 + x4
130 x9 = x9 + x14
131 ```
132
133 ## Introduction to Vertical-First Mode
134
135 We're going to use Vertical-First mode (VF) to implement this, so we
136 will first do a short introduction to VF. In normal Horizontal Vector
137 mode, or even in traditional SIMD mode, the instruction is executed
138 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
139 the instruction
140
141 ```
142 sv.add *RT, *RA, *RB # RT = RA + RB
143 ```
144
145 will be executed on all elements of the vector, **before** moving to
146 the next assembly instruction. This behaviour changes in Vertical-First
147 mode. In VF mode, the instruction is executed on a **single** element of
148 the vector and then moves to the next assembly instruction. It continues
149 to do so until execution reaches the `svstep` instruction where processing
150 will be moved to the next element/register in the vector. Branching to
151 the beginning of the loop does not occur automatically though, a branch
152 instruction will have to be added manually.
153
154 ## Application of VF mode in the Xchacha20 loop
155
156 Let's assume the values `x` in the registers 24-36
157
158 | GPR # | | | | |
159 |--------|-----|-----|-----|-----|
160 | 24 | x0 | x1 | x2 | x3 |
161 | 28 | x4 | x5 | x6 | x7 |
162 | 32 | x8 | x9 | x10 | x11 |
163 | 36 | x12 | x13 | x14 | x15 |
164
165 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
166 same) indices are (in terms of x):
167
168 | | | | | | | | |
169 |----|----|----|----|----|----|----|----|
170 | 0 | 8 | 0 | 8 | 1 | 9 | 1 | 9 |
171 | 2 | 10 | 2 | 10 | 3 | 11 | 3 | 11 |
172 | 0 | 10 | 0 | 10 | 1 | 11 | 1 | 11 |
173 | 2 | 8 | 2 | 8 | 3 | 9 | 3 | 9 |
174
175 However, since the indices are small values, using a single 64-bit
176 register for a single index value is a waste so we will compress them,
177 8 indices in a 64-bit register:
178 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
179
180 | | | | | |
181 |-----------|-------------------|-------------------|-------------------|-------------------|
182 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
183
184 Similarly we find the RB indices:
185
186 | | | | | | | | |
187 |----|----|----|----|----|----|----|----|
188 | 4 | 12 | 4 | 12 | 5 | 13 | 5 | 13 |
189 | 6 | 14 | 6 | 14 | 7 | 15 | 7 | 15 |
190 | 5 | 15 | 5 | 15 | 6 | 12 | 6 | 12 |
191 | 7 | 13 | 7 | 13 | 4 | 14 | 7 | 14 |
192
193 Using a similar method, we find the final 4 registers with the `RB` indices:
194
195 | | | | | |
196 |-----------|-------------------|-------------------|-------------------|-------------------|
197 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
198
199 Now, we can construct the Vertical First loop:
200
201 ```
202 svindex 4, 0, 1, 3, 0, 1, 0 # SVSHAPE0, add RA/RT indices
203 svindex 6, 1, 1, 3, 0, 1, 0 # SVSHAPE1, add RB indices
204 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
205 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
206 sv.add/w=32 *x, *x, *x # RT, RB: SHAPE0. RA: SHAPE1
207 svstep. 16, 1, 0 # step to next in-regs element
208 ```
209
210 What this code snippet does is the following:
211
212 The first instruction
213
214 ```
215 svindex 4, 0, 1, 3, 0, 1, 0
216 ```
217
218 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
219 that 4 is listed, but that's because it only works on even registers,
220 so in order to save a bit, we have to double that number to get the
221 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
222 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
223 2=16-bit, 3=8-bit.
224
225 The next step instruction
226
227 ```
228 svindex 6, 1, 1, 3, 0, 1, 0
229 ```
230
231 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
232 the actual registers will be loaded in GPR #12, again a use of 8-bit
233 elements is denoted.
234 Next, the `setvl` instructions:
235
236 ```
237 setvl 0, 0, 32, 1, 1, 1
238 ```
239
240 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
241 Vertical-First mode. Afterwards, we have to instruct the way we intend
242 to use the indices, and we do this using `svremap`.
243
244 ```
245 svremap 31, 1, 0, 0, 0, 0, 0
246 ```
247
248 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
249 `SVSHAPE1` for `RA`. The next instruction performs the **actual** addition:
250
251 ```
252 sv.add/w=32 *x, *x, *x
253 ```
254
255 Note the `/w=32` suffix. This instructs the adder to perform the operation
256 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
257 that we need to have 2 32-bit elements loaded in each register. Also,
258 note that in all parameters we use the `*x` as argument. This instructs
259 the scheduler to act on the registers as a vector, or a sequence of
260 elements. But even though they are all the same, their indices will be
261 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
262 note that the indices are relative to the actual register used. So,
263 if `*x` starts in GPR 24 for example, in essence this instruction will
264 issue the following sequence of instructions:
265
266 ```
267 add/w=32 24 + 0, 24 + 4, 24 + 0
268 add/w=32 24 + 8, 24 + 12, 24 + 8
269 add/w=32 24 + 0, 24 + 4, 24 + 0
270 add/w=32 24 + 8, 24 + 12, 24 + 8
271 add/w=32 24 + 1, 24 + 5, 24 + 1
272 add/w=32 24 + 9, 24 + 13, 24 + 9
273 add/w=32 24 + 1, 24 + 5, 24 + 1
274 add/w=32 24 + 9, 24 + 13, 24 + 9
275 ...
276 ```
277
278 Finally, the `svstep.` instruction steps to the next set of indices
279
280 We have shown how to do the additions in a Vertical-first mode. Now
281 let's add the rest of the instructions in the `QUARTERROUND`s. For the
282 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
283 XOR(d, a)`:
284
285 ```
286 x12 = x12 ^ x0
287 x4 = x4 ^ x8
288 x12 = x12 ^ x0
289 x4 = x4 ^ x8
290 x13 = x13 ^ x1
291 x5 = x5 ^ x9
292 x13 = x13 ^ x1
293 x5 = x5 ^ x9
294 x14 = x14 ^ x2
295 x6 = x6 ^ x10
296 x14 = x14 ^ x2
297 x6 = x6 ^ x10
298 x15 = x15 ^ x3
299 x7 = x7 ^ x11
300 x15 = x15 ^ x3
301 x7 = x7 ^ x11
302 x15 = x15 ^ x0
303 x5 = x5 ^ x10
304 x12 = x15 ^ x0
305 x5 = x5 ^ x10
306 x12 = x12 ^ x1
307 x6 = x6 ^ x11
308 x12 = x12 ^ x1
309 x6 = x6 ^ x11
310 x13 = x13 ^ x2
311 x7 = x7 ^ x8
312 x13 = x13 ^ x1
313 x7 = x7 ^ x8
314 x14 = x14 ^ x3
315 x4 = x4 ^ x9
316 x14 = x14 ^ x3
317 x4 = x4 ^ x9
318 ```
319
320 We will need to create another set of indices for the `XOR` instructions. We
321 will only need one set as the other set of indices is the same as `RT`
322 for `sv.add` (`SHAPE0`). So, remembering that our
323
324 | | | | | | | | |
325 |----|----|----|----|----|----|----|----|
326 | 12 | 4 | 12 | 4 | 13 | 5 | 13 | 5 |
327 | 14 | 6 | 14 | 6 | 15 | 7 | 15 | 7 |
328 | 15 | 5 | 15 | 5 | 12 | 6 | 12 | 6 |
329 | 13 | 7 | 13 | 7 | 14 | 4 | 14 | 4 |
330
331 Again, we find
332
333 | | | | | |
334 |-----------|-------------------|-------------------|-------------------|-------------------|
335 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
336
337 The next operation is the `ROTATE` which takes as operand the result of the
338 `XOR` and a shift argument. You can easily see that the indices used in this
339 case are the same as the `XOR`. However, the shift values cycle every 4:
340 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
341
342 ```
343 svindex 8, 2, 1, 3, 0, 1, 0
344 ```
345
346 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
347 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
348 the `svshape2` instruction will be used:
349
350 ```
351 svshape2 0, 0, 3, 4, 0, 1
352 ```
353
354 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
355 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
356 together with the respective svremap instructions:
357
358 ```
359 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
360 sv.xor/w=32 *x, *x, *x
361 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
362 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
363 ```
364
365 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
366 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
367 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
368 (the shift values, which cycle every 4 elements). Note that the actual
369 indices for `SVSHAPE3` will have to be in 32-bit elements:
370
371 | | | |
372 |---------|--------------------|--------------------|
373 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
374
375 The complete algorithm for a loop with 10 iterations is as follows:
376
377 ```
378 li 7, 10 # Load value 10 into GPR #7
379 mtctr 7 # Set up counter on GPR #7
380
381 # set up VL=32 vertical-first, and SVSHAPEs 0-2
382 setvl 0, 0, 32, 1, 1, 1
383 # SHAPE0, used by sv.add starts at GPR #8
384 svindex 8/2, 0, 1, 3, 0, 1, 0 # SVSHAPE0, a
385 # SHAPE1, used by sv.xor starts at GPR #12
386 svindex 12/2, 1, 1, 3, 0, 1, 0 # SVSHAPE1, b
387 # SHAPE2, used by sv.rldcl starts at GPR #16
388 svindex 16/2, 2, 1, 3, 0, 1, 0 # SVSHAPE2, c
389 # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
390 # The inner loop will do 32 iterations, but there are only
391 # 4 shift values, so we mod by 4, and can cycle through them
392 svshape2 0, 0, 3, 4, 0, 1 # SVSHAPE3, shift amount, mod4
393 .outer:
394 # outer loop begins here (standard CTR loop)
395 setvl 0, 0, 32, 1, 1, 1 # MAXVL=VL=32, VF=1
396 # inner loop begins here. add-xor-rotl32 with remap, step, branch
397 .inner:
398 svremap 31, 1, 0, 0, 0, 0, 0 # RA=1, RB=0, RT=0 (0b01011)
399 sv.add/w=32 *x, *x, *x
400 svremap 31, 2, 0, 2, 2, 0, 0 # RA=2, RB=0, RS=2 (0b00111)
401 sv.xor/w=32 *x, *x, *x
402 svremap 31, 0, 3, 2, 2, 0, 0 # RA=2, RB=3, RS=2 (0b01110)
403 sv.rldcl/w=32 *x, *x, *SHIFTS, 0
404 # 16 is the destination containing the result of svstep.
405 # it overlaps with SHAPE2 which is also 16. the first 8 indices
406 # will get corrupted.
407 svstep. 7, 1, 0 # step to next in-regs element
408 bc 6, 3, .inner # svstep. Rc=1 loop-end-condition?
409 # inner-loop done: outer loop standard CTR-decrement to setvl again
410 bdnz .outer # Loop until CTR is zero
411 ```
412
413