openpower/sv/cookbook/chacha20.mdwn

   1 [[!tag svp64_cookbook]]
   2
   3 # XChaCha20 SVP64 Implementation Analysis
   4
   5 ## Description of XChacha20 Algorithm
   6
   7 We will first try to analyze the XChacha20 algorithm as found in:
   8
   9 https://github.com/spcnvdr/xchacha20/blob/master/src/xchacha20.c
  10
  11 The function under inspection is `xchacha_hchacha20`. If we notice we will that the main part of the computation, the main algorithm is just a for loop -which is also the same in the `xchacha_encrypt_bytes` function as well.
  12
  13 Main loop for `xchacha_hchacha20`:
  14
  15 ```
  16     for (i = 0; i < 10; i++){
  17         QUARTERROUND(x0, x4,  x8, x12);
  18         QUARTERROUND(x1, x5,  x9, x13);
  19         QUARTERROUND(x2, x6, x10, x14);
  20         QUARTERROUND(x3, x7, x11, x15);
  21         QUARTERROUND(x0, x5, x10, x15);
  22         QUARTERROUND(x1, x6, x11, x12);
  23         QUARTERROUND(x2, x7,  x8, x13);
  24         QUARTERROUND(x3, x4,  x9, x14);
  25     }
  26
  27     #define QUARTERROUND(a,b,c,d) \
  28         a = PLUS(a,b); d = ROTATE(XOR(d,a),16); \
  29         c = PLUS(c,d); b = ROTATE(XOR(b,c),12); \
  30         a = PLUS(a,b); d = ROTATE(XOR(d,a), 8); \
  31         c = PLUS(c,d); b = ROTATE(XOR(b,c), 7);
  32 ```
  33
  34 We see that the loop is split in two groups of `QUARTERROUND` calls,
  35 one with `step=4`:
  36
  37 ```
  38     QUARTERROUND(x0, x4,  x8, x12);
  39     QUARTERROUND(x1, x5,  x9, x13);
  40     QUARTERROUND(x2, x6, x10, x14);
  41     QUARTERROUND(x3, x7, x11, x15);
  42 ```
  43
  44 and another with `step=5`:
  45
  46 ```
  47     QUARTERROUND(x0, x5, x10, x15);
  48     QUARTERROUND(x1, x6, x11, x12);
  49     QUARTERROUND(x2, x7,  x8, x13);
  50     QUARTERROUND(x3, x4,  x9, x14);
  51 ```
  52
  53 Let's start with the first group of `QUARTERROUND`s, by unrolling it,
  54 essentially it results in the following instructions:
  55
  56 ```
  57     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 16);
  58     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 12);
  59     x0  = x0 + x4;   x12 = ROTATE(x12 ^ x0, 8);
  60     x8  = x8 + x12;   x4 = ROTATE(x4 ^ x8, 7);
  61     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 16);
  62     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 12);
  63     x1  = x1 + x5;   x13 = ROTATE(x13 ^ x1, 8);
  64     x9  = x9 + x13;   x5 = ROTATE(x5 ^ x9, 7);
  65     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 16);
  66     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 12);
  67     x2  = x2 + x6;   x14 = ROTATE(x14 ^ x2, 8);
  68     x10 = x10 + x14;  x6 = ROTATE(x6 ^ x10, 7);
  69     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 16);
  70     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 12);
  71     x3  = x3 + x7;   x15 = ROTATE(x15 ^ x3, 8);
  72     x11 = x11 + x15;  x7 = ROTATE(x7 ^ x11, 7);
  73 ```
  74
  75 Second group of `QUARTERROUND`s, unrolled:
  76
  77 ```
  78     x0  = x0 + x5;   x15 = ROTATE(x15 ^ x0, 16);
  79     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 12);
  80     x0  = x0 + x5;   x12 = ROTATE(x15 ^ x0, 8);
  81     x10 = x10 + x15;  x5 = ROTATE(x5 ^ x10, 7);
  82     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 16);
  83     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 12);
  84     x1  = x1 + x6;   x12 = ROTATE(x12 ^ x1, 8);
  85     x11 = x11 + x12;  x6 = ROTATE(x6 ^ x11, 7);
  86     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 16);
  87     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 12);
  88     x2  = x2 + x7;   x13 = ROTATE(x13 ^ x2, 8);
  89     x8  = x8 + x13;   x7 = ROTATE(x7 ^ x8, 7);
  90     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 16);
  91     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 12);
  92     x3  = x3 + x4;   x14 = ROTATE(x14 ^ x3, 8);
  93     x9  = x9 + x14;   x4 = ROTATE(x4 ^ x9, 7);
  94 ```
  95
  96 Let's list the additions only:
  97
  98 ```
  99     x0  = x0 + x4
 100     x8  = x8 + x12
 101     x0  = x0 + x4
 102     x8  = x8 + x12
 103     x1  = x1 + x5
 104     x9  = x9 + x13
 105     x1  = x1 + x5
 106     x9  = x9 + x13
 107     x2  = x2 + x6
 108     x10 = x10 + x14
 109     x2  = x2 + x6
 110     x10 = x10 + x14
 111     x3  = x3 + x7
 112     x11 = x11 + x15
 113     x3  = x3 + x7
 114     x11 = x11 + x15
 115     x0  = x0 + x5
 116     x10 = x10 + x15
 117     x0  = x0 + x5
 118     x10 = x10 + x15
 119     x1  = x1 + x6
 120     x11 = x11 + x12
 121     x1  = x1 + x6
 122     x11 = x11 + x12
 123     x2  = x2 + x7
 124     x8  = x8 + x13
 125     x2  = x2 + x7
 126     x8  = x8 + x13
 127     x3  = x3 + x4
 128     x9  = x9 + x14
 129     x3  = x3 + x4
 130     x9  = x9 + x14
 131 ```
 132
 133 ## Introduction to Vertical-First Mode
 134
 135 We're going to use Vertical-First mode (VF) to implement this, so we
 136 will first do a short introduction to VF. In normal Horizontal Vector
 137 mode, or even in traditional SIMD mode, the instruction is executed
 138 across a (Horizontal) Vector. So, if we have, for example `VL=8`, then
 139 the instruction
 140
 141 ```
 142     sv.add *RT, *RA, *RB  # RT = RA + RB
 143 ```
 144
 145 will be executed on all elements of the vector, **before** moving to
 146 the next assembly instruction.  This behaviour changes in Vertical-First
 147 mode. In VF mode, the instruction is executed on a **single** element of
 148 the vector and then moves to the next assembly instruction. It continues
 149 to do so until execution reaches the `svstep` instruction where processing
 150 will be moved to the next element/register in the vector. Branching to
 151 the beginning of the loop does not occur automatically though, a branch
 152 instruction will have to be added manually.
 153
 154 ## Application of VF mode in the Xchacha20 loop
 155
 156 Let's assume the values `x` in the registers 24-36
 157
 158 | GPR #  |     |     |     |     |
 159 |--------|-----|-----|-----|-----|
 160 |   24   |  x0 |  x1 |  x2 |  x3 |
 161 |   28   |  x4 |  x5 |  x6 |  x7 |
 162 |   32   |  x8 |  x9 | x10 | x11 |
 163 |   36   | x12 | x13 | x14 | x15 |
 164
 165 So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
 166 same) indices are (in terms of x):
 167
 168 |    |    |    |    |    |    |    |    |
 169 |----|----|----|----|----|----|----|----|
 170 |  0 |  8 |  0 |  8 |  1 |  9 |  1 |  9 |
 171 |  2 | 10 |  2 | 10 |  3 | 11 |  3 | 11 |
 172 |  0 | 10 |  0 | 10 |  1 | 11 |  1 | 11 |
 173 |  2 |  8 |  2 |  8 |  3 |  9 |  3 |  9 |
 174
 175 However, since the indices are small values, using a single 64-bit
 176 register for a single index value is a waste so we will compress them,
 177 8 indices in a 64-bit register:
 178 So, `RT` indices will fit inside these 4 registers (in Little Endian format):
 179
 180 |           |                   |                   |                   |                   |
 181 |-----------|-------------------|-------------------|-------------------|-------------------|
 182 | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
 183
 184 Similarly we find the RB indices:
 185
 186 |    |    |    |    |    |    |    |    |
 187 |----|----|----|----|----|----|----|----|
 188 |  4 | 12 |  4 | 12 |  5 | 13 |  5 | 13 |
 189 |  6 | 14 |  6 | 14 |  7 | 15 |  7 | 15 |
 190 |  5 | 15 |  5 | 15 |  6 | 12 |  6 | 12 |
 191 |  7 | 13 |  7 | 13 |  4 | 14 |  7 | 14 |
 192
 193 Using a similar method, we find the final 4 registers with the `RB` indices:
 194
 195 |           |                   |                   |                   |                   |
 196 |-----------|-------------------|-------------------|-------------------|-------------------|
 197 | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
 198
 199 Now, we can construct the Vertical First loop:
 200
 201 ```
 202     svindex             4, 0, 1, 3, 0, 1, 0     # SVSHAPE0, add RA/RT indices
 203     svindex             6, 1, 1, 3, 0, 1, 0     # SVSHAPE1, add RB indices
 204     setvl               0, 0, 32, 1, 1, 1       # MAXVL=VL=32, VF=1
 205     svremap             31, 1, 0, 0, 0, 0, 0    # RA=1, RB=0, RT=0 (0b01011)
 206     sv.add/w=32         *x, *x, *x              # RT, RB: SHAPE0. RA: SHAPE1
 207     svstep. 16, 1, 0                            # step to next in-regs element
 208 ```
 209
 210 What this code snippet does is the following:
 211
 212 The first instruction
 213
 214 ```
 215     svindex             4, 0, 1, 3, 0, 1, 0
 216 ```
 217
 218 loads the add `RT` indices in the `SVSHAPE0`, in register 8. You will note
 219 that 4 is listed, but that's because it only works on even registers,
 220 so in order to save a bit, we have to double that number to get the
 221 actual register. So, `SVSHAPE0` will be listed in GPRs 8-12. The number
 222 3 lists that the elements will be 8-bit long. 0=64-bit, 1=32-bit,
 223 2=16-bit, 3=8-bit.
 224
 225 The next step instruction
 226
 227 ```
 228     svindex             6, 1, 1, 3, 0, 1, 0
 229 ```
 230
 231 loads the add `RB` indices into `SVSHAPE1`. Again, even though we list 6,
 232 the actual registers will be loaded in GPR #12, again a use of 8-bit
 233 elements is denoted.
 234 Next, the `setvl` instructions:
 235
 236 ```
 237     setvl               0, 0, 32, 1, 1, 1
 238 ```
 239
 240 We have to call `setvl` to set `MAXVL` and `VL` to 32 and also configure
 241 Vertical-First mode.  Afterwards, we have to instruct the way we intend
 242 to use the indices, and we do this using `svremap`.
 243
 244 ```
 245     svremap 31, 1, 0, 0, 0, 0, 0
 246 ```
 247
 248 `svremap` basically instructs the scheduler to use `SVSHAPE0` for `RT` and `RB`,
 249 `SVSHAPE1` for `RA`.  The next instruction performs the **actual** addition:
 250
 251 ```
 252     sv.add/w=32         *x, *x, *x
 253 ```
 254
 255 Note the `/w=32` suffix. This instructs the adder to perform the operation
 256 in elements of `w=32` bits. Since the Power CPU is a 64-bit CPU, this means
 257 that we need to have 2 32-bit elements loaded in each register. Also,
 258 note that in all parameters we use the `*x` as argument. This instructs
 259 the scheduler to act on the registers as a vector, or a sequence of
 260 elements. But even though they are all the same, their indices will be
 261 taken from the `SVSHAPE0`/`SVSHAPE1` indices as defined previously. Also
 262 note that the indices are relative to the actual register used. So,
 263 if `*x` starts in GPR 24 for example, in essence this instruction will
 264 issue the following sequence of instructions:
 265
 266 ```
 267     add/w=32            24 + 0, 24 +  4, 24 + 0
 268     add/w=32            24 + 8, 24 + 12, 24 + 8
 269     add/w=32            24 + 0, 24 +  4, 24 + 0
 270     add/w=32            24 + 8, 24 + 12, 24 + 8
 271     add/w=32            24 + 1, 24 +  5, 24 + 1
 272     add/w=32            24 + 9, 24 + 13, 24 + 9
 273     add/w=32            24 + 1, 24 +  5, 24 + 1
 274     add/w=32            24 + 9, 24 + 13, 24 + 9
 275     ...
 276 ```
 277
 278 Finally, the `svstep.` instruction steps to the next set of indices
 279
 280 We have shown how to do the additions in a Vertical-first mode. Now
 281 let's add the rest of the instructions in the `QUARTERROUND`s.  For the
 282 `XOR` instructions of both `QUARTERROUND`s groups only, assuming that `d =
 283 XOR(d, a)`:
 284
 285 ```
 286     x12 = x12 ^ x0
 287     x4 = x4 ^ x8
 288     x12 = x12 ^ x0
 289     x4 = x4 ^ x8
 290     x13 = x13 ^ x1
 291     x5 = x5 ^ x9
 292     x13 = x13 ^ x1
 293     x5 = x5 ^ x9
 294     x14 = x14 ^ x2
 295     x6 = x6 ^ x10
 296     x14 = x14 ^ x2
 297     x6 = x6 ^ x10
 298     x15 = x15 ^ x3
 299     x7 = x7 ^ x11
 300     x15 = x15 ^ x3
 301     x7 = x7 ^ x11
 302     x15 = x15 ^ x0
 303     x5 = x5 ^ x10
 304     x12 = x15 ^ x0
 305     x5 = x5 ^ x10
 306     x12 = x12 ^ x1
 307     x6 = x6 ^ x11
 308     x12 = x12 ^ x1
 309     x6 = x6 ^ x11
 310     x13 = x13 ^ x2
 311     x7 = x7 ^ x8
 312     x13 = x13 ^ x1
 313     x7 = x7 ^ x8
 314     x14 = x14 ^ x3
 315     x4 = x4 ^ x9
 316     x14 = x14 ^ x3
 317     x4 = x4 ^ x9
 318 ```
 319
 320 We will need to create another set of indices for the `XOR` instructions. We
 321 will only need one set as the other set of indices is the same as `RT`
 322 for `sv.add` (`SHAPE0`). So, remembering that our
 323
 324 |    |    |    |    |    |    |    |    |
 325 |----|----|----|----|----|----|----|----|
 326 | 12 |  4 | 12 |  4 | 13 |  5 | 13 |  5 |
 327 | 14 |  6 | 14 |  6 | 15 |  7 | 15 |  7 |
 328 | 15 |  5 | 15 |  5 | 12 |  6 | 12 |  6 |
 329 | 13 |  7 | 13 |  7 | 14 |  4 | 14 |  4 |
 330
 331 Again, we find
 332
 333 |           |                   |                   |                   |                   |
 334 |-----------|-------------------|-------------------|-------------------|-------------------|
 335 | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
 336
 337 The next operation is the `ROTATE` which takes as operand the result of the
 338 `XOR` and a shift argument. You can easily see that the indices used in this
 339 case are the same as the `XOR`. However, the shift values cycle every 4:
 340 16, 12, 8, 7. For the indices we can again use `svindex`, like this:
 341
 342 ```
 343     svindex             8, 2, 1, 3, 0, 1, 0
 344 ```
 345
 346 Which again means `SVSHAPE2`, operating on 8-bit elements, starting
 347 from GPR #16 (`8*2`). For the shift values cycling every 4 elements,
 348 the `svshape2` instruction will be used:
 349
 350 ```
 351     svshape2            0, 0, 3, 4, 0, 1
 352 ```
 353
 354 This will create an `SVSHAPE3`, which will use a modulo 4 for all of its
 355 elements. Now we can list both `XOR` and `ROTATE` instructions in assembly,
 356 together with the respective svremap instructions:
 357
 358 ```
 359     svremap             31, 2, 0, 2, 2, 0, 0       # RA=2, RB=0, RS=2 (0b00111)
 360     sv.xor/w=32         *x, *x, *x
 361     svremap             31, 0, 3, 2, 2, 0, 0       # RA=2, RB=3, RS=2 (0b01110)
 362     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 363 ```
 364
 365 So, in a similar fashion, we instruct `XOR` (`sv.xor`) to use `SVSHAPE2` for
 366 `RA` and `RS` and `SVSHAPE0` for `RB`, again for 32-bit elements, while `ROTATE`
 367 (`sv.rldcl`) will also use `SVSHAPE2` for `RA` and `RS`, but `SVSHAPE3` for `RB`
 368 (the shift values, which cycle every 4 elements). Note that the actual
 369 indices for `SVSHAPE3` will have to be in 32-bit elements:
 370
 371 |         |                    |                    |
 372 |---------|--------------------|--------------------|
 373 | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
 374
 375 The complete algorithm for a loop with 10 iterations is as follows:
 376
 377 ```
 378     li                  7, 10                     # Load value 10 into GPR #7
 379     mtctr               7                         # Set up counter on GPR #7
 380
 381     # set up VL=32 vertical-first, and SVSHAPEs 0-2
 382     setvl               0, 0, 32, 1, 1, 1
 383     # SHAPE0, used by sv.add starts at GPR #8
 384     svindex             8/2, 0, 1, 3, 0, 1, 0     # SVSHAPE0, a
 385     # SHAPE1, used by sv.xor starts at GPR #12
 386     svindex             12/2, 1, 1, 3, 0, 1, 0    # SVSHAPE1, b
 387     # SHAPE2, used by sv.rldcl starts at GPR #16
 388     svindex             16/2, 2, 1, 3, 0, 1, 0    # SVSHAPE2, c
 389     # SHAPE3, used also by sv.rldcl to hold the shift values starts at GPR #20
 390     # The inner loop will do 32 iterations, but there are only
 391     # 4 shift values, so we mod by 4, and can cycle through them
 392     svshape2            0, 0, 3, 4, 0, 1          # SVSHAPE3, shift amount, mod4
 393 .outer:
 394     # outer loop begins here (standard CTR loop)
 395     setvl               0, 0, 32, 1, 1, 1         # MAXVL=VL=32, VF=1
 396     # inner loop begins here. add-xor-rotl32 with remap, step, branch
 397 .inner:
 398     svremap             31, 1, 0, 0, 0, 0, 0      # RA=1, RB=0, RT=0 (0b01011)
 399     sv.add/w=32         *x, *x, *x
 400     svremap             31, 2, 0, 2, 2, 0, 0      # RA=2, RB=0, RS=2 (0b00111)
 401     sv.xor/w=32         *x, *x, *x
 402     svremap             31, 0, 3, 2, 2, 0, 0      # RA=2, RB=3, RS=2 (0b01110)
 403     sv.rldcl/w=32       *x, *x, *SHIFTS, 0
 404     # 16 is the destination containing the result of svstep.
 405     # it overlaps with SHAPE2 which is also 16. the first 8 indices
 406     # will get corrupted.
 407     svstep.             7, 1, 0               # step to next in-regs element
 408     bc                  6, 3, .inner          # svstep. Rc=1 loop-end-condition?
 409     # inner-loop done: outer loop standard CTR-decrement to setvl again
 410     bdnz                .outer                # Loop until CTR is zero
 411 ```
 412
 413