|--------|-----|-----|-----|-----|
| GPR 24 | x0 | x1 | x2 | x3 |
+|--------|-----|-----|-----|-----|
| GPR 28 | x4 | x5 | x6 | x7 |
+|--------|-----|-----|-----|-----|
| GPR 32 | x8 | x9 | x10 | x11 |
+|--------|-----|-----|-----|-----|
| GPR 36 | x12 | x13 | x14 | x15 |
+|--------|-----|-----|-----|-----|
So for the addition in Vertical-First mode, `RT` (and `RA` as they are the
same) indices are (in terms of x):
|----|----|----|----|----|----|----|----|
| 0 | 8 | 0 | 8 | 1 | 9 | 1 | 9 |
+|----|----|----|----|----|----|----|----|
| 2 | 10 | 2 | 10 | 3 | 11 | 3 | 11 |
+|----|----|----|----|----|----|----|----|
| 0 | 10 | 0 | 10 | 1 | 11 | 1 | 11 |
+|----|----|----|----|----|----|----|----|
| 2 | 8 | 2 | 8 | 3 | 9 | 3 | 9 |
+|----|----|----|----|----|----|----|----|
However, since the indices are small values, using a single 64-bit
register for a single index value is a waste so we will compress them,
8 indices in a 64-bit register:
So, `RT` indices will fit inside these 4 registers (in Little Endian format):
- |-----------|-------------------|-------------------|-------------------|-------------------|
- | SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
+|-----------|-------------------|-------------------|-------------------|-------------------|
+| SVSHAPE0: | 0x901090108000800 | 0xb030b030a020a02 | 0xb010b010a000a00 | 0x903090308020802 |
+|-----------|-------------------|-------------------|-------------------|-------------------|
Similarly we find the RB indices:
|----|----|----|----|----|----|----|----|
| 4 | 12 | 4 | 12 | 5 | 13 | 5 | 13 |
+|----|----|----|----|----|----|----|----|
| 6 | 14 | 6 | 14 | 7 | 15 | 7 | 15 |
+|----|----|----|----|----|----|----|----|
| 5 | 15 | 5 | 15 | 6 | 12 | 6 | 12 |
+|----|----|----|----|----|----|----|----|
| 7 | 13 | 7 | 13 | 4 | 14 | 7 | 14 |
+|----|----|----|----|----|----|----|----|
Using a similar method, we find the final 4 registers with the `RB` indices:
- |-----------|-------------------|-------------------|-------------------|-------------------|
- | SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
+|-----------|-------------------|-------------------|-------------------|-------------------|
+| SVSHAPE1: | 0xd050d050c040c04 | 0xf070f070e060e06 | 0xc060c060f050f05 | 0xe040e040d070d07 |
+|-----------|-------------------|-------------------|-------------------|-------------------|
Now, we can construct the Vertical First loop:
|----|----|----|----|----|----|----|----|
| 12 | 4 | 12 | 4 | 13 | 5 | 13 | 5 |
+|----|----|----|----|----|----|----|----|
| 14 | 6 | 14 | 6 | 15 | 7 | 15 | 7 |
+|----|----|----|----|----|----|----|----|
| 15 | 5 | 15 | 5 | 12 | 6 | 12 | 6 |
+|----|----|----|----|----|----|----|----|
| 13 | 7 | 13 | 7 | 14 | 4 | 14 | 4 |
+|----|----|----|----|----|----|----|----|
Again, we find
- |-----------|-------------------|-------------------|-------------------|-------------------|
- | SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
+|-----------|-------------------|-------------------|-------------------|-------------------|
+| SVSHAPE2: | 0x50d050d040c040c | 0x70f070f060e060e | 0x60c060c050f050f | 0x40e040e070d070d |
+|-----------|-------------------|-------------------|-------------------|-------------------|
The next operation is the `ROTATE` which takes as operand the result of the
`XOR` and a shift argument. You can easily see that the indices used in this
(the shift values, which cycle every 4 elements). Note that the actual
indices for `SVSHAPE3` will have to be in 32-bit elements:
- |---------|--------------------|--------------------|
- | SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
+|---------|--------------------|--------------------|
+| SHIFTS: | 0x0000000c00000010 | 0x0000000700000008 |
+|---------|--------------------|--------------------|
The complete algorithm for a loop with 10 iterations is as follows: