From 41d46fd472cf8f96a1e8726829909821fb505d23 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 6 Dec 2023 15:15:54 +0000 Subject: [PATCH] whitespace cleanup --- openpower/sv/cookbook/pospopcnt.mdwn | 233 +++++++++++++-------------- 1 file changed, 110 insertions(+), 123 deletions(-) diff --git a/openpower/sv/cookbook/pospopcnt.mdwn b/openpower/sv/cookbook/pospopcnt.mdwn index 516db6d2c..3a584ba56 100644 --- a/openpower/sv/cookbook/pospopcnt.mdwn +++ b/openpower/sv/cookbook/pospopcnt.mdwn @@ -16,10 +16,10 @@ This project is funded through the [NGI Assure Fund](https://nlnet.nl/assure), a **Introduction** -Positional popcount in optimised assembler is typically done on SIMD ISAs in -around 500 lines. Power ISA thanks to `bpermd` can be much more efficient: -with SVP64 even more so. The reference implementation showing the concept -is below. +Positional popcount in optimised assembler is typically done on SIMD +ISAs in around 500 lines. Power ISA thanks to `bpermd` can be much +more efficient: with SVP64 even more so. The reference implementation +showing the concept is below. ``` // Copyright (c) 2020 Robert Clausecker @@ -70,49 +70,47 @@ bit-position, of an array of input values. Refer to Fig.2 # Visual representation of the pospopcount algorithm -In order to perform positional popcount we need to go -through series of steps shown below in figures 3, 4, 5 & 6. -The limitation to overcome is that the CPU can only work -on registers (horizontally) but the requirement of pospopcount -is to add *vertically*. Part of the challenge is therefore -to perform an appropriate transpose of the data, in blocks -that suit the processor and the ISA capacity. - + +In order to perform positional popcount we need to go through series of +steps shown below in figures 3, 4, 5 & 6. The limitation to overcome is +that the CPU can only work on registers (horizontally) but the requirement +of pospopcount is to add *vertically*. Part of the challenge is therefore +to perform an appropriate transpose of the data, in blocks that suit +the processor and the ISA capacity. + Fig.3 depicts how the data is divided into blocks of 8*8-bit. pospopcnt -Fig.4 shows that a transpose is needed on each block. -The gbbd instruction is used for implementing the transpose -function, preparing the data for using the standard popcount -instruction. +Fig.4 shows that a transpose is needed on each block. The gbbd +instruction is used for implementing the transpose function, preparing +the data for using the standard popcount instruction. pospopcnt -Now on each 8*8 block, 8 popcount instructions -can be run each of which is independent and therefore can -be parallelised even by In-Order multi-issue hardware(Fig.5). +Now on each 8*8 block, 8 popcount instructions can be run each of +which is independent and therefore can be parallelised even by In-Order +multi-issue hardware(Fig.5). pospopcnt -Fig.6 depicts how each of the intermediate results are -accumulated. It is worth noting that each intermediate result -is independent of the other intermediate results and also -parallel reduction can be applied to all of them -individually. This gives *two* opportunities for -hardware parallelism rather than one. +Fig.6 depicts how each of the intermediate results are accumulated. It +is worth noting that each intermediate result is independent of the +other intermediate results and also parallel reduction can be applied +to all of them individually. This gives *two* opportunities for hardware +parallelism rather than one. pospopcnt -In short this algorithm is very straightforward to implement thanks to the two -crucial instructions, `gbbd` and `popcntd`. Below is a walkthrough of the -assembler, keeping it very simple, and exploiting only one of the opportunities -for parallelism (by not including the Parallel Reduction opportunity mentioned -above). +In short this algorithm is very straightforward to implement thanks to the +two crucial instructions, `gbbd` and `popcntd`. Below is a walkthrough +of the assembler, keeping it very simple, and exploiting only one of the +opportunities for parallelism (by not including the Parallel Reduction +opportunity mentioned above). # Walkthrough of the assembler @@ -123,19 +121,17 @@ as outlined further, below mtspr 9, 3 # move r3 to CTR ``` -The Vector Length, which is limited to 8 (MVL - Maximum -Vector Length) is set up. A special "CTR" Mode is used -which automatically uses the CTR SPR rather than register -RA. (*Note that RA is set to zero to indicate this, because there is -limited encoding space. See [[openpower/sv/setvl]] instruction -specification for details)*. - -The result of this instruction is that if CTR is greater than -8, VL is set to 8. If however CTR is less than or equal to 8, -then VL is set to CTR. Additionally, a copy of VL is placed -into RT (r3 in this case), which again is necessary as part -of the limited encoding space but in some cases (not here) -this is desirable, and avoids a `mfspr` instruction to take +The Vector Length, which is limited to 8 (MVL - Maximum Vector Length) +is set up. A special "CTR" Mode is used which automatically uses the CTR +SPR rather than register RA. (*Note that RA is set to zero to indicate +this, because there is limited encoding space. See [[openpower/sv/setvl]] +instruction specification for details)*. + +The result of this instruction is that if CTR is greater than 8, VL is set +to 8. If however CTR is less than or equal to 8, then VL is set to CTR. +Additionally, a copy of VL is placed into RT (r3 in this case), which +again is necessary as part of the limited encoding space but in some cases +(not here) this is desirable, and avoids a `mfspr` instruction to take a copy of VL into a GPR. ``` @@ -143,24 +139,21 @@ a copy of VL into a GPR. setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR) ``` -Next the data must be loaded in batches (of up to QTY 8of -8-bit values). We must also not over-run the end of the -array (which a normal SIMD ISA would do, potentially causing -a pagefault) and this is achieved by when CTR <= 8: VL -(Vector Length) in such circumstances being set to the -remaining elements. - -The next instruction is `sv.lbzu/pi` - a Post-Increment -variant of `lbzu` which updates the Effective Address (in RA) -**after** it has been used, not before. Also of note is the -element-width override of `dw=8` which writes to *individual -bytes* of r6, not to r6/7/8...13. - -Of note here however is the precursor `addi` instruction, which -clears out the contents of r6 to zero before beginning the -Vector/Looped Load sequence. This is *important* because -SV does **not** zero-out parts of a register when element-width -overrides are in use. +Next the data must be loaded in batches (of up to QTY 8of 8-bit +values). We must also not over-run the end of the array (which a normal +SIMD ISA would do, potentially causing a pagefault) and this is achieved +by when CTR <= 8: VL (Vector Length) in such circumstances being set to +the remaining elements. + +The next instruction is `sv.lbzu/pi` - a Post-Increment variant of `lbzu` +which updates the Effective Address (in RA) **after** it has been used, +not before. Also of note is the element-width override of `dw=8` which +writes to *individual bytes* of r6, not to r6/7/8...13. + +Of note here however is the precursor `addi` instruction, which clears +out the contents of r6 to zero before beginning the Vector/Looped Load +sequence. This is *important* because SV does **not** zero-out parts +of a register when element-width overrides are in use. ``` # load VL bytes (update r4 addr) but compressed (dw=8) @@ -168,39 +161,37 @@ addi 6, 0, 0 # initialise all 64-bits of r6 to zero sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well ``` -The next instruction is quite straightforward, and has the -effect of turning (transposing) 8 values. Each bit zero -of each input byte is placed into the first byte; each bit one -of each input byte is placed into the second byte and so on. +The next instruction is quite straightforward, and has the effect of +turning (transposing) 8 values. Each bit zero of each input byte is +placed into the first byte; each bit one of each input byte is placed +into the second byte and so on. -(*This instruction in VSX is called `vgbbd`, and it is present -in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, -x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others -all have this extremely powerful and useful instruction*) +(*This instruction in VSX is called `vgbbd`, and it is present in many +other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`, x86 calls it +GF2P7AFFINEQB. Power ISA, Cray, and many others all have this extremely +powerful and useful instruction*) ``` # gather performs the transpose (which gets us to positional..) gbbd 8,6 ``` -Now the bits have been transposed in bytes, it is plain sailing -to perform QTY8 parallel popcounts. However there is a trick -going on: we have set VL=MVL=8. To explain this: it covers the -last case where CTR may be between 1 and 8. +Now the bits have been transposed in bytes, it is plain sailing to +perform QTY8 parallel popcounts. However there is a trick going on: +we have set VL=MVL=8. To explain this: it covers the last case where +CTR may be between 1 and 8. -Remember what happened back at the Vector-Load, where r6 -was cleared to zero before-hand? This filled out the 8x8 transposed -grid (`gbbd`) fully with zeros prior to the actual transpose. -Now when we do the popcount, there will be upper-numbered -columns that are *not part of the result* that contain *zero* -and *consequently have no impact on the result*. +Remember what happened back at the Vector-Load, where r6 was cleared to +zero before-hand? This filled out the 8x8 transposed grid (`gbbd`) fully +with zeros prior to the actual transpose. Now when we do the popcount, +there will be upper-numbered columns that are *not part of the result* +that contain *zero* and *consequently have no impact on the result*. This elegant trick extends even to the accumulation of the -results. However before we get there it is necessary to -describe why `sw=8` has been added to `sv.popcntd`. What -this is doing is treating each **byte** of its input -(starting at the first byte of r8) as an independent -quantity to be popcounted, but the result is *zero-extended* +results. However before we get there it is necessary to describe why +`sw=8` has been added to `sv.popcntd`. What this is doing is treating +each **byte** of its input (starting at the first byte of r8) as an +independent quantity to be popcounted, but the result is *zero-extended* to 64-bit and stored in r24, r25, r26 ... r31. Therefore: @@ -216,21 +207,19 @@ setvl 0,0,8,0,1,1 # set MVL=VL=8 sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount ``` -With VL being hard-set to 8, we can now Vector-Parallel -sum the partial results r24-31 into the array of accumulators -r16-r23. There was no need to set VL=8 again because it was -set already by the `setvl` above. +With VL being hard-set to 8, we can now Vector-Parallel sum the partial +results r24-31 into the array of accumulators r16-r23. There was no need +to set VL=8 again because it was set already by the `setvl` above. ``` sv.add *16,*16,*24 # and accumulate in results ``` -Finally, `sv.bc` loops back, subtracting VL from CTR. -Being based on Power ISA Scalar Branch-Conditional, if -CTR goes negative (which it will because VL was set -hard-coded to 8, above) it does not matter, the loop -will still be correctly exited. In other words we complete -the correct number of *blocks* (of length 8). +Finally, `sv.bc` loops back, subtracting VL from CTR. Being based on +Power ISA Scalar Branch-Conditional, if CTR goes negative (which it will +because VL was set hard-coded to 8, above) it does not matter, the loop +will still be correctly exited. In other words we complete the correct +number of *blocks* (of length 8). ``` # branch back if CTR still non-zero. works even though VL=8 @@ -239,39 +228,37 @@ sv.bc/all 16, *0, -0x28 # reduce CTR by VL and stop if -ve # Improvements -There exist many opportunities for parallelism that simpler hardware -would need to have in order to maximise performance. On Out-of-Order -hardware the above extremely simple and very clear algorithm will -achieve extreme high levels of performance simply by exploiting -standard Multi-Issue Register Hazard Management. +There exist many opportunities for parallelism that simpler hardware would +need to have in order to maximise performance. On Out-of-Order hardware +the above extremely simple and very clear algorithm will achieve extreme +high levels of performance simply by exploiting standard Multi-Issue +Register Hazard Management. However simpler hardware - in-order - will need a little bit of -assistance, and that can come in the form of expanding to QTY4 or -QTY8 64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), -`gbbd` becomes an `sv.gbbd` but VL being set to the block count -(QTY4 or QTY8), and the SV REMAP Parallel Reduction Schedule being -applied to each intermediary result rather than using an array -of straight accumulators `r16-r23`. However this starts to push -the boundaries of the number of registers needed, so as an -exercise is left for another time. +assistance, and that can come in the form of expanding to QTY4 or QTY8 +64-bit blocks (so that sv.popcntd uses MVL=VL=32 or MVL=VL=64), `gbbd` +becomes an `sv.gbbd` but VL being set to the block count (QTY4 or QTY8), +and the SV REMAP Parallel Reduction Schedule being applied to each +intermediary result rather than using an array of straight accumulators +`r16-r23`. However this starts to push the boundaries of the number of +registers needed, so as an exercise is left for another time. # Conclusion -Where a normal SIMD ISA requires explicit hand-crafted optimisation -in order to achieve full utilisation of the underlying hardware, -Simple-V instead can rely to a large extent on standard Multi-Issue -hardware to achieve similar performance, whilst crucially keeping the -algorithm implementation down to a shockingly-simple degree that makes -it easy to understand an easy to review. Again also as with many -other algorithms when implemented in Simple-V SVP54, by keeping to -a LOAD-COMPUTE-STORE paradigm the L1 Data Cache usage is minimised, -and in this case just as with chacha20 the entire algorithm, being -only 9 lines of assembler fitting into 13 4-byte words it can fit -into a single L1 I-Cache Line without triggering Virtual Memory TLB -misses. - -Further performance improvements are achievable by using REMAP -Parallel Reduction, still fitting into a single L1 Cache line, -but beginning to approach the limit of the 128-long register file. +Where a normal SIMD ISA requires explicit hand-crafted optimisation in +order to achieve full utilisation of the underlying hardware, Simple-V +instead can rely to a large extent on standard Multi-Issue hardware +to achieve similar performance, whilst crucially keeping the algorithm +implementation down to a shockingly-simple degree that makes it easy to +understand an easy to review. Again also as with many other algorithms +when implemented in Simple-V SVP54, by keeping to a LOAD-COMPUTE-STORE +paradigm the L1 Data Cache usage is minimised, and in this case just +as with chacha20 the entire algorithm, being only 9 lines of assembler +fitting into 13 4-byte words it can fit into a single L1 I-Cache Line +without triggering Virtual Memory TLB misses. + +Further performance improvements are achievable by using REMAP Parallel +Reduction, still fitting into a single L1 Cache line, but beginning to +approach the limit of the 128-long register file. [[!tag svp64_cookbook ]] -- 2.30.2