a copy of VL into a GPR.
```
-# VL = MIN(CTR,MAXVL=8)
+# VL = r3 = MIN(CTR,MAXVL=8)
setvl 3,0,8,0,1,1 # set MVL=8, VL=MIN(MVL,CTR)
```
+Next the data must be loaded in batches (of up to QTY 8of
+8-bit values). We must also not over-run the end of the
+array (which a normal SIMD ISA would do, potentially causing
+a pagefault) and this is achieved by when CTR <= 8: VL
+(Vector Length) in such circumstances being set to the
+remaining elements.
+
+The next instruction is `sv.lbzu/pi` - a Post-Increment
+variant of `lbzu` which updates the Effective Address (in RA)
+**after** it has been used, not before. Also of note is the
+element-width override of `dw=8` which writes to *individual
+bytes* of r6, not to r6/7/8...13.
+
+Of note here however is the precursor `addi` instruction, which
+clears out the contents of r6 to zero before beginning the
+Vector/Looped Load sequence. This is *important* because
+SV does **not** zero-out parts of a register when element-width
+overrides are in use.
+
```
# load VL bytes (update r4 addr) but compressed (dw=8)
addi 6, 0, 0 # initialise all 64-bits of r6 to zero
sv.lbzu/pi/dw=8 *6, 1(4) # should be /lf here as well
```
+The next instruction is quite straightforward, and has the
+effect of turning (transposing) 8 values. Each bit zero
+of each input byte is placed into the first byte; each bit one
+of each input byte is placed into the second byte and so on.
+
+(*This instruction in VSX is called `vgbbd`, and it is present
+in many other ISAs. RISC-V bitmanip-draft-0.94 calls it `bmatflip`,
+x86 calls it GF2P7AFFINEQB. Power ISA, Cray, and many others
+all have this extremely powerful and useful instruction*)
+
```
# gather performs the transpose (which gets us to positional..)
gbbd 8,6
```
+Now the bits have been transposed in bytes, it is plain sailing
+to perform QTY8 parallel popcounts. However there is a trick
+going on: we have set VL=MVL=8. To explain this: it covers the
+last case where CTR may be between 1 and 8.
+
+Remember what happened back at the Vector-Load, where r6
+was cleared to zero before-hand? This filled out the 8x8 transposed
+grid (`gbbd`) fully with zeros prior to the actual transpose.
+Now when we do the popcount, there will be upper-numbered
+columns that are *not part of the result* that contain *zero*
+and *consequently have no impact on the result*.
+
+This elegant trick extends even to the accumulation of the
+results. However before we get there it is necessary to
+describe why `sw=8` has been added to `sv.popcntd`. What
+this is doing is treating each **byte** of its input
+(starting at the first byte of r8) as an independent
+quantity to be popcounted, but the result is *zero-extended*
+to 64-bit and stored in r24, r25, r26 ... r31.
+
+Therefore:
+
+* r24 contains the popcount of the first byte of r8
+* r25 contains the popcount of the second byte of r8
+* ...
+* r31 contains the popcount of the last (7th) byte of r8
+
```
# now those bits have been turned around, popcount and sum them
setvl 0,0,8,0,1,1 # set MVL=VL=8
sv.popcntd/sw=8 *24,*8 # do the (now transposed) popcount
```
+With VL being hard-set to 8, we can now Vector-Parallel
+sum the partial results r24-31 into the array of accumulators
+r16-r23. There was no need to set VL=8 again because it was
+set already by the `setvl` above.
+
```
sv.add *16,*16,*24 # and accumulate in results
```