nm = i;
}
}
- return nm;
+ return nm;
}
```
subroutine below, explaining that it finds the index of a minimum value within
a group of eight unsigned bytes. It is necessary to use a second outer loop
to perform many of these searches in parallel, followed by conditionally
-offsetting each of the block-results.
+offsetting each of the block-results.
<https://stackoverflow.com/questions/49683866/find-min-and-position-of-the-min-element-in-uint8x8-t-neon-register>
```
# while (i<n)
-setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
+setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
# while (i<n and a[i]<=m) : i += 1
-sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
-sv.creqv *16,*16,*16 # set mask on already-tested
-setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
-mtcrf 128, 0 # clear CR0 (in case VL=0?)
+sv.cmp/ff=gt/m=ge *0,0,*10,4 # truncates VL to min
+sv.creqv *16,*16,*16 # set mask on already-tested
+setvl 2,0,4,0,1,1 # set MVL=4, VL=MIN(MVL,CTR)
+mtcrf 128, 0 # clear CR0 (in case VL=0?)
# while (i<n and a[i]>m):
-sv.minmax./ff=le/m=ge 4,*10,4,1 # uses r4 as accumulator
-crternlogi 0,1,2,127 # test greater/equal or VL=0
-sv.crand *19,*16,0 # clear if CR0.eq=0
+sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
+crternlogi 0,1,2,127 # test greater/equal or VL=0
+sv.crand *19,*16,0 # clear if CR0.eq=0
# nm = i (count masked bits. could use crweirds here TODO)
-sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
-sv.creqv *16,*16,*16 # set mask on already-tested
-bc 12,0, -0x40 # CR0 lt bit clear, branch back
+sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
+sv.creqv *16,*16,*16 # set mask on already-tested
+bc 12,0, -0x40 # CR0 lt bit clear, branch back
```
`sv.cmp` can be used in the first while loop because m (r4, the current
CR7.SO = CR7.EQ AND CR0.EQ (if VL = 4)
```
+**Converting the unary mask to binary**
+
+Now that the `CR4/5/6/7.SO` bits have been set, it is necessary to
+count them, i.e. convert an unary mask into a binary number. There
+are several ways to do this, one of which is
+the `crweird` suite of instructions, combined with `popcnt`. However
+there is a very straightforward way to it: use `sv.svstep`.
+
+```
+crternlogi 0,1,2,127
+
+ i ----> 0 1 2 3
+ CR4.EQ CR5.EQ CR6.EQ CR7.EQ
+ & CR0 & CR0 & CR0 & CR0
+ | | | |
+ v v v v
+ CR4.SO CR5.SO CR6.SO CR7.SO
+
+sv.svstep/mr/m=so 1, 0, 6, 1
+
+ | | | |
+ count <--+---------+--------+---------+
+```
+
+In reality what is happening is that svstep is requested to return
+the current `dststep` (destination step index), into a scalar
+destination (`RT=r1`), but in "mapreduce" mode. Mapreduce mode
+will keep running as if the destination was a Vector, overwriting
+previously-written results. Normally, one of the *sources* would
+be the same register (`RA=r1` for example) which would turn r1 into
+an accumulator, however in this particular case we simply consistently
+overwrite `r1` and end up with the last `dststep`.
+
+There `is` an alternative to this approach: `getvl` followed by
+subtracting 1. `VL` being the total Vector Length as set by
+Data-Dependent Fail-First, is the *length* of the Vector, whereas
+what we actually want is the index of the last element: hence
+using `sv.svstep`. Given that using `getvl` followed by `addi -1`
+is two instructions rather than one, we use `sv.svstep` instead,
+although they are the same amount of words (SVP64 is 64-bit sized).
+
+Lastly the `creqv` sets up the mask (based on the current
+Vector Length), and the loop-back (`bc`) jumps to reset the
+Vector Length back to the maximum (4). Thus, the mask (CR4/5/6/7 EQ
+bit) which began as empty will exclude nothing but on subsequent
+loops will exclude previously-tested elements (up to the previous
+value of VL) before it was reset back to the maximum.
+
+This is actually extremely important to note, here, because the
+Data-Dependent Fail-First testing starts at the **first non-masked**
+element, not the first element. This fact is exploited here, and
+the only thing to contend with is if VL is ever set to zero
+(first element fails). If that happens then CR0 will never get set,
+(because no element succeeded and the Vector Length is truncated to zero)
+hence the need to clear CR0 prior to starting that instruction.
+
+Overall this algorithm is extremely short but quite complex in its
+intricate use of SVP64 capabilities, yet still remains parallelliseable
+in hardware. It could potentially be shorter: there may be opportunities
+to use predicate masking with the CR field manipulation. However
+the number of instructions is already at the point where, even when the
+LOAD outer loop is added it will still remain short enough to fit in
+a single line of L1 Cache.
+
[[!tag svp64_cookbook ]]