\lstinputlisting[language={}]{maxloc.py}
\begin{itemize}
- \item "TODO
+ \item FORTRAN MAXLOC - find the index of largest number
+ \item notoriously difficult to optimally implement for SIMD
+ \item algorithms include \textit{depth-first} recursive
+ descent (!) mapreduce-style, offsetting the
+ locally-computed largest index (plus value) which
+ are then tested in upper level(s)
+ \item SVP64 through Data-Dependent Fail-First can perform
+ each of the two key while-loop tests with
+ \textit{single instructions}.
+ \item There is however quite a bit of "housekeeping".
+ Full analysis: \\
+ https://libre-soc.org/openpower/sv/cookbook/fortran\_maxloc
\end{itemize}
}
-\frame{\frametitle{maxlocassembler}
+\frame{\frametitle{maxloc assembler}
\lstinputlisting[language={}]{maxloc.s}
# while (i<n and a[i]>m):
sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # r4 accumulate
crternlogi 0,1,2,127 # test >= (or VL=0)
-sv.crand *19,*16,0 # clear if CR0.eq=0
+sv.crnand/m=lt/zz *19,*16,0 # SO=~LT, if CR0.eq=0
# nm = i: count masked bits. could use crweirds
sv.svstep/mr/m=so 1,0,6,1 # get vector dststep
sv.creqv *16,*16,*16 # set mask on already-tested
mtcrf 128, 0 # clear CR0 (in case VL=0?)
# while (i<n and a[i]>m):
sv.minmax./ff=le/m=ge/mr 4,*10,4,1 # uses r4 as accumulator
-crternlogi 0,1,2,127 # test greater/equal or VL=0
-sv.crand *19,*16,0 # clear if CR0.eq=0
+sv.crnand/m=lt/zz *19,*16,0 # SO=~LT, if CR0.eq=0
# nm = i (count masked bits. could use crweirds here TODO)
sv.svstep/mr/m=so 1, 0, 6, 1 # svstep: get vector dststep
sv.creqv *16,*16,*16 # set mask on already-tested
-bc 12,0, -0x40 # CR0 lt bit clear, branch back
+bc 12,0, -0x3c # CR0 lt bit clear, branch back
```
`sv.cmp` can be used in the first while loop because m (r4, the current
that happens then CR0 would be left it in its previous state: a
very much undesirable behaviour!)
-`crternlogi 0,1,2,127` will combine the setting of CR0.EQ and CR0.LT
-to give us a true Greater-than-or-equal, including under the circumstance
-where VL=0. The `sv.crand` will then take a copy of the `i-in-unary`
-mask, but only when CR0.EQ is set. This is why the third operand `BB`
-is a Scalar not a Vector (BT=16/Vector, BA=19/Vector, BB=0/Scalar)
-which effectively performs a broadcast-splat-ANDing, as follows:
+`sv.crnand/m=lt/zz` is quite sophisticated - a lot is going on behind
+the scenes. The effect is (through the NAND) to invert the Less-than
+to give us a Greater-than-or-equal, including under the circumstance
+where VL=0, but only when CR0.EQ is set. Note that the third operand
+`BB` (CR0.EQ) is a *scalar*, but that zeroing is used here. Therefore
+whenever the Vector of `LT` bits is zero, a zero is put into the
+Vector `SO` result. In effect, the predication is being exploited
+as a way to combine a third operand into what would otherwise be a
+2-in 1-out Condition Register operation, making it effectively 3-in 1-out.
+Obscure but effective!
```
CR4.SO = CR4.EQ AND CR0.EQ (if VL >= 1)
there is a very straightforward way to it: use `sv.svstep`.
```
-crternlogi 0,1,2,127
+sv.crnand/m=lt/zz *19,*16,0 # Vector SO = Vector ~LT, if CR0.eq=0
i ----> 0 1 2 3
CR4.EQ CR5.EQ CR6.EQ CR7.EQ