+ \item See "SIMD Considered Harmful" for SIMD/RVV analysis\\
+ https://sigarch.org/simd-instructions-considered-harmful/
+ \end{itemize}
+
+
+\end{frame}
+
+
+\begin{frame}[fragile]
+\frametitle{RVV DAXPY assembly (RV32V)}
+
+\begin{semiverbatim}
+# a0 is n, a1 is ptr to x[0], a2 is ptr to y[0], fa0 is a
+ li t0, 2<<25
+ vsetdcfg t0 # enable 2 64b Fl.Pt. registers
+loop:
+ setvl t0, a0 # vl = t0 = min(mvl, n)
+ vld v0, a1 # load vector x
+ slli t1, t0, 3 # t1 = vl * 8 (in bytes)
+ vld v1, a2 # load vector y
+ add a1, a1, t1 # increment pointer to x by vl*8
+ vfmadd v1, v0, fa0, v1 # v1 += v0 * fa0 (y = a * x + y)
+ sub a0, a0, t0 # n -= vl (t0)
+ vst v1, a2 # store Y
+ add a2, a2, t1 # increment pointer to y by vl*8
+ bnez a0, loop # repeat if n != 0
+\end{semiverbatim}
+\end{frame}
+
+
+\begin{frame}[fragile]
+\frametitle{SV DAXPY assembly (RV64D)}
+
+\begin{semiverbatim}
+# a0 is n, a1 is ptr to x[0], a2 is ptr to y[0], fa0 is a
+ CSRvect1 = \{type: F, key: a3, val: a3, elwidth: dflt\}
+ CSRvect2 = \{type: F, key: a7, val: a7, elwidth: dflt\}
+loop:
+ setvl t0, a0, 4 # vl = t0 = min(4, n)
+ ld a3, a1 # load 4 registers a3-6 from x
+ slli t1, t0, 3 # t1 = vl * 8 (in bytes)
+ ld a7, a2 # load 4 registers a7-10 from y
+ add a1, a1, t1 # increment pointer to x by vl*8
+ fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
+ sub a0, a0, t0 # n -= vl (t0)
+ st a7, a2 # store 4 registers a7-10 to y
+ add a2, a2, t1 # increment pointer to y by vl*8
+ bnez a0, loop # repeat if n != 0
+\end{semiverbatim}
+\end{frame}
+
+
+\frame{\frametitle{Under consideration}
+
+ \begin{itemize}
+ \item Is C.FNE actually needed? Should it be added if it is?
+ \item Element type implies polymorphism. Should it be in SV?
+ \item Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+ \item Is detection of all-scalar ops ok (without slowing pipeline)?
+ \item Can VSELECT be removed? (it's really complex)
+ \item Can CLIP be done as a CSR (mode, like elwidth)
+ \item SIMD saturation (etc.) also set as a mode?
+ \item Include src1/src2 predication on Comparison Ops?\\
+ (same arrangement as C.MV, with same flexibility/power)
+ \item 8/16-bit ops is it worthwhile adding a "start offset"? \\
+ (a bit like misaligned addressing... for registers)\\
+ or just use predication to skip start?