+
+\frame{\frametitle{How are SIMD Instructions Vectorised?}
+
+ \begin{itemize}
+ \item SIMD ALU(s) primarily unchanged
+ \item Predication added down to each SIMD element (if requested,
+ otherwise entire block will be predicated as a whole)
+ \item Predication bits sent in groups to the ALU (if requested,
+ otherwise just one bit for the entire packed block)
+ \item End of Vector enables (additional) predication:
+ completely nullifies end-case code (ONLY in multi-bit
+ predication mode)
+ \end{itemize}
+ Considerations:
+ \begin{itemize}
+ \item Many SIMD ALUs possible (parallel execution)
+ \item Implementor free to choose (API remains the same)
+ \item Unused ALU units wasted, but s/w DRASTICALLY simpler
+ \item Very long SIMD ALUs could waste significant die area
+ \end{itemize}
+}
+% With multiple SIMD ALUs at for example 32-bit wide they can be used
+% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations
+% or they can be used to cover several operations on totally different
+% vectors / registers.
+
+\frame{\frametitle{Predicated 9-parallel SIMD ADD (Packed=Y)}
+ \begin{center}
+ \includegraphics[height=2.5in]{padd9_simd.png}\\
+ {\bf \red 4-wide 8-bit SIMD, 4 bits of predicate passed to ALU}
+ \end{center}
+}
+
+
+\frame{\frametitle{Why are overlaps allowed in Regfiles?}
+
+ \begin{itemize}
+ \item Same target register(s) can have multiple "interpretations"
+ \item CSRs are costly to write to (do it once)
+ \item Set "real" register (scalar) without needing to set/unset CSRs.
+ \item xBitManip plus SIMD plus xBitManip = Hi/Lo bitops
+ \item (32-bit GREV plus 4x8-bit SIMD plus 32-bit GREV:\\
+ GREV @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8)
+ \item RGB 565 (video): BEXTW plus 4x8-bit SIMD plus BDEPW\\
+ (BEXT/BDEP @ VL=N,wid=32; SIMD @ VL=Nx4,wid=8)
+ \item Same register(s) can be offset (no need for VSLIDE)\vspace{6pt}
+ \end{itemize}
+ Note:
+ \begin{itemize}
+ \item xBitManip reduces O($N^{6}$) SIMD down to O($N^{3}$) on its own.
+ \item Hi-Performance: Macro-op fusion (more pipeline stages?)
+ \end{itemize}
+}
+
+
+\frame{\frametitle{C.MV extremely flexible!}
+
+ \begin{itemize}
+ \item scalar-to-vector (w/ no pred): VSPLAT
+ \item scalar-to-vector (w/ dest-pred): Sparse VSPLAT
+ \item scalar-to-vector (w/ 1-bit dest-pred): VINSERT
+ \item vector-to-scalar (w/ [1-bit?] src-pred): VEXTRACT
+ \item vector-to-vector (w/ no pred): Vector Copy
+ \item vector-to-vector (w/ src pred): Vector Gather (inc VSLIDE)
+ \item vector-to-vector (w/ dest pred): Vector Scatter (inc. VSLIDE)
+ \item vector-to-vector (w/ src \& dest pred): Vector Gather/Scatter
+ \end{itemize}
+ \vspace{4pt}
+ Notes:
+ \begin{itemize}
+ \item Surprisingly powerful! Zero-predication even more so
+ \item Same arrangement for FCVT, FMV, FSGNJ etc.
+ \end{itemize}
+}
+
+