-\frame{\frametitle{Implementation Options}
-
- \begin{itemize}
- \item Absolute minimum: Exceptions (if CSRs indicate "V", trap)
- \item Hardware loop, single-instruction issue\\
- (Do / Don't send through predication to ALU)
- \item Hardware loop, parallel (multi-instruction) issue\\
- (Do / Don't send through predication to ALU)
- \item Hardware loop, full parallel ALU (not recommended)
- \end{itemize}
- Notes:\vspace{6pt}
- \begin{itemize}
- \item 4 (or more?) options above may be deployed on per-op basis
- \item SIMD always sends predication bits through to ALU
- \item Minimum MVL MUST be sufficient to cover regfile LD/ST
- \item Instr. FIFO may repeatedly split off N scalar ops at a time
- \end{itemize}
-}
-% Instr. FIFO may need its own slide. Basically, the vectorised op
-% gets pushed into the FIFO, where it is then "processed". Processing
-% will remove the first set of ops from its vector numbering (taking
-% predication into account) and shoving them **BACK** into the FIFO,
-% but MODIFYING the remaining "vectorised" op, subtracting the now
-% scalar ops from it.
-
-\frame{\frametitle{Predicated 8-parallel ADD: 1-wide ALU}
- \begin{center}
- \includegraphics[height=2.5in]{padd9_alu1.png}\\
- {\bf \red Predicated adds are shuffled down: 6 cycles in total}
- \end{center}
-}
-
-
-\frame{\frametitle{Predicated 8-parallel ADD: 4-wide ALU}
- \begin{center}
- \includegraphics[height=2.5in]{padd9_alu4.png}\\
- {\bf \red Predicated adds are shuffled down: 4 in 1st cycle, 2 in 2nd}
- \end{center}
-}
-
-
-\frame{\frametitle{Predicated 8-parallel ADD: 3 phase FIFO expansion}
- \begin{center}
- \includegraphics[height=2.5in]{padd9_fifo.png}\\
- {\bf \red First cycle takes first four 1s; second takes the rest}
- \end{center}
-}
-
-
-\frame{\frametitle{How are SIMD Instructions Vectorised?}
-
- \begin{itemize}
- \item SIMD ALU(s) primarily unchanged\vspace{6pt}
- \item Predication is added to each SIMD element\vspace{6pt}
- \item Predication bits sent in groups to the ALU\vspace{6pt}
- \item End of Vector enables (additional) predication\vspace{10pt}
- \end{itemize}
- Considerations:\vspace{4pt}
- \begin{itemize}
- \item Many SIMD ALUs possible (parallel execution)
- \item Implementor free to choose (API remains the same)
- \item Unused ALU units wasted, but s/w DRASTICALLY simpler
- \item Very long SIMD ALUs could waste significant die area
- \end{itemize}
-}
-% With multiple SIMD ALUs at for example 32-bit wide they can be used
-% to either issue 64-bit or 128-bit or 256-bit wide SIMD operations
-% or they can be used to cover several operations on totally different
-% vectors / registers.
-
-\frame{\frametitle{Predicated 9-parallel SIMD ADD}
- \begin{center}
- \includegraphics[height=2.5in]{padd9_simd.png}\\
- {\bf \red 4-wide 8-bit SIMD, 4 bits of predicate passed to ALU}
- \end{center}
-}
-
-