\frametitle{Reminder of Simple-V}
\begin{semiverbatim}
+https://libre-soc.org/openpower/sv/overview/
Greatly simplified (like x86 "REP" instruction):
for (i = 0; i < VL; i++)
- ireg[RT+i] <= ireg[RA+i] + ireg[RB+i];
+ GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
-function op\_add(rd, rs1, rs2, predr) # add not VADD!
+function op\_add(RT, RA, RB, predr) # add not VADD!
int i, id=0, irs1=0, irs2=0;
for (i = 0; i < VL; i++)
- if (ireg[predr] & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
- if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
- if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
+ if (GPR[predr] & 1<<i) # predication
+ GPR[RT+id] <= GPR[RA+irs1] + GPR[RB+irs2];
+ if (reg\_is\_vectorised[RT]) \{ id += 1; \}
+ if (reg\_is\_vectorised[RA]) \{ irs1 += 1; \}
+ if (reg\_is\_vectorised[RB]) \{ irs2 += 1; \}
\end{semiverbatim}
\end{frame}
+\begin{frame}[fragile]
+\frametitle{SVP64 REMAP system}
+
+\begin{semiverbatim}
+Register offsets are "REMAP"ed through a Hardware FSM
+https://libre-soc.org/openpower/sv/remap/
+remarkably similar to ZOLC
+https://www.researchgate.net/publication/224647569
+
+function op\_add(RT, RA, rs2, predr) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ for (i = 0; i < VL; i++)
+ if (GPR[predr] & 1<<i) # predication
+ GPR[RT+REMAP(id)] <= GPR[RA+REMAP(irs1)] +
+ GPR[rs2+REMAP(irs2)];
+ if (reg\_is\_vectorised[RT]) \{ id += 1; \}
+ if (reg\_is\_vectorised[RA]) \{ irs1 += 1; \}
+ if (reg\_is\_vectorised[s2]) \{ irs2 += 1; \}
+\end{semiverbatim}
+
+\end{frame}
+
+\begin{frame}[fragile]
+\frametitle{Matrix Multiply Basics}
+
+\begin{semiverbatim}
+(a00 a01 a02 x (b00 b01 =
+ a10 a11 a12) b10 b11
+ b20 b21)
+
+(a00*b00 + a01*b10 + a02*b20 a00*b01 + a01*b11 + a02*b21
+ a10*b00 + a11*b10 + a12*b20 a10*b01 + a11*b11 + a12*b21)
+
+ (b00 b01 x (a00 a01 a02 =
+ b10 b11 a10 a11 a12)
+ b20 b21)
+
+(b00*a00 + b01*a10 b00*a01 + b01*a11 b00*a02 + b01*a12
+ b10*a00 + b11*a10 b10*a01 + b11*a11 b10*a02 + b11*a12
+ b20*a00 + b21*a10 b20*a01 + b21*a11 b20*a02 + b21*a12)
+
+\end{semiverbatim}
+
+\end{frame}
+
+
+\begin{frame}[fragile]
+\frametitle{Matrix Multiply Basics}
+
+\begin{semiverbatim}
+(a00 a01 a02 x (b00 b01 =
+ a10 a11 a12) b10 b11
+ b20 b21)
+
+(a00*b00 + a01*b10 + a02*b20 a00*b01 + a01*b11 + a02*b21
+ a10*b00 + a11*b10 + a12*b20 a10*b01 + a11*b11 + a12*b21)
+
+ (b00 b01 x (a00 a01 a02 =
+ b10 b11 a10 a11 a12)
+ b20 b21)
+
+(b00*a00 + b01*a10 b00*a01 + b01*a11 b00*a02 + b01*a12
+ b10*a00 + b11*a10 b10*a01 + b11*a11 b10*a02 + b11*a12
+ b20*a00 + b21*a10 b20*a01 + b21*a11 b20*a02 + b21*a12)
+
+\end{semiverbatim}
+
+\end{frame}
+
+
+\begin{frame}[fragile]
+\frametitle{Naive Matrix Multiply with python for-loops}
+
+\begin{semiverbatim}
+result = [] # final result
+for i in range(len(A)):
+
+ row = [] # the new row in new matrix
+ for j in range(len(B[0])):
+
+ product = 0 # the new element in the new row
+ for v in range(len(A[i])):
+ product += A[i][v] * B[v][j]
+ row.append(product) # add sum of product to new row
+
+ result.append(row) # add new row into final result
+\end{semiverbatim}
+
+\end{frame}
+
+\begin{frame}[fragile]
+\frametitle{Matrix Multiply suitable for Hardware scheduling}
+
+\begin{semiverbatim}
+Unsuitable: creates massive Read-After-Write chains
+
+for i in range(len(A)):
+ for j in range(len(B[0])):
+ for v in range(len(A[i])):
+ product[i][j] += A[i][v] * B[v][j]
+
+Suitable: can be parallelised / pipelined. RaW avoided
+
+for i in range(len(A)):
+ for v in range(len(A[i])): # swapped
+ for j in range(len(B[0])): # with this
+ product[i][j] += A[i][v] * B[v][j]
+
+\end{semiverbatim}
+
+\end{frame}
+
+
+\frame{\frametitle{Generalise but Specialise}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Why not make a general-purpose nested "Loop" system?\\
+ - Allow any arbitrary-sized loops\\
+ - Allow any permutation of nesting\\
+ - Allow reversing per-dimension\vspace{8pt}
+ \item Specialise by making Matrix Multiply "setup" quick/easy\\
+ - two 32-bit instructions to set up A, B, C sizes\\
+ - one 64-bit SVP64 FMAC instruction.\\
+ - Nothing else needed. Saves on I-Cache\vspace{8pt}
+ \item Hardware turns out to be near-identical to ZOLC\\
+ https://opencores.org/projects/hwlu\\
+ https://libre-soc.org/openpower/sv/remap/\vspace{15pt}
+ \end{itemize}
+}
+
+\begin{frame}[fragile]
+\frametitle{Matrix Multiply unit test / example}
+
+\begin{semiverbatim}
+ 94 def test_sv_remap2(self):
+ 95 lst = ["svshape 5, 4, 3, 0, 0",
+ 96 "svremap 0b11111, 1, 2, 3, 0, 0, 0, 0",
+ 97 "sv.fmadds 0.v, 8.v, 16.v, 0.v"
+ 98 ]
+ 99 REMAP fmadds FRT, FRA, FRC, FRB
+
+svshape 5, 4, 3, 0, 0 => A: 3x5 B: 3x4
+ => C: 3x3
+svremap (enable) (F)RS, (F)RT, (F)RA, (F)RB, (F)RC
+sv.fmadds: uses fp0 as accumulator
+ product[i][j] += A[i][v] * B[v][j]
+\end{semiverbatim}
+
+\end{frame}
+
+\frame{\frametitle{Ehm that's all Folks}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Really is that straightforward: no actual Vector ops\\
+ - Does not dictate or limit micro-architectural detail\\
+ - Issues Scalar FMACs into existing back-end hardware\\
+ - Can use any 4-operand instruction (GF, INT, Bitmanip)\\
+ - Any operand width (8/16/32/64), up to 127 ops\vspace{8pt}
+ \item Specialise by making Matrix Multiply "setup" quick/easy\\
+ - two 32-bit instructions to set up A, B, C sizes\\
+ - one 64-bit SVP64 FMAC instruction.\\
+ - Nothing else needed. Saves on I-Cache\vspace{8pt}
+ \item Hardware turns out to be near-identical to ZOLC\\
+ https://opencores.org/projects/hwlu\\
+ https://libre-soc.org/openpower/sv/remap/\vspace{15pt}
+ \end{itemize}
+}
+
\frame{\frametitle{Summary}
\item Combination of which is that Board Support Package is 100\%
upstream, app and product development by customer is hugely
simplified and much more attractive
-
+
\end{itemize}
}
Questions?\vspace{15pt}
}
\end{center}
-
+
\begin{itemize}
\item Discussion: Libre-SOC-dev mailing list
\item Freenode IRC \#libre-soc