\vspace{24pt}
\Large{OpenPOWER Summit 2021}\\
\vspace{16pt}
- \large{Sponsored by NLnet's PET Programme}\\
+ \large{Sponsored by NLnet Grants}\\
+ \large{and NGI POINTER Grants}\\
\vspace{6pt}
\large{28th Oct 2021}
\end{center}
}
-\frame{\frametitle{}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item \vspace{15pt}
- \item \vspace{15pt}
- \item \vspace{15pt}
- \end{itemize}
-}
-
\frame{\frametitle{Overview of Libre-SOC goals}
-\vspace{15pt}
+\vspace{8pt}
\begin{itemize}
- \item To create power-efficient mass-volume products\vspace{15pt}
- \item To leverage the OpenPOWER ecosystem to do so\vspace{15pt}
- \item To be entirely transparent for Security reasons\vspace{15pt}
+ \item To create power-efficient mass-volume products\vspace{8pt}
+ \item To leverage the OpenPOWER ecosystem to do so\vspace{8pt}
+ \item To be entirely transparent for Security reasons\vspace{8pt}
\item To empower businesses to bring Secure transparent\\
- mass-volume products to market\vspace{15pt}
+ mass-volume products to market\vspace{8pt}
+ \item Mass-volume end-user products need 3D, Video, Audio
+ \textbf{therefore we require small-size Matrices (3x3 but not with
+ 75\% utilisation, and 4x4) and the core strategic parts
+ of A/V CODECs and that means DCT and FFT.}
+ Anything else is a bonus (NTT with Galois Field bitmanip)
\end{itemize}
}
\vspace{15pt}
\begin{itemize}
- \item High performance and high performance/watt\vspace{15pt}
+ \item High performance and high performance/watt\vspace{10pt}
\item Reduced code density (reduced I-Cache usage)\\
https://arxiv.org/abs/2002.10143 - 3.5x power reduction\vspace{8pt}
- \item Remain accessible for assembler writers and compilers alike\vspace{15pt}
+ \item Remain accessible for assembler writers and compilers alike\vspace{10pt}
\item Introduce true Vectorisation to the Power ISA\\
(VSX is Packed SIMD)\vspace{8pt}
\item Be adopted via the external OPF ISA WG RFC process\\
(not: be a non-official custom extension. proprietary\\
- custom extensions conflict with mass-volume adoption)\vspace{15pt}
+ custom extensions conflict with mass-volume adoption)\vspace{10pt}
\end{itemize}
}
-%%\frame{\frametitle{nmigen PowerISA Decoder}
-
-%%\begin{center}
-%%\includegraphics[width=0.55\textwidth]{2020-09-09_21-04.png}
-%%\end{center}
-
-%%}
-
\begin{frame}[fragile]
\frametitle{Reminder of Simple-V}
\frametitle{Matrix Multiply: Basics}
\begin{semiverbatim}
-(a00 a01 a02 x (b00 b01 =
- a10 a11 a12) b10 b11
+(a00 a01 a02 x (b00 b01 = (c00 c01
+ a10 a11 a12) b10 b11 c10 c11) = ...
b20 b21)
(a00*b00 + a01*b10 + a02*b20 a00*b01 + a01*b11 + a02*b21
a10*b00 + a11*b10 + a12*b20 a10*b01 + a11*b11 + a12*b21)
- (b00 b01 x (a00 a01 a02 =
- b10 b11 a10 a11 a12)
- b20 b21)
+ (b00 b01 x (a00 a01 a02 = (c00 c01 c02
+ b10 b11 a10 a11 a12) c10 c11 c12
+ b20 b21) c20 c21 c22) = ...
(b00*a00 + b01*a10 b00*a01 + b01*a11 b00*a02 + b01*a12
b10*a00 + b11*a10 b10*a01 + b11*a11 b10*a02 + b11*a12
\frame{\frametitle{Matrix Multiply: Generalise but Specialise}
-\vspace{8pt}
-
\begin{itemize}
\item Why not make a general-purpose nested "Loop" system?\\
- Other uses (algorithms) beyond Matrix Multiplication\\
- Allow any arbitrary-sized loops\\
- Allow any permutation of nesting\\
- - Allow reversing per-dimension\vspace{5pt}
+ - Allow reversing per-dimension
\item Specialise by making Matrix Multiply "setup" quick/easy\\
- two 32-bit instructions to set up A, B, C sizes\\
- - one 64-bit SVP64 FMAC instruction.\\
- - Nothing else needed. Saves on I-Cache\vspace{5pt}
+ - one 64-bit SVP64 FMAC instruction (hot-loop).\\
+ - Nothing else needed. Saves on I-Cache
\item Hardware turns out to be near-identical to ZOLC\\
https://opencores.org/projects/hwlu\\
- https://libre-soc.org/openpower/sv/remap/\vspace{15pt}
+ https://libre-soc.org/openpower/sv/remap/
+ \item Concept is actually borrowed from Aspex Array-String Processor
+ 1D/2D/3D Memory DMA "reordering" Engine (except applied to
+ the register file)
\end{itemize}
}
}
-\frame{\frametitle{TODO rewrite Summary}
+\frame{\frametitle{DCT / FFT / DFT / NTT: what if we could REMAP?}
+
+\vspace{6pt}
+
+ \begin{itemize}
+ \item Can we create a REMAP Schedule for FFT (etc)? YES\\
+ - More complicated than Matrix Schedules but same principle\\
+ - Again: issues Scalar instructions into back-end micro-arch\\
+ - Requires 5-operand (3-in, 2-out) new Scalar Instructions\\
+ - Any operand width (8/16/32/64)\vspace{8pt}
+ \item Limited to in-place registers and Power-of-Two. Future?\\
+ - Again: CISC-like auto-load/store-and-increment\\
+ - https://arxiv.org/abs/2002.10143\\
+ - Again: still power-efficient (no I-Cache usage in loops)\vspace{8pt}
+ \item Again: can be investigated as part of EUR 22.6m EU Grant\\
+ https://libre-soc.org/SEP-210803722-Libre-SOC-8-core/\vspace{15pt}
+ \end{itemize}
+}
+
+
+\frame{\frametitle{DCT / FFT / DFT / NTT: other implementations?}
+
+\vspace{6pt}
+
+ \begin{itemize}
+ \item Texas Instruments TMS320 and C6xx DSPs (VLIW) \\
+ - 14 u-Ops per VLIW (including Zero-Overhead Looping)\\
+ - Performs Odd/Even FP32 looping single-instruction FFT\\
+ - Cannot do anything other than FP32\\
+ - Otherwise absolutely brilliant and elegant (20+ years)
+ \item Qualcom Hexagon DSP\\
+ - Again: VLIW (29 RISC-like u-Ops in 1 cycle)\\
+ - Seriously power-efficient and effective\\
+ - Has ZOLC and Complex-number Multiply\\
+ - Only seems to handle the inner loop of FFT though
+ \item SVP64\\
+ - not limited to inner loop (handles entire triple-loop) \\
+ - like Hexagon, not limted to type of operation inside loop \\
+ - Complex-number ops: a bit too heavy-duty for now (later?)
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Discrete Cosine Transform (DCT): Basics}
+
+ \begin{itemize}
+ \item Standard DCT Schedule (messy, impossible for SIMD)
+ \item Output is in bit-reversed order\\
+ 0b000 = 0b000 (in: 0 out: 0)\\
+ 0b001 = 0b100 (in: 1 out: 4) ...\\
+ 0b110 = 0b011 (in: 6 out: 3)\\
+ 0b111 = 0b111 (in: 7 out: 7)
+ \end{itemize}
+
+\begin{center}
+\includegraphics[width=0.75\textwidth]{plonka_dct1.png}
+\end{center}
+
+}
+
+\frame{\frametitle{Fast Fourier Transform (FFT/DFT): Butterfly Basics}
+
+ \begin{itemize}
+ \item Standard Butterfly Schedule (again: messy, but less so)
+ \item Output, again, is in bit-reversed order
+ \end{itemize}
+
+\begin{center}
+\includegraphics[width=0.70\textwidth]{fft_butterfly.png}
+\end{center}
+
+}
+
+\frame{\frametitle{FFT: 3-in, 2-out butterfly}
+
+ \begin{itemize}
+ \item One multiply (by coefficient), one add, one subtract
+ \item inputs: X[0] X[1] C(oeff) outputs: X[0] X[1]
+ \end{itemize}
+
+\begin{center}
+\includegraphics[width=0.90\textwidth]{2-point-dft.png}
+\end{center}
+
+}
+
+
+\begin{frame}[fragile]
+\frametitle{DFT: Project Nayuki Radix-2 Cooley-Tukey DIT (1)}
+
+\begin{semiverbatim}
+coef = (2 if inverse else -2) * cmath.pi / n
+exptable = [cmath.rect(1, i*coef) for i in range(n // 2)]
+vec = [vec[reverse_bits(i, levels)] for i in range(n)]
+size = 2
+while size <= n:
+ halfsize, tablestep = size // 2, n // size
+ for i in range(0, n, size):
+ k = 0
+ for j in range(i, i + halfsize):
+ temp = vec[j + halfsize] * exptable[k]
+ vec[j + halfsize] = vec[j] - temp
+ vec[j] += temp
+ k += tablestep
+ size *= 2
+\end{semiverbatim}
+
+\end{frame}
+
+
+\begin{frame}[fragile]
+\frametitle{DFT: Project Nayuki Radix-2 Cooley-Tukey DIT (2)}
+
+\begin{semiverbatim}
+coef = (2 if inverse else -2) * cmath.pi / n
+exptable = [cmath.rect(1, i*coef) for i in range(n // 2)]
+vec = [vec[reverse_bits(i, levels)] for i in range(n)]
+size = 2
+while size <= n:
+ hs, tablestep = size // 2, n // size
+ for i in range(0, n, size):
+ k = 0
+ for j in range(i, i+hs):
+ # Twin-Butterfly 3-in 2-out: one instruction
+ C = exptable[k]
+ vec[j+hs], vec[j] = 2B(vec[j+hs], vec[j], C)
+ k += tablestep
+ size *= 2
+\end{semiverbatim}
+
+\end{frame}
+
+\begin{frame}[fragile]
+\frametitle{DFT: Project Nayuki Radix-2 Cooley-Tukey DIT (3)}
+
+ \begin{itemize}
+ \item What if the Triple Loop could be done with REMAP?
+ \item Register Offsets j, j+hs, k created automatically?
+ \item Only one actual inner loop instruction (Twin-butterfly)
+ \item 3-in (X0/X1/C) 2-out (X0/X1) allows for in-place FFT
+ \item Hardware FSM (like ZOLC) creates offset triplet\\
+ - Technically not that hard to implement (for Radix-2)\\
+ - Exact same principle as Triple-loop for Matrices
+ \end{itemize}
+
+\begin{semiverbatim}
+for j,k,hs in REMAP_TRIPLE_LOOP_GENERATOR():
+ # Twin-Butterfly 3-in 2-out: one instruction
+ C = exptable[k]
+ vec[j+hs], vec[j] = 2B(vec[j+hs], vec[j], C)
+\end{semiverbatim}
+
+\end{frame}
+
+\frame{\frametitle{DCT: pre-arrange (pre-load) data}
+
+ \begin{itemize}
+ \item Arrange input data such that output falls into place
+ \item (another) Twin 3-in 2-out Mul-Add in-place instruction
+ \end{itemize}
+
+\begin{center}
+\includegraphics[width=0.70\textwidth]{dct_butterfly.png}
+\end{center}
+
+}
+
+
+\frame{\frametitle{FFT (Complex numbers) and DCT coefficients?}
+
+ \begin{itemize}
+ \item Problem (DCT): DCT Cosine Coefficients change (cos + 0.5)
+ depending on the layer. Cannot do as single instruction
+ \item Problem (FFT): Complex number butterfly multiplication involves
+ 4 multiplies. Cannot do in-place as single instruction\vspace{12pt}
+ \item Solution: "Vertical-First" Vectors (Mitch Alsup 66000 ISA)\vspace{12pt}
+ \item Understanding of SVP64 "Vertical-First" 30min video
+ https://youtube.com/watch?v=fn2KJvWyBKg
+ \item Basically involves stepping "vertically" through instructions
+ then ("stepping") to the next offset (REMAP), loop with bc
+ \item Horizontal-first: run through the entire REMAP schedule on a single
+ instruction before repeating looping on the next
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Summary}
\begin{itemize}
\item Goal is to create a mass-volume low-power embedded SoC suitable
for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
- \item No DRM. 'Trustable' (by the users, not by Media Moguls) design
- ethos as a \textit{business} objective: requires full transparency
- as well as Formal Correctness Proofs
- \item Collaboration with OpenPOWER Foundation and Members absolutely
- essential. No short-cuts. Standards to be developed and ratified
- so that everyone benefits.
- \item Working on the back of huge stability of POWER ecosystem
- \item Combination of which is that Board Support Package is 100\%
- upstream, app and product development by customer is hugely
- simplified and much more attractive
+ \item This means a computational focus on 3D and Audio/Video.\\
+ - Critical not to waste 75\% of Power-2 SIMD Lanes on 3x3
+ \item Reducing core work to a one-instruction hot-loop inherently
+ reduces power consumption because the I-Cache is 100\% idle.
+ \item REMAP system completely independent from the instructions it
+ REMAPs. Applies to future scalar ops (GF, Bitmanip)
+ \item Future versions involve proper Zero-Overhead Loop-Control
+ and hidden "tags" to automatically perform CISC-like
+ auto-load/store-and-inc (for much larger data sets)
+ \item Please help contribute: it's your Open Power ISA too.
\end{itemize}
}
\frame{
\begin{center}
- {\Huge The end\vspace{15pt}\\
- Thank you\vspace{15pt}\\
- Questions?\vspace{15pt}
+ {\Huge The end\vspace{10pt}\\
+ Thank you\vspace{10pt}\\
+ Questions?\vspace{10pt}
}
\end{center}
\begin{itemize}
- \item Discussion: Libre-SOC-dev mailing list
- \item Freenode IRC \#libre-soc
+ \item Discussion: Libre-SOC-ISA mailing list\\
+ http://lists.libre-soc.org/mailman/listinfo/libre-soc-isa
+ \item Libera IRC \#libre-soc
\item http://libre-soc.org/
- \item http://nlnet.nl/PET
+ \item http://nlnet.nl/PET\\
+ https://www.ngi.eu/ngi-projects/ngi-pointer/
\end{itemize}
}