--- /dev/null
+\documentclass[slidestop]{beamer}
+\usepackage{beamerthemesplit}
+\usepackage{graphics}
+\usepackage{pstricks}
+
+\graphicspath{{./}}
+
+\title{The Libre-SOC Hybrid 3D CPU}
+\author{Luke Kenneth Casson Leighton}
+
+
+\begin{document}
+
+\frame{
+ \begin{center}
+ \huge{The Libre-SOC Hybrid 3D CPU}\\
+ \vspace{32pt}
+ \Large{Augmenting the OpenPOWER ISA}\\
+ \Large{to provide 3D and Video instructions}\\
+ \Large{(properly and officially) and make a GPU}\\
+ \vspace{24pt}
+ \Large{FOSDEM2021}\\
+ \vspace{16pt}
+ \large{Sponsored by NLnet's PET Programme}\\
+ \vspace{6pt}
+ \large{\today}
+ \end{center}
+}
+
+
+\frame{\frametitle{Why another SoC?}
+
+ \begin{itemize}
+ \item Intel Management Engine, Apple QA issues, Spectre\vspace{6pt}
+ \item Endless proprietary drivers, "simplest" solution: \\
+ License proprietary hard macros (with proprietary firmware)\\
+ Adversely affects product development cost\\
+ due to opaque driver bugs (Samsung S3C6410 / S5P100)
+ \vspace{6pt}
+ \item Alternative: Intel and Valve-Steam collaboration\\
+ "Most productive business meeting ever!"\\
+ https://tinyurl.com/valve-steam-intel
+ \vspace{6pt}
+ \item Because for 30 years I Always Wanted To Design A CPU
+ \vspace{6pt}
+ \item Ultimately it is a strategic \textit{business} objective to
+ develop entirely Libre hardware, firmware and drivers.
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Why OpenPOWER?}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Good ecosystem essential\\
+ linux kernel, u-boot, compilers, OSes,\\
+ Reference Implementation(s)\vspace{10pt}
+ \item Supportive Foundation and Members\\
+ need to be able to submit ISA augmentations\\
+ (for proper peer review)\vspace{10pt}
+ \item No NDAs, full transparency must be acceptable\\
+ due to being funded under NLnet's PET Programme\vspace{10pt}
+ \item OpenPOWER: established for decades, excellent Foundation,\\
+ Microwatt as Reference, approachable and friendly.
+ \end{itemize}
+}
+
+\frame{\frametitle{How can you help?}
+
+\vspace{5pt}
+
+ \begin{itemize}
+ \item Start here! https://libre-soc.org \\
+ Mailing lists https://lists.libre-soc.org \\
+ IRC Freenode libre-soc \\
+ etc. etc. (it's a Libre project, go figure) \\
+ \vspace{3pt}
+ \item Can I get paid? Yes! NLnet funded\\
+ See https://libre-soc.org/nlnet/\#faq \\
+ \vspace{3pt}
+ \item Also profit-sharing in any commercial ventures \\
+ \vspace{3pt}
+ \item How many opportunities to develop Libre SoCs exist,\\
+ and actually get paid for it?
+ \vspace{3pt}
+ \item I'm not a developer, how can I help?\\
+ - Plenty of research needed, artwork, website \\
+ - Help find customers and OEMs willing to commit (LOI)
+ \end{itemize}
+}
+
+
+
+\frame{\frametitle{What goes into a typical SoC?}
+\vspace{9pt}
+ \begin{itemize}
+ \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
+ heat sink normally not required (simplifies overall design)
+ \vspace{3pt}
+ \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
+ USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
+ \vspace{3pt}
+ \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{3pt}
+ \item Built-in VPU (likewise, proprietary)\vspace{3pt}
+ \item Target price between \$2.50 and \$30 depending on market\\
+ Radically different from IBM POWER9 Core (200 Watt)
+ \vspace{3pt}
+ \item We're doing the same, just with a hybrid architecture.\\
+ CPU == GPU == VPU
+ \end{itemize}
+}
+
+
+
+\frame{\frametitle{Simple SBC-style SoC}
+
+\begin{center}
+\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
+\end{center}
+
+}
+
+\frame{\frametitle{What's different about Libre-SOC?}
+
+ \begin{itemize}
+ \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
+ The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
+ \textit{There is No Separate VPU/GPU Pipeline or Processor}\\
+ \vspace{9pt}
+ \item written in nmigen (a python-based HDL). Not VHDL\\
+ not Verilog (definitely not Chisel3/Scala)\\
+ This is an extremely important strategic decision.\\
+ \vspace{9pt}
+ \item Simple-V Vector Extension. See `SIMD Considered harmful'.\\
+ https://tinyurl.com/simd-considered-harmful\\
+ SV effectively a "hardware for-loop" on standard scalar ISA\\
+ (conceptually similar to Zero-Overhead Loops in DSPs)
+ \vspace{6pt}
+ \item Yes great, but what's different compared to Intel, AMD, NVIDIA,
+ ARM and IBM?
+ \end{itemize}
+}
+
+\frame{\frametitle{OpenPOWER Cell Processor and upwards}
+
+ \begin{itemize}
+ \item OpenPOWER ISA developed from PowerPC, with the RS6000 in the 90s.
+ \vspace{6pt}
+ \item Sony, IBM and Toshiba began the Cell Processor in 2001 \\
+ (Sony Playstation 3) - NUMA approach
+ \vspace{6pt}
+ \item Raw brute-force performance pissed all over the competition
+ at the time
+ \vspace{6pt}
+ \item VSX later evolved out of this initiative.
+ \vspace{6pt}
+ \item VSX, a SIMD extension, now showing its age. \\
+ Fixed-width, no predication, limited pixel formats (15 bit)
+ \vspace{6pt}
+ \item (Vulkan requires dozens of pixel formats)
+ \end{itemize}
+}
+
+\frame{\frametitle{Apple M1 (ARM) vs Intel / AMD (x86)}
+
+ \begin{itemize}
+ \item Very interesting article: tinyurl.com/apple-m1-review
+ \item Apple M1: uses ARM. Intel: implements x86
+ \item Apple M1: RISC multi-issue. Intel: CISC multi-issue.
+ \item Apple M1: uniform (easy) instruction decode \\
+ Intel: \textit{Cannot easily identify start of instruction}
+ \item Result: multi-issue x86 decoder is so complex, it misses
+ opportunities to keep back-end execution engines 100 percent
+ occupied
+ \item OpenPOWER happens to be RISC (easy decode), which is why POWER10
+ has 8-way multi-issue.
+ \item Libre-SOC can do the same tricks that IBM POWER10 and Apple M1
+ can. Intel (x86) literally cannot keep up.
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Hybrid Architecture: Augmented 6600}
+
+ \begin{itemize}
+ \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
+ Help from Mitch Alsup includes \textit{precise exceptions}, \\
+ multi-issue and more. Academic literature on 6600 utterly misleading.
+ 6600 Scoreboards completely underestimated (Seymour Cray and
+ James Thornton
+ solved problems they didn't realise existed elsewhere!)
+ \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
+ nmigen (python OO) strategically critical to achieving this.
+ \item Out-of-order combined with Simple-V allows scalar operations\\
+ at the developer end to be turned into SIMD at the back-end\\
+ \textit{without the developer needing to do SIMD}
+ \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
+ all automatically vectorised.
+ \end{itemize}
+}
+
+\frame{\frametitle{Learning from these and putting it together}
+
+ \begin{itemize}
+ \item Apple M1 and IBM POWER10 show that RISC plus superscalar
+ multi-issue produces insane performance
+ \item Intel AVX 512 and CISC in general is getting out of hand (what's
+ next: 256-bit length instructions, AVX 1024?)
+ \item RISC-V RVV shows Cray-style Vectors can save power. Simple-V
+ has the same benefits with far less instructions (188 for RVV,
+ 3 to 5 new instructions for Simple-V).
+ \item CDC 6600 shows that intelligently-implemented designs can do the
+ job, with far less resources.
+ \item Libre-SOC combines the best of historical processor designs,
+ co-opting and innovating on them (pissing in the back yard of
+ every incumbent CPU and GPU company in the process).
+ \item It's a Libre project: you get to help
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Why nmigen?}
+
+ \begin{itemize}
+ \item Uses python to build an AST (Abstract Syntax Tree).
+ Actually hands that over to yosys (to create ILANG file)
+ after which verilog can (if necessary) be created
+ \item Deterministic synthesiseable behaviour (Signals are declared
+ with their reset pattern: no more forgetting "if rst" block).
+ \item python OO programming techniques can be deployed. classes
+ and functions created which pass in parameters which change
+ what HDL is created (IEEE754 FP16 / 32 / 64 for example)
+ \item python-based for-loops can e.g. read CSV files then generate
+ a hierarchical nested suite of HDL Switch / Case statements
+ (this is how the Libre-soc PowerISA decoder is implemented)
+ \item extreme OO abstraction can even be used to create "dynamic
+ partitioned Signals" that have the same operator-overloaded
+ "add", "subtract", "greater-than" operators
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
+
+ \begin{itemize}
+ \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
+ SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
+ \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
+ able to fit a PowerISA decoder in only one clock cycle)
+ \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
+ then rolls incrementally through the operand register numbers\\
+ issuing \textit{multiple} scalar instructions into the pipelines\\
+ (hence the reason for a multi-issue OoO microarchitecture)
+ \item Current \textit{and future} PowerISA scalar opcodes inherently
+ \textit{and automatically} become 'vectorised' by SV without
+ needing an explicit new Vector opcode.
+ \item Predication and element width polymorphism are also 'tags'.
+ elwidth polymorphism allows for BF16 / FP16 / 80 / 128 to be added to
+ the ISA \textit{without modifying the ISA}
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Quick refresher on SIMD}
+
+ \begin{itemize}
+ \item SIMD very easy to implement (and very seductive)
+ \item Parallelism is in the ALU
+ \item Zero-to-Negligeable impact for rest of core
+ \end{itemize}
+ Where SIMD Goes Wrong:\vspace{6pt}
+ \begin{itemize}
+ \item See "SIMD instructions considered harmful"
+ https://sigarch.org/simd-instructions-considered-harmful
+ \item Setup and corner-cases alone are extremely complex.\\
+ Hardware is easy, but software is hell.\\
+ strncpy VSX patch for POWER9: 250 hand-written asm lines!\\
+ (RVV / SimpleV strncpy is 14 instructions)
+ \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\
+ opcode, elwidth, veclen, src1-src2-dest hi/lo
+ \end{itemize}
+}
+
+\begin{frame}[fragile]
+\frametitle{Simple-V ADD in a nutshell}
+
+\begin{semiverbatim}
+function op\_add(rd, rs1, rs2, predr) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ for (i = 0; i < VL; i++)
+ if (ireg[predr] & 1<<i) # predication uses intregs
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
+ if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
+ if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
+\end{semiverbatim}
+
+ \begin{itemize}
+ \item Above is oversimplified: Reg. indirection left out (for clarity).
+ \item SIMD slightly more complex (case above is elwidth = default)
+ \item Scalar-scalar and scalar-vector and vector-vector now all in one
+ \item OoO may choose to push ADDs into instr. queue (v. busy!)
+ \end{itemize}
+\end{frame}
+
+\frame{\frametitle{Additional Simple-V features}
+
+ \begin{itemize}
+ \item "fail-on-first" (POWER9 VSX strncpy segfaults on boundary!)
+ \item "Twin Predication" (covers VSPLAT, VGATHER, VSCATTER, VINDEX etc.)
+ \item SVP64: extensive "tag" (Vector context) augmentation
+ \item "Context propagation": a VLIW-like context. Allows contexts
+ to be repeatedly applied.
+ Effectively a "hardware compression algorithm" for ISAs.
+ \item Ultimate goal: cut down I-Cache usage, cuts down on power
+ \item Typical GPU has its own I-Cache and small shaders.\\
+ \textit{We are a Hybrid CPU/GPU: I-Cache is not separate!}
+ \item Needs to go through OpenPOWER Foundation `approval'
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Summary}
+
+ \begin{itemize}
+ \item Goal is to create a mass-volume low-power embedded SoC suitable
+ for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
+ \item No way we could implement a project of this magnitude without
+ nmigen (being able to use python OO to HDL)
+ \item Collaboration with OpenPOWER Foundation and Members absolutely
+ essential. No short-cuts. Standards to be developed and ratified
+ so that everyone benefits.
+ \item Riding the wave of huge stability of OpenPOWER ecosystem
+ \item Greatly simplified open 3D and Video drivers reduces product
+ development costs for customers
+ \item It also happens to be fascinating, deeply rewarding technically
+ challenging, and funded by NLnet
+
+ \end{itemize}
+}
+
+
+\frame{
+ \begin{center}
+ {\Huge The end\vspace{12pt}\\
+ Thank you\vspace{12pt}\\
+ Questions?\vspace{12pt}
+ }
+ \end{center}
+
+ \begin{itemize}
+ \item Discussion: http://lists.libre-soc.org
+ \item Freenode IRC \#libre-soc
+ \item http://libre-soc.org/
+ \item http://nlnet.nl/PET
+ \item https://libre-soc.org/nlnet/\#faq
+ \end{itemize}
+}
+
+
+\end{document}
+++ /dev/null
-\documentclass[slidestop]{beamer}
-\usepackage{beamerthemesplit}
-\usepackage{graphics}
-\usepackage{pstricks}
-
-\graphicspath{{./}}
-
-\title{The Libre-SOC Hybrid 3D CPU}
-\author{Luke Kenneth Casson Leighton}
-
-
-\begin{document}
-
-\frame{
- \begin{center}
- \huge{The Libre-SOC Hybrid 3D CPU}\\
- \vspace{32pt}
- \Large{Augmenting the OpenPOWER ISA}\\
- \Large{to provide 3D and Video instructions}\\
- \Large{(properly and officially) and make a GPU}\\
- \vspace{24pt}
- \Large{FOSDEM2021}\\
- \vspace{16pt}
- \large{Sponsored by NLnet's PET Programme}\\
- \vspace{6pt}
- \large{\today}
- \end{center}
-}
-
-
-\frame{\frametitle{Why another SoC?}
-
- \begin{itemize}
- \item Intel Management Engine, Apple QA issues, Spectre\vspace{6pt}
- \item Endless proprietary drivers, "simplest" solution: \\
- License proprietary hard macros (with proprietary firmware)\\
- Adversely affects product development cost\\
- due to opaque driver bugs (Samsung S3C6410 / S5P100)
- \vspace{6pt}
- \item Alternative: Intel and Valve-Steam collaboration\\
- "Most productive business meeting ever!"\\
- https://tinyurl.com/valve-steam-intel
- \vspace{6pt}
- \item Because for 30 years I Always Wanted To Design A CPU
- \vspace{6pt}
- \item Ultimately it is a strategic \textit{business} objective to
- develop entirely Libre hardware, firmware and drivers.
- \end{itemize}
-}
-
-
-\frame{\frametitle{Why OpenPOWER?}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item Good ecosystem essential\\
- linux kernel, u-boot, compilers, OSes,\\
- Reference Implementation(s)\vspace{10pt}
- \item Supportive Foundation and Members\\
- need to be able to submit ISA augmentations\\
- (for proper peer review)\vspace{10pt}
- \item No NDAs, full transparency must be acceptable\\
- due to being funded under NLnet's PET Programme\vspace{10pt}
- \item OpenPOWER: established for decades, excellent Foundation,\\
- Microwatt as Reference, approachable and friendly.
- \end{itemize}
-}
-
-\frame{\frametitle{How can you help?}
-
-\vspace{5pt}
-
- \begin{itemize}
- \item Start here! https://libre-soc.org \\
- Mailing lists https://lists.libre-soc.org \\
- IRC Freenode libre-soc \\
- etc. etc. (it's a Libre project, go figure) \\
- \vspace{3pt}
- \item Can I get paid? Yes! NLnet funded\\
- See https://libre-soc.org/nlnet/\#faq \\
- \vspace{3pt}
- \item Also profit-sharing in any commercial ventures \\
- \vspace{3pt}
- \item How many opportunities to develop Libre SoCs exist,\\
- and actually get paid for it?
- \vspace{3pt}
- \item I'm not a developer, how can I help?\\
- - Plenty of research needed, artwork, website \\
- - Help find customers and OEMs willing to commit (LOI)
- \end{itemize}
-}
-
-
-
-\frame{\frametitle{What goes into a typical SoC?}
-\vspace{9pt}
- \begin{itemize}
- \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
- heat sink normally not required (simplifies overall design)
- \vspace{3pt}
- \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
- USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
- \vspace{3pt}
- \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{3pt}
- \item Built-in VPU (likewise, proprietary)\vspace{3pt}
- \item Target price between \$2.50 and \$30 depending on market\\
- Radically different from IBM POWER9 Core (200 Watt)
- \vspace{3pt}
- \item We're doing the same, just with a hybrid architecture.\\
- CPU == GPU == VPU
- \end{itemize}
-}
-
-
-
-\frame{\frametitle{Simple SBC-style SoC}
-
-\begin{center}
-\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
-\end{center}
-
-}
-
-\frame{\frametitle{What's different about Libre-SOC?}
-
- \begin{itemize}
- \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
- The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
- \textit{There is No Separate VPU/GPU Pipeline or Processor}\\
- \vspace{9pt}
- \item written in nmigen (a python-based HDL). Not VHDL\\
- not Verilog (definitely not Chisel3/Scala)\\
- This is an extremely important strategic decision.\\
- \vspace{9pt}
- \item Simple-V Vector Extension. See `SIMD Considered harmful'.\\
- https://tinyurl.com/simd-considered-harmful\\
- SV effectively a "hardware for-loop" on standard scalar ISA\\
- (conceptually similar to Zero-Overhead Loops in DSPs)
- \vspace{6pt}
- \item Yes great, but what's different compared to Intel, AMD, NVIDIA,
- ARM and IBM?
- \end{itemize}
-}
-
-\frame{\frametitle{OpenPOWER Cell Processor and upwards}
-
- \begin{itemize}
- \item OpenPOWER ISA developed from PowerPC, with the RS6000 in the 90s.
- \vspace{6pt}
- \item Sony, IBM and Toshiba began the Cell Processor in 2001 \\
- (Sony Playstation 3) - NUMA approach
- \vspace{6pt}
- \item Raw brute-force performance pissed all over the competition
- at the time
- \vspace{6pt}
- \item VSX later evolved out of this initiative.
- \vspace{6pt}
- \item VSX, a SIMD extension, now showing its age. \\
- Fixed-width, no predication, limited pixel formats (15 bit)
- \vspace{6pt}
- \item (Vulkan requires dozens of pixel formats)
- \end{itemize}
-}
-
-\frame{\frametitle{Apple M1 (ARM) vs Intel / AMD (x86)}
-
- \begin{itemize}
- \item Very interesting article: tinyurl.com/apple-m1-review
- \item Apple M1: uses ARM. Intel: implements x86
- \item Apple M1: RISC multi-issue. Intel: CISC multi-issue.
- \item Apple M1: uniform (easy) instruction decode \\
- Intel: \textit{Cannot easily identify start of instruction}
- \item Result: multi-issue x86 decoder is so complex, it misses
- opportunities to keep back-end execution engines 100 percent
- occupied
- \item OpenPOWER happens to be RISC (easy decode), which is why POWER10
- has 8-way multi-issue.
- \item Libre-SOC can do the same tricks that IBM POWER10 and Apple M1
- can. Intel (x86) literally cannot keep up.
- \end{itemize}
-}
-
-
-\frame{\frametitle{Hybrid Architecture: Augmented 6600}
-
- \begin{itemize}
- \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
- Help from Mitch Alsup includes \textit{precise exceptions}, \\
- multi-issue and more. Academic literature on 6600 utterly misleading.
- 6600 Scoreboards completely underestimated (Seymour Cray and
- James Thornton
- solved problems they didn't realise existed elsewhere!)
- \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
- nmigen (python OO) strategically critical to achieving this.
- \item Out-of-order combined with Simple-V allows scalar operations\\
- at the developer end to be turned into SIMD at the back-end\\
- \textit{without the developer needing to do SIMD}
- \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
- all automatically vectorised.
- \end{itemize}
-}
-
-\frame{\frametitle{Learning from these and putting it together}
-
- \begin{itemize}
- \item Apple M1 and IBM POWER10 show that RISC plus superscalar
- multi-issue produces insane performance
- \item Intel AVX 512 and CISC in general is getting out of hand (what's
- next: 256-bit length instructions, AVX 1024?)
- \item RISC-V RVV shows Cray-style Vectors can save power. Simple-V
- has the same benefits with far less instructions (188 for RVV,
- 3 to 5 new instructions for Simple-V).
- \item CDC 6600 shows that intelligently-implemented designs can do the
- job, with far less resources.
- \item Libre-SOC combines the best of historical processor designs,
- co-opting and innovating on them (pissing in the back yard of
- every incumbent CPU and GPU company in the process).
- \item It's a Libre project: you get to help
- \end{itemize}
-}
-
-
-\frame{\frametitle{Why nmigen?}
-
- \begin{itemize}
- \item Uses python to build an AST (Abstract Syntax Tree).
- Actually hands that over to yosys (to create ILANG file)
- after which verilog can (if necessary) be created
- \item Deterministic synthesiseable behaviour (Signals are declared
- with their reset pattern: no more forgetting "if rst" block).
- \item python OO programming techniques can be deployed. classes
- and functions created which pass in parameters which change
- what HDL is created (IEEE754 FP16 / 32 / 64 for example)
- \item python-based for-loops can e.g. read CSV files then generate
- a hierarchical nested suite of HDL Switch / Case statements
- (this is how the Libre-soc PowerISA decoder is implemented)
- \item extreme OO abstraction can even be used to create "dynamic
- partitioned Signals" that have the same operator-overloaded
- "add", "subtract", "greater-than" operators
-
- \end{itemize}
-}
-
-\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
-
- \begin{itemize}
- \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
- SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
- \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
- able to fit a PowerISA decoder in only one clock cycle)
- \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
- then rolls incrementally through the operand register numbers\\
- issuing \textit{multiple} scalar instructions into the pipelines\\
- (hence the reason for a multi-issue OoO microarchitecture)
- \item Current \textit{and future} PowerISA scalar opcodes inherently
- \textit{and automatically} become 'vectorised' by SV without
- needing an explicit new Vector opcode.
- \item Predication and element width polymorphism are also 'tags'.
- elwidth polymorphism allows for BF16 / FP16 / 80 / 128 to be added to
- the ISA \textit{without modifying the ISA}
-
- \end{itemize}
-}
-
-\frame{\frametitle{Quick refresher on SIMD}
-
- \begin{itemize}
- \item SIMD very easy to implement (and very seductive)
- \item Parallelism is in the ALU
- \item Zero-to-Negligeable impact for rest of core
- \end{itemize}
- Where SIMD Goes Wrong:\vspace{6pt}
- \begin{itemize}
- \item See "SIMD instructions considered harmful"
- https://sigarch.org/simd-instructions-considered-harmful
- \item Setup and corner-cases alone are extremely complex.\\
- Hardware is easy, but software is hell.\\
- strncpy VSX patch for POWER9: 250 hand-written asm lines!\\
- (RVV / SimpleV strncpy is 14 instructions)
- \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\
- opcode, elwidth, veclen, src1-src2-dest hi/lo
- \end{itemize}
-}
-
-\begin{frame}[fragile]
-\frametitle{Simple-V ADD in a nutshell}
-
-\begin{semiverbatim}
-function op\_add(rd, rs1, rs2, predr) # add not VADD!
- int i, id=0, irs1=0, irs2=0;
- for (i = 0; i < VL; i++)
- if (ireg[predr] & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
- if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
- if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
-\end{semiverbatim}
-
- \begin{itemize}
- \item Above is oversimplified: Reg. indirection left out (for clarity).
- \item SIMD slightly more complex (case above is elwidth = default)
- \item Scalar-scalar and scalar-vector and vector-vector now all in one
- \item OoO may choose to push ADDs into instr. queue (v. busy!)
- \end{itemize}
-\end{frame}
-
-\frame{\frametitle{Additional Simple-V features}
-
- \begin{itemize}
- \item "fail-on-first" (POWER9 VSX strncpy segfaults on boundary!)
- \item "Twin Predication" (covers VSPLAT, VGATHER, VSCATTER, VINDEX etc.)
- \item SVP64: extensive "tag" (Vector context) augmentation
- \item "Context propagation": a VLIW-like context. Allows contexts
- to be repeatedly applied.
- Effectively a "hardware compression algorithm" for ISAs.
- \item Ultimate goal: cut down I-Cache usage, cuts down on power
- \item Typical GPU has its own I-Cache and small shaders.\\
- \textit{We are a Hybrid CPU/GPU: I-Cache is not separate!}
- \item Needs to go through OpenPOWER Foundation `approval'
- \end{itemize}
-}
-
-
-\frame{\frametitle{Summary}
-
- \begin{itemize}
- \item Goal is to create a mass-volume low-power embedded SoC suitable
- for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
- \item No way we could implement a project of this magnitude without
- nmigen (being able to use python OO to HDL)
- \item Collaboration with OpenPOWER Foundation and Members absolutely
- essential. No short-cuts. Standards to be developed and ratified
- so that everyone benefits.
- \item Riding the wave of huge stability of OpenPOWER ecosystem
- \item Greatly simplified open 3D and Video drivers reduces product
- development costs for customers
- \item It also happens to be fascinating, deeply rewarding technically
- challenging, and funded by NLnet
-
- \end{itemize}
-}
-
-
-\frame{
- \begin{center}
- {\Huge The end\vspace{12pt}\\
- Thank you\vspace{12pt}\\
- Questions?\vspace{12pt}
- }
- \end{center}
-
- \begin{itemize}
- \item Discussion: http://lists.libre-soc.org
- \item Freenode IRC \#libre-soc
- \item http://libre-soc.org/
- \item http://nlnet.nl/PET
- \item https://libre-soc.org/nlnet/\#faq
- \end{itemize}
-}
-
-
-\end{document}
+++ /dev/null
-\documentclass[slidestop]{beamer}
-\usepackage{beamerthemesplit}
-\usepackage{graphics}
-\usepackage{pstricks}
-
-\graphicspath{{./}}
-
-\title{The delicate disadvantage of Reverse-Engineering}
-\author{Luke Kenneth Casson Leighton}
-
-
-\begin{document}
-
-\frame{
- \begin{center}
- \huge{The delicate disadvantage of Reverse-Engineering}\\
- \vspace{32pt}
- \Large{The consequences of maintaining}\\
- \Large{proprietary hardware}\\
- \Large{Can we do better?}\\
- \vspace{24pt}
- \Large{[proposed for] OFSC 2020}\\
- \vspace{16pt}
- \large{\today}
- \end{center}
-}
-
-
-\frame{\frametitle{Background (about me)}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item First reverse-engineering was Samba-TNG\\
- NTBugTraq, August 1996\\
- "Welcome to the SAMBA Domain"\\
- 3 years later...\vspace{6pt}
- \item 2002: Exchange 5.5, enhancing FreeDCE \\
- Copied by an "Open" team that removed all attribution\vspace{6pt}
- \item 2003-2005: Xanadux Project\\
- 9 HTC smartphones reverse-engineered\\
- Zero income earned.\vspace{6pt}
- \item Lesson learned: everyone else makes money from your work.
- \end{itemize}
-}
-
-
-\frame{\frametitle{How come I could do this but others couldn't?}
-
-\vspace{10pt}
-
- \begin{itemize}
- \item Self-analysis time: what capability did I have \\
- that others do not?\\
- \vspace{12pt}
- \item Definition of Reverse-Engineering:\\
- \vspace{4pt}
- The ability to infer knowledge.\\
- (That's really it. No prior-knowledge is required:\\
- you DERIVE knowledge)
- \vspace{12pt}
- \item Definitions of knowledge were a clue:\\
- Demster-Shafer (generalisation of Bayes)\\
- Epistemology (Advaita Vedanta wikipedia page)
- \end{itemize}
-}
-
-\frame{\frametitle{Advaita Vedanta, Epistemology section}
-
-
- \begin{itemize}
-
- \item Pratyakṣa - perception (includes senses, but also "intuition")
- \item Anumana - inference (where there's smoke there's fire)
- \item Upamana - comparison, analogy (A is to B as C is to D;\\
- also included here is the "difference" between two things)
- \item Arthapatti - postulation, derivation from
- circumstances\\
- (Joe is gaining weight; we do not see Joe eat during the day.
- Therefore Joe is eating at night)
- \item Anupalabdi - non-perception, negative/cognitive proof\\
- ("there is no jug in this room")
- \item Sabda - relying on word, testimony of past/present experts
- \end{itemize}
- \bf{ Reverse-Engineers develop these knowledge-derivation skills
- without knowing that they have them! It's incredible and valuable!}
-
-}
-
-\frame{\frametitle{Why do products need reverse-engineering?}
-\vspace{9pt}
- \begin{itemize}
- \item The profit-maximising Corporation can't be bothered to provide
- documentation or source code
- \vspace{4pt}
- \item The profit-maximising Corporation is based in China and is
- happy to blatantly disregard Copyright law.
- \vspace{4pt}
- \item The profit-maximising Corporation could be bothered but has
- realised that they make more money through entrapment of end-users
- \vspace{4pt}
- \item Bottom line: helping such Corporations helps keep their products
- in circulation.
- \vspace{4pt}
- \end{itemize}
- \bf{ Reverse-Engineers by applying their amazing skills actively support
- unethical and pathological Corporations to do harm to end-users
- and to the environment}
-
-}
-
-
-
-\frame{\frametitle{How about an alternative?}
-
-
- \begin{itemize}
- \item You have unbelievably empowering and powerful skills, far
- beyond those of an average programmer!
- \item Instead of supporting unethical Corporations, why not support
- yourselves?
- \item Transition from full-time to part-time (or work evenings)
- \item How about creating your own products? (You're a Reverse-Engineer:
- you know how products work, and what you don't know, you already
- know you can to find out!)
- \item How about designing a product and put it on Crowdsupply?
- \item The internet exists: you can find others to team up with,
- in an area of technology that interests you.
- \end{itemize}
-
- \bf{Ultimately you could do a huge amount of good. With your skill
- there is nothing that can stop you except yourself}
-
-}
-
-
-\end{document}
-
--- /dev/null
+\documentclass[slidestop]{beamer}
+\usepackage{beamerthemesplit}
+\usepackage{graphics}
+\usepackage{pstricks}
+
+\graphicspath{{./}}
+
+\title{The delicate disadvantage of Reverse-Engineering}
+\author{Luke Kenneth Casson Leighton}
+
+
+\begin{document}
+
+\frame{
+ \begin{center}
+ \huge{The delicate disadvantage of Reverse-Engineering}\\
+ \vspace{32pt}
+ \Large{The consequences of maintaining}\\
+ \Large{proprietary hardware}\\
+ \Large{Can we do better?}\\
+ \vspace{24pt}
+ \Large{[proposed for] OFSC 2020}\\
+ \vspace{16pt}
+ \large{\today}
+ \end{center}
+}
+
+
+\frame{\frametitle{Background (about me)}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item First reverse-engineering was Samba-TNG\\
+ NTBugTraq, August 1996\\
+ "Welcome to the SAMBA Domain"\\
+ 3 years later...\vspace{6pt}
+ \item 2002: Exchange 5.5, enhancing FreeDCE \\
+ Copied by an "Open" team that removed all attribution\vspace{6pt}
+ \item 2003-2005: Xanadux Project\\
+ 9 HTC smartphones reverse-engineered\\
+ Zero income earned.\vspace{6pt}
+ \item Lesson learned: everyone else makes money from your work.
+ \end{itemize}
+}
+
+
+\frame{\frametitle{How come I could do this but others couldn't?}
+
+\vspace{10pt}
+
+ \begin{itemize}
+ \item Self-analysis time: what capability did I have \\
+ that others do not?\\
+ \vspace{12pt}
+ \item Definition of Reverse-Engineering:\\
+ \vspace{4pt}
+ The ability to infer knowledge.\\
+ (That's really it. No prior-knowledge is required:\\
+ you DERIVE knowledge)
+ \vspace{12pt}
+ \item Definitions of knowledge were a clue:\\
+ Demster-Shafer (generalisation of Bayes)\\
+ Epistemology (Advaita Vedanta wikipedia page)
+ \end{itemize}
+}
+
+\frame{\frametitle{Advaita Vedanta, Epistemology section}
+
+
+ \begin{itemize}
+
+ \item Pratyakṣa - perception (includes senses, but also "intuition")
+ \item Anumana - inference (where there's smoke there's fire)
+ \item Upamana - comparison, analogy (A is to B as C is to D;\\
+ also included here is the "difference" between two things)
+ \item Arthapatti - postulation, derivation from
+ circumstances\\
+ (Joe is gaining weight; we do not see Joe eat during the day.
+ Therefore Joe is eating at night)
+ \item Anupalabdi - non-perception, negative/cognitive proof\\
+ ("there is no jug in this room")
+ \item Sabda - relying on word, testimony of past/present experts
+ \end{itemize}
+ \bf{ Reverse-Engineers develop these knowledge-derivation skills
+ without knowing that they have them! It's incredible and valuable!}
+
+}
+
+\frame{\frametitle{Why do products need reverse-engineering?}
+\vspace{9pt}
+ \begin{itemize}
+ \item The profit-maximising Corporation can't be bothered to provide
+ documentation or source code
+ \vspace{4pt}
+ \item The profit-maximising Corporation is based in China and is
+ happy to blatantly disregard Copyright law.
+ \vspace{4pt}
+ \item The profit-maximising Corporation could be bothered but has
+ realised that they make more money through entrapment of end-users
+ \vspace{4pt}
+ \item Bottom line: helping such Corporations helps keep their products
+ in circulation.
+ \vspace{4pt}
+ \end{itemize}
+ \bf{ Reverse-Engineers by applying their amazing skills actively support
+ unethical and pathological Corporations to do harm to end-users
+ and to the environment}
+
+}
+
+
+
+\frame{\frametitle{How about an alternative?}
+
+
+ \begin{itemize}
+ \item You have unbelievably empowering and powerful skills, far
+ beyond those of an average programmer!
+ \item Instead of supporting unethical Corporations, why not support
+ yourselves?
+ \item Transition from full-time to part-time (or work evenings)
+ \item How about creating your own products? (You're a Reverse-Engineer:
+ you know how products work, and what you don't know, you already
+ know you can to find out!)
+ \item How about designing a product and put it on Crowdsupply?
+ \item The internet exists: you can find others to team up with,
+ in an area of technology that interests you.
+ \end{itemize}
+
+ \bf{Ultimately you could do a huge amount of good. With your skill
+ there is nothing that can stop you except yourself}
+
+}
+
+
+\end{document}
+
+++ /dev/null
-\documentclass[slidestop]{beamer}
-\usepackage{beamerthemesplit}
-\usepackage{graphics}
-\usepackage{pstricks}
-
-\graphicspath{{./}}
-
-\title{The Libre-SOC Hybrid 3D CPU}
-\author{Luke Kenneth Casson Leighton}
-
-
-\begin{document}
-
-\frame{
- \begin{center}
- \huge{The Libre-SOC Hybrid 3D CPU}\\
- \vspace{32pt}
- \Large{Augmenting the OpenPOWER ISA}\\
- \Large{to provide 3D and Video instructions}\\
- \Large{(properly and officially)}\\
- \vspace{24pt}
- \Large{[proposed for] OpenPOWER Summit 2020}\\
- \vspace{16pt}
- \large{Sponsored by NLnet's PET Programme}\\
- \vspace{6pt}
- \large{\today}
- \end{center}
-}
-
-
-\frame{\frametitle{Why another SoC?}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item Intel Management Engine, QA issues, Spectre\vspace{15pt}
- \item Endless proprietary drivers \\
- (affects product development cost)\vspace{15pt}
- \item Opportunity to drastically simplify driver development\\
- and engage in "long-tail" markets\vspace{15pt}
- \item Because for 30 years I Always Wanted To Design A CPU\vspace{10pt}
- \end{itemize}
-}
-
-
-\frame{\frametitle{Why OpenPOWER? (but first: Evaluation Criteria)}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item Good ecosystem essential\\
- linux kernel, u-boot, compilers, OSes,\\
- Reference Implementation(s)\vspace{12pt}
- \item Supportive Foundation and Members\\
- need to be able to submit ISA augmentations\\
- (for proper peer review)\vspace{12pt}
- \item No NDAs, full transparency must be acceptable\\
- due to being funded under NLnet's PET Programme\vspace{12pt}
- \end{itemize}
-}
-
-\frame{\frametitle{Why OpenPOWER?}
-
-
- \begin{itemize}
- \item RISC-V: closed secretive mailing lists, closed secretive\\
- ISA Working Groups, no acceptance of transparency\\
- requirements, not well-established enough
- \item MIPS Open Initiative website was offline
- \item ARM and x86 are proprietary (x86 too complex)
- \item OpenRISC 1200 not enough adoption
- \item Nyuzi GPU too specialist (not a general-purpose ISA)
- \item MIAOW GPU is not a GPU (it's an AMD Vector Engine)
- \item "rolling your own" out of the question (20+ man-years)
- \item OpenPOWER: established for decades, excellent Foundation,\\
- Microwatt as Reference, approachable and friendly.
- \end{itemize}
-}
-
-\frame{\frametitle{What goes into a typical SoC?}
-\vspace{9pt}
- \begin{itemize}
- \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
- heat sink normally not required (simplifies overall design)
- \vspace{10pt}
- \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
- USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
- \vspace{10pt}
- \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{10pt}
- \item Build-in VPU (likewise)\vspace{10pt}
- \item Target price between \$2.50 and \$30 depending on market\\
- Radically different from IBM POWER9 Core (200 Watt)
- \vspace{10pt}
- \end{itemize}
-}
-
-
-
-\frame{\frametitle{Simple SBC-style SoC}
-
-\begin{center}
-\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
-\end{center}
-
-}
-
-
-\frame{\frametitle{Where to start? (roadmap)}
-
- \begin{itemize}
- \item First thing: get a basic core working on an FPGA\\
- (use Microwatt as a reference)
- \item Next: create a low-cost test ASIC (180nm).\\
- (first OpenPOWER ASIC since IBM's POWER9, 10 years ago)
- \item (in parallel): Develop Vector ISA with 3D and Video\\
- extensions, under watchful eye of OpenPOWER Foundation
- \item Implement Vector ISA in simulator, then HDL, then FPGA\\
- and finally (only when ratified by OPF) into silicon
- \item Sell chips, make \$\$\$.
- \end{itemize}
-}
-
-\frame{\frametitle{What's different about Libre-SOC?}
-
- \begin{itemize}
- \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
- The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
- \textit{There is No Separate VPU/GPU Pipeline}\\
- \vspace{9pt}
- \item written in nmigen (a python-based HDL). Not VHDL\\
- not Verilog (definitely not Chisel3/Scala)\\
- This is an extremely important strategic decision.
- \vspace{9pt}
- \item Simple-V Vector Extension. See "SIMD Considered harmful".\\
- SV effectively a "hardware for-loop" on standard scalar ISA\\
- (conceptually similar to Zero-Overhead Loops in DSPs)
- \vspace{9pt}
- \end{itemize}
-}
-
-\frame{\frametitle{Hybrid Architecture: Augmented 6600}
-
- \begin{itemize}
- \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
- Help from Mitch Alsup includes \textit{precise exceptions}, \\
- multi-issue and more. Academic literature on 6600 utterly misleading.
- 6600 Scoreboards completely underestimated (Seymour Cray and
- James Thornton
- solved problems they didn't realise existed elsewhere!)
- \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
- nmigen (python OO) strategically critical to achieving this.
- \item Out-of-order combined with Simple-V allows scalar operations\\
- at the developer end to be turned into SIMD at the back-end\\
- \textit{without the developer needing to do SIMD}
- \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
- all automatically vectorised.
- \end{itemize}
-}
-
-\frame{\frametitle{Why nmigen? (but first: evaluate other HDLs)}
-
- \begin{itemize}
- \item Verilog: designed in the 1980s purely for doing unit tests (!)
- \item VHDL: again, a 1980s-era "Procedural" language (BASIC, Fortran).
- Does now have "records" which is nice.
- \item Chisel3 / Scala: OO, but very obscure (20th on index)
- \item pyrtl: not large enough community
- \item MyHDL: subset of python only
- \vspace{9pt}
- \item Slowly forming a set of criteria: must be OO (python), must have
- wide adoption (python), must have good well-established
- programming practices already in place (python), must be
- easy to learn (python)
- \item HDL itself although a much smaller community must have the same
- criteria. Only nmigen meets that criteria.
-
- \end{itemize}
-}
-
-\frame{\frametitle{Why nmigen?}
-
- \begin{itemize}
- \item Uses python to build an AST (Abstract Syntax Tree).
- Actually hands that over to yosys (to create ILANG file)
- after which verilog can (if necessary) be created
- \item Deterministic synthesiseable behaviour (Signals are declared
- with their reset pattern: no more forgetting "if rst" block).
- \item python OO programming techniques can be deployed. classes
- and functions created which pass in parameters which change
- what HDL is created (IEEE754 FP16 / 32 / 64 for example)
- \item python-based for-loops can e.g. read CSV files then generate
- a hierarchical nested suite of HDL Switch / Case statements
- (this is how the Libre-soc PowerISA decoder is implemented)
- \item extreme OO abstraction can even be used to create "dynamic
- partitioned Signals" that have the same operator-overloaded
- "add", "subtract", "greater-than" operators
-
- \end{itemize}
-}
-
-\frame{\frametitle{nmigen (dynamic) vs VHDL (static)}
-
-\begin{center}
-\includegraphics[width=1.0\textwidth]{2020-09-10_11-53.png}
-\end{center}
-
-}
-
-\frame{\frametitle{nmigen PowerISA Decoder}
-
-\begin{center}
-\includegraphics[width=1.0\textwidth]{2020-09-10_11-46.png}
-\end{center}
-
-}
-
-\frame{\frametitle{nmigen PowerISA Decoder}
-
-\begin{center}
-\includegraphics[width=0.55\textwidth]{2020-09-09_21-04.png}
-\end{center}
-
-}
-
-\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
-
- \begin{itemize}
- \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
- SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
- \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
- able to fit a PowerISA decoder in only one clock cycle)
- \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
- then rolls incrementally through the operand register numbers\\
- issuing \textit{multiple} scalar instructions into the pipelines\\
- (hence the reason for a multi-issue OoO microarchitecture)
- \item Current \textit{and future} PowerISA scalar opcodes inherently
- \textit{and automatically} become 'vectorised' by SV without
- needing an explicit new Vector opcode.
- \item Predication and element width polymorphism are also 'tags'.
- elwidth polymorphism allows for FP16 / 80 / 128 to be added to
- the ISA \textit{without modifying the ISA}
-
- \end{itemize}
-}
-
-
-\begin{frame}[fragile]
-\frametitle{Simple-V ADD in a nutshell}
-
-\begin{semiverbatim}
-function op\_add(rd, rs1, rs2, predr) # add not VADD!
- int i, id=0, irs1=0, irs2=0;
- for (i = 0; i < VL; i++)
- if (ireg[predr] & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
- if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
- if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
-\end{semiverbatim}
-
- \begin{itemize}
- \item Above is oversimplified: Reg. indirection left out (for clarity).
- \item SIMD slightly more complex (case above is elwidth = default)
- \item Scalar-scalar and scalar-vector and vector-vector now all in one
- \item OoO may choose to push ADDs into instr. queue (v. busy!)
- \end{itemize}
-\end{frame}
-
-
-\frame{\frametitle{Summary}
-
- \begin{itemize}
- \item Goal is to create a mass-volume low-power embedded SoC suitable
- for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
- \item No DRM. 'Trustable' (by the users, not by Media Moguls) design
- ethos as a \textit{business} objective: requires full transparency
- as well as Formal Correctness Proofs
- \item Collaboration with OpenPOWER Foundation and Members absolutely
- essential. No short-cuts. Standards to be developed and ratified
- so that everyone benefits.
- \item Working on the back of huge stability of POWER ecosystem
- \item Combination of which is that Board Support Package is 100\%
- upstream, app and product development by customer is hugely
- simplified and much more attractive
-
- \end{itemize}
-}
-
-
-\frame{
- \begin{center}
- {\Huge The end\vspace{15pt}\\
- Thank you\vspace{15pt}\\
- Questions?\vspace{15pt}
- }
- \end{center}
-
- \begin{itemize}
- \item Discussion: Libre-SOC-dev mailing list
- \item Freenode IRC \#libre-soc
- \item http://libre-soc.org/
- \item http://nlnet.nl/PET
- \end{itemize}
-}
-
-
-\end{document}
--- /dev/null
+\documentclass[slidestop]{beamer}
+\usepackage{beamerthemesplit}
+\usepackage{graphics}
+\usepackage{pstricks}
+
+\graphicspath{{./}}
+
+\title{The Libre-SOC Hybrid 3D CPU}
+\author{Luke Kenneth Casson Leighton}
+
+
+\begin{document}
+
+\frame{
+ \begin{center}
+ \huge{The Libre-SOC Hybrid 3D CPU}\\
+ \vspace{32pt}
+ \Large{Augmenting the OpenPOWER ISA}\\
+ \Large{to provide 3D and Video instructions}\\
+ \Large{(properly and officially)}\\
+ \vspace{24pt}
+ \Large{[proposed for] OpenPOWER Summit 2020}\\
+ \vspace{16pt}
+ \large{Sponsored by NLnet's PET Programme}\\
+ \vspace{6pt}
+ \large{\today}
+ \end{center}
+}
+
+
+\frame{\frametitle{Why another SoC?}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Intel Management Engine, QA issues, Spectre\vspace{15pt}
+ \item Endless proprietary drivers \\
+ (affects product development cost)\vspace{15pt}
+ \item Opportunity to drastically simplify driver development\\
+ and engage in "long-tail" markets\vspace{15pt}
+ \item Because for 30 years I Always Wanted To Design A CPU\vspace{10pt}
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Why OpenPOWER? (but first: Evaluation Criteria)}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Good ecosystem essential\\
+ linux kernel, u-boot, compilers, OSes,\\
+ Reference Implementation(s)\vspace{12pt}
+ \item Supportive Foundation and Members\\
+ need to be able to submit ISA augmentations\\
+ (for proper peer review)\vspace{12pt}
+ \item No NDAs, full transparency must be acceptable\\
+ due to being funded under NLnet's PET Programme\vspace{12pt}
+ \end{itemize}
+}
+
+\frame{\frametitle{Why OpenPOWER?}
+
+
+ \begin{itemize}
+ \item RISC-V: closed secretive mailing lists, closed secretive\\
+ ISA Working Groups, no acceptance of transparency\\
+ requirements, not well-established enough
+ \item MIPS Open Initiative website was offline
+ \item ARM and x86 are proprietary (x86 too complex)
+ \item OpenRISC 1200 not enough adoption
+ \item Nyuzi GPU too specialist (not a general-purpose ISA)
+ \item MIAOW GPU is not a GPU (it's an AMD Vector Engine)
+ \item "rolling your own" out of the question (20+ man-years)
+ \item OpenPOWER: established for decades, excellent Foundation,\\
+ Microwatt as Reference, approachable and friendly.
+ \end{itemize}
+}
+
+\frame{\frametitle{What goes into a typical SoC?}
+\vspace{9pt}
+ \begin{itemize}
+ \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
+ heat sink normally not required (simplifies overall design)
+ \vspace{10pt}
+ \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
+ USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
+ \vspace{10pt}
+ \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{10pt}
+ \item Build-in VPU (likewise)\vspace{10pt}
+ \item Target price between \$2.50 and \$30 depending on market\\
+ Radically different from IBM POWER9 Core (200 Watt)
+ \vspace{10pt}
+ \end{itemize}
+}
+
+
+
+\frame{\frametitle{Simple SBC-style SoC}
+
+\begin{center}
+\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
+\end{center}
+
+}
+
+
+\frame{\frametitle{Where to start? (roadmap)}
+
+ \begin{itemize}
+ \item First thing: get a basic core working on an FPGA\\
+ (use Microwatt as a reference)
+ \item Next: create a low-cost test ASIC (180nm).\\
+ (first OpenPOWER ASIC since IBM's POWER9, 10 years ago)
+ \item (in parallel): Develop Vector ISA with 3D and Video\\
+ extensions, under watchful eye of OpenPOWER Foundation
+ \item Implement Vector ISA in simulator, then HDL, then FPGA\\
+ and finally (only when ratified by OPF) into silicon
+ \item Sell chips, make \$\$\$.
+ \end{itemize}
+}
+
+\frame{\frametitle{What's different about Libre-SOC?}
+
+ \begin{itemize}
+ \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
+ The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
+ \textit{There is No Separate VPU/GPU Pipeline}\\
+ \vspace{9pt}
+ \item written in nmigen (a python-based HDL). Not VHDL\\
+ not Verilog (definitely not Chisel3/Scala)\\
+ This is an extremely important strategic decision.
+ \vspace{9pt}
+ \item Simple-V Vector Extension. See "SIMD Considered harmful".\\
+ SV effectively a "hardware for-loop" on standard scalar ISA\\
+ (conceptually similar to Zero-Overhead Loops in DSPs)
+ \vspace{9pt}
+ \end{itemize}
+}
+
+\frame{\frametitle{Hybrid Architecture: Augmented 6600}
+
+ \begin{itemize}
+ \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
+ Help from Mitch Alsup includes \textit{precise exceptions}, \\
+ multi-issue and more. Academic literature on 6600 utterly misleading.
+ 6600 Scoreboards completely underestimated (Seymour Cray and
+ James Thornton
+ solved problems they didn't realise existed elsewhere!)
+ \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
+ nmigen (python OO) strategically critical to achieving this.
+ \item Out-of-order combined with Simple-V allows scalar operations\\
+ at the developer end to be turned into SIMD at the back-end\\
+ \textit{without the developer needing to do SIMD}
+ \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
+ all automatically vectorised.
+ \end{itemize}
+}
+
+\frame{\frametitle{Why nmigen? (but first: evaluate other HDLs)}
+
+ \begin{itemize}
+ \item Verilog: designed in the 1980s purely for doing unit tests (!)
+ \item VHDL: again, a 1980s-era "Procedural" language (BASIC, Fortran).
+ Does now have "records" which is nice.
+ \item Chisel3 / Scala: OO, but very obscure (20th on index)
+ \item pyrtl: not large enough community
+ \item MyHDL: subset of python only
+ \vspace{9pt}
+ \item Slowly forming a set of criteria: must be OO (python), must have
+ wide adoption (python), must have good well-established
+ programming practices already in place (python), must be
+ easy to learn (python)
+ \item HDL itself although a much smaller community must have the same
+ criteria. Only nmigen meets that criteria.
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Why nmigen?}
+
+ \begin{itemize}
+ \item Uses python to build an AST (Abstract Syntax Tree).
+ Actually hands that over to yosys (to create ILANG file)
+ after which verilog can (if necessary) be created
+ \item Deterministic synthesiseable behaviour (Signals are declared
+ with their reset pattern: no more forgetting "if rst" block).
+ \item python OO programming techniques can be deployed. classes
+ and functions created which pass in parameters which change
+ what HDL is created (IEEE754 FP16 / 32 / 64 for example)
+ \item python-based for-loops can e.g. read CSV files then generate
+ a hierarchical nested suite of HDL Switch / Case statements
+ (this is how the Libre-soc PowerISA decoder is implemented)
+ \item extreme OO abstraction can even be used to create "dynamic
+ partitioned Signals" that have the same operator-overloaded
+ "add", "subtract", "greater-than" operators
+
+ \end{itemize}
+}
+
+\frame{\frametitle{nmigen (dynamic) vs VHDL (static)}
+
+\begin{center}
+\includegraphics[width=1.0\textwidth]{2020-09-10_11-53.png}
+\end{center}
+
+}
+
+\frame{\frametitle{nmigen PowerISA Decoder}
+
+\begin{center}
+\includegraphics[width=1.0\textwidth]{2020-09-10_11-46.png}
+\end{center}
+
+}
+
+\frame{\frametitle{nmigen PowerISA Decoder}
+
+\begin{center}
+\includegraphics[width=0.55\textwidth]{2020-09-09_21-04.png}
+\end{center}
+
+}
+
+\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
+
+ \begin{itemize}
+ \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
+ SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
+ \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
+ able to fit a PowerISA decoder in only one clock cycle)
+ \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
+ then rolls incrementally through the operand register numbers\\
+ issuing \textit{multiple} scalar instructions into the pipelines\\
+ (hence the reason for a multi-issue OoO microarchitecture)
+ \item Current \textit{and future} PowerISA scalar opcodes inherently
+ \textit{and automatically} become 'vectorised' by SV without
+ needing an explicit new Vector opcode.
+ \item Predication and element width polymorphism are also 'tags'.
+ elwidth polymorphism allows for FP16 / 80 / 128 to be added to
+ the ISA \textit{without modifying the ISA}
+
+ \end{itemize}
+}
+
+
+\begin{frame}[fragile]
+\frametitle{Simple-V ADD in a nutshell}
+
+\begin{semiverbatim}
+function op\_add(rd, rs1, rs2, predr) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ for (i = 0; i < VL; i++)
+ if (ireg[predr] & 1<<i) # predication uses intregs
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
+ if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
+ if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
+\end{semiverbatim}
+
+ \begin{itemize}
+ \item Above is oversimplified: Reg. indirection left out (for clarity).
+ \item SIMD slightly more complex (case above is elwidth = default)
+ \item Scalar-scalar and scalar-vector and vector-vector now all in one
+ \item OoO may choose to push ADDs into instr. queue (v. busy!)
+ \end{itemize}
+\end{frame}
+
+
+\frame{\frametitle{Summary}
+
+ \begin{itemize}
+ \item Goal is to create a mass-volume low-power embedded SoC suitable
+ for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
+ \item No DRM. 'Trustable' (by the users, not by Media Moguls) design
+ ethos as a \textit{business} objective: requires full transparency
+ as well as Formal Correctness Proofs
+ \item Collaboration with OpenPOWER Foundation and Members absolutely
+ essential. No short-cuts. Standards to be developed and ratified
+ so that everyone benefits.
+ \item Working on the back of huge stability of POWER ecosystem
+ \item Combination of which is that Board Support Package is 100\%
+ upstream, app and product development by customer is hugely
+ simplified and much more attractive
+
+ \end{itemize}
+}
+
+
+\frame{
+ \begin{center}
+ {\Huge The end\vspace{15pt}\\
+ Thank you\vspace{15pt}\\
+ Questions?\vspace{15pt}
+ }
+ \end{center}
+
+ \begin{itemize}
+ \item Discussion: Libre-SOC-dev mailing list
+ \item Freenode IRC \#libre-soc
+ \item http://libre-soc.org/
+ \item http://nlnet.nl/PET
+ \end{itemize}
+}
+
+
+\end{document}
+++ /dev/null
-\documentclass[slidestop]{beamer}
-\usepackage{beamerthemesplit}
-\usepackage{graphics}
-\usepackage{pstricks}
-
-\graphicspath{{./}}
-
-\title{The Libre-SOC Hybrid 3D CPU}
-\author{Luke Kenneth Casson Leighton}
-
-
-\begin{document}
-
-\frame{
- \begin{center}
- \huge{The Libre-SOC Hybrid 3D CPU}\\
- \vspace{32pt}
- \Large{Augmenting the OpenPOWER ISA}\\
- \Large{to provide 3D and Video instructions}\\
- \Large{(properly and officially) and make a GPU}\\
- \vspace{24pt}
- \Large{XDC2020}\\
- \vspace{16pt}
- \large{Sponsored by NLnet's PET Programme}\\
- \vspace{6pt}
- \large{\today}
- \end{center}
-}
-
-
-\frame{\frametitle{Why another SoC?}
-
- \begin{itemize}
- \item Intel Management Engine, Apple QA issues, Spectre\vspace{6pt}
- \item Endless proprietary drivers, "simplest" solution: \\
- License proprietary hard macros (with proprietary firmware)\\
- Adversely affects product development cost\\
- due to opaque driver bugs (Samsung S3C6410 / S5P100)
- \vspace{6pt}
- \item Alternative: Intel and Valve-Steam collaboration\\
- "Most productive business meeting ever!"\\
- https://tinyurl.com/valve-steam-intel
- \vspace{6pt}
- \item Because for 30 years I Always Wanted To Design A CPU
- \vspace{6pt}
- \item Ultimately it is a strategic \textit{business} objective to
- develop entirely Libre hardware, firmware and drivers.
- \end{itemize}
-}
-
-
-\frame{\frametitle{Why OpenPOWER?}
-
-\vspace{15pt}
-
- \begin{itemize}
- \item Good ecosystem essential\\
- linux kernel, u-boot, compilers, OSes,\\
- Reference Implementation(s)\vspace{10pt}
- \item Supportive Foundation and Members\\
- need to be able to submit ISA augmentations\\
- (for proper peer review)\vspace{10pt}
- \item No NDAs, full transparency must be acceptable\\
- due to being funded under NLnet's PET Programme\vspace{10pt}
- \item OpenPOWER: established for decades, excellent Foundation,\\
- Microwatt as Reference, approachable and friendly.
- \end{itemize}
-}
-
-
-\frame{\frametitle{What goes into a typical SoC?}
-\vspace{9pt}
- \begin{itemize}
- \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
- heat sink normally not required (simplifies overall design)
- \vspace{10pt}
- \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
- USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
- \vspace{10pt}
- \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{10pt}
- \item Build-in VPU (likewise)\vspace{10pt}
- \item Target price between \$2.50 and \$30 depending on market\\
- Radically different from IBM POWER9 Core (200 Watt)
- \vspace{10pt}
- \end{itemize}
-}
-
-
-
-\frame{\frametitle{Simple SBC-style SoC}
-
-\begin{center}
-\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
-\end{center}
-
-}
-
-\frame{\frametitle{What's different about Libre-SOC?}
-
- \begin{itemize}
- \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
- The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
- \textit{There is No Separate VPU/GPU Pipeline}\\
- \vspace{9pt}
- \item written in nmigen (a python-based HDL). Not VHDL\\
- not Verilog (definitely not Chisel3/Scala)\\
- This is an extremely important strategic decision.\\
- \vspace{9pt}
- \item Simple-V Vector Extension. See `SIMD Considered harmful'.\\
- https://tinyurl.com/simd-considered-harmful\\
- SV effectively a "hardware for-loop" on standard scalar ISA\\
- (conceptually similar to Zero-Overhead Loops in DSPs)
- \vspace{9pt}
- \end{itemize}
-}
-
-\frame{\frametitle{Hybrid Architecture: Augmented 6600}
-
- \begin{itemize}
- \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
- Help from Mitch Alsup includes \textit{precise exceptions}, \\
- multi-issue and more. Academic literature on 6600 utterly misleading.
- 6600 Scoreboards completely underestimated (Seymour Cray and
- James Thornton
- solved problems they didn't realise existed elsewhere!)
- \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
- nmigen (python OO) strategically critical to achieving this.
- \item Out-of-order combined with Simple-V allows scalar operations\\
- at the developer end to be turned into SIMD at the back-end\\
- \textit{without the developer needing to do SIMD}
- \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
- all automatically vectorised.
- \end{itemize}
-}
-
-\frame{\frametitle{Why nmigen?}
-
- \begin{itemize}
- \item Uses python to build an AST (Abstract Syntax Tree).
- Actually hands that over to yosys (to create ILANG file)
- after which verilog can (if necessary) be created
- \item Deterministic synthesiseable behaviour (Signals are declared
- with their reset pattern: no more forgetting "if rst" block).
- \item python OO programming techniques can be deployed. classes
- and functions created which pass in parameters which change
- what HDL is created (IEEE754 FP16 / 32 / 64 for example)
- \item python-based for-loops can e.g. read CSV files then generate
- a hierarchical nested suite of HDL Switch / Case statements
- (this is how the Libre-soc PowerISA decoder is implemented)
- \item extreme OO abstraction can even be used to create "dynamic
- partitioned Signals" that have the same operator-overloaded
- "add", "subtract", "greater-than" operators
-
- \end{itemize}
-}
-
-\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
-
- \begin{itemize}
- \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
- SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
- \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
- able to fit a PowerISA decoder in only one clock cycle)
- \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
- then rolls incrementally through the operand register numbers\\
- issuing \textit{multiple} scalar instructions into the pipelines\\
- (hence the reason for a multi-issue OoO microarchitecture)
- \item Current \textit{and future} PowerISA scalar opcodes inherently
- \textit{and automatically} become 'vectorised' by SV without
- needing an explicit new Vector opcode.
- \item Predication and element width polymorphism are also 'tags'.
- elwidth polymorphism allows for FP16 / 80 / 128 to be added to
- the ISA \textit{without modifying the ISA}
-
- \end{itemize}
-}
-
-\frame{\frametitle{Quick refresher on SIMD}
-
- \begin{itemize}
- \item SIMD very easy to implement (and very seductive)
- \item Parallelism is in the ALU
- \item Zero-to-Negligeable impact for rest of core
- \end{itemize}
- Where SIMD Goes Wrong:\vspace{6pt}
- \begin{itemize}
- \item See "SIMD instructions considered harmful"
- https://sigarch.org/simd-instructions-considered-harmful
- \item Setup and corner-cases alone are extremely complex.\\
- Hardware is easy, but software is hell.\\
- strncpy VSX patch for POWER9: 250 hand-written asm lines!\\
- (RVV / SimpleV strncpy is 14 instructions)
- \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\
- opcode, elwidth, veclen, src1-src2-dest hi/lo
- \end{itemize}
-}
-
-\begin{frame}[fragile]
-\frametitle{Simple-V ADD in a nutshell}
-
-\begin{semiverbatim}
-function op\_add(rd, rs1, rs2, predr) # add not VADD!
- int i, id=0, irs1=0, irs2=0;
- for (i = 0; i < VL; i++)
- if (ireg[predr] & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
- if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
- if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
-\end{semiverbatim}
-
- \begin{itemize}
- \item Above is oversimplified: Reg. indirection left out (for clarity).
- \item SIMD slightly more complex (case above is elwidth = default)
- \item Scalar-scalar and scalar-vector and vector-vector now all in one
- \item OoO may choose to push ADDs into instr. queue (v. busy!)
- \end{itemize}
-\end{frame}
-
-\begin{frame}[fragile]
-\frametitle{Predication-Branch (overload meaning of "branch")}
-
-\begin{semiverbatim}
-s1 = reg\_is\_vectorised(src1);
-s2 = reg\_is\_vectorised(src2);
-if (!s2 && !s1) goto branch;
-for (int i = 0; i < VL; ++i)
- if (cmp(s1 ? reg[src1+i]:reg[src1],
- s2 ? reg[src2+i]:reg[src2])
- ireg[rs3] |= 1<<i;
-\end{semiverbatim}
-
- \begin{itemize}
- \item Above is oversimplified (case above is elwidth = default)
- \item If s1 and s2 both scalars, Standard branch occurs
- \item Predication stored in integer regfile as a bitfield
- \item Scalar-vector and vector-vector supported
- \item Overload Branch immediate to be predication target rs3
- \end{itemize}
-\end{frame}
-
-\begin{frame}[fragile]
-\frametitle{Register element width and packed SIMD}
-
-\begin{semiverbatim}
- typedef union \{
- uint8\_t actual\_bytes[8]; // actual SRAM bytes
- uint8\_t b[]; // array of type uint8\_t
- uint16\_t s[]; // etc
- uint32\_t i[];
- uint64\_t l[];
- \} reg\_t;
-
- reg\_t int\_regfile[128];
-\end{semiverbatim}
-
- \begin{itemize}
- \item Regfile is treated (sort-of) as a byte-level SRAM
- \item Each "register" starts at an 8-byte offset into SRAM
- \item requires byte-level "select" lines on SRAM
- \end{itemize}
-
-\end{frame}
-
-\frame{\frametitle{Register element width and packed SIMD}
-
- \begin{itemize}
- \item default: elements behave as defined by the standard ISA
- \item override for Integer operations: 8/16/32 bit SIMD
- \item override for IEEE754 FP: FP16/FP32 (and later FP80 or FP128)
- \item Effectively "typecasts" regfile to union of arrays
- \item Does not require modification of ISA! This is "tagging"\\
- (similar to the `Mill' ISA)
- \item FPADD64 RT, RA, RB becomes `actually please do FP16'\\
- (but without needing to add an actual FPADD16 opcode)
- \item Note: no zeroing unless explicitly requested!\\
- (unused elements e.g. VL=3 when elwidth=16 are
- predicated out: int\_regfile[RA].s[3] is not zero'd)
- \end{itemize}
-
-}
-
-\frame{\frametitle{Additional Simple-V features}
-
- \begin{itemize}
- \item "fail-on-first" (POWER9 VSX strncpy segfaults on boundary!)
- \item "Twin Predication" (covers VSPLAT, VGATHER, VSCATTER, VINDEX etc.)
- \item SVPrefix: 16-bit and 32-bit prefix to scalar operations\\
- (SVP-64 allows more extensive "tag" augmentation)
- \item VBLOCK: a VLIW-like context. Allows space for `swizzle' tags
- and more. Effectively a "hardware compression algorithm" for ISAs.
- \item Ultimate goal: cut down I-Cache usage, cuts down on power
- \item Typical GPU has its own I-Cache and small shaders.\\
- \textit{We are a Hybrid CPU/GPU: I-Cache is not separate!}
- \item Needs to go through OpenPOWER Foundation `approval'
- \end{itemize}
-}
-
-
-\frame{\frametitle{Summary}
-
- \begin{itemize}
- \item Goal is to create a mass-volume low-power embedded SoC suitable
- for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
- \item No way we could implement a project of this magnitude without
- nmigen (being able to use python OO to HDL)
- \item Collaboration with OpenPOWER Foundation and Members absolutely
- essential. No short-cuts. Standards to be developed and ratified
- so that everyone benefits.
- \item Working on the back of huge stability of POWER ecosystem
- \item Greatly simplified open 3D and Video drivers reduces product
- development costs for customers
- \item It also happens to be fascinating, deeply rewarding technically
- challenging, and funded by NLnet
-
- \end{itemize}
-}
-
-
-\frame{
- \begin{center}
- {\Huge The end\vspace{15pt}\\
- Thank you\vspace{15pt}\\
- Questions?\vspace{15pt}
- }
- \end{center}
-
- \begin{itemize}
- \item Discussion: http://lists.libre-soc.org
- \item Freenode IRC \#libre-soc
- \item http://libre-soc.org/
- \item http://nlnet.nl/PET
- \end{itemize}
-}
-
-
-\end{document}
--- /dev/null
+\documentclass[slidestop]{beamer}
+\usepackage{beamerthemesplit}
+\usepackage{graphics}
+\usepackage{pstricks}
+
+\graphicspath{{./}}
+
+\title{The Libre-SOC Hybrid 3D CPU}
+\author{Luke Kenneth Casson Leighton}
+
+
+\begin{document}
+
+\frame{
+ \begin{center}
+ \huge{The Libre-SOC Hybrid 3D CPU}\\
+ \vspace{32pt}
+ \Large{Augmenting the OpenPOWER ISA}\\
+ \Large{to provide 3D and Video instructions}\\
+ \Large{(properly and officially) and make a GPU}\\
+ \vspace{24pt}
+ \Large{XDC2020}\\
+ \vspace{16pt}
+ \large{Sponsored by NLnet's PET Programme}\\
+ \vspace{6pt}
+ \large{\today}
+ \end{center}
+}
+
+
+\frame{\frametitle{Why another SoC?}
+
+ \begin{itemize}
+ \item Intel Management Engine, Apple QA issues, Spectre\vspace{6pt}
+ \item Endless proprietary drivers, "simplest" solution: \\
+ License proprietary hard macros (with proprietary firmware)\\
+ Adversely affects product development cost\\
+ due to opaque driver bugs (Samsung S3C6410 / S5P100)
+ \vspace{6pt}
+ \item Alternative: Intel and Valve-Steam collaboration\\
+ "Most productive business meeting ever!"\\
+ https://tinyurl.com/valve-steam-intel
+ \vspace{6pt}
+ \item Because for 30 years I Always Wanted To Design A CPU
+ \vspace{6pt}
+ \item Ultimately it is a strategic \textit{business} objective to
+ develop entirely Libre hardware, firmware and drivers.
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Why OpenPOWER?}
+
+\vspace{15pt}
+
+ \begin{itemize}
+ \item Good ecosystem essential\\
+ linux kernel, u-boot, compilers, OSes,\\
+ Reference Implementation(s)\vspace{10pt}
+ \item Supportive Foundation and Members\\
+ need to be able to submit ISA augmentations\\
+ (for proper peer review)\vspace{10pt}
+ \item No NDAs, full transparency must be acceptable\\
+ due to being funded under NLnet's PET Programme\vspace{10pt}
+ \item OpenPOWER: established for decades, excellent Foundation,\\
+ Microwatt as Reference, approachable and friendly.
+ \end{itemize}
+}
+
+
+\frame{\frametitle{What goes into a typical SoC?}
+\vspace{9pt}
+ \begin{itemize}
+ \item 15 to 20mm BGA package: 2.5 to 5 watt power consumption\\
+ heat sink normally not required (simplifies overall design)
+ \vspace{10pt}
+ \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
+ USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
+ \vspace{10pt}
+ \item Built-in GPU (shared memory bus, 3rd party licensed) \vspace{10pt}
+ \item Build-in VPU (likewise)\vspace{10pt}
+ \item Target price between \$2.50 and \$30 depending on market\\
+ Radically different from IBM POWER9 Core (200 Watt)
+ \vspace{10pt}
+ \end{itemize}
+}
+
+
+
+\frame{\frametitle{Simple SBC-style SoC}
+
+\begin{center}
+\includegraphics[width=0.9\textwidth]{shakti_libre_soc.jpg}
+\end{center}
+
+}
+
+\frame{\frametitle{What's different about Libre-SOC?}
+
+ \begin{itemize}
+ \item Hybrid - integrated. The CPU \textit{is} the GPU.\\
+ The GPU \textit{is} the CPU. The VPU \textit{is} the CPU.\\
+ \textit{There is No Separate VPU/GPU Pipeline}\\
+ \vspace{9pt}
+ \item written in nmigen (a python-based HDL). Not VHDL\\
+ not Verilog (definitely not Chisel3/Scala)\\
+ This is an extremely important strategic decision.\\
+ \vspace{9pt}
+ \item Simple-V Vector Extension. See `SIMD Considered harmful'.\\
+ https://tinyurl.com/simd-considered-harmful\\
+ SV effectively a "hardware for-loop" on standard scalar ISA\\
+ (conceptually similar to Zero-Overhead Loops in DSPs)
+ \vspace{9pt}
+ \end{itemize}
+}
+
+\frame{\frametitle{Hybrid Architecture: Augmented 6600}
+
+ \begin{itemize}
+ \item CDC 6600 is a design from 1965. The \textit{augmentations} are not.\\
+ Help from Mitch Alsup includes \textit{precise exceptions}, \\
+ multi-issue and more. Academic literature on 6600 utterly misleading.
+ 6600 Scoreboards completely underestimated (Seymour Cray and
+ James Thornton
+ solved problems they didn't realise existed elsewhere!)
+ \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
+ nmigen (python OO) strategically critical to achieving this.
+ \item Out-of-order combined with Simple-V allows scalar operations\\
+ at the developer end to be turned into SIMD at the back-end\\
+ \textit{without the developer needing to do SIMD}
+ \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
+ all automatically vectorised.
+ \end{itemize}
+}
+
+\frame{\frametitle{Why nmigen?}
+
+ \begin{itemize}
+ \item Uses python to build an AST (Abstract Syntax Tree).
+ Actually hands that over to yosys (to create ILANG file)
+ after which verilog can (if necessary) be created
+ \item Deterministic synthesiseable behaviour (Signals are declared
+ with their reset pattern: no more forgetting "if rst" block).
+ \item python OO programming techniques can be deployed. classes
+ and functions created which pass in parameters which change
+ what HDL is created (IEEE754 FP16 / 32 / 64 for example)
+ \item python-based for-loops can e.g. read CSV files then generate
+ a hierarchical nested suite of HDL Switch / Case statements
+ (this is how the Libre-soc PowerISA decoder is implemented)
+ \item extreme OO abstraction can even be used to create "dynamic
+ partitioned Signals" that have the same operator-overloaded
+ "add", "subtract", "greater-than" operators
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Why another Vector ISA? (or: not-exactly another)}
+
+ \begin{itemize}
+ \item Simple-V is a 'register tag' system. \textit{There are no opcodes}\\
+ SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
+ \item (PowerISA SIMD is around 700 opcodes, making it unlikely to be
+ able to fit a PowerISA decoder in only one clock cycle)
+ \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
+ then rolls incrementally through the operand register numbers\\
+ issuing \textit{multiple} scalar instructions into the pipelines\\
+ (hence the reason for a multi-issue OoO microarchitecture)
+ \item Current \textit{and future} PowerISA scalar opcodes inherently
+ \textit{and automatically} become 'vectorised' by SV without
+ needing an explicit new Vector opcode.
+ \item Predication and element width polymorphism are also 'tags'.
+ elwidth polymorphism allows for FP16 / 80 / 128 to be added to
+ the ISA \textit{without modifying the ISA}
+
+ \end{itemize}
+}
+
+\frame{\frametitle{Quick refresher on SIMD}
+
+ \begin{itemize}
+ \item SIMD very easy to implement (and very seductive)
+ \item Parallelism is in the ALU
+ \item Zero-to-Negligeable impact for rest of core
+ \end{itemize}
+ Where SIMD Goes Wrong:\vspace{6pt}
+ \begin{itemize}
+ \item See "SIMD instructions considered harmful"
+ https://sigarch.org/simd-instructions-considered-harmful
+ \item Setup and corner-cases alone are extremely complex.\\
+ Hardware is easy, but software is hell.\\
+ strncpy VSX patch for POWER9: 250 hand-written asm lines!\\
+ (RVV / SimpleV strncpy is 14 instructions)
+ \item O($N^{6}$) ISA opcode proliferation (1000s of instructions)\\
+ opcode, elwidth, veclen, src1-src2-dest hi/lo
+ \end{itemize}
+}
+
+\begin{frame}[fragile]
+\frametitle{Simple-V ADD in a nutshell}
+
+\begin{semiverbatim}
+function op\_add(rd, rs1, rs2, predr) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ for (i = 0; i < VL; i++)
+ if (ireg[predr] & 1<<i) # predication uses intregs
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (reg\_is\_vectorised[rd] ) \{ id += 1; \}
+ if (reg\_is\_vectorised[rs1]) \{ irs1 += 1; \}
+ if (reg\_is\_vectorised[rs2]) \{ irs2 += 1; \}
+\end{semiverbatim}
+
+ \begin{itemize}
+ \item Above is oversimplified: Reg. indirection left out (for clarity).
+ \item SIMD slightly more complex (case above is elwidth = default)
+ \item Scalar-scalar and scalar-vector and vector-vector now all in one
+ \item OoO may choose to push ADDs into instr. queue (v. busy!)
+ \end{itemize}
+\end{frame}
+
+\begin{frame}[fragile]
+\frametitle{Predication-Branch (overload meaning of "branch")}
+
+\begin{semiverbatim}
+s1 = reg\_is\_vectorised(src1);
+s2 = reg\_is\_vectorised(src2);
+if (!s2 && !s1) goto branch;
+for (int i = 0; i < VL; ++i)
+ if (cmp(s1 ? reg[src1+i]:reg[src1],
+ s2 ? reg[src2+i]:reg[src2])
+ ireg[rs3] |= 1<<i;
+\end{semiverbatim}
+
+ \begin{itemize}
+ \item Above is oversimplified (case above is elwidth = default)
+ \item If s1 and s2 both scalars, Standard branch occurs
+ \item Predication stored in integer regfile as a bitfield
+ \item Scalar-vector and vector-vector supported
+ \item Overload Branch immediate to be predication target rs3
+ \end{itemize}
+\end{frame}
+
+\begin{frame}[fragile]
+\frametitle{Register element width and packed SIMD}
+
+\begin{semiverbatim}
+ typedef union \{
+ uint8\_t actual\_bytes[8]; // actual SRAM bytes
+ uint8\_t b[]; // array of type uint8\_t
+ uint16\_t s[]; // etc
+ uint32\_t i[];
+ uint64\_t l[];
+ \} reg\_t;
+
+ reg\_t int\_regfile[128];
+\end{semiverbatim}
+
+ \begin{itemize}
+ \item Regfile is treated (sort-of) as a byte-level SRAM
+ \item Each "register" starts at an 8-byte offset into SRAM
+ \item requires byte-level "select" lines on SRAM
+ \end{itemize}
+
+\end{frame}
+
+\frame{\frametitle{Register element width and packed SIMD}
+
+ \begin{itemize}
+ \item default: elements behave as defined by the standard ISA
+ \item override for Integer operations: 8/16/32 bit SIMD
+ \item override for IEEE754 FP: FP16/FP32 (and later FP80 or FP128)
+ \item Effectively "typecasts" regfile to union of arrays
+ \item Does not require modification of ISA! This is "tagging"\\
+ (similar to the `Mill' ISA)
+ \item FPADD64 RT, RA, RB becomes `actually please do FP16'\\
+ (but without needing to add an actual FPADD16 opcode)
+ \item Note: no zeroing unless explicitly requested!\\
+ (unused elements e.g. VL=3 when elwidth=16 are
+ predicated out: int\_regfile[RA].s[3] is not zero'd)
+ \end{itemize}
+
+}
+
+\frame{\frametitle{Additional Simple-V features}
+
+ \begin{itemize}
+ \item "fail-on-first" (POWER9 VSX strncpy segfaults on boundary!)
+ \item "Twin Predication" (covers VSPLAT, VGATHER, VSCATTER, VINDEX etc.)
+ \item SVPrefix: 16-bit and 32-bit prefix to scalar operations\\
+ (SVP-64 allows more extensive "tag" augmentation)
+ \item VBLOCK: a VLIW-like context. Allows space for `swizzle' tags
+ and more. Effectively a "hardware compression algorithm" for ISAs.
+ \item Ultimate goal: cut down I-Cache usage, cuts down on power
+ \item Typical GPU has its own I-Cache and small shaders.\\
+ \textit{We are a Hybrid CPU/GPU: I-Cache is not separate!}
+ \item Needs to go through OpenPOWER Foundation `approval'
+ \end{itemize}
+}
+
+
+\frame{\frametitle{Summary}
+
+ \begin{itemize}
+ \item Goal is to create a mass-volume low-power embedded SoC suitable
+ for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
+ \item No way we could implement a project of this magnitude without
+ nmigen (being able to use python OO to HDL)
+ \item Collaboration with OpenPOWER Foundation and Members absolutely
+ essential. No short-cuts. Standards to be developed and ratified
+ so that everyone benefits.
+ \item Working on the back of huge stability of POWER ecosystem
+ \item Greatly simplified open 3D and Video drivers reduces product
+ development costs for customers
+ \item It also happens to be fascinating, deeply rewarding technically
+ challenging, and funded by NLnet
+
+ \end{itemize}
+}
+
+
+\frame{
+ \begin{center}
+ {\Huge The end\vspace{15pt}\\
+ Thank you\vspace{15pt}\\
+ Questions?\vspace{15pt}
+ }
+ \end{center}
+
+ \begin{itemize}
+ \item Discussion: http://lists.libre-soc.org
+ \item Freenode IRC \#libre-soc
+ \item http://libre-soc.org/
+ \item http://nlnet.nl/PET
+ \end{itemize}
+}
+
+
+\end{document}