1 \documentclass[slidestop
]{beamer
}
2 \usepackage{beamerthemesplit
}
8 \title{The Libre-SOC Hybrid
3D CPU
}
9 \author{Luke Kenneth Casson Leighton
}
16 \huge{The Libre-SOC Hybrid
3D CPU
}\\
18 \Large{Augmenting the OpenPOWER ISA
}\\
19 \Large{to provide
3D and Video instructions
}\\
20 \Large{(properly and officially) and make a GPU
}\\
24 \large{Sponsored by NLnet's PET Programme
}\\
31 \frame{\frametitle{Why another SoC?
}
34 \item Intel Management Engine, Apple QA issues, Spectre
\vspace{6pt
}
35 \item Endless proprietary drivers, "simplest" solution: \\
36 License proprietary hard macros (with proprietary firmware)\\
37 Adversely affects product development cost\\
38 due to opaque driver bugs (Samsung S3C6410 / S5P100)
40 \item Alternative: Intel and Valve-Steam collaboration\\
41 "Most productive business meeting ever!"\\
42 https://tinyurl.com/valve-steam-intel
44 \item Because for
30 years I Always Wanted To Design A CPU
46 \item Ultimately it is a strategic
\textit{business
} objective to
47 develop entirely Libre hardware, firmware and drivers.
52 \frame{\frametitle{Why OpenPOWER?
}
57 \item Good ecosystem essential\\
58 linux kernel, u-boot, compilers, OSes,\\
59 Reference Implementation(s)
\vspace{10pt
}
60 \item Supportive Foundation and Members\\
61 need to be able to submit ISA augmentations\\
62 (for proper peer review)
\vspace{10pt
}
63 \item No NDAs, full transparency must be acceptable\\
64 due to being funded under NLnet's PET Programme
\vspace{10pt
}
65 \item OpenPOWER: established for decades, excellent Foundation,\\
66 Microwatt as Reference, approachable and friendly.
70 \frame{\frametitle{How can you help?
}
75 \item Start here! https://libre-soc.org \\
76 Mailing lists https://lists.libre-soc.org \\
77 IRC Freenode libre-soc \\
78 etc. etc. (it's a Libre project, go figure) \\
80 \item Can I get paid? Yes! NLnet funded\\
81 See https://libre-soc.org/nlnet/\#faq \\
83 \item Also profit-sharing in any commercial ventures \\
85 \item How many opportunities to develop Libre SoCs exist,\\
86 and actually get paid for it?
88 \item I'm not a developer, how can I help?\\
89 - Plenty of research needed, artwork, website \\
90 - Help find customers and OEMs willing to commit (LOI)
96 \frame{\frametitle{What goes into a typical SoC?
}
99 \item 15 to
20mm BGA package:
2.5 to
5 watt power consumption\\
100 heat sink normally not required (simplifies overall design)
102 \item Fully-integrated peripherals (not Northbridge/Southbridge)\\
103 USB, HDMI, RGB/TTL, SD/MMC, I2C, UART, SPI, GPIO etc. etc.
105 \item Built-in GPU (shared memory bus,
3rd party licensed)
\vspace{3pt
}
106 \item Built-in VPU (likewise, proprietary)
\vspace{3pt
}
107 \item Target price between \$
2.50 and \$
30 depending on market\\
108 Radically different from IBM POWER9 Core (
200 Watt)
110 \item We're doing the same, just with a hybrid architecture.\\
117 \frame{\frametitle{Simple SBC-style SoC
}
120 \includegraphics[width=
0.9\textwidth]{shakti_libre_soc.jpg
}
125 \frame{\frametitle{What's different about Libre-SOC?
}
128 \item Hybrid - integrated. The CPU
\textit{is
} the GPU.\\
129 The GPU
\textit{is
} the CPU. The VPU
\textit{is
} the CPU.\\
130 \textit{There is No Separate VPU/GPU Pipeline or Processor
}\\
132 \item written in nmigen (a python-based HDL). Not VHDL\\
133 not Verilog (definitely not Chisel3/Scala)\\
134 This is an extremely important strategic decision.\\
136 \item Simple-V Vector Extension. See `SIMD Considered harmful'.\\
137 https://tinyurl.com/simd-considered-harmful\\
138 SV effectively a "hardware for-loop" on standard scalar ISA\\
139 (conceptually similar to Zero-Overhead Loops in DSPs)
141 \item Yes great, but what's different compared to Intel, AMD, NVIDIA,
146 \frame{\frametitle{OpenPOWER Cell Processor and upwards
}
149 \item OpenPOWER ISA developed from PowerPC, with the RS6000 in the
90s.
151 \item Sony, IBM and Toshiba began the Cell Processor in
2001 \\
152 (Sony Playstation
3) - NUMA approach
154 \item Raw brute-force performance pissed all over the competition
157 \item VSX later evolved out of this initiative.
159 \item VSX, a SIMD extension, now showing its age. \\
160 Fixed-width, no predication, limited pixel formats (
15 bit)
162 \item (Vulkan requires dozens of pixel formats)
166 \frame{\frametitle{Apple M1 (ARM) vs Intel / AMD (x86)
}
169 \item Very interesting article: tinyurl.com/apple-m1-review
170 \item Apple M1: uses ARM. Intel: implements x86
171 \item Apple M1: RISC multi-issue. Intel: CISC multi-issue.
172 \item Apple M1: uniform (easy) instruction decode \\
173 Intel:
\textit{Cannot easily identify start of instruction
}
174 \item Result: multi-issue x86 decoder is so complex, it misses
175 opportunities to keep back-end execution engines
100 percent
177 \item OpenPOWER happens to be RISC (easy decode), which is why POWER10
178 has
8-way multi-issue.
179 \item Libre-SOC can do the same tricks that IBM POWER10 and Apple M1
180 can. Intel (x86) literally cannot keep up.
185 \frame{\frametitle{Hybrid Architecture: Augmented
6600}
188 \item CDC
6600 is a design from
1965. The
\textit{augmentations
} are not.\\
189 Help from Mitch Alsup includes
\textit{precise exceptions
}, \\
190 multi-issue and more. Academic literature on
6600 utterly misleading.
191 6600 Scoreboards completely underestimated (Seymour Cray and
193 solved problems they didn't realise existed elsewhere!)
194 \item Front-end Vector ISA, back-end "Predicated (masked) SIMD"\\
195 nmigen (python OO) strategically critical to achieving this.
196 \item Out-of-order combined with Simple-V allows scalar operations\\
197 at the developer end to be turned into SIMD at the back-end\\
198 \textit{without the developer needing to do SIMD
}
199 \item IEEE754 sin / cos / atan2, Texturisation opcodes, YUV2RGB\\
200 all automatically vectorised.
204 \frame{\frametitle{Learning from these and putting it together
}
207 \item Apple M1 and IBM POWER10 show that RISC plus superscalar
208 multi-issue produces insane performance
209 \item Intel AVX
512 and CISC in general is getting out of hand (what's
210 next:
256-bit length instructions, AVX
1024?)
211 \item RISC-V RVV shows Cray-style Vectors can save power. Simple-V
212 has the same benefits with far less instructions (
188 for RVV,
213 3 to
5 new instructions for Simple-V).
214 \item CDC
6600 shows that intelligently-implemented designs can do the
215 job, with far less resources.
216 \item Libre-SOC combines the best of historical processor designs,
217 co-opting and innovating on them (pissing in the back yard of
218 every incumbent CPU and GPU company in the process).
219 \item It's a Libre project: you get to help
224 \frame{\frametitle{Why nmigen?
}
227 \item Uses python to build an AST (Abstract Syntax Tree).
228 Actually hands that over to yosys (to create ILANG file)
229 after which verilog can (if necessary) be created
230 \item Deterministic synthesiseable behaviour (Signals are declared
231 with their reset pattern: no more forgetting "if rst" block).
232 \item python OO programming techniques can be deployed. classes
233 and functions created which pass in parameters which change
234 what HDL is created (IEEE754 FP16 /
32 /
64 for example)
235 \item python-based for-loops can e.g. read CSV files then generate
236 a hierarchical nested suite of HDL Switch / Case statements
237 (this is how the Libre-soc PowerISA decoder is implemented)
238 \item extreme OO abstraction can even be used to create "dynamic
239 partitioned Signals" that have the same operator-overloaded
240 "add", "subtract", "greater-than" operators
245 \frame{\frametitle{Why another Vector ISA? (or: not-exactly another)
}
248 \item Simple-V is a 'register tag' system.
\textit{There are no opcodes
}\\
249 SV 'tags' scalar operations (scalar regfiles) as 'vectorised'
250 \item (PowerISA SIMD is around
700 opcodes, making it unlikely to be
251 able to fit a PowerISA decoder in only one clock cycle)
252 \item Effectively a 'hardware sub-counter for-loop': pauses the PC\\
253 then rolls incrementally through the operand register numbers\\
254 issuing
\textit{multiple
} scalar instructions into the pipelines\\
255 (hence the reason for a multi-issue OoO microarchitecture)
256 \item Current
\textit{and future
} PowerISA scalar opcodes inherently
257 \textit{and automatically
} become 'vectorised' by SV without
258 needing an explicit new Vector opcode.
259 \item Predication and element width polymorphism are also 'tags'.
260 elwidth polymorphism allows for BF16 / FP16 /
80 /
128 to be added to
261 the ISA
\textit{without modifying the ISA
}
266 \frame{\frametitle{Quick refresher on SIMD
}
269 \item SIMD very easy to implement (and very seductive)
270 \item Parallelism is in the ALU
271 \item Zero-to-Negligeable impact for rest of core
273 Where SIMD Goes Wrong:
\vspace{6pt
}
275 \item See "SIMD instructions considered harmful"
276 https://sigarch.org/simd-instructions-considered-harmful
277 \item Setup and corner-cases alone are extremely complex.\\
278 Hardware is easy, but software is hell.\\
279 strncpy VSX patch for POWER9:
250 hand-written asm lines!\\
280 (RVV / SimpleV strncpy is
14 instructions)
281 \item O($N^
{6}$) ISA opcode proliferation (
1000s of instructions)\\
282 opcode, elwidth, veclen, src1-src2-dest hi/lo
286 \begin{frame
}[fragile
]
287 \frametitle{Simple-V ADD in a nutshell
}
290 function op
\_add(rd, rs1, rs2, predr) # add not VADD!
291 int i, id=
0, irs1=
0, irs2=
0;
292 for (i =
0; i < VL; i++)
293 if (ireg
[predr
] &
1<<i) # predication uses intregs
294 ireg
[rd+id
] <= ireg
[rs1+irs1
] + ireg
[rs2+irs2
];
295 if (reg
\_is\_vectorised[rd
] ) \
{ id +=
1; \
}
296 if (reg
\_is\_vectorised[rs1
]) \
{ irs1 +=
1; \
}
297 if (reg
\_is\_vectorised[rs2
]) \
{ irs2 +=
1; \
}
301 \item Above is oversimplified: Reg. indirection left out (for clarity).
302 \item SIMD slightly more complex (case above is elwidth = default)
303 \item Scalar-scalar and scalar-vector and vector-vector now all in one
304 \item OoO may choose to push ADDs into instr. queue (v. busy!)
308 \frame{\frametitle{Additional Simple-V features
}
311 \item "fail-on-first" (POWER9 VSX strncpy segfaults on boundary!)
312 \item "Twin Predication" (covers VSPLAT, VGATHER, VSCATTER, VINDEX etc.)
313 \item SVP64: extensive "tag" (Vector context) augmentation
314 \item "Context propagation": a VLIW-like context. Allows contexts
315 to be repeatedly applied.
316 Effectively a "hardware compression algorithm" for ISAs.
317 \item Ultimate goal: cut down I-Cache usage, cuts down on power
318 \item Typical GPU has its own I-Cache and small shaders.\\
319 \textit{We are a Hybrid CPU/GPU: I-Cache is not separate!
}
320 \item Needs to go through OpenPOWER Foundation `approval'
325 \frame{\frametitle{Summary
}
328 \item Goal is to create a mass-volume low-power embedded SoC suitable
329 for use in netbooks, chromebooks, tablets, smartphones, IoT SBCs.
330 \item No way we could implement a project of this magnitude without
331 nmigen (being able to use python OO to HDL)
332 \item Collaboration with OpenPOWER Foundation and Members absolutely
333 essential. No short-cuts. Standards to be developed and ratified
334 so that everyone benefits.
335 \item Riding the wave of huge stability of OpenPOWER ecosystem
336 \item Greatly simplified open
3D and Video drivers reduces product
337 development costs for customers
338 \item It also happens to be fascinating, deeply rewarding technically
339 challenging, and funded by NLnet
347 {\Huge The end
\vspace{12pt
}\\
348 Thank you
\vspace{12pt
}\\
349 Questions?
\vspace{12pt
}
354 \item Discussion: http://lists.libre-soc.org
355 \item Freenode IRC \#libre-soc
356 \item http://libre-soc.org/
357 \item http://nlnet.nl/PET
358 \item https://libre-soc.org/nlnet/\#faq