From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Thu, 5 May 2022 16:41:44 +0000 (+0100)
Subject: add GPU paragraph
X-Git-Tag: opf_rfc_ls005_v1~2433
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=b195d9580efdbc9cd3c61415161f0a4fec6576d4;p=libreriscv.git

add GPU paragraph
---

diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn
index 8323571fe..6e5d14a03 100644
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -28,12 +28,14 @@ First hints are that whilst memory bitcells have not increased in speed
 since the 90s (around 150 mhz), increasing the datapath widths has allowed
 significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
 and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
-all make an effort, but these efforts are dwarfed by the two nearly
-three orders of magnitude increase in CPU horsepower. Seymour Cray,
-from his amazing in-depth knowledge, predicted that the mismatch would
-become a serious limitation.  Some systems at the time of writing are
-approaching a *Gigabyte* of L4 Cache, by way of compensation, and as we
-know from experience even that will be considered inadequate in future.
+all make an effort (all simply increasing the parallel deployment of
+the underlying 150 mhz bitcells), but these efforts are dwarfed by the
+two nearly three orders of magnitude increase in CPU horsepower. Seymour
+Cray, from his amazing in-depth knowledge, predicted that the mismatch
+would become a serious limitation, over two decades ago.  Some systems
+at the time of writing are now approaching a *Gigabyte* of L4 Cache,
+by way of compensation, and as we know from experience even that will
+be considered inadequate in future.
 
 Efforts to solve this problem by moving the processing closer to or
 directly integrated into the memory have traditionally not gone well:
@@ -48,6 +50,14 @@ for massive wide Baseband FFTs saved it from going under.  Any "better
 AI mousetrap" that comes along will quickly render both D-Matrix and
 Graphcore obsolete.
 
+NVIDIA and other GPUs have taken a different approach again: massive
+parallelism with more Turing-complete ISAs in each, and dedicated
+slower parallel memory paths (GDDR5) suited to the specific tasks of
+3D, Parallel Compute and AI. The complexity of this approach is only dwarfed
+by the amount of money poured into the software ecosystem in order
+to make it accessible, and even then, GPU Programmers are a specialist
+and rare (expensive) breed.
+ 
 Second hints as to the answer emerge from an article
 "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)"
 which illustrates a catastrophic rabbit-hole taken by Industry Giants