From: Luke Kenneth Casson Leighton Date: Thu, 5 May 2022 16:33:39 +0000 (+0100) Subject: whitespace X-Git-Tag: opf_rfc_ls005_v1~2435 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=acb1ef7a53e3c5c4bf3c2489b9d9b0aca1adc0e4;p=libreriscv.git whitespace --- diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn index a862bb3b8..a997acf77 100644 --- a/openpower/sv/SimpleV_rationale.mdwn +++ b/openpower/sv/SimpleV_rationale.mdwn @@ -2,108 +2,103 @@ # Why in the 2020s would you invent a new Vector ISA -Inventing a new Scalar ISA from scratch is over a decade-long task including -simulators and compilers: OpenRISC 1200 took 12 years to mature. -A Vector or Packed SIMD ISA to reach stable -general-purpose auto-vectorisation compiler support has never been -achieved in the history of computing, not with the combined resources of -ARM, Intel, AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. -Rather: GPUs have ultra-specialist compilers that are designed -from the ground up to support Vector/SIMD parallelism, -and associated standards managed by the -Khronos Group, with multi-man-century development committment from +Inventing a new Scalar ISA from scratch is over a decade-long task +including simulators and compilers: OpenRISC 1200 took 12 years to +mature. A Vector or Packed SIMD ISA to reach stable general-purpose +auto-vectorisation compiler support has never been achieved in the +history of computing, not with the combined resources of ARM, Intel, +AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. Rather: GPUs +have ultra-specialist compilers that are designed from the ground up +to support Vector/SIMD parallelism, and associated standards managed by +the Khronos Group, with multi-man-century development committment from multiple billion-dollar-revenue companies, to sustain them. -Therefore it begs the question, why on earth would anyone consider this -task? - -First hints are that whilst memory bitcells have not increased in -speed since the 90s (around 150 mhz), increasing the datapath widths -has allowed significant apparent speed increases: 3200 mhz DDR4 and -even faster DDR5, and other advanced Memory interfaces such as HBM, -Gen-Z, and OpenCAPI, all make an effort, but these efforts are -dwarfed by the two nearly three orders of magnitude increase in -CPU horsepower. Seymour Cray, from his amazing in-depth knowledge, -predicted that the mismatch would become a serious limitation. -Some systems at the time of writing are approaching -a *Gigabyte* of L4 Cache, by way of compensation, and as we know -from experience even that will be considered inadequate in future. - -Efforts to solve this problem by moving the processing closer to -or directly integrated into the memory have traditionally not gone -well: Aspex Microelectronics, Elixent, these are parallel processing -companies that very few have heard of, because their software stack -was so specialist that it required heavy investment by customers to -utilise. D-Matrix and Graphcore are a modern incarnation of the exact -same "specialist parallel processing" mistake, betting heavily on -AI with Matrix and Convolution Engines that can do no other task. -Aspex only survived by being bought by Ericsson, where its specialised -suitability for massive wide Baseband FFTs saved it from going under. -Any "better AI mousetrap" that comes along will quickly render -both D-Matrix and Graphcore obsolete. +Therefore it begs the question, why on earth would anyone consider +this task? + +First hints are that whilst memory bitcells have not increased in speed +since the 90s (around 150 mhz), increasing the datapath widths has allowed +significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5, +and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI, +all make an effort, but these efforts are dwarfed by the two nearly +three orders of magnitude increase in CPU horsepower. Seymour Cray, +from his amazing in-depth knowledge, predicted that the mismatch would +become a serious limitation. Some systems at the time of writing are +approaching a *Gigabyte* of L4 Cache, by way of compensation, and as we +know from experience even that will be considered inadequate in future. + +Efforts to solve this problem by moving the processing closer to or +directly integrated into the memory have traditionally not gone well: +Aspex Microelectronics, Elixent, these are parallel processing companies +that very few have heard of, because their software stack was so +specialist that it required heavy investment by customers to utilise. +D-Matrix and Graphcore are a modern incarnation of the exact same +"specialist parallel processing" mistake, betting heavily on AI with +Matrix and Convolution Engines that can do no other task. Aspex only +survived by being bought by Ericsson, where its specialised suitability +for massive wide Baseband FFTs saved it from going under. Any "better +AI mousetrap" that comes along will quickly render both D-Matrix and +Graphcore obsolete. Second hints as to the answer emerge from an article "[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)" which illustrates a catastrophic rabbit-hole taken by Industry Giants -ARM, Intel, AMD, -since the 90s (over 3 decades) whereby SIMD, an Order(N^6) opcode -proliferation nightmare, with its mantra "make it easy for hardware engineers, -let software sort out the mess" literally overwhelming programmers. -Worse than that, specialists in charging clients Optimisation Services -are finding that AVX-512, to take an example, is anything but optimal: -overall performance of AVX-512 actually *decreases* even as power -consumption goes up. +ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an +Order(N^6) opcode proliferation nightmare, with its mantra "make it +easy for hardware engineers, let software sort out the mess" literally +overwhelming programmers. Worse than that, specialists in charging +clients Optimisation Services are finding that AVX-512, to take an +example, is anything but optimal: overall performance of AVX-512 actually +*decreases* even as power consumption goes up. Cray-style Vectors solved, over thirty years ago, the opcode proliferation -nightmare. Only the NEC SX Aurora however truly kept the Cray Vector flame -alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined it. -ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl` instruction -that makes a truly ubiquitous Vector ISA) in ways that will become apparent -over time as adoption increases. In the meantime programmers are, in -direct violation of ARM's advice on how to use SVE2, trying desperately -to use it as if it was Packed SIMD NEON. The advice not to create SVE2 -assembler that is hardcoded to fixed widths is being disregarded, in -favour of writing *multiple identical implementations* of a function, -each with a different hardware width, and compelling software to choose -one at runtime after probing the hardware. - -Even RISC-V, for all that we can be grateful to the RISC-V Founders for -reviving Cray Vectors, has severe performance and implementation +nightmare. Only the NEC SX Aurora however truly kept the Cray Vector +flame alive, until RISC-V RVV and now SVP64 and recently MRISC32 joined +it. ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl` +instruction that makes a truly ubiquitous Vector ISA) in ways that +will become apparent over time as adoption increases. In the meantime +programmers are, in direct violation of ARM's advice on how to use SVE2, +trying desperately to use it as if it was Packed SIMD NEON. The advice +not to create SVE2 assembler that is hardcoded to fixed widths is being +disregarded, in favour of writing *multiple identical implementations* +of a function, each with a different hardware width, and compelling +software to choose one at runtime after probing the hardware. + +Even RISC-V, for all that we can be grateful to the RISC-V Founders +for reviving Cray Vectors, has severe performance and implementation limitations that are only really apparent to exceptionally experienced assembly-level developers with a wide, diverse depth in multiple ISAs: one of the best and clearest is a [ycombinator post](https://news.ycombinator.com/item?id=24459041) by adrian_b. -Adrian logically and concisely points out that the fundamental -design assumptions and -simplifications that went into the RISC-V ISA have an -irrevocably damaging effect -on its viability for high performance use. That is not to say that -its use in low-performance embedded scenarios is not ideal: in -private custom secretive commercial usage it is perfect. -Trinamic, an early adopter, created their TMC2660 Stepper IC -replacing ARM with RISC-V and saving themselves USD 1 in licensing -royalties per product are a classic case study. Ubiquitous -and common everyday usage in scenarios currently occupied by ARM, Intel, -AMD and IBM? not so much. Even though RISC-V has Cray-style Vectors, -the whole ISA is, unfortunately, fundamentally flawed as far as power -efficient high performance is concerned. +Adrian logically and concisely points out that the fundamental design +assumptions and simplifications that went into the RISC-V ISA have an +irrevocably damaging effect on its viability for high performance use. +That is not to say that its use in low-performance embedded scenarios is +not ideal: in private custom secretive commercial usage it is perfect. +Trinamic, an early adopter, created their TMC2660 Stepper IC replacing +ARM with RISC-V and saving themselves USD 1 in licensing royalties +per product are a classic case study. Ubiquitous and common everyday +usage in scenarios currently occupied by ARM, Intel, AMD and IBM? not +so much. Even though RISC-V has Cray-style Vectors, the whole ISA is, +unfortunately, fundamentally flawed as far as power efficient high +performance is concerned. Slowly, at this point, a realisation should be sinking in that, actually, there aren't as many really truly viable Vector ISAs out there, as the ones that are evolving in the general direction of Vectorisation are, in various completely different ways, flawed. -**Successfully identifying a limitation marks the beginning of an opportunity** +**Successfully identifying a limitation marks the beginning of an +opportunity** -We are nowhere near done, however, because a Vector ISA is a superset of -a Scalar ISA, and even a Scalar ISA takes over a decade to develop -compiler support, and even longer to get the software ecosystem up and -running. +We are nowhere near done, however, because a Vector ISA is a superset of a +Scalar ISA, and even a Scalar ISA takes over a decade to develop compiler +support, and even longer to get the software ecosystem up and running. -Which ISAs, therefore, have or have had, at one point in time, a decent Software -Ecosystem? Debian supports most of these including s390: +Which ISAs, therefore, have or have had, at one point in time, a decent +Software Ecosystem? Debian supports most of these including s390: * SPARC, created by Sun Microsystems and all but abandoned by Oracle. Gaisler Research maintains the LEON Open Source Cores but with Oracle's @@ -163,10 +158,9 @@ candidates for further advancement are: particularly disingenuous group from performing illegal Reverse-Engineering. -By asking the question, "which ISA would be the best and most stable -to base a Vector Supercomputing-class Extension on?" where patent -protection, software ecosystem, open-ness and pedigree all combine -to reduce risk and increase the chances of success, there is really -only one candidate. +By asking the question, "which ISA would be the best and most stable to +base a Vector Supercomputing-class Extension on?" where patent protection, +software ecosystem, open-ness and pedigree all combine to reduce risk +and increase the chances of success, there is really only one candidate. **Of all of these, the only one with the most going for it is the Power ISA.**