remove Rc=1 for now from bmatflip
[libreriscv.git] / openpower / sv / microcontroller_power_isa_for_ai.mdwn
1 [[!tag whitepapers]]
2
3 # Increasing average area efficiency and reducing resource utilisation for the Power ISA
4
5 originally posted at: <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-February/004505.html>
6
7 in between attempting to compile microwatt and Libre-SOC for an 85k LUT4 FPGA which took 4 hours (and then did not run), i decided to see if, in Libre-SOC's HDL, what level of resource reduction could be achieved by going to 32 bit ALUs and register files.
8
9 the difference was an astounding 1.4 to 1.
10
11 * the MUL pipeline dropped an astonishing 75% which given that multiply is O(N^2) is, retrospectively, not surprising
12 * SHIFT dropped to 50%
13 * ALU (add) dropped over 50%
14 * Logical dropped over 60%
15 * BRAM usage dropped by over 75%
16
17 i then took a look at the I-Cache, D-Cache and MMU, and i am not seeing any practical barriers to setting them to 32 bit either, other than needing to define a new RADIX32 data format, which looks to be as simple as reducing the PTE and PDE lengths.
18
19 why consider this at all? surely "32 bit is dead, Jim, dead, Jim, dead, Jim, dead" [https://m.youtube.com/watch?v=FCARADb9asE]
20
21 the answers are multiple:
22
23 * Anton had a hard time getting Microwatt into the Sky130 MPW1, which is limited to 10 mm^2.
24 (Libre-SOC's 180 nm test ASIC was 30 mm^2 and
25 that was with no MMU or L1 D/I-Cache)
26 * Compared to RISC-V, which can easily fit into only 3000 LUT4s of an FPGA,
27 a Power ISA implementation is completely missing on opportunities
28 to be taken up by Hardware enthusiasts because it requires a bare minimum of 40K LUT4s.
29 (Without L1/MMU, Libre-SOC 64-bit is still 20k LUT4s)
30 * The high resource utilisation is making life difficult for Libre-BMC
31 and the fact that it is not the slightest bit justified to be running a 64 bit OS, just for a bootloader,
32 leaves me puzzled as to the justification of what is inherently a self-inflicted handicap
33
34 The high resource utilisation is (including for Libre-BMC) pressurising everyone, hindering adoption, and slowing down the iteration cycle on development. The 4 hour turnround was not a throwaway comment, it was deeply significant: designs using 95% of a 45K LUT4 ECP5 can usually complete on nextpnr-ecp5 in around 12-15 minutes. 4 hours is insane and wasting time.
35
36 it is also worth reiterating that larger designs give FPGA tools a much harder job, dramatically reducing the maximum achievable clock rate.
37
38
39 based on the above analysis, a 32 bit implementation of a MMU-capable Power ISA core could easily fit into a lower cost Digilent Arty A7-35t, a 45K LUT4 VERSA_ECP5, and with a little corner-cutting (no MMU/L1) even potentially fit into the low-cost 25K orangecrab with plenty of room.
40
41 this would make it affordable and accessible to e.g. students in India as well as increase general adoption
42
43 not only that but it would cleanly fit into sky130's 10 mm^2 budget (with reduced I/D-Caches), retain an MMU, and have room for some peripherals (kinda important, that)
44
45 this in turn allows for a faster iterative cycle on ASIC development through access every couple of *months* to an MPW Shuttle run.
46
47
48 the next step requires a little explanation and context. SVP64 has been designed as a "Sub-Program-Counter for-loop in hardware" (similar to x86 "REP"). it is not a new idea: Peter Hsu, designer of the MIPS R8000, came up with the exact same concept behind SVP64, in 1994.
49
50 the register file is treated as a byte-addressable SRAM (with byte-level masks this is not difficult to envisage) and the ALUs end up being conceptually similar to MMX, which can do 8x8 4x16 2x32 or 1x64 bit operations, except that SVP64 introduces predicate masks which of course
51 map directly and simply onto the write-select lines of the underlying
52 SRAM of the register file.
53
54 however as an intermediary step on the path to converting Libre-SOC's HDL to cope with 8/16/32/64 we actually have to define and implement *scalar* operations at 8, 16 and 32 bit in addition to those already present in the 64-bit Power ISA. this is underway with a Draft RFC proposal to define the Power ISA in terms of "XLEN", where XLEN=64 very deliberately, thoroughly and intentionally matches precisely, and by definition, with exactly that which is currently in Power ISA 3.0/3.1
55
56 let that sink in a moment because the implications are startling:
57
58 we are in effect defining not only a 32 bit Draft
59 variant of the Power ISA, we (Libre-SOC) are also
60 defining a 16 bit *and an 8 bit* variant of Power
61 [and anticipate someone in the future to
62 define a 128-bit variant to match RISC-V RV128].
63
64 bear in mind that SVP64 *has* to have Scalar Operations first, because by design and by definition *only Scalar operations may be Vectorized*. SVP64 *DOES NOT* add *ANY* Vector Instructions. SVP64 is a generic loop around *Scalar* operations and it us up to the Architecture to take advantage of that, at the back-end.
65
66 without SVP64 Sub-Looping it would on the face of it seem absolutely mental and a total waste of time and resources to define an 8 or 16 bit General-Purpose ISA in the year 2022 until you recall that:
67
68 * students cannot possibly fit a Power ISA 64 bit implementation into a USD $10 ICE40 FPGA, but they might achieve a 16 bit one, and potentially do so in a few short weeks
69
70 * the primary focus of AI is FP16, BF16, and even FP8 in some cases, QTY massive parallel banks of cores numbering in the thousands, often with SIMD ALUs.
71
72 * a typical GPU has over 30% by area dedicated to parallel computational
73 resources (SIMD ALUs) where a General-purpose RISC Core is typically
74 dwarfed by literally two orders of magnitude by routing, register files,
75 caches and peripherals.
76
77 the inherent downside of such massively parallel task-centric cores is that they are absolutely useless at anything other than that specialist task, and are additionally a pig to program, lacking a useful ISA and compiler or, worse, having one but under proprietary licenses.
78
79 the delicate balance of massively parallel supercomputing architecture is not to overcook the performance of a single core above all else (hint: Intel), but to focus instead on *average* efficiency per *total* area or power.
80
81 what if there was a way to leverage the Power ISA
82 to have high-end AI performance yet be able to
83 allow programmers to use standard compiler tools
84 to run general-purpose programs on all of those
85 massively-parallel cores?
86
87 anyone who has tried either CUDA, 3D Shader programs, deep or wide SIMD Programming, or tried to get their heads twisted round GPU SIMT threads would celebrate and welcome the opportunity.
88
89 (in particular, anyone who remembers how hard programming the Cell Processor turned out to be will be having that familiar "lightbulb moment" right about now)
90
91 more than that: what if those 8 and 16 bit cores had a Supercomputing-class Vectorization option in the ISA, and there were implementations out there with back-end ALUs that could perform 64 or 128 8 or 16 bit operations per clock cycle?
92
93 Quantity several thousand per processor, all of them capable of adapting to run massive AI number crunching or (at lower IPC than "normal" processors) general-purpose compute?
94
95 To achieve this requires some insights:
96
97 1. access (addressing memory) beyond 8-bit, 16-bit, or 32-bit, can easily be achieved by allowing LD/STs to leverage *multiple* 8/16/32-bit registers to create 32 or 64 bit addresses.
98
99 SVP64 *already* has the concept of allowing consecutive 8/16/32/64 bit registers to be considered a "Vector" so typecasting to create 32 or 64 bit addresses fits easily
100
101 2. If the Power ISA did not already have Carry-In/Out and Condition Registers, this entire idea would have much less merit.
102
103 the idea of using multiple instructions to construct bigger integer values is nothing new, but doing so is far easier and more efficient if the ISA has Carry Flags. that particularly hits home if the basic arithmetic width is only 8 or 16 bit!
104
105 3. SVP64 already has the concept of extending the GPRs and FPRs to 128 entries. however if those are say 16 bit registers, the actual size of the regfile (in bytes) is back down to exactly the same size (in total bytes) as Power ISA 3.0
106 - only 32 16-bit registers would be alarmingly resource pressured, particularly given that 4 of them would be needed to construct a 64 bit LD/ST address
107 - 128 16-bit registers on the other hand are equivalent to 32 64-bit regs and Computer Science shows we are comfortable with that quantity.
108
109 4. Power ISA already has load-quad and store-quad which span 2
110 registers: transparent seamless spanning of more than one
111 register to create larger operands and larger results is
112 therefore not even a foriegn concept to Power.
113
114 given the ease with which both 32 and 64 bit addresses may be constructed, and 32 and 64 bit integer arithmetic (and beyond) may be created using multiple instructions *and* how much more efficient that can be done by leveraging SVP64, what at first sounded like an absolutely insane-to-the-point-of-laughable idea instead would be not only workable but combine General-Purpose Compute and AI workloads into a single hybrid ISA.
115
116 as you are no doubt aware this has been the focus of so many unsuccessful ventures for so many decades, it would be nice to have one that worked. but, by definition, being "General" Purpose Compute (that happens to also be Supercomputing AI capable) it starts at the ISA and grows from there.
117
118 bottom line, i would very much like to see the Power ISA take on Esperanto, but without having to define a custom proprietary extension to the ISA that nobody but they have access to.