updates/020_2019aug28_intriguing_ideas.mdwn

   1 Pixilica starts a 3D Open Graphics Alliance initiative; we decide to
   2 go with a "reconfigurable" pipeline; seven additional 50,000 EUR NLNet
   3 grant proposals submitted.
   4
   5 ### The Possibility of a 3D Open Graphics Alliance
   6
   7 youtube: HeVz-z4D8os
   8
   9 At SIGGRAPH this year, there was a very interesting BoF, where the
  10 [idea was put forward](https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d62b6282dc27100170a4a05)
  11 by Atif, of Pixilica, to use RISC-V as the core basis of a 3D embedded
  12 flexible GPGPU (hybrid / general purpose GPU).  Whilst the idea of a
  13 GPGPU has been floated before (in particular by ICubeCorp), the
  14 reasons *why* were what particularly caught people's attention at the
  15 BoF.
  16
  17 The current 3D GPU designs - NVIDIA, AMD, Intel - are hugely optimised
  18 for mass volume appeal. Niche markets, by virtue of the profit
  19 opportunities being lower or even negative given the design choices of
  20 the incumbents, are inherently penalised.  Not only that: whilst
  21 things are slowly changing due to ongoing multi-man-year
  22 reverse-engineering efforts, 3D driver source code is often
  23 proprietary as well.
  24
  25 At the BoF, one attendee described how they are implementing *transparent*
  26 shader algorithms. Most shader hardware provides fixed-function triangle
  27 algorithms that assume a solid surface. Using such hardware for transparent
  28 shaders is a two-pass process which clearly comes with an inherent *100%*
  29 performance penalty. If, on the other hand, they had some input into a
  30 new 3D core, one that was designed to be flexible...
  31
  32 The level of interest was sufficiently high that Atif is reaching out
  33 to people (including our team) to set up an Open 3D Graphics
  34 Alliance. The basic idea being to have people work together to create
  35 an appropriate efficient "Hybrid CPU/GPU" instruction set architecture
  36 (ISA) suitable for a diverse range of requirements, from small
  37 embedded softcores, to embedded GPUs for use in mobile processors, all
  38 the way to HPC servers to high-end machine learning and robotics
  39 applications.
  40
  41 One interesting thing that has to be made clear - the lesson from
  42 Nyuzi and Larrabee - is that a good vector processor does **not**
  43 automatically make a good 3D GPU. Jeff Bush designed Nyuzi very
  44 specifically to replicate the Larrabee team's work - in particular, their
  45 use of a recursive software-based tiling algorithm. By deliberately
  46 not including custom 3D hardware accelerated opcodes, Nyuzi has only
  47 25% the performance of a modern GPU consuming the same amount of power.
  48 Put another way, if you want to use a pure vector engine to get the same
  49 performance as a commercially competitive GPU, you need *four times*
  50 the power consumption and four times the silicon area.
  51
  52 Thus, we simply cannot use an off-the-shelf vector extension such as the
  53 upcoming RISC-V vector extension, or even SimpleV, and expect to
  54 automatically have a commercially competitive 3D GPU. It takes texture
  55 opcodes, Z-buffers, pixel conversion, linear interpolation, transcendentals
  56 (sin, cos, exp, log), and much more, all of which has to be designed,
  57 thought through, implemented, *and then used behind a suitable API*.
  58
  59 In addition, given that the Alliance is to meet the needs of "unusual"
  60 markets, it is no good creating an ISA that has such a high barrier to
  61 entry and such a power-performance penalty that it inherently excludes
  62 the very implementors it is targetted at, particularly in embedded markets.
  63
  64 Thus, we need a hybrid architecture, not just to reduce complexity, not
  65 just to meet Libre criteria, but to meet the long tail of innovation in
  66 3D and kick start some real innovation.
  67
  68 These were the challenges discussed at the first
  69 [meetup](https://www.meetup.com/Bay-Area-RISC-V-Meetup/events/264231095/)
  70 at Western Digital's Milpitas HQ. Experts at the meetup from the 3D
  71 industry who have worked for decades for ATI, NVIDIA, and Intel, were
  72 [really enthusiastic](https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/JlKZdzS6VtQ/eDStaf8vAQAJ)
  73 and praised this approach, saying that it was exactly
  74 the kind of shake up the 3D Industry needs.
  75
  76 ### Reconfigureable Pipelines
  77
  78 Jacob came up with a fascinating idea: a reconfigureable pipeline. The
  79 basic idea behind pipelines is that combinatorial blocks are separated
  80 by latches. The reason is because when gates are chained together,
  81 there is a ripple effect which has to have time to stabilise. If the
  82 clock is run too fast, computations no longer have time to become valid.
  83
  84 So the solution is to split the combinatorial blocks into shorter chains,
  85 and have "latches" in between them which capture the intermediary
  86 results. This is termed a "pipeline." Actually it's more like an
  87 escalator.
  88
  89 The problem comes when you want to vary the clock speed. This is desirable
  90 because if the pipeline is long and the clock rate is slow, clearly the latency
  91 (completion time of an instruction) is also long.
  92
  93 Conversely, if the pipeline is short (large numbers of gates connected
  94 together) then as mentioned above, this can inherently limit the maximum
  95 frequency that the processor could run at, because, due to the "ripple" effect
  96 in each pipeline stage, a longer chain of gates clearly has to have a longer
  97 time to stabilise.
  98
  99 What if there was a solution which allowed *both* options? What if you
 100 could actually reconfigure the pipeline to be shorter or longer?
 101
 102 It turns out that by using what is termed "transparent latches," it
 103 is possible to do precisely that. The advantages are enormous and were
 104 described in detail on comp.arch.
 105
 106 Earlier in
 107 [this thread](https://groups.google.com/d/msg/comp.arch/fcq-GLQqvas/SY2F9Hd8AQAJ),
 108 someone kindly pointed out that IBM published
 109 papers on the technique. Basically, the latches normally present in the
 110 pipeline have an additional combinatorial "bypass" in the form of a
 111 mux. The output is dynamically selected from either the input *or* the
 112 input after it has been put through a flip-flop. The flip-flop basically
 113 stores (and delays) its input for one clock cycle, or it can be bypassed,
 114 i.e., just be another part of that "ripple" effect mentioned earlier.
 115
 116 By putting these transparent latches on every other combinatorial stage
 117 in the processing chain, the length of the pipeline may be halved, such
 118 that when the clock rate is also halved the *instruction completion time
 119 remains the same*.
 120
 121 As described earlier, normally if the processor speed were lowered it
 122 would have an adverse impact on instruction latency.  With the transparent
 123 latches bypassed and with plenty of time to stabilise at the lower speed,
 124 two back-to-back stages now comprise a *single* pipeline stage, and thus,
 125 even if the processor speed is halved,
 126 *so is the length of the overall pipeline* and thus the instruction
 127 completion time remains the same.
 128
 129 It's a fantastic idea that will allow us to reconfigure the processor
 130 either to reach a 1.5 GHz clock rate for high performance bursts, or to
 131 run at 800 MHz in reduced-power mode.
 132
 133 ### NLNet Funding Proposals
 134
 135 The next step is to put in over half a dozen NLNet funding proposals. No,
 136 literally:
 137 [seven new proposals](https://libre-riscv.org/nlnet_proposals/),
 138 each for 50,000 EUR. One for gcc, one for a port of MESA RADV to the
 139 new processor, another for writing experimental assembly code to go into
 140 libswscale, libx264 etc. ultimately for use in VLC and ffmpeg, and so on.
 141
 142 Best of all, two for actually doing a test ASIC: one working with
 143 [chips4makers](http://chips4makers.io/blog), the other with
 144 [lip6.fr](https://www-soc.lip6.fr/en/). It turns out that 180 nm ASIC shuttle
 145 services cost only 600 USD per square mm, and we can get away with around
 146 20 square mm which is about 12,000 USD and an estimated 800,000 gates.
 147
 148 At that low cost, we can iterate before going to lower geometries plus
 149 actually have something which, even at 350 MHz, if it was dual issue,
 150 would be a reasonably saleable product in its own right.  The only thing
 151 we have to watch out for there, is that it will be a bit of a monster,
 152 so power consumption is going to be high at 350 MHz. Still, for our first
 153 ASIC ever, it's just exciting to think that it's possible at all.
 154
 155 Regarding the NLNet proposals: we need people! In particular, we need two
 156 EU citizens to come forward, to satisfy NLNet's backers' requirements
 157 (thanks to [NGU.eu](https://ngi.eu), NLNet has received its money under
 158 the EU Horizon 2020 Programme), so at least one EU Citizen has to be
 159 part of the proposal. One for gcc, another for the MESA/RADV port.
 160 Please do contact me for details. There's no contract or obligation,
 161 because this is charitable donations.
 162
 163 In addition, if anyone wants to receive tax deductible charitable
 164 donations direct from NLNet for working on aspects of this project,
 165 do get in touch, there is plenty to do.  Application reviews start in two
 166 weeks, we will hear from NLnet by December as to what has been approved,
 167 and will be able to expand the project scope around January 2020,
 168 which is just in time for FOSDEM2020.
 169
 170 Also, remember, if you work for a corporation that could financially
 171 benefit from this project being a reality, sponsorship, via NLNet,
 172 is tax deductible because it is a charitable donation.
 173
 174 (Update: covered in a
 175 [Slashdot](https://hardware.slashdot.org/story/19/09/29/1845252/libre-risc-v-3d-cpugpu-seeks-grants-for-ambitious-expansion#comments)
 176 article)