From 3b1801834e9d1993a5b357fd96b29ba2ee79f93b Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 28 Aug 2019 12:05:18 +0100 Subject: [PATCH] add first draft 2019aug28 update --- updates/020_2019aug28_intriguing_ideas.mdwn | 109 ++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 updates/020_2019aug28_intriguing_ideas.mdwn diff --git a/updates/020_2019aug28_intriguing_ideas.mdwn b/updates/020_2019aug28_intriguing_ideas.mdwn new file mode 100644 index 0000000..589bf09 --- /dev/null +++ b/updates/020_2019aug28_intriguing_ideas.mdwn @@ -0,0 +1,109 @@ +Intriguing Ideas + +Pixilica starts a 3D Open Graphics Alliance initiative; +We decide to go with a "reconfigurable" pipeline; + +# The possibility of a 3D Open Graphics Alliance + +At SIGGRAPH 2019 this year there was a very interesting BoF, where the +[idea was put forward] +(https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d62b6282dc27100170a4a05) +by Atif, of Pixilica, to use RISC-V as the core +basis of a 3D Embedded flexible GPGPU (hybrid / general purpose GPU).  +Whilst the idea of a GPGPU has been floated before (in particular by +ICubeCorp), the reasons *why* were what particularly caught peoples' +attention at the BoF. + +The current 3D GPU designs -  NVIDIA, AMD, Intel, are hugely optimised +for mass volume appeal. Niche markets, by virtue of the profit +opportunities being lower or even negative given the design choices of +the incumbents, are inherently penalised. Not only that but the source +code of the 3D engines is proprietary, meaning that anything outside of +what is dictated by the incumbents is out of the question. + +At the BoF, one attendee described how they are implementing *transparent* +shader algorithms. Most shader hardware provides triangle algorithms that +asume a solid surface. Using such hardware for transparent shaders is a +2 pass process which clearly comes with an inherent *100%* performance +penalty. If on the other hand they had some input into a new 3D core, +one that was designed to be flexible... + +The level of interest was sufficiently high that Atif is reaching out to +people (including our team) to set up an Open 3D Graphics Alliance. The +basic idea being to have people work together to create an appropriate +efficient "Hybrid CPU/GPU" Instruction Set (ISA) suitable for a diverse +range of architectures and requirements: all the way from small embedded +softcores, to embedded GPUs for use in mobile processors, to HPC servers +to high end Machine Learning and Robotics applications. + +One interesting thing that has to be made clear - the lesson from Nyuzi +and Larrabee - is that a good Vector Processor does **not** automatically +make a good 3D GPU. Jeff Bush designed Nyuzi very specifically to +replicate the Larrabee team's work. By deliberately not including custom +3D Hardware Accelerated Opcodes, Nyuzi has only 25% the performance of a modern +GPU consuming the same amount of power. Put another way: if you want to use +a pure Vector Engine to get the same performance as a commercially-competitive +GPU, you need *four times* the power consumption and four times the silicon +area. + +Thus we simply cannot use the upcoming RISC-V Vector Extension, or even +SimpleV, and expect to automatically have a commercially competitive +3D GPU. It takes texture opcodes, Z-Buffers, pixel conversion, Linear +Interpolation, Trascendentals (sin, cos, exp, log), and much more, all +of which has to be designed, thought through, implemented *and then used +behind a suitable API*. + +In addition, given that the Alliance is to meet the needs of "unusual" +markets, it is no good creating an ISA that has such a high barrier to +entry and such a power-performance penalty that it inherently excludes +the very implementors it is targetted at, particularly in Embedded markets. + +https://youtu.be/HeVz-z4D8os + +# Reconfigureable Pipelines + +Jacob came up with a fascinating idea: a reconfigureable pipeline. The +basic idea behind pipelines is that combinatorial blocks are separated +by latches.  The reason is because when gates are chained together, +there is a ripple effect which has to have time to stabilise. If the +clock is run too fast, computations no longer have time to become valid. + +So the solution is to split the combinatorial blocks into shorter chains, +and have "latches" in between them which capture the intermediary +results. This is termed a "pipeline".  Actually it's more like an +escalator. + +The problem comes when you want to vary the clock speed. This is desirable +because if the pipeline is long and the clock rate is slow, the latency +(completion time of an instruction) is also long. + +Conversely, if the pipeline is short (large numbers of gates connected +together) then as mentioned above, this can inherently limit the maximum +frequency that the processor could run at. + +What if there was a solution which allowed *both* options? What if you +could actually reconfigure tge pipeline to be shorter or longer? + +It turns out that by using what is termed "transparent latches" that it +is possible to do precisely that.  The advantages are enormous and were +described in detail on comp.arch + +https://groups.google.com/d/msg/comp.arch/fcq-GLQqvas/SY2F9Hd8AQAJ   +Earlier in that thread, someone kindly pointed out that IBM published +papers on the technique.  Basically, the latches normally present in the +pipeline have a combinatorial "bypass" in the form of a Mux. The output +is dynamically selected from either the input *or* the input after it +has been put through a flip-flop. The flip-flop basically stores (and +delays) its input for one clock cycle. + +By putting these transparent latches on every other combinatorial stage +in the processing chain, the length of the pipeline may be halved, such +that when the clock rate is also halved the *instruction completion time +remains the same*. + +Normally if the processor speed were lowered it would have an adverse +impact on instruction latency. + +It's a fantastic idea that will allow us to reconfigure the processor +to reach a 1.5ghz clock rate for high performance bursts. + -- 2.30.2