# RISC-V 3D GPU
-See [[libre_3d_gpu]]
+See:
-at FOSDEM 2018 when Yunsup and the team announced the U540 there was
-some discussion about this: it was one of the questions asked. one of
-the possibilities raised there was that maddog was heading something:
-i've looked for that effort, and have not been able to find it [jon is
-getting quite old, now, bless him. he had to have an operation last
-year. he's recovered well].
+* [[libre_3d_gpu]]
+* [[discussion]]
+* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/>
+* <https://www.crowdsupply.com/libre-risc-v/m-class>
-also at the Barcelona Conference i mentioned in the
-very-very-very-rapid talk on the Libre RISC-V chip that i have been
-tasked with, that if there is absolutely absolutely no other option,
-it will use Vivante GC800 (and, obviously, use etnaviv). what *that*
-means is that there's a definite budget of USD $250,000 available
-which the (anonymous) sponsor is definitely willing to spend... so if
-anyone can come up with an alternative that is entirely libre and
-open, i can put that initiative to the sponsor for evaluation.
+Progress:
-basically i've been looking at this for several months, so have been
-talking to various people (jeff bush from nyuzi [1] and chiselgpu [2],
-frank from gplgpu [3], VRG for MIAOW [4]) to get a feel for what would
-be involved.
-
-* miaow is just an OpenCL engine that is compatible with a subset of
- AMD/ATI's OpenCL assembly code. it is NOT a GPU. they have
- preliminary plans to *make* one... however the development process is
- not open. we'll hear about it if and when it succeeds, probably as
- part of a published research paper.
-
-* nyuzi is a *modern* "software shader / renderer" and is a
- replication of the intel larrabee architecture. it explored the
- concept of doing recursive software-driven rasterisation (as did
- larrabee) where hardware rasterisation uses brute force and often
- wastes time and power. jeff went to a lot of trouble to find out
- *why* intel's researchers were um "not permitted" to actually put
- performance numbers into their published papers. he found out why :)
- one of the main facts that jeff's research reveals (and there are a
- lot of them) is that most of the energy of a GPU is spent getting data
- each way past the L2/L1 cache barrier, and secondly much of the time
- (if doing software-only rendering) you have several instruction cycles
- where in a hardware design you issue one and a separate pipeline takes
- over (see videocore-iv below)
-
-* chiselgpu was an additional effort by jeff to create the absolute
- minimum required tile-based "triangle renderer" in hardware, for
- comparative purposes in the nyuzi raster engine research. synthesis
- of such a block he pointed out to me would actually be *enormous*,
- despite appearances from how little code there is in the chiselgpu
- repository. in his paper he mentions that the majority of the time
- when such hardware-renderers are deployed, the rest of the GPU is
- really struggling to keep up feeding the hardware-rasteriser, so you
- have to put in multiple threads, and that brings its own problems.
- it's all in the paper, it's fascinating stuff.
-
-* gplgpu was done by one of the original developers of the "Number
- Nine" GPU, and is based around a "fixed function" design and as such
- is no longer considered suitable for use in the modern 3D developer
- community (they hate having to code for it), and its performance would
- be *really* hard to optimise and extend. however in speaking to jeff,
- who analysed it quite comprehensively, he said that there were a large
- number of features (4-tuple floating-point colour to 16/32-bit ARGB
- fixed functions) that have retained a presence in modern designs, so
- it's still useful for inspiration and analysis purposes. you can see
- jeff's analysis here [7]
-
-* an extremely useful resource has been the videocore-iv project [8]
- which has collected documentation and part-implemented compiler tools.
- the architecture is quite interesting, it's a hybrid of a
- Software-driven Vector architecture similar to Nyuzi plus
- fixed-functions on separate pipelines such as that "take 4-tuple FP,
- turn it into fixed-point ARGB and overlay it into the tile"
- instruction. that's done as a *single* instruction to cover i think 4
- pixels, where Nyuzi requires an average of 4 cycles per pixel. the
- other thing about videocore-iv is that there is a separate internal
- "scratch" memory area of size 4x4 (x32-bit) which is the "tile" area,
- and focussing on filling just that is one of the things that saves
- power. jeff did a walkthrough, you can read it here [10] [11]
-
-so on this basis i have been investigating a couple of proposals for
-RISC-V extensions: one is Simple-V [9] and the other is a *small*
-general-purpose memory-scratch area extension, which would be
-accessible only on the *other* side of the L1/L2 cache area and *ONLY*
-accessible by an individual core [or its hyperthreads]. small would
-be essential because if a context-switch occurs it would be necessary
-to swap the scratch-area out to main memory (and back).
-general-purpose so that it's useful and useable in other contexts and
-situations.
-
-whilst there are many additional reasons - justifications that make
-it attractive for *general-purpose* usage (such as accidentally
-providing LD.MULTI and ST.MULTI for context-switching and efficient
-function call parameter stack storing, and an accidental
-single-instruction "memcpy" and "memzero") - the primary driver behind
-Simple-V has been as the basis for turning RISC-V into an
-embedded-style (low-power) GPU (and also a VPU).
-
-one of the things that's lacking from
-[RVV](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc)
-is parallelisation of
-Bit-Manipulation. RVV has been primarily designed based on input from
-the Supercomputer community, and as such it's *incredible*.
-absolutely amazing... but only desirable to implementt if you need to
-build a Supercomputer.
-
-Simple-V i therefore designed to parallelise *everything*. custom
-extensions, future extensions, current extensions, current
-instructions, *everything*. RVV, once it's been implemented in gcc
-for example, would require heavy-customisation to support e.g.
-Bit-Manipulation, would require special Bit-Manipulation Vector
-instructions to be added *to RVV*... all of which would need to AGAIN
-go through the Extension Proposal process... you can imagine how that
-would go, and the subsequent cost of maintenance of gcc, binutils and
-so on as a long-term preliminary (or if the extension to RVV is not
-accepted, after all the hard work) even a permanent hard-fork.
-
-in other words once you've been through the "Extension Proposal
-Process" with Simple-V, it need never be done again, not for one
-single parallel / vector / SIMD instruction, ever again.
-
-that would include for example creating a fixed-function 3D "FP to
-ARGB" custom instruction. a custom extension with special 3D
-pipelines would, with Simple-V, not need to also have to worry about
-how those operations would be parallelised.
-
-this is not a new concept: it's borrowed directly from videocore-iv
-(which in turn probably borrowed it from somewhere else).
-videocore-iv call it "virtual parallelism". the Vector Unit
-*actually* has a 4-wide FPU for certain heavily-used operations such
-as ADD, and a ***ONE*** wide FPU for less-used operations such as
-RECIPSQRT.
-
-however at the *instruction* level each of those operations,
-regardless of whether they're heavily-used or less-used they *appear*
-to be 16 parallel operations all at once, as far as the compiler and
-assembly writers are concerned. Simple-V just borrows this exact same
-concept and lets implementors decide where to deploy it, to best
-advantage.
-
-
-> 2. If it’s a good idea to implement, are there any projects currently
-> working on it?
-
-i haven't been able to find any: if you do please do let me know, i
-would like to speak to them and find out how much time and money they
-would need to complete the work.
-
-> If the answer is yes, would you mind mention the project’s name and
-> website?
->
-> If the answer is no, are there any special reasons that nobody not
-> implement it yet?
-
-it's damn hard, it requires a *lot* of resources, and if the idea is
-to make it entirely libre-licensed and royalty-free there is an extra
-step required which a proprietary GPU company would not normally do,
-and that is to follow the example of the BBC when they created their
-own Video CODEC called Dirac [5].
-
-what the BBC did there was create the algorithm *exclusively* from
-prior art and expired patents... they applied for their own patents...
-and then *DELIBERATELY* let them lapse. the way that the patent
-system works, the patents will *still be published*, there will be an
-official priority filing date in the patent records with the full text
-and details of the patents.
-
-this strategy, where you MUST actually pay for the first filing
-otherwise the records are REMOVED and never published, acts as a way
-of preventing and prohibiting unscrupulous people from grabbing the
-whitepapers and source code, and trying to patent details of the
-algorithm themselves just like Google did very recently [6]
-
-* [0] https://www.youtube.com/watch?v=7z6xjIRXcp4
-* [1] https://github.com/jbush001/NyuziProcessor/wiki
-* [2] https://github.com/asicguy/gplgpu
-* [3] https://github.com/jbush001/ChiselGPU/
-* [4] http://miaowgpu.org/
-* [5] https://en.wikipedia.org/wiki/Dirac_(video_compression_format)
-* [6] https://yro.slashdot.org/story/18/06/11/2159218/inventor-says-google-is-patenting-his-public-domain-work
-* [7] https://jbush001.github.io/2016/07/24/gplgpu-walkthrough.html
-* [8] https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual
-* [9] libre-riscv.org/simple_v_extension/
-* [10] https://jbush001.github.io/2016/03/02/videocore-qpu-pipeline.html
-* [11] https://jbush001.github.io/2016/02/27/life-of-triangle.html
-* OpenPiton https://openpiton-blog.princeton.edu/2018/11/announcing-openpiton-with-ariane/
+* 2017 to Nov 2018: Simple-V specification preliminary draft completed
+* Aug 2018 - Nov 2018: spike-sv implementation of draft spec completed
+* Aug 2018: Kazan Vulkan Driver initiated
+* Sep 2018: mailing list established
+* Sep 2018: Crowdsupply pre-launch page up (for updates)
+* Dec 2018: preliminary floorplan and architecture designed (comp.arch)
# News Articles
--- /dev/null
+# Notes
+
+at FOSDEM 2018 when Yunsup and the team announced the U540 there was
+some discussion about this: it was one of the questions asked. one of
+the possibilities raised there was that maddog was heading something:
+i've looked for that effort, and have not been able to find it [jon is
+getting quite old, now, bless him. he had to have an operation last
+year. he's recovered well].
+
+also at the Barcelona Conference i mentioned in the
+very-very-very-rapid talk on the Libre RISC-V chip that i have been
+tasked with, that if there is absolutely absolutely no other option,
+it will use Vivante GC800 (and, obviously, use etnaviv). what *that*
+means is that there's a definite budget of USD $250,000 available
+which the (anonymous) sponsor is definitely willing to spend... so if
+anyone can come up with an alternative that is entirely libre and
+open, i can put that initiative to the sponsor for evaluation.
+
+basically i've been looking at this for several months, so have been
+talking to various people (jeff bush from nyuzi [1] and chiselgpu [2],
+frank from gplgpu [3], VRG for MIAOW [4]) to get a feel for what would
+be involved.
+
+* miaow is just an OpenCL engine that is compatible with a subset of
+ AMD/ATI's OpenCL assembly code. it is NOT a GPU. they have
+ preliminary plans to *make* one... however the development process is
+ not open. we'll hear about it if and when it succeeds, probably as
+ part of a published research paper.
+
+* nyuzi is a *modern* "software shader / renderer" and is a
+ replication of the intel larrabee architecture. it explored the
+ concept of doing recursive software-driven rasterisation (as did
+ larrabee) where hardware rasterisation uses brute force and often
+ wastes time and power. jeff went to a lot of trouble to find out
+ *why* intel's researchers were um "not permitted" to actually put
+ performance numbers into their published papers. he found out why :)
+ one of the main facts that jeff's research reveals (and there are a
+ lot of them) is that most of the energy of a GPU is spent getting data
+ each way past the L2/L1 cache barrier, and secondly much of the time
+ (if doing software-only rendering) you have several instruction cycles
+ where in a hardware design you issue one and a separate pipeline takes
+ over (see videocore-iv below)
+
+* chiselgpu was an additional effort by jeff to create the absolute
+ minimum required tile-based "triangle renderer" in hardware, for
+ comparative purposes in the nyuzi raster engine research. synthesis
+ of such a block he pointed out to me would actually be *enormous*,
+ despite appearances from how little code there is in the chiselgpu
+ repository. in his paper he mentions that the majority of the time
+ when such hardware-renderers are deployed, the rest of the GPU is
+ really struggling to keep up feeding the hardware-rasteriser, so you
+ have to put in multiple threads, and that brings its own problems.
+ it's all in the paper, it's fascinating stuff.
+
+* gplgpu was done by one of the original developers of the "Number
+ Nine" GPU, and is based around a "fixed function" design and as such
+ is no longer considered suitable for use in the modern 3D developer
+ community (they hate having to code for it), and its performance would
+ be *really* hard to optimise and extend. however in speaking to jeff,
+ who analysed it quite comprehensively, he said that there were a large
+ number of features (4-tuple floating-point colour to 16/32-bit ARGB
+ fixed functions) that have retained a presence in modern designs, so
+ it's still useful for inspiration and analysis purposes. you can see
+ jeff's analysis here [7]
+
+* an extremely useful resource has been the videocore-iv project [8]
+ which has collected documentation and part-implemented compiler tools.
+ the architecture is quite interesting, it's a hybrid of a
+ Software-driven Vector architecture similar to Nyuzi plus
+ fixed-functions on separate pipelines such as that "take 4-tuple FP,
+ turn it into fixed-point ARGB and overlay it into the tile"
+ instruction. that's done as a *single* instruction to cover i think 4
+ pixels, where Nyuzi requires an average of 4 cycles per pixel. the
+ other thing about videocore-iv is that there is a separate internal
+ "scratch" memory area of size 4x4 (x32-bit) which is the "tile" area,
+ and focussing on filling just that is one of the things that saves
+ power. jeff did a walkthrough, you can read it here [10] [11]
+
+so on this basis i have been investigating a couple of proposals for
+RISC-V extensions: one is Simple-V [9] and the other is a *small*
+general-purpose memory-scratch area extension, which would be
+accessible only on the *other* side of the L1/L2 cache area and *ONLY*
+accessible by an individual core [or its hyperthreads]. small would
+be essential because if a context-switch occurs it would be necessary
+to swap the scratch-area out to main memory (and back).
+general-purpose so that it's useful and useable in other contexts and
+situations.
+
+whilst there are many additional reasons - justifications that make
+it attractive for *general-purpose* usage (such as accidentally
+providing LD.MULTI and ST.MULTI for context-switching and efficient
+function call parameter stack storing, and an accidental
+single-instruction "memcpy" and "memzero") - the primary driver behind
+Simple-V has been as the basis for turning RISC-V into an
+embedded-style (low-power) GPU (and also a VPU).
+
+one of the things that's lacking from
+[RVV](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc)
+is parallelisation of
+Bit-Manipulation. RVV has been primarily designed based on input from
+the Supercomputer community, and as such it's *incredible*.
+absolutely amazing... but only desirable to implementt if you need to
+build a Supercomputer.
+
+Simple-V i therefore designed to parallelise *everything*. custom
+extensions, future extensions, current extensions, current
+instructions, *everything*. RVV, once it's been implemented in gcc
+for example, would require heavy-customisation to support e.g.
+Bit-Manipulation, would require special Bit-Manipulation Vector
+instructions to be added *to RVV*... all of which would need to AGAIN
+go through the Extension Proposal process... you can imagine how that
+would go, and the subsequent cost of maintenance of gcc, binutils and
+so on as a long-term preliminary (or if the extension to RVV is not
+accepted, after all the hard work) even a permanent hard-fork.
+
+in other words once you've been through the "Extension Proposal
+Process" with Simple-V, it need never be done again, not for one
+single parallel / vector / SIMD instruction, ever again.
+
+that would include for example creating a fixed-function 3D "FP to
+ARGB" custom instruction. a custom extension with special 3D
+pipelines would, with Simple-V, not need to also have to worry about
+how those operations would be parallelised.
+
+this is not a new concept: it's borrowed directly from videocore-iv
+(which in turn probably borrowed it from somewhere else).
+videocore-iv call it "virtual parallelism". the Vector Unit
+*actually* has a 4-wide FPU for certain heavily-used operations such
+as ADD, and a ***ONE*** wide FPU for less-used operations such as
+RECIPSQRT.
+
+however at the *instruction* level each of those operations,
+regardless of whether they're heavily-used or less-used they *appear*
+to be 16 parallel operations all at once, as far as the compiler and
+assembly writers are concerned. Simple-V just borrows this exact same
+concept and lets implementors decide where to deploy it, to best
+advantage.
+
+
+> 2. If it’s a good idea to implement, are there any projects currently
+> working on it?
+
+i haven't been able to find any: if you do please do let me know, i
+would like to speak to them and find out how much time and money they
+would need to complete the work.
+
+> If the answer is yes, would you mind mention the project’s name and
+> website?
+>
+> If the answer is no, are there any special reasons that nobody not
+> implement it yet?
+
+it's damn hard, it requires a *lot* of resources, and if the idea is
+to make it entirely libre-licensed and royalty-free there is an extra
+step required which a proprietary GPU company would not normally do,
+and that is to follow the example of the BBC when they created their
+own Video CODEC called Dirac [5].
+
+what the BBC did there was create the algorithm *exclusively* from
+prior art and expired patents... they applied for their own patents...
+and then *DELIBERATELY* let them lapse. the way that the patent
+system works, the patents will *still be published*, there will be an
+official priority filing date in the patent records with the full text
+and details of the patents.
+
+this strategy, where you MUST actually pay for the first filing
+otherwise the records are REMOVED and never published, acts as a way
+of preventing and prohibiting unscrupulous people from grabbing the
+whitepapers and source code, and trying to patent details of the
+algorithm themselves just like Google did very recently [6]
+
+* [0] https://www.youtube.com/watch?v=7z6xjIRXcp4
+* [1] https://github.com/jbush001/NyuziProcessor/wiki
+* [2] https://github.com/asicguy/gplgpu
+* [3] https://github.com/jbush001/ChiselGPU/
+* [4] http://miaowgpu.org/
+* [5] https://en.wikipedia.org/wiki/Dirac_(video_compression_format)
+* [6] https://yro.slashdot.org/story/18/06/11/2159218/inventor-says-google-is-patenting-his-public-domain-work
+* [7] https://jbush001.github.io/2016/07/24/gplgpu-walkthrough.html
+* [8] https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual
+* [9] libre-riscv.org/simple_v_extension/
+* [10] https://jbush001.github.io/2016/03/02/videocore-qpu-pipeline.html
+* [11] https://jbush001.github.io/2016/02/27/life-of-triangle.html
+* OpenPiton https://openpiton-blog.princeton.edu/2018/11/announcing-openpiton-with-ariane/
+