From: Luke Kenneth Casson Leighton Date: Mon, 17 Dec 2018 14:33:09 +0000 (+0000) Subject: add update 005 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=7d403fc695d997ed7f24f24c6e37eb6dde57a159;p=crowdsupply.git add update 005 --- diff --git a/updates/005_2018dec14_simd_without_simd.mdwn b/updates/005_2018dec14_simd_without_simd.mdwn new file mode 100644 index 0000000..ce952f7 --- /dev/null +++ b/updates/005_2018dec14_simd_without_simd.mdwn @@ -0,0 +1,64 @@ +# Microarchitectural Design by Osmosis + +In a series of different descriptions and evaluations, a picture is +beginning to emerge of a suitable microarchitecture, as the process +of talking on [videos](https://youtu.be/DoZrGJIltgU), and +[writing out thoughts](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow) +and then talking about the resultant feedback +[elsewhere](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html) +begins to crystallise, without overloading any one group of people. + +There are several things to remember about this design: the primary being +that it is not explicitly intended as a discrete GPU (although one could +be made), it is primarily for a battery-operated efficient hand-held device, +where it happens to just about pass on, say, a low to mid-range chromebook. +Power consumption *for the entire chip* is targetted at 2.5 watts. + +We learned quite quickly that, paradoxically, even a mobile embedded 3D +GPU *requires* extreme numbers of registers (128 floating-point registers) +because it is handling vectors (or quads as they are called), and even +pixel data in floating-point format is also 4 32-bit numbers (including +the transparency). So where a "normal" RISC processor has 32 registers, +a GPU typically has to have 4 times that amount simply because it is +dealing with 4 lots of numbers simultaneously. If you don't do this, +then that data has to go back down to memory (even to L1 cache), and, as the +L1 cache runs a CAM, it's guaranteed to be power-hungry. + +128 registers brings some unique challenges not normally faced by general +purpose CPUs, and when it becomes possible (or a requirement) to access +even down to the byte level of those 64-bit registers as "elements" in +a vector operation, it is even more challenging. Recall Mitch Alsup's +scoreboard dependency floorplan (reproduced with kind permission, here): + +{{mitch_ld_st_augmentation.jpg}} + +There are two key Dependency Matrices here: on the left is the Function +Unit (rows) to Register File (columns), where you can see at the bottom +in the CDC 6600 the Register File is divided down into A, B and X. +On the right is the Function Unit to Function Unit dependency matrix, +which ensures that each FU only starts its arithmetic operations when +its dependent FUs have created the results it needs. Thus, that Matrix +expresses source register to destination register dependencies. + +Now let's do something hair-raising. Let's do two crazed things at once: +increase the number of registers to a whopping 256 total (128 floating +point and 128 integer), and at the same time allow those 64-bit registers +to be broken down into **eight** separate 8-bit values... *and allow +Function Unit dependencies to exist on them*! + +What would happen if we did not properly take this into account in the +design is that an 8-bit ADD would require us to "lock" say Register R5 +(all 64 bits of it), absolutely preventing and prohibiting the other 7 +bytes of R5 from being used, until such time as that extremely small +8-bit ADD had completed. Such a design would be laughed at, its +performance would be so low. Only one 8-bit ADD per clock cycle, when +Intel has recently added 512-bit SIMD?? + +So this is a diagram of a proposed solution. What if, when an 8-bit +operation needs to do a calculation to go into the 1st byte, the other +7 bytes have their own **completely separate** dependency lines, in +the Register and Function Unit Matrices? It looks like this: + +{{reorder_alias_bytemask_scheme.png}} + +