updates/005_2018dec14_simd_without_simd.mdwn

   1 Spread over various [videos](https://youtu.be/DoZrGJIltgU),
   2 [writings](https://groups.google.com/forum/#!topic/comp.arch/2kYGFU4ppow),
   3 and [mailing list
   4 discussions](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000261.html),
   5 a picture is beginning to emerge of a suitable microarchitecture.
   6
   7 There are several things to remember about this design, the primary
   8 being that it is not explicitly intended as a discrete GPU (although
   9 one could be made). Instead, it is primarily for a battery-operated,
  10 power-efficient hand-held device, where it happens to just about pass
  11 on, say, a low to mid-range Chromebook.  Power consumption *for the
  12 entire chip* is targeted at 2.5 watts.
  13
  14 We learned quite quickly that, paradoxically, even a mobile embedded
  15 3D GPU *requires* an extreme number of registers (128 floating-point
  16 registers) because it is handling vectors (or quads as they are
  17 called), and even pixel data, in floating-point format, which means
  18 four 32-bit numbers (including the transparency).  So, where a
  19 "normal" RISC processor has 32 registers, a GPU typically has to have
  20 four times that many simply because it is dealing with four lots of
  21 numbers simultaneously.  If you don't do this, then that data has to
  22 go back down to memory (even to L1 cache), and, as the L1 cache runs a
  23 CAM, it's guaranteed to be power-hungry.
  24
  25 Dealing with 128 registers brings some unique challenges not normally
  26 faced by general purpose CPUs, and when it becomes possible (or a
  27 requirement) to access even down to the byte level of those 64-bit
  28 registers as "elements" in a vector operation, it is even more
  29 challenging.  Recall Mitch Alsup's scoreboard dependency floor plan
  30 (reproduced with kind permission, here):
  31
  32 {mitch-ld-st-augmentation | link}
  33
  34 There are two key dependency matrices here: on the left is the
  35 function unit (rows) to register file (columns), where you can see at
  36 the bottom in the CDC 6600 the register file is divided down into A, B
  37 and X.  On the right is the function unit to function unit dependency
  38 matrix, which ensures that each function unit only starts its
  39 arithmetic operations when its dependent function units have created
  40 the results it needs.  Thus, that matrix expresses source register to
  41 destination register dependencies.
  42
  43 Now, let's do something hair-raising.  Let's do two crazed things at once:
  44 increase the number of registers to a whopping 256 total (128 floating
  45 point and 128 integer), and at the same time allow those 64-bit registers
  46 to be broken down into **eight** separate 8-bit values... *and allow
  47 function unit dependencies to exist on them*!
  48
  49 If we didn't properly take this into account in the design, then an
  50 8-bit ADD would require us to "lock", say, Register R5 (all 64 bits of
  51 it), absolutely preventing and prohibiting the other seven bytes of R5
  52 from being used, until such time as that extremely small 8-bit ADD had
  53 completed.  Such a design would be laughed at, its performance would
  54 be so low.  Only one 8-bit ADD per clock cycle, when Intel has
  55 recently added 512-bit [SIMD](https://en.wikipedia.org/wiki/SIMD)?
  56
  57 Here's a proposed solution.  What if, when an 8-bit operation needs to
  58 do a calculation to go into the first byte, the other seven bytes have
  59 their own **completely separate** dependency lines in the register and
  60 function unit matrices? It looks like this:
  61
  62 {reorder-alias-bytemask-scheme | link}
  63
  64 If you recall from the [previous updates about
  65 scoreboards](https://www.crowdsupply.com/libre-risc-v/m-class/updates),
  66 it's not the "scoreboard" that's the key, it's these register to
  67 function unit and function unit to function unit dependency matrices
  68 that are the misunderstood key.  Let's explain the above diagram.
  69 Firstly, in purple in the bottom left, is a massive matrix of function
  70 units to function units, just as with the standard CDC 6600, except
  71 now there are separate 32-bit function units, 16-bit function units,
  72 and 8-bit function units.  In this way, we can have a 32-bit ADD
  73 depending on and waiting for an 8-bit computation, or a 16-bit MUL on
  74 a 32-bit SQRT and so on.  Nothing obviously different there.
  75
  76 Likewise, in the bottom right, in red, we see matrices that have
  77 function units along rows, and registers along the columns, exactly
  78 again as with the CDC 6600 standard scoreboard. However, again, we
  79 note that because there are separate 32-bit function units and
  80 separate 16-bit and 8-bit function units, there are *three* separate
  81 sets of function unit to register matrices.  Also, note that these are
  82 separate, where they would be expected to be grouped together.
  83 Except, they're *not* independent, and that's where the diagram at the
  84 top (middle) comes in.
  85
  86 The diagram at the top says, in words, "if you need a 32-bit register
  87 for an operation (using a 32-bit function unit), the 16-bit and 8-bit
  88 function units *also* connected to that exact same register **must**
  89 be prevented from occurring.  Also, if you need eight bits of a register,
  90 whilst it does not prevent the other bytes of the register from being
  91 used, it *does* prevent the overlapping 16-bit portion **and the 32-bit
  92 and the 64-bit** portions of that same named register from being used."
  93
  94 This "cascading" relationship is absolutely essential to understand.
  95 If you need register R1 (all of it), you **cannot** go and allocate
  96 any of that register for use in any 32-bit, 16-bit, or 8-bit
  97 operations.  This is common sense!  However, if you use the lowest
  98 byte (byte 1), you can still use the top three 16-bit portions of R1,
  99 and you can also still use byte 2.  This is also common sense!
 100
 101 So in fact, it's actually quite simple, and this "cascade" is simply and
 102 easily propagated down to the function unit dependency matrices, stopping
 103 32-bit operations from overwriting 8-bit and vice-versa.
 104
 105 ### Virtual Registers
 106
 107 The fourth part of the above diagram is the grid in green, in the top
 108 left corner.  This is a "virtual" to "real" one-bit table.  It's here
 109 because the size of these matrices is so enormous that there is deep
 110 concern about the line driver strength, as well as the actual size.
 111 128 registers means that one single gate, when it goes high or low,
 112 has to "drive" the input of 128 other gates.  That takes longer and
 113 longer to do, the higher the number of gates, so it becomes a critical
 114 factor in determining the maximum speed of the entire processor.  We
 115 will have to keep an eye on this.
 116
 117 So, to keep the function unit to register matrix size down, this
 118 "virtual" register concept was introduced.  Only one bit in each row
 119 of the green table may be active: it says, for example, "IR1 actually
 120 represents that there is an instruction being executed using R3."
 121 This does mean, however, that if this table is not high enough (not
 122 enough IRs), the processor has to stall until an instruction is
 123 completed, so that one register becomes free.  Again, another thing to
 124 keep an eye on, in simulations.
 125
 126 ### Refinements
 127
 128 The second major concern is the purple matrix, the function unit to
 129 function unit one.  Basically, where previously we would have FU1
 130 cover all ADDs, FU2 would cover all MUL operations, FU3 covers BRANCH,
 131 and so on, now we have to multiply those numbers by **four** (64-bit
 132 ops, 32-bit ops, 16-bit, and 8), which in turn means that the size of
 133 the FU-to-FU matrix has gone up by a staggering **sixteen** times.
 134 This is not really acceptable, so we have to do something different.
 135
 136 The refinement is based on an observation that 16-bit operations of
 137 course may be constructed from 8-bit values, and that 64-bit
 138 operations can be constructed from 32-bit ones.  So, what if we
 139 skipped the cascade on 64 and 16 bit, and made the cascade out of just
 140 32-bit and 8-bit?  Then, very simply, the top half of a 64-bit source
 141 register is allocated to one function unit, the bottom half to the one
 142 next to it, and when it comes to actually passing the source registers
 143 to the relevant ALU, take from *both* function units.
 144
 145 For 3D, the primary focus is on 32-bit (single-precision
 146 floating-point) performance anyway, so if 64-bit operations happen to
 147 have half the number of reservation stations / function units, and
 148 block more often, we actually don't mind so much.  Also, we can still
 149 apply the same "banks" trick on the register file, except this time
 150 with four-way multiplexing on 32-bit wide banks, and 4 x 4 crossbars
 151 on the bytes as well:
 152
 153 {register-file-multiplexing | link}
 154
 155 To cope with 16-bit operations, pairs of 8-bit values in adjacent function
 156 units are reserved.  Likewise for 64-bit operations, the 8-bit crossbars
 157 are not used, and pairs of 32-bit source values in adjacent Function Units
 158 in the *32-bit* function unit area are reserved.
 159
 160 However, the gate count in such a staggered crossbar arrangement is
 161 insane: bear in mind that this will be 3R1W or 2R1W (2 or 3 reads, 1
 162 write per register), and that means **three** sets of crossbars,
 163 comprising **four** banks, with effectively 16 byte to 16 byte
 164 routing.
 165
 166 It's too much - so in later updates, this will be explored further.