updates/004_2018dec06_microarchitecture_cont.mdwn

   1 Firstly, many thanks to
   2 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
   3 for publishing a story on this project. I replied to some of the
   4 [Heise
   5 Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
   6 comments, endeavouring to use translation software to respect that the
   7 forum is in German.
   8
   9 In this update, following on from [the analysis of the Tomasulo
  10 algorithm](https://www.crowdsupply.com/libre-risc-v/m-class/updates/microarchitectural-decisions),
  11 by a process of osmosis I finally was able to make out a light at the
  12 end of the "scoreboard" tunnel, and it is not an oncoming
  13 train. Conversations with [Mitch
  14 Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
  15 are becoming clear, providing insights that, as we will find out
  16 below, have not made it into the academic literature in over 20 years.
  17
  18 In the previous update, I really did not like the
  19 [scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
  20 for doing out-of-order superscalar execution, because, *as described*,
  21 it is hopelessly inadequate. There's no roll-back method for
  22 exceptions, no method for coping with register "hazards" (e.g., read
  23 after write), so register "renaming" has to be done as a precursor
  24 step, no way to do branch prediction, and only a single LOAD/STORE can
  25 be done at any one time.
  26
  27 All of these things have to be added, and the best way to do so is to
  28 absorb the feature known as the "reorder buffer" (and associated
  29 reservation stations) normally associated with the Tomasulo
  30 algorithm. At which point, as noted on
  31 [comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
  32 there really is no functional difference between "scoreboarding plus
  33 reorder buffer" and "Tomasulo algorithm plus reorder buffer." Even
  34 the Tomasulo common data bus is present in a functionally-orthogonal
  35 way (see later for details).
  36
  37 The only *well-known* documentation on the CDC 6600 scoreboarding
  38 technique is the 1967 patent. Here's the kicker: the patent *does
  39 not* describe the key strategic part of scoreboarding that makes it so
  40 powerful and much more power-efficient than the Tomasulo algorithm
  41 when combined with reorder buffers: the functional unit's dependency
  42 matrices.
  43
  44 Before getting to that stage, I thought it would be a good idea to
  45 make people aware of a book that Mitch told me about, called "Design
  46 of a Computer: the Control Data 6600" by James Thornton. James worked
  47 with Seymour Cray on the *original design* of the 6600. It was
  48 literally constructed from PCB modules using hand-soldered
  49 transistors. Memory was magnetic rings (which is where we get the
  50 term "core memory" from), and the bootloader was a bank of
  51 toggle-switches. The design was absolutely revolutionary: where all
  52 other computers were managing an instruction every 11 clock cycles,
  53 the 6600 reduced that to **four**. The 7600, its successor, took that
  54 figure even lower.
  55
  56 In 2002, someone named Tom Uban sought permission from James and his
  57 wife, to make the book available online, as, historically, the CDC
  58 6600 is quite literally the precursor to modern supercomputing:
  59
  60 {design-of-a-computer-6600-permission | link}
  61
  62 I particularly wanted to show the dependency matrix, which is the
  63 key strategic part of the scoreboard:
  64
  65 {design-of-a-computer-6600 | link}
  66
  67 Basically, the patent shows a table with src1 and src2, and "ready"
  68 signals: what it does *not* show is the "go read" and "go write"
  69 signals, which allowed an instruction to *begin* execution without
  70 *committing* execution - a feature that's usually believed to be
  71 exclusive to Reorder Buffers. Furthermore, the patent certainly does
  72 not show the way in which one function unit blocks others, which is
  73 via the dependency matrix.
  74
  75 It is well known that the Tomasulo Reorder Buffer requires a CAM on
  76 the Destination Register, which is power-hungry and expensive. This
  77 is described in academic literature as data coming "to." The
  78 scoreboard technique is described as data coming "from" source
  79 registers, however because the dependency matrix is left out of these
  80 discussions (not being part of the patent), what they fail to mention
  81 is that there are *multiple single-line* source wires, thus achieving
  82 the exact same purpose as the reorder buffer's CAM, with *far less
  83 power and die area*.
  84
  85 Mitch's description of this on comp.arch was that the dependency
  86 matrix columns effectively may be viewed as a single-bit-wide "CAM,"
  87 which of course is far less hardware, being just AND gates. However,
  88 it wasn't until he very kindly sent me the chapters of his unpublished
  89 book on the 6600 that the significance of what he was saying actually
  90 sank in. Namely, that instead of a merged multi-wire very expensive
  91 "destination register" CAM, copying the *value* of the dependent src
  92 register into the reorder buffer (and then having to match it up
  93 afterwards on every clock cycle), the dependency matrix breaks this
  94 down into multiple really simple single wire comparators that
  95 *preserve* a **direct** link between the src register(s) and the
  96 destination(s) where they're needed. Consequently, the scoreboard and
  97 dependency matrix logic gates take up far less space, and use
  98 significantly less power.
  99
 100 Not only that, but it is quite easy to add incremental
 101 register-renaming tags on top of the scoreboard + dependency matrix --
 102 again, no need for a CAM. Moreover, in the second unpublished book
 103 chapter, Mitch describes several techniques that each bring in all of
 104 the techniques that are usually exclusively associated with reorder
 105 buffers, such as branch prediction, speculative execution, precise
 106 exceptions, and multi-issue LOAD/STORE hazard avoidance. The diagram
 107 below is reproduced with Mitch's permission:
 108
 109 {mitch-ld-st-augmentation | link}
 110
 111 This high-level diagram includes some subtle modifications that
 112 augment a standard CDC 6600 design to allow speculative execution. A
 113 "Schroedinger" wire is added ("neither alive nor dead"), which, very
 114 simply put, prohibits function unit "write" of results (mentioned
 115 earlier as a pre-existing under-recognised key part of the 6600
 116 design). In this way, because the "read" signals were independent of
 117 "write" (something that is again completely missing from the academic
 118 literature in discussions of 6600 scoreboards), the instruction may
 119 *begin* execution, but is prevented from *completing* execution.
 120
 121 All that is required to gain speculative execution on branches is to
 122 add to the dependency matrix one extra line per "branch" that is to be
 123 speculatively executed. The "branch speculation" unit is just like
 124 any other functional unit, in effect. In this way, we gain *exactly*
 125 the same capability as a reorder buffer, including all of the
 126 benefits. The same trick will work just as well for exceptions.
 127
 128 Mitch also has a high-level diagram of an additional LOAD/STORE matrix
 129 that has, again, extremely simple rules: LOADs block STOREs, and
 130 STOREs block LOADs, and the signals "read / write" are then passed
 131 down to the function unit dependency matrix as well. The rules for the
 132 blocking need only be based on "there is no possibility of a conflict"
 133 rather than "on which exact and precise address does a conflict
 134 occur". This in turn means that the number of address bits needed to
 135 detect a conflict may be significantly reduced, i.e., only the top
 136 bits are needed.
 137
 138 Interestingly, RISC-V "fence" instruction rules are based on the same
 139 idea, and it may turn out to be possible to leverage the L1 cache line
 140 numbers instead of the (full) address.
 141
 142 Also, Mitch's unpublished book chapters help to
 143 identify and make clear that the CDC 6600's register file is designed
 144 with "write-through" capability, i.e., that a register that's written
 145 will go through *on the same clock cycle* to a "read" request. This
 146 makes the 6600's register file pretty much synonymous with the
 147 Tomasulo algorithm's "common data bus." This same-cycle feature *also
 148 provides operand forwarding for free*!
 149
 150 This is just amazing. Let's recap. It's 2018, there's absolutely no
 151 Libre SoCs in existence anywhere on our planet of 8 billion people,
 152 and we're looking for inspiration on how to make a modern,
 153 power-efficient 3D-capable processor, only to find it in a literally
 154 55-year-old design for a computer that occupied an entire room and was
 155 hand-built with transistors!
 156
 157 Not only that, but the project has accidentally unearthed incredibly
 158 valuable historic processor design information that has eluded the
 159 Intels and ARMs (billion-dollar companies), as well as the academic
 160 community, for several decades.
 161
 162 I'd like to take a minute to especially thank Mitch Alsup for his time
 163 in ongoing discussions, without which there would be absolutely no
 164 chance I could possibly have learned about, let alone understood, any
 165 of the above. As I mentioned in the [very first project
 166 update](https://www.crowdsupply.com/libre-risc-v/m-class/updates/why-make-a-quad-core-64-bit-soc-surely-there-are-enough-already):
 167 new processor designs get one shot at success. Basing the core of the
 168 design on a 55-year-old, well-documented, and extremely compact and
 169 efficient design is a reasonable strategy: it's just that, without
 170 Mitch's help, there would have been no way to understand the 6600's
 171 true value.
 172
 173 The bottom line is, we have a way forward that will result in
 174 significantly less hardware and a simpler design, using a lot less
 175 power than modern designs, yet providing all of the features normally
 176 the exclusive domain of top-end processors, all thanks to a refresh of
 177 a 55-year-old processor and the willingness of Mitch Alsup and James
 178 Thornton to share their expertise with the world.