clarification
[crowdsupply.git] / updates / 004_2018dec06_microarchitecture_cont.mdwn
1 Firstly, many thanks to
2 [Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html)
3 for publishing a story on this project. I replied to some of the
4 [Heise
5 Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/)
6 comments, endeavouring to use translation software to respect that the
7 forum is in German.
8
9 In this update, following on from [the analysis of the Tomasulo
10 algorithm](https://www.crowdsupply.com/libre-risc-v/m-class/updates/microarchitectural-decisions),
11 by a process of osmosis I finally was able to make out a light at the
12 end of the "scoreboard" tunnel, and it is not an oncoming
13 train. Conversations with [Mitch
14 Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ)
15 are becoming clear, providing insights that, as we will find out
16 below, have not made it into the academic literature in over 20 years.
17
18 In the previous update, I really did not like the
19 [scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique
20 for doing out-of-order superscalar execution, because, *as described*,
21 it is hopelessly inadequate. There's no roll-back method for
22 exceptions, no method for coping with register "hazards" (e.g., read
23 after write), so register "renaming" has to be done as a precursor
24 step, no way to do branch prediction, and only a single LOAD/STORE can
25 be done at any one time.
26
27 All of these things have to be added, and the best way to do so is to
28 absorb the feature known as the "reorder buffer" (and associated
29 reservation stations) normally associated with the Tomasulo
30 algorithm. At which point, as noted on
31 [comp.arch](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/AIMVVS3DBwAJ)
32 there really is no functional difference between "scoreboarding plus
33 reorder buffer" and "Tomasulo algorithm plus reorder buffer." Even
34 the Tomasulo common data bus is present in a functionally-orthogonal
35 way (see later for details).
36
37 The only *well-known* documentation on the CDC 6600 scoreboarding
38 technique is the 1967 patent. Here's the kicker: the patent *does
39 not* describe the key strategic part of scoreboarding that makes it so
40 powerful and much more power-efficient than the Tomasulo algorithm
41 when combined with reorder buffers: the functional unit's dependency
42 matrices.
43
44 Before getting to that stage, I thought it would be a good idea to
45 make people aware of a book that Mitch told me about, called "Design
46 of a Computer: the Control Data 6600" by James Thornton. James worked
47 with Seymour Cray on the *original design* of the 6600. It was
48 literally constructed from PCB modules using hand-soldered
49 transistors. Memory was magnetic rings (which is where we get the
50 term "core memory" from), and the bootloader was a bank of
51 toggle-switches. The design was absolutely revolutionary: where all
52 other computers were managing an instruction every 11 clock cycles,
53 the 6600 reduced that to **four**. The 7600, its successor, took that
54 figure even lower.
55
56 In 2002, someone named Tom Uban sought permission from James and his
57 wife, to make the book available online, as, historically, the CDC
58 6600 is quite literally the precursor to modern supercomputing:
59
60 {design-of-a-computer-6600-permission | link}
61
62 I particularly wanted to show the dependency matrix, which is the
63 key strategic part of the scoreboard:
64
65 {design-of-a-computer-6600 | link}
66
67 Basically, the patent shows a table with src1 and src2, and "ready"
68 signals: what it does *not* show is the "go read" and "go write"
69 signals, which allowed an instruction to *begin* execution without
70 *committing* execution - a feature that's usually believed to be
71 exclusive to Reorder Buffers. Furthermore, the patent certainly does
72 not show the way in which one function unit blocks others, which is
73 via the dependency matrix.
74
75 It is well known that the Tomasulo Reorder Buffer requires a CAM on
76 the Destination Register, which is power-hungry and expensive. This
77 is described in academic literature as data coming "to." The
78 scoreboard technique is described as data coming "from" source
79 registers, however because the dependency matrix is left out of these
80 discussions (not being part of the patent), what they fail to mention
81 is that there are *multiple single-line* source wires, thus achieving
82 the exact same purpose as the reorder buffer's CAM, with *far less
83 power and die area*.
84
85 Mitch's description of this on comp.arch was that the dependency
86 matrix columns effectively may be viewed as a single-bit-wide "CAM,"
87 which of course is far less hardware, being just AND gates. However,
88 it wasn't until he very kindly sent me the chapters of his unpublished
89 book on the 6600 that the significance of what he was saying actually
90 sank in. Namely, that instead of a merged multi-wire very expensive
91 "destination register" CAM, copying the *value* of the dependent src
92 register into the reorder buffer (and then having to match it up
93 afterwards on every clock cycle), the dependency matrix breaks this
94 down into multiple really simple single wire comparators that
95 *preserve* a **direct** link between the src register(s) and the
96 destination(s) where they're needed. Consequently, the scoreboard and
97 dependency matrix logic gates take up far less space, and use
98 significantly less power.
99
100 Not only that, but it is quite easy to add incremental
101 register-renaming tags on top of the scoreboard + dependency matrix --
102 again, no need for a CAM. Moreover, in the second unpublished book
103 chapter, Mitch describes several techniques that each bring in all of
104 the techniques that are usually exclusively associated with reorder
105 buffers, such as branch prediction, speculative execution, precise
106 exceptions, and multi-issue LOAD/STORE hazard avoidance. The diagram
107 below is reproduced with Mitch's permission:
108
109 {mitch-ld-st-augmentation | link}
110
111 This high-level diagram includes some subtle modifications that
112 augment a standard CDC 6600 design to allow speculative execution. A
113 "Schroedinger" wire is added ("neither alive nor dead"), which, very
114 simply put, prohibits function unit "write" of results (mentioned
115 earlier as a pre-existing under-recognised key part of the 6600
116 design). In this way, because the "read" signals were independent of
117 "write" (something that is again completely missing from the academic
118 literature in discussions of 6600 scoreboards), the instruction may
119 *begin* execution, but is prevented from *completing* execution.
120
121 All that is required to gain speculative execution on branches is to
122 add to the dependency matrix one extra line per "branch" that is to be
123 speculatively executed. The "branch speculation" unit is just like
124 any other functional unit, in effect. In this way, we gain *exactly*
125 the same capability as a reorder buffer, including all of the
126 benefits. The same trick will work just as well for exceptions.
127
128 Mitch also has a high-level diagram of an additional LOAD/STORE matrix
129 that has, again, extremely simple rules: LOADs block STOREs, and
130 STOREs block LOADs, and the signals "read / write" are then passed
131 down to the function unit dependency matrix as well. The rules for the
132 blocking need only be based on "there is no possibility of a conflict"
133 rather than "on which exact and precise address does a conflict
134 occur". This in turn means that the number of address bits needed to
135 detect a conflict may be significantly reduced, i.e., only the top
136 bits are needed.
137
138 Interestingly, RISC-V "fence" instruction rules are based on the same
139 idea, and it may turn out to be possible to leverage the L1 cache line
140 numbers instead of the (full) address.
141
142 Also, Mitch's unpublished book chapters help to
143 identify and make clear that the CDC 6600's register file is designed
144 with "write-through" capability, i.e., that a register that's written
145 will go through *on the same clock cycle* to a "read" request. This
146 makes the 6600's register file pretty much synonymous with the
147 Tomasulo algorithm's "common data bus." This same-cycle feature *also
148 provides operand forwarding for free*!
149
150 This is just amazing. Let's recap. It's 2018, there's absolutely no
151 Libre SoCs in existence anywhere on our planet of 8 billion people,
152 and we're looking for inspiration on how to make a modern,
153 power-efficient 3D-capable processor, only to find it in a literally
154 55-year-old design for a computer that occupied an entire room and was
155 hand-built with transistors!
156
157 Not only that, but the project has accidentally unearthed incredibly
158 valuable historic processor design information that has eluded the
159 Intels and ARMs (billion-dollar companies), as well as the academic
160 community, for several decades.
161
162 I'd like to take a minute to especially thank Mitch Alsup for his time
163 in ongoing discussions, without which there would be absolutely no
164 chance I could possibly have learned about, let alone understood, any
165 of the above. As I mentioned in the [very first project
166 update](https://www.crowdsupply.com/libre-risc-v/m-class/updates/why-make-a-quad-core-64-bit-soc-surely-there-are-enough-already):
167 new processor designs get one shot at success. Basing the core of the
168 design on a 55-year-old, well-documented, and extremely compact and
169 efficient design is a reasonable strategy: it's just that, without
170 Mitch's help, there would have been no way to understand the 6600's
171 true value.
172
173 The bottom line is, we have a way forward that will result in
174 significantly less hardware and a simpler design, using a lot less
175 power than modern designs, yet providing all of the features normally
176 the exclusive domain of top-end processors, all thanks to a refresh of
177 a 55-year-old processor and the willingness of Mitch Alsup and James
178 Thornton to share their expertise with the world.