From: Luke Kenneth Casson Leighton Date: Sun, 9 Dec 2018 05:25:22 +0000 (+0000) Subject: scoreboard update X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=4f0045893e5dbcd145b9364476c783ae891f749e;p=crowdsupply.git scoreboard update --- diff --git a/updates/004_2018dec06_microarchitecture_cont.mdwn b/updates/004_2018dec06_microarchitecture_cont.mdwn new file mode 100644 index 0000000..28228c6 --- /dev/null +++ b/updates/004_2018dec06_microarchitecture_cont.mdwn @@ -0,0 +1,122 @@ +# Modernising 1960s Computer Technology: what can be learned from the CDC 6600 + +Firstly, many thanks to +[Heise.de](https://www.heise.de/newsticker/meldung/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant-4242802.html) +for publishing a story on this project. I replied to some of the +[Heise Forum](https://www.heise.de/forum/heise-online/News-Kommentare/Mobilprozessor-mit-freier-GPU-Libre-RISC-V-M-Class-geplant/forum-414986/comment/) +comments, here, endeavouring to use translation software to respect that +the forum is in German. + +In this update, following on from the analysis of the Tomasulo Algorithm, +by a process of osmosis I finally was able to make out a light at the +end of the "Scoreboard" tunnel, and it is not an oncoming train. +Conversations with +[Mitch Alsup](https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/-9JNF0cUCAAJ) +are becoming clear. + +In the previous update, I really did not like the +[Scoreboard](https://en.wikipedia.org/wiki/Scoreboarding) technique +for doing out-of-order superscalar execution, because, *as described*, +it is hopelessly inadequate. There's no roll-back method for +exceptions, no method for coping with register "hazards" (Read after Write +and so on), so register "renaming" has to be done as a precursor step, +no way to do branch prediction, and only a single LOAD/STORE can be +done at any one time. + +The only *well-known* documentation on the CDC 6600 Scoreboarding technique +is the 1967 patent. Here's the kicker: the patent *does not* describe +the key strategic part of Scoreboarding that makes it so powerful and +much more power-efficient than the Tomasulo Algorithm when combined +with Reorder Buffers: the Dependency Matrices. + +Before getting to that stage, I thought it would be a good idea to +make people aware of a book that Mitch told me about, called +"Design of a Computer: the Control Data 6600" by James Thornton. +James worked with Seymour Cray on the 6600. It was literally +constructed from PCB modules using hand-soldered transistors. +Memory was magnetic rings (which is where we get the term "core memory" +from), and the bootloader was a bank of toggle-switches. + +In 2002, someone named Tom Uban sought permission from James and his +wife, to make the book available online, as, historically, the +CDC 6600 is quite literally the precursor to modern supercomputing: + +[[design_of_a_computer_6600_permission.jpg]] + +So I particularly wanted to show the Dependency Matrix, which is the +key strategic part of the Scoreboard: + +[[design_of_a_computer_6600.jpg]] + +Basically, the patent shows a table with src1 and src2, and "ready" +signals: what it does *not* show is the "Go Read" and "Go Write" +signals, and it does not show the way in which one Function Unit +blocks others, via the Dependency Matrix. + +It is well-known that the Tomasulo Reorder Buffer requires a CAM +on the Destination Register, (which is power-hungry and expensive). +This is described in academic literature as data coming "to". The +Scoreboard technique is described as data coming "from" source +registers, however because the Dependency Matrix is left out of +these discussions, what they fail to mention is that there are +*multiple single-line* source wires, thus achieving the exact +same purpose as the Reorder Buffer's CAM, with *far less power +and die area*. + +Not only that: it is quite easy to add incremental register-renaming +tags on top of the Scoreboard + Dependency Matrix, again, no need +for a CAM. Not only that: Mitch describes in an unpublished book +chapter several techniques that each bring in all of the techniques +that are usually exclusively associated with Reorder Buffers, +such as Branch Prediction, speculative execution, precise exceptions +and multi-issue LOAD / STORE hazard avoidance. This diagram below +is reproduced with Mitch's permission: + +[[mitch_ld_st_augmentation.jpg]] + +This high-level diagram includes some subtle modifications that +augment a standard CDC 6600 design to allow speculative execution. +A "Schroedinger" wire is added ("neither alive nor dead"), which, +very simply put, prohibits Function Unit "Write" of results. In +this way, because the "Read" signals were independent of "Write" +(something that is again completely missing from the academic +literature in discussions of 6600 Scoreboards), the instruction +may *begin* execution, but is prevented from *completing* +execution. + +All that is required is to add one extra line to the Dependency +Matrix per "branch" that is to be speculatively executed, just like +any other Functional Unit, in effect. + +Mitch also has a high-level diagram of an additional LOAD/STORE Matrix that +has, again, extremely simple rules: LOADs block STOREs, and +STOREs block LOADs, and the signals "Read / Write" are then passed +down to the Function Unit Dependency Matrix as well. The rules for +the blocking need only be based on "there is no possibility of a conflict" +rather than "on which exact and precise address does a conflict occur". +This in turn means that the number of address bits needed to detect a +conflict may be significantly reduced. Interestingly, RISC-V "Fence" +instruction rules are based on the same idea. + +So this is just amazing. Let's recap. It's 2018, there's absolutely zero +Libre SoCs in existence anywhere on our planet of 8 billion people, and +we're looking for inspiration at literally a 55-year-old computer design +that occupied an entire room and was hand-built with transistors, +on how to make a modern, power-efficient 3D-capable processor. + +Not only that: the project has accidentally unearthed incredibly valuable +historic processor design information that has eluded the Intels and +ARMs - billion-dollar companies - as well as the Academic community - +for several decades. + +I'd like to take a minute to especially thank Mitch Alsup for his +time in ongoing discussions, without which there would be absolutely +no chance that I could possibly have learned about, let alone understood, +any of the above. As I mentioned in the very first update: new processor +designs get one shot at success. Basing the core of the design on +a 55-year-old well-documented and extremely compact and efficient design +is a reasonable strategy: it's just that, without Mitch's help, there +would have been no way to understand the 6600's true value. + +Bottom line: we do not need to follow Intel's power-inefficient lead, here. +