From: Luke Kenneth Casson Leighton Date: Wed, 2 Jan 2019 01:16:15 +0000 (+0000) Subject: add register overwrites update X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=66bc5cb6ac3c6fbe71ca77e672cbb94bf61490f0;p=crowdsupply.git add register overwrites update --- diff --git a/updates/009_register_overwrites.mdwn b/updates/009_register_overwrites.mdwn new file mode 100644 index 0000000..396e4c0 --- /dev/null +++ b/updates/009_register_overwrites.mdwn @@ -0,0 +1,165 @@ +# "Name-less" register exception handling + +In this +[comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0) +post a scheme has been outlined that, if added to a precise-exception +augmented CDC 6600 style Scoreboard, would allow less load on the register +file (less reads and writes) and still guarantee precise exception handling. + +The goal here is to reduce the number of reads and writes to the register +file, because, quite simply put, doing so saves power and reduces contention +for the limited resource of the data buses between the ALUs and the register +file. Why limited resource? Because keeping four or more ALUs fully occupied +with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*. +If those are vectorised predicated FMAC operations, it's an even higher READ +count than that. + +Four parallel FMACs initiated per clock requires a whopping **TWELVE** read +ports and four WRITE ports. This is completely insane and it is why the +register file has been subdivided into four separate banks. + +There are certain standard "cells", including in FPGAs - pre-designed layouts - +for register files. The typical layout is 2R1W (2 read ports, 1 write port, +per clock cycle). Therefore, keeping to that will not only reduce power +consumption, it will reduce the development cost for the project, as well. + +It turns out that with FMAC (Floating-point Multiply and Accumulate) operations, +the destination register is usually also the (additive) source register, +in a sequential chain of FMACs. So, actually... aside from the very first +FMAC in the chain, if operand "forwarding" is available in the architecture, +then actually it is only the two numbers being multiplied (and then added) +that need to be read from the register file. That nicely meshes with the +whole "2R1W" thing. + +[operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding) +is basically that the result from one instruction is "forwarded" +*directly* to the source input of a dependent instruction. In the CDC 6600 +this, interestingly, is achieved through a special design of Register File, +where if a register is being read at the same time (on the same clock cycle) +as it is being written, it is "passed through" (literally). "Normal" +(modern) Register File designs simply do not do this, meaning that a +dependent operation would have to wait an additional cycle: hence the +reason why the concept of "Operand Forwarding" was "invented"... even though +the 6600 had implemented it 55 years earlier. + +The "Banks" which are planned to be used in the Libre RISC-V SoC present a +bit of a problem as far as forwarding is concerned, even if they include +6600-style same-clock "write-through" capability (aka Operand Forwarding). +The issue is that whilst there are multiplexers planned to be added to the +source (**after** the reads are performed), there are **no** multiplexers +planned to be added before the **destination** registers are written. +Therefore, the plan is to add an additional "forwarding" Bus which can +"bypass" the register file entirely. + +This is apparently fairly standard practice in high-performance modern +micro-architectures. The problem is, however: if the register is +identified and marked as "not to be written back to the register file", +and an exception occurs, how on earth do you ensure that the system state +is stable i.e. not corrupted? Most modern systems have a "rollback" +mechanism to deal with this. + +Before we get there, however, let's back up a little bit, and go over +the example shown +[here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ) +in more depth. This is the sequence: + + ADD r1, r2, #5 + ADD r2, r1, #5 + ADD r1, r2, #5 + +Note that instruction 3 actually overwrites R1, however R1 is used as +a *source* register in instruction 2. So what that means is, if we have +Tomasulo-style Reservation Stations on the Function Units, we don't +*actually* need to write R1 from instruction 1 into the Register File +at all! We can in fact simply use the fact that it will be sitting in +the Function Unit's Reservation Station, use "Operand Forwarding" to +pass it to instruction 2, and, once instruction 2 is underway, throw +the instruction 1 R1 result **away**. We achieve this by noting that +Instruction 3 "overwrites" Instruction 1's R1 as a destination, and, +whilst all three ALUs are still busy with pipelined processing, "mark" +the Function Unit handling instruction 1 as "nameless". The "name" +of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming +FU 1 is handling instruction 1). + +Now we have the context, let's return to the bit about exceptions, and +assume that instruction 2 throws one (ADDs do not normally do that: +let's assume that they can, for now). Note that these are the conditions: + +* Precise Exception handling has been added (by adding a "schroedinger" + wire plus a write-hazard block that prevents down-stream instructions + from "committing" (writing) until such time as the up-stream instruction + absolutely knows that there will not be an exception. +* When an up-stream instruction knows that it has passed (cleared) the + hurdle of potentially needing to throw an exception, it **drops** the + write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream + dependent instructions to be free of write hazards, and thus commit. + (However, that's not happening here: instruction 2 **has** flipped + the "schroedinger" wire to "Go\_Die"). +* R1 from instruction 1 has been **specifically** marked as **not** to + be written to the Register File: it has been renamed to "nameless" + (FU1.#n). +* R1 from instruction 1 is also a source register of instruction 2. +* Instruction 2 is to be "rolled back" +* Instruction 3 is to be told to die as well (instruction 2 has flipped + the "Go\_Die" signal). + +...um, what do we do about the value "FU1.#n"? Instruction 3 told it +that it was no longer permitted to write to the Register File, except that +now Instruction 3 is dead! Instruction 1 has absolutely no place to put +that value. Should we discard Instruction 1 **as well**?? How far back does +this go? This is completely wasteful of resources! More than that, what +if we have a multi-issue engine, which issues multiple +instructions in this "nameless" fashion, where they get rolled back +again and again in an endless loop? + +This is where modern micro-architectures get a little unstuck: apparently +what they do is, roll back to where there are **no** "nameless" registers, +they then **disable** multi-issue instruction execution, **disable** the +"nameless" capability, and slowly move forward one instruction at a time +until the exception is re-encountered. +This basically ensures that when the exception is encountered, absolutely +all of the registers may be (or are already) committed to the Register File. +At that point, a trap handler knows that it can safely context-switch, or +do whatever it likes, confident that the Register File Architectural State +is sane. + +This approach is extremely wasteful of resources, and sub-optimal. In a +design that is supposed to be power-efficient, there's an obligation to +"Do Better". Hence the scheme below. + +# CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement. + +In CDC 6600 Terminology there is something called the "Q-Table", which +is basically an array, indexed by Register number, which keeps a record +of which Function Unit (relative to instruction order) last had that +FU as a Destination Register. This is directly equivalent to and +completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM +entry (except that in 6600 Scoreboarding it's not a CAM). + +The problem with "nameless" Operand Forwarding is: whenever a Q-Table +entry (for any given FU) is overwritten, that's it: that instruction +**absolutely cannot** "roll back". The critical information that would +allow the prior Function Unit (the "overwritee") has just been destroyed. + +There is a simple solution to that: provide a *Queue* of Q-Table entries. + +Now we have exactly the information needed to "roll back", should an +exception occur. Like many augmentations and enhancements to the 6600 +Scoreboard system, it's kind-of obvious in retrospect. However the *real* +"duh" moment, as posted on comp.arch, is to always ensure that FUs that +are providing "nameless" data in their destination latches will never +let down-stream dependent instructions commit if any of those down-stream +instructions could potentially hit an exception. + +Why is that important? It's because it's not enough to know that the +down-stream (dependent) instructions have all initiated (read the +FU's dest latch and taken it as a forwarded src operand). If **even one** +of those instructions throws an exception, the "nameless" FU is hosed. +So, firstly: the "nameless" FU absolutely has to wait until its dependencies +are clear of exceptions (and then **and only** then may it safely drop (throw +away) the data (without writing it to the Register File); and secondly, +the "nameless" FU absolutely has to know that it can "roll back" from +"nameless" to a "named" state, in the event that one of its dependent +instructions does indeed throw an exception. This is where the "History" +Q-Table Entries come into play. +