# "Name-less" register exception handling In this [comp.arch](https://groups.google.com/forum/#!topic/comp.arch/8pAGuX6UBu0) post a scheme has been outlined that, if added to a precise-exception augmented CDC 6600 style Scoreboard, would allow less load on the register file (less reads and writes) and still guarantee precise exception handling. The goal here is to reduce the number of reads and writes to the register file, because, quite simply put, doing so saves power and reduces contention for the limited resource of the data buses between the ALUs and the register file. Why limited resource? Because keeping four or more ALUs fully occupied with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*. If those are vectorised predicated FMAC operations, it's an even higher READ count than that. Four parallel FMACs initiated per clock requires a whopping **TWELVE** read ports and four WRITE ports. This is completely insane and it is why the register file has been subdivided into four separate banks. There are certain standard "cells", including in FPGAs - pre-designed layouts - for register files. The typical layout is 2R1W (2 read ports, 1 write port, per clock cycle). Therefore, keeping to that will not only reduce power consumption, it will reduce the development cost for the project, as well. It turns out that with FMAC (Floating-point Multiply and Accumulate) operations, the destination register is usually also the (additive) source register, in a sequential chain of FMACs. So, actually... aside from the very first FMAC in the chain, if operand "forwarding" is available in the architecture, then actually it is only the two numbers being multiplied (and then added) that need to be read from the register file. That nicely meshes with the whole "2R1W" thing. [operand "forwarding"](https://en.wikipedia.org/wiki/Operand_forwarding) is basically that the result from one instruction is "forwarded" *directly* to the source input of a dependent instruction. In the CDC 6600 this, interestingly, is achieved through a special design of Register File, where if a register is being read at the same time (on the same clock cycle) as it is being written, it is "passed through" (literally). "Normal" (modern) Register File designs simply do not do this, meaning that a dependent operation would have to wait an additional cycle: hence the reason why the concept of "Operand Forwarding" was "invented"... even though the 6600 had implemented it 55 years earlier. The "Banks" which are planned to be used in the Libre RISC-V SoC present a bit of a problem as far as forwarding is concerned, even if they include 6600-style same-clock "write-through" capability (aka Operand Forwarding). The issue is that whilst there are multiplexers planned to be added to the source (**after** the reads are performed), there are **no** multiplexers planned to be added before the **destination** registers are written. Therefore, the plan is to add an additional "forwarding" Bus which can "bypass" the register file entirely. This is apparently fairly standard practice in high-performance modern micro-architectures. The problem is, however: if the register is identified and marked as "not to be written back to the register file", and an exception occurs, how on earth do you ensure that the system state is stable i.e. not corrupted? Most modern systems have a "rollback" mechanism to deal with this. Before we get there, however, let's back up a little bit, and go over the example shown [here](https://groups.google.com/forum/#!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ) in more depth. This is the sequence: ADD r1, r2, #5 ADD r2, r1, #5 ADD r1, r2, #5 Note that instruction 3 actually overwrites R1, however R1 is used as a *source* register in instruction 2. So what that means is, if we have Tomasulo-style Reservation Stations on the Function Units, we don't *actually* need to write R1 from instruction 1 into the Register File at all! We can in fact simply use the fact that it will be sitting in the Function Unit's Reservation Station, use "Operand Forwarding" to pass it to instruction 2, and, once instruction 2 is underway, throw the instruction 1 R1 result **away**. We achieve this by noting that Instruction 3 "overwrites" Instruction 1's R1 as a destination, and, whilst all three ALUs are still busy with pipelined processing, "mark" the Function Unit handling instruction 1 as "nameless". The "name" of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming FU 1 is handling instruction 1). Now we have the context, let's return to the bit about exceptions, and assume that instruction 2 throws one (ADDs do not normally do that: let's assume that they can, for now). Note that these are the conditions: * Precise Exception handling has been added (by adding a "schroedinger" wire plus a write-hazard block that prevents down-stream instructions from "committing" (writing) until such time as the up-stream instruction absolutely knows that there will not be an exception. * When an up-stream instruction knows that it has passed (cleared) the hurdle of potentially needing to throw an exception, it **drops** the write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream dependent instructions to be free of write hazards, and thus commit. (However, that's not happening here: instruction 2 **has** flipped the "schroedinger" wire to "Go\_Die"). * R1 from instruction 1 has been **specifically** marked as **not** to be written to the Register File: it has been renamed to "nameless" (FU1.#n). * R1 from instruction 1 is also a source register of instruction 2. * Instruction 2 is to be "rolled back" * Instruction 3 is to be told to die as well (instruction 2 has flipped the "Go\_Die" signal). ...um, what do we do about the value "FU1.#n"? Instruction 3 told it that it was no longer permitted to write to the Register File, except that now Instruction 3 is dead! Instruction 1 has absolutely no place to put that value. Should we discard Instruction 1 **as well**?? How far back does this go? This is completely wasteful of resources! More than that, what if we have a multi-issue engine, which issues multiple instructions in this "nameless" fashion, where they get rolled back again and again in an endless loop? This is where modern micro-architectures get a little unstuck: apparently what they do is, roll back to where there are **no** "nameless" registers, they then **disable** multi-issue instruction execution, **disable** the "nameless" capability, and slowly move forward one instruction at a time until the exception is re-encountered. This basically ensures that when the exception is encountered, absolutely all of the registers may be (or are already) committed to the Register File. At that point, a trap handler knows that it can safely context-switch, or do whatever it likes, confident that the Register File Architectural State is sane. This approach is extremely wasteful of resources, and sub-optimal. In a design that is supposed to be power-efficient, there's an obligation to "Do Better". Hence the scheme below. # CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement. In CDC 6600 Terminology there is something called the "Q-Table", which is basically an array, indexed by Register number, which keeps a record of which Function Unit (relative to instruction order) last had that FU as a Destination Register. This is directly equivalent to and completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM entry (except that in 6600 Scoreboarding it's not a CAM). The problem with "nameless" Operand Forwarding is: whenever a Q-Table entry (for any given FU) is overwritten, that's it: that instruction **absolutely cannot** "roll back". The critical information that would allow the prior Function Unit (the "overwritee") has just been destroyed. There is a simple solution to that: provide a *Queue* of Q-Table entries. Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup). In the original 6600 it is a binary table with a unary decoder on the left and a pair of unary encoders on the right. {{6600_q_table.png}} The plan is, therefore, to add effectively *multiple* Q-Tables (or, multiple entries), recording the "history" of which *prior* Function Units had any given register as its destination. Now we have exactly the information needed to "roll back", should an exception occur. Like many augmentations and enhancements to the 6600 Scoreboard system, it's kind-of obvious in retrospect. However the *real* "duh" moment, as posted on comp.arch, is to always ensure that FUs that are providing "nameless" data in their destination latches will never let down-stream dependent instructions commit if any of those down-stream instructions could potentially hit an exception. Why is that important? It's because it's not enough to know that the down-stream (dependent) instructions have all initiated (read the FU's dest latch and taken it as a forwarded src operand). If **even one** of those instructions throws an exception, the "nameless" FU from which that value came is hosed, as it has nowhere to put its result. So, firstly: the "nameless" FU absolutely has to wait until its dependencies are clear of exceptions (and then **and only** then may it safely drop (throw away) the data (without writing it to the Register File); and secondly, the "nameless" FU absolutely has to know that it can "roll back" from "nameless" to a "named" state, in the event that one of its dependent instructions does indeed throw an exception. This is where the "History" Q-Table Entries come into play. So there's a few potential ways to go about this: * Using the Historical Q-Table Entries, in chronological and Dependency Order, store all "Nameless" Registers (using the "history" to determine where), even if they are going to get overwritten in the next cycle. * After triggering the "Go\_die" wire from the Exception, and all dependent instructions have been removed (including their Destination Register Reservations), use the "history" information to work out which (formerly nameless) Function Unit(s) now actually have the Destination Reservation for all "vacated" Register. * Any remaining "nameless" Registers, if their results are available, are likewise either stored or trigger their shadow (dependent) instructions to die (even if it's the original exception). * Once the dust settles, carry on. Realistically, this is going to need to be investigated with simulations. It's quite complicated, however the payoff is a significant reduction in the workload on the register file. It basically means the difference between 12 GFLOPs and 6 GFLOPs when doing 32-bit FMACs, at 800mhz (quad-core), and still being able to keep to a "standard" 2R1W register file. So it's a big deal!