# "Name-less" register exception handling
In this
[comp.arch](!topic/comp.arch/8pAGuX6UBu0)
post a scheme has been outlined that, if added to a precise-exception
augmented CDC 6600 style Scoreboard, would allow less load on the register
file (less reads and writes) and still guarantee precise exception handling.
The goal here is to reduce the number of reads and writes to the register
file, because, quite simply put, doing so saves power and reduces contention
for the limited resource of the data buses between the ALUs and the register
file. Why limited resource? Because keeping four or more ALUs fully occupied
with for example an FMAC operation requires 3 READs and 1 WRITE port *per ALU*.
If those are vectorised predicated FMAC operations, it's an even higher READ
count than that.
Four parallel FMACs initiated per clock requires a whopping **TWELVE** read
ports and four WRITE ports. This is completely insane and it is why the
register file has been subdivided into four separate banks.
There are certain standard "cells", including in FPGAs - pre-designed layouts -
for register files. The typical layout is 2R1W (2 read ports, 1 write port,
per clock cycle). Therefore, keeping to that will not only reduce power
consumption, it will reduce the development cost for the project, as well.
It turns out that with FMAC (Floating-point Multiply and Accumulate) operations,
the destination register is usually also the (additive) source register,
in a sequential chain of FMACs. So, actually... aside from the very first
FMAC in the chain, if operand "forwarding" is available in the architecture,
then actually it is only the two numbers being multiplied (and then added)
that need to be read from the register file. That nicely meshes with the
whole "2R1W" thing.
[operand "forwarding"](
is basically that the result from one instruction is "forwarded"
*directly* to the source input of a dependent instruction. In the CDC 6600
this, interestingly, is achieved through a special design of Register File,
where if a register is being read at the same time (on the same clock cycle)
as it is being written, it is "passed through" (literally). "Normal"
(modern) Register File designs simply do not do this, meaning that a
dependent operation would have to wait an additional cycle: hence the
reason why the concept of "Operand Forwarding" was "invented"... even though
the 6600 had implemented it 55 years earlier.
The "Banks" which are planned to be used in the Libre RISC-V SoC present a
bit of a problem as far as forwarding is concerned, even if they include
6600-style same-clock "write-through" capability (aka Operand Forwarding).
The issue is that whilst there are multiplexers planned to be added to the
source (**after** the reads are performed), there are **no** multiplexers
planned to be added before the **destination** registers are written.
Therefore, the plan is to add an additional "forwarding" Bus which can
"bypass" the register file entirely.
This is apparently fairly standard practice in high-performance modern
micro-architectures. The problem is, however: if the register is
identified and marked as "not to be written back to the register file",
and an exception occurs, how on earth do you ensure that the system state
is stable i.e. not corrupted? Most modern systems have a "rollback"
mechanism to deal with this.
Before we get there, however, let's back up a little bit, and go over
the example shown
[here](!msg/comp.arch/gedwgWzCK4A/mRcfK8IODwAJ)
in more depth. This is the sequence:
ADD r1, r2, #5
ADD r2, r1, #5
ADD r1, r2, #5
Note that instruction 3 actually overwrites R1, however R1 is used as
a *source* register in instruction 2. So what that means is, if we have
Tomasulo-style Reservation Stations on the Function Units, we don't
*actually* need to write R1 from instruction 1 into the Register File
at all! We can in fact simply use the fact that it will be sitting in
the Function Unit's Reservation Station, use "Operand Forwarding" to
pass it to instruction 2, and, once instruction 2 is underway, throw
the instruction 1 R1 result **away**. We achieve this by noting that
Instruction 3 "overwrites" Instruction 1's R1 as a destination, and,
whilst all three ALUs are still busy with pipelined processing, "mark"
the Function Unit handling instruction 1 as "nameless". The "name"
of Register R1 effectively changes from "R#1" to "FU1.#n" (assuming
FU 1 is handling instruction 1).
Now we have the context, let's return to the bit about exceptions, and
assume that instruction 2 throws one (ADDs do not normally do that:
let's assume that they can, for now). Note that these are the conditions:
* Precise Exception handling has been added (by adding a "schroedinger"
wire plus a write-hazard block that prevents down-stream instructions
from "committing" (writing) until such time as the up-stream instruction
absolutely knows that there will not be an exception.
* When an up-stream instruction knows that it has passed (cleared) the
hurdle of potentially needing to throw an exception, it **drops** the
write-hazand, DEASSERTs the "schroedinger" wire, thus allowing down-stream
dependent instructions to be free of write hazards, and thus commit.
(However, that's not happening here: instruction 2 **has** flipped
the "schroedinger" wire to "Go\_Die").
* R1 from instruction 1 has been **specifically** marked as **not** to
be written to the Register File: it has been renamed to "nameless"
(FU1.#n).
* R1 from instruction 1 is also a source register of instruction 2.
* Instruction 2 is to be "rolled back"
* Instruction 3 is to be told to die as well (instruction 2 has flipped
the "Go\_Die" signal).
So, what do we do about the value "FU1.#n"? Instruction 3 told it
that it was no longer permitted to write to the Register File, except that
now Instruction 3 is dead! Instruction 1 has absolutely no place to put
that value. Should we discard Instruction 1 **as well**?? How far back does
this go? This is completely wasteful of resources! More than that, what
if we have a multi-issue engine, which issues multiple
instructions in this "nameless" fashion, where they get rolled back
again and again in an endless loop?
This is where modern micro-architectures get a little unstuck: apparently
what they do is, roll back to where there are **no** "nameless" registers,
they then **disable** multi-issue instruction execution, **disable** the
"nameless" capability, and slowly move forward one instruction at a time
until the exception is re-encountered.
This basically ensures that when the exception is encountered, absolutely
all of the registers may be (or are already) committed to the Register File.
At that point, a trap handler knows that it can safely context-switch, or
do whatever it likes, confident that the Register File Architectural State
is sane.
This approach is extremely wasteful of resources, and sub-optimal. In a
design that is supposed to be power-efficient, there's an obligation to
"Do Better". Hence the scheme below.
# CDC 6600 Q-Table (FU-to-Register lookup) "History" Enhancement.
In CDC 6600 Terminology there is something called the "Q-Table", which
is basically an array, indexed by Register number, which keeps a record
of which Function Unit (relative to instruction order) last had that
FU as a Destination Register. This is directly equivalent to and
completely synonymous with the Tomasulo Reorder Buffer's "Dest Reg" CAM
entry (except that in 6600 Scoreboarding it's not a CAM).
The problem with "nameless" Operand Forwarding is: whenever a Q-Table
entry (for any given FU) is overwritten, that's it: that instruction
**absolutely cannot** "roll back". The critical information that would
allow the prior Function Unit (the "overwritee") has just been destroyed.
There is a simple solution to that: provide a *Queue* of Q-Table entries.
Below is what a 6600 Q-Table looks like (image courtesy of Mitch Alsup).
In the original 6600 it is a binary table with a unary decoder on the
left and a pair of unary encoders on the right.
{{6600_q_table.png}}
The plan is, therefore, to add effectively *multiple* Q-Tables
(or, multiple entries), recording the "history" of which *prior*
Function Units had any given register as its destination.
Now we have exactly the information needed to "roll back", should an
exception occur. Like many augmentations and enhancements to the 6600
Scoreboard system, it's kind-of obvious in retrospect. However the *real*
"duh" moment, as posted on comp.arch, is to always ensure that FUs that
are providing "nameless" data in their destination latches will never
let down-stream dependent instructions commit if any of those down-stream
instructions could potentially hit an exception.
Why is that important? It's because it's not enough to know that the
down-stream (dependent) instructions have all initiated (read the
FU's dest latch and taken it as a forwarded src operand). If **even one**
of those instructions throws an exception, the "nameless" FU from which that
value came is hosed, as it has nowhere to put its result.
So, firstly: the "nameless" FU absolutely has to wait until its dependencies
are clear of exceptions (and then **and only** then may it safely drop (throw
away) the data (without writing it to the Register File); and secondly,
the "nameless" FU absolutely has to know that it can "roll back" from
"nameless" to a "named" state, in the event that one of its dependent
instructions does indeed throw an exception. This is where the "History"
Q-Table Entries come into play.
So there's a few
180 * Using the Historical Q-Table Entries, in chronological and Dependency
181 Order, store all "Nameless" Registers (using the "history" to determine
182 where), even if they are going to get overwritten in the next cycle.
183 * After triggering the "Go\_die" wire from the Exception, and all
184 dependent instructions have been removed (including their Destination
185 Register Reservations), use the "history" information to work out
186 which (formerly nameless) Function Unit(s) now actually have the
187 Destination Reservation for all "vacated" Register.
188 * Any remaining "nameless" Registers, if their results are available,
189 are likewise either stored or trigger their shadow (dependent)
190 instructions to die (even if it's the original exception).
191 * Once the dust settles, carry on.
193 Realistically, this is going to need to be investigated with simulations.
194 It's quite complicated, however the payoff is a significant reduction in
195 the workload on the register file. It basically means the difference between
196 12 GFLOPs and 6 GFLOPs when doing 32-bit FMACs, at 800mhz (quad-core),
197 and still being able to keep to a "standard" 2R1W register file.
198 So it's a big deal!