# FPUs and nmigen

[nmigen](https://github.com/m-labs/nmigen) by
[m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very
interesting choice.  It's not a panacea: however the fact remains that
verilog is simply never going to be as up-to-date or have advanced and
powerful features added to it that python has, and, in all seriousness,
it never should be updated either.  Instead, it is quite reasonable to
treat verilog in effect as a machine-code (compiler target).

However, it is critical to remember that despite writing code in python,
the underlying rules to obey are those of hardware, not software.  That
modules (and how to use them) are not the same thing - at all - as calling
a function, and that classes are definitely not synonymous with modules.
This update outlines some of the quirks encountered.

# Modules != classes

The initial conversion process of John Dawson's IEEE754 FPU verilog code
to nmigen went extremely well and very rapidly.  Where things began to
come unstuck for over a week was in the efforts to "pythonify" the code,
with a view to converting a Finite State Machine design into a pipeline.
The initial efforts focussed on splitting out the relevant sections of
code into python classes and functions, to be followed up by subsequently
converting those to modules (actual verilog modules, rather than "python"
modules).

John's design is based around the use of global variables.  The code moves
from state to state, using the global variables to store forward progress.
A pipeline requires the use of *local* variables (local to each stage),
where the output from one stage is connected as time-synchronised as the
input to another.  Aleksander encountered some excellent work by
Dan Gisselquist on
[zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html),
which describes various pipeline strategies including one which involves
buffered handshakes.  It turns out that John's code (as a unit) in fact
conforms to the very paradigm that Dan describes.  However, John's code
also has stages that perform shifting one bit at a time, for normalisation
of the floating-point result. The global internal variable is updated one
bit every cycle, and that's not how pipelines work: it's an imperative
prerequisite that a pipeline stage do its work in a *single* cycle.

So the first incremental change was to split out each stage (alignment,
normalisation, the actual add, rounding and so on) into separate classes.

It was a mess.

The problem is that where in computer programming languages it is normal
to have a variable that can be updated (overwritten), hardware is parallel
and doesn't like it when more than one piece of "code" tries to update the
same "variable".  Outputs *have* to be separated from inputs.  So although
the "code" to do some work may be split out into a separate class, it's
necessary to also cleanly separate the inputs from the outputs.  *No*
variables may be overwritten without them being properly protected, and
in a pipeline paradigm, global variables are not an option.

In addition, modules need to be "connected" to the places where they are
used.  It's not possible to "call" a module, and expect the parameters
to be passed in and automatically the inputs and outputs magically work:
nmigen is a different paradigm because you can either use "sync" or
"comb" - clock-synchronisation or combinatorial logic.

If you use "comb", it generates hardware that is updated immediately
from its inputs.  However if you use "sync", nmigen knows to auto-generate
hardware where on the **next** cycle, the result is updated from its
inputs.  The problem in converting code over to a module and using local
inputs and outputs *and* removing globals is that it's too many things at
once to tackle.

It took about ten days to work all this out, keeping the unit tests running
at all times and using their success or failure as an indicator of whether
things were on track.  Eventually however it all worked out.

# Add Example module

It's worthwhile showing some of the pipeline stages.  Here's the python
nmigen code for the adder first stage:

{add_code_screenshot.png}

A prerequisite is that an "alignment" phase was run, which ensured that
the exponents were both the same, so there is no need in this phase to
perform bit-shifting of the mantissas: that's already been handled.

There's two inputs (in_a and in_b) and one output (out_z): these are
modules in their own right, each containing a sign, mantissa and exponent.
in_a.m is the mantissa of input A, for example.  So the first thing is:
four intermediate variables are created: one for testing whether the
signs of A and B are equal (or not), the second for comparing the
mantissas of A and B, and two further intermediates are used to store
the mantissas A and B zero-extended by one bit.

Next we have some simple combinatorial tests: if the signs were the
same, we perform an add of A and B's mantissas, storing them in Z's
mantissa.  If we get to the next "If" statement, we know that this
is to be a subtraction, not an addition.  However, for subtraction,
it matters which way round the subtractions are done, depending on
which of A or B is the larger.

It's really quite straightforward, and the important thing here is to
note that the code is properly commented.  It's not the most compact
code in the world: it's not the prettiest-looking.  Python cannot
handle overloading of the assignment operator (not without overloading
getattr and setattr, that is), so nmigen creates and uses a method
named "eq" to handle assignment.

One aspect of this project that's considered to be extremely important
is to do a visual inspection of each module.  Here's what add looks like
when yosys "show" command is run on it:

{add_graph.png}

On the left it can be seen that the names are a bit of a mess: the members
of A and B (s, e and m) are extracted and, because they clash, are given
auto-generated names.  m can be seen to go into a square (a graphviz module)
with "e" and "m" on it, in a box named "add0_n_a".  That's the name we
chose to give to the submodule in the nmigen code, shown above, purely
so that it would be easy to visually identify in the graphviz output.

Note that there is an arrow into a block that takes m (bits 26 down to 0)
and a single-bit zero, and outputs that Concatenated together: these then
go into a diamond-block named "am0".  We've identified am0 from the python
code!

The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater
or Equal) operator, which in turn goes to a diamond-block named "mge":
this is the check to see which of the mantissas is larger!

Then we can see $15, $12 and $18 are add and subtraction operations,
which feed through to a selection procedure ($group_5), which ultimately
goes into the "out_tot" variable.  This is the mantissa output of the
addition.

So with a little careful analysis, by tracking the names of the inputs,
intermediates and outputs, we can verify that the resultant auto-generated
output from nmigen looks reasonable.  The thing is: the module has been
*deliberately* kept small so as to be able to do *exactly this*.  One of
the reasons for this is illustrated below.

# Where things go horribly wrong

In nmigen, it's perfectly possible to use python variables to assign
(accumulate) intermediate results, without actually storing them in
actual "named" hardware (so to speak).  Note in the add above, that
the tests for the If and Elif statements were placed into intermediate
variables?  The reason for this was that if they were not, yosys
**duplicated** the expressions.  Here's an example of where that goes
horribly wrong.  Note the simple-looking innocuous code, below:

{shift_screenshot.png}

sm.rshift basically does a variable-length right shift (the "<<" operator
in both python and verilog).  Note the assignment of the intermediary
calculation m_mask to a python temporary variable.  Note the commented-out
code which uses the "reduce" operator to or all of the bits of
a *secondary* expression, which ANDs all of the bits of "m_mask" with
the input mantissa?  Watch what happens when that's handed over to yosys:

{align_single_fail.png}

It's an absolute mess.  If you zoom in close on the left side, what's
happened is that the shift expression has been **multiplied** (duplicated)
a whopping **twenty four** times (this is a 32-bit FP number so the mantissa
is 24 bits).  The reason is because the reduce operation needed 24 copies
of the input, in order to select one bit at a time.  Then, on the right
hand side, each bit is ORed in a chain with the previous bit, exactly as
would be expected to be carried out by a sequential processor performing
a "reduce" operation.

On seeing this graph output, it was immediately apparent that it would be
totally unacceptable, yet from the python nmigen it is not in the slightest
bit obvious that there's a problem.  **This is why the yosys "show" output is
so important**.

On further investigation, it was discovered that there is a "bool" function
of nmigen, which ORs all bits of a variable together.  In yosys it even
has a name, "reduce_bool".  Here's the graph output once that function has
been used instead:

{align_single_clean.png}

*Now* we are looking at something that's much clearer, smaller, cleaner and
easier to understand.  It's still quite amazing how so few lines of code
can turn into something so comprehensive.  The line of "1s" (11111111...)
is where the variable "m_mask" gets created: this line of "1s" is right-shifted
to create the mask.  In the box named "$43" it is then ANDed with the original
mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation,
and so on.

This shift-mask is basically for the creation of the "sticky" bit in
IEEE754 rounding.  It's essential to get right, and it's an essential part
of IEEE754 Floating-Point.  By doing this kind of close visual inspection,
and keeping things to small, compact modules, in combination with comprehensive
unit test coverage and performing incremental minimalist changes, we stand a
reasonable chance of not making huge glaring design errors and being able
to bootstrap up to a decent design.

Not knowing how to do something is not an excuse for not trying.  Having a
strategy for being able to work things out is essential to succeeding, even
when faced with a huge number of unknowns.  Go from known-good to known-good,
create the building blocks first, make sure that they're reasonable, make
sure that they're unit-tested comprehensively, then incremental changes can
be attempted with the confidence that mistakes can be weeded out immediately
by a unit test failing when it should not.

However, as this update demonstrates: both those versions of the normalisation
alignment produced the correct answer, yet one of them was deeply flawed.
Even code that produces the correct answer may have design flaws:
that's what the visual inspection is for.