# FPUs and nmigen [nmigen](https://github.com/m-labs/nmigen) by [m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very interesting choice. It's not a panacea: however the fact remains that verilog is simply never going to be as up-to-date or have advanced and powerful features added to it that python has, and, in all seriousness, it never should be updated either. Instead, it is quite reasonable to treat verilog in effect as a machine-code (compiler target). However, it is critical to remember that despite writing code in python, the underlying rules to obey are those of hardware, not software. That modules (and how to use them) are not the same thing - at all - as calling a function, and that classes are definitely not synonymous with modules. This update outlines some of the quirks encountered. # Modules != classes The initial conversion process of John Dawson's IEEE754 FPU verilog code to nmigen went extremely well and very rapidly. Where things began to come unstuck for over a week was in the efforts to "pythonify" the code, with a view to converting a Finite State Machine design into a pipeline. The initial efforts focussed on splitting out the relevant sections of code into python classes and functions, to be followed up by subsequently converting those to modules (actual verilog modules, rather than "python" modules). John's design is based around the use of global variables. The code moves from state to state, using the global variables to store forward progress. A pipeline requires the use of *local* variables (local to each stage), where the output from one stage is connected as time-synchronised as the input to another. Aleksander encountered some excellent work by Dan Gisselquist on [zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html), which describes various pipeline strategies including one which involves buffered handshakes. It turns out that John's code (as a unit) in fact conforms to the very paradigm that Dan describes. However, John's code also has stages that perform shifting one bit at a time, for normalisation of the floating-point result. The global internal variable is updated one bit every cycle, and that's not how pipelines work: it's an imperative prerequisite that a pipeline stage do its work in a *single* cycle. So the first incremental change was to split out each stage (alignment, normalisation, the actual add, rounding and so on) into separate classes. It was a mess. The problem is that where in computer programming languages it is normal to have a variable that can be updated (overwritten), hardware is parallel and doesn't like it when more than one piece of "code" tries to update the same "variable". Outputs *have* to be separated from inputs. So although the "code" to do some work may be split out into a separate class, it's necessary to also cleanly separate the inputs from the outputs. *No* variables may be overwritten without them being properly protected, and in a pipeline paradigm, global variables are not an option. In addition, modules need to be "connected" to the places where they are used. It's not possible to "call" a module, and expect the parameters to be passed in and automatically the inputs and outputs magically work: nmigen is a different paradigm because you can either use "sync" or "comb" - clock-synchronisation or combinatorial logic. If you use "comb", it generates hardware that is updated immediately from its inputs. However if you use "sync", nmigen knows to auto-generate hardware where on the **next** cycle, the result is updated from its inputs. The problem in converting code over to a module and using local inputs and outputs *and* removing globals is that it's too many things at once to tackle. It took about ten days to work all this out, keeping the unit tests running at all times and using their success or failure as an indicator of whether things were on track. Eventually however it all worked out. # Add Example module It's worthwhile showing some of the pipeline stages. Here's the python nmigen code for the adder first stage: {add_code_screenshot.png} A prerequisite is that an "alignment" phase was run, which ensured that the exponents were both the same, so there is no need in this phase to perform bit-shifting of the mantissas: that's already been handled. There's two inputs (in_a and in_b) and one output (out_z): these are modules in their own right, each containing a sign, mantissa and exponent. in_a.m is the mantissa of input A, for example. So the first thing is: four intermediate variables are created: one for testing whether the signs of A and B are equal (or not), the second for comparing the mantissas of A and B, and two further intermediates are used to store the mantissas A and B zero-extended by one bit. Next we have some simple combinatorial tests: if the signs were the same, we perform an add of A and B's mantissas, storing them in Z's mantissa. If we get to the next "If" statement, we know that this is to be a subtraction, not an addition. However, for subtraction, it matters which way round the subtractions are done, depending on which of A or B is the larger. It's really quite straightforward, and the important thing here is to note that the code is properly commented. It's not the most compact code in the world: it's not the prettiest-looking. Python cannot handle overloading of the assignment operator (not without overloading getattr and setattr, that is), so nmigen creates and uses a method named "eq" to handle assignment. One aspect of this project that's considered to be extremely important is to do a visual inspection of each module. Here's what add looks like when yosys "show" command is run on it: {add_graph.png} On the left it can be seen that the names are a bit of a mess: the members of A and B (s, e and m) are extracted and, because they clash, are given auto-generated names. m can be seen to go into a square (a graphviz module) with "e" and "m" on it, in a box named "add0_n_a". That's the name we chose to give to the submodule in the nmigen code, shown above, purely so that it would be easy to visually identify in the graphviz output. Note that there is an arrow into a block that takes m (bits 26 down to 0) and a single-bit zero, and outputs that Concatenated together: these then go into a diamond-block named "am0". We've identified am0 from the python code! The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater or Equal) operator, which in turn goes to a diamond-block named "mge": this is the check to see which of the mantissas is larger! Then we can see $15, $12 and $18 are add and subtraction operations, which feed through to a selection procedure ($group_5), which ultimately goes into the "out_tot" variable. This is the mantissa output of the addition. So with a little careful analysis, by tracking the names of the inputs, intermediates and outputs, we can verify that the resultant auto-generated output from nmigen looks reasonable. The thing is: the module has been *deliberately* kept small so as to be able to do *exactly this*. One of the reasons for this is illustrated below. # Where things go horribly wrong In nmigen, it's perfectly possible to use python variables to assign (accumulate) intermediate results, without actually storing them in actual "named" hardware (so to speak). Note in the add above, that the tests for the If and Elif statements were placed into intermediate variables? The reason for this was that if they were not, yosys **duplicated** the expressions. Here's an example of where that goes horribly wrong. Note the simple-looking innocuous code, below: {shift_screenshot.png} sm.rshift basically does a variable-length right shift (the "<<" operator in both python and verilog). Note the assignment of the intermediary calculation m_mask to a python temporary variable. Note the commented-out code which uses the "reduce" operator to or all of the bits of a *secondary* expression, which ANDs all of the bits of "m_mask" with the input mantissa? Watch what happens when that's handed over to yosys: {align_single_fail.png} It's an absolute mess. If you zoom in close on the left side, what's happened is that the shift expression has been **multiplied** (duplicated) a whopping **twenty four** times (this is a 32-bit FP number so the mantissa is 24 bits). The reason is because the reduce operation needed 24 copies of the input, in order to select one bit at a time. Then, on the right hand side, each bit is ORed in a chain with the previous bit, exactly as would be expected to be carried out by a sequential processor performing a "reduce" operation. On seeing this graph output, it was immediately apparent that it would be totally unacceptable, yet from the python nmigen it is not in the slightest bit obvious that there's a problem. **This is why the yosys "show" output is so important**. On further investigation, it was discovered that there is a "bool" function of nmigen, which ORs all bits of a variable together. In yosys it even has a name, "reduce_bool". Here's the graph output once that function has been used instead: {align_single_clean.png} *Now* we are looking at something that's much clearer, smaller, cleaner and easier to understand. It's still quite amazing how so few lines of code can turn into something so comprehensive. The line of "1s" (11111111...) is where the variable "m_mask" gets created: this line of "1s" is right-shifted to create the mask. In the box named "$43" it is then ANDed with the original mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation, and so on. This shift-mask is basically for the creation of the "sticky" bit in IEEE754 rounding. It's essential to get right, and it's an essential part of IEEE754 Floating-Point. By doing this kind of close visual inspection, and keeping things to small, compact modules, in combination with comprehensive unit test coverage and performing incremental minimalist changes, we stand a reasonable chance of not making huge glaring design errors and being able to bootstrap up to a decent design. Not knowing how to do something is not an excuse for not trying. Having a strategy for being able to work things out is essential to succeeding, even when faced with a huge number of unknowns. Go from known-good to known-good, create the building blocks first, make sure that they're reasonable, make sure that they're unit-tested comprehensively, then incremental changes can be attempted with the confidence that mistakes can be weeded out immediately by a unit test failing when it should not. However, as this update demonstrates: both those versions of the normalisation alignment produced the correct answer, yet one of them was deeply flawed. Even code that produces the correct answer may have design flaws: that's what the visual inspection is for.