From: Luke Kenneth Casson Leighton Date: Sun, 3 Mar 2019 00:17:06 +0000 (+0000) Subject: FPU update X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=c756d2f8b95bfc84305d04dfb639533528655e2d;p=crowdsupply.git FPU update --- diff --git a/images/shift_screenshot.png b/images/shift_screenshot.png new file mode 100644 index 0000000..17fab04 Binary files /dev/null and b/images/shift_screenshot.png differ diff --git a/updates/017_2019mar02_nmigen_learning_curve.mdwn b/updates/017_2019mar02_nmigen_learning_curve.mdwn new file mode 100644 index 0000000..6765f9a --- /dev/null +++ b/updates/017_2019mar02_nmigen_learning_curve.mdwn @@ -0,0 +1,210 @@ +# FPUs and nmigen + +[nmigen](https://github.com/m-labs/nmigen) by +[m-labs](http://m-labs.hk/migen/index.html) is turning out to be a very +interesting choice. It's not a panacea: however the fact remains that +verilog is simply never going to be as up-to-date or have advanced and +powerful features added to it that python has, and, in all seriousness, +it never should be updated either. Instead, it is quite reasonable to +treat verilog in effect as a machine-code (compiler target). + +However, it is critical to remember that despite writing code in python, +the underlying rules to obey are those of hardware, not software. That +modules (and how to use them) are not the same thing - at all - as calling +a function, and that classes are definitely not synonymous with modules. +This update outlines some of the quirks encountered. + +# Modules != classes + +The initial conversion process of John Dawson's IEEE754 FPU verilog code +to nmigen went extremely well and very rapidly. Where things began to +come unstuck for over a week was in the efforts to "pythonify" the code, +with a view to converting a Finite State Machine design into a pipeline. +The initial efforts focussed on splitting out the relevant sections of +code into python classes and functions, to be followed up by subsequently +converting those to modules (actual verilog modules, rather than "python" +modules). + +John's design is based around the use of global variables. The code moves +from state to state, using the global variables to store forward progress. +A pipeline requires the use of *local* variables (local to each stage), +where the output from one stage is connected as time-synchronised as the +input to another. Aleksander encountered some excellent work by +Dan Gisselquist on +[zipcpu](https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html), +which describes various pipeline strategies including one which involves +buffered handshakes. It turns out that John's code (as a unit) in fact +conforms to the very paradigm that Dan describes. However, John's code +also has stages that perform shifting one bit at a time, for normalisation +of the floating-point result. The global internal variable is updated one +bit every cycle, and that's not how pipelines work: it's an imperative +prerequisite that a pipeline stage do its work in a *single* cycle. + +So the first incremental change was to split out each stage (alignment, +normalisation, the actual add, rounding and so on) into separate classes. + +It was a mess. + +The problem is that where in computer programming languages it is normal +to have a variable that can be updated (overwritten), hardware is parallel +and doesn't like it when more than one piece of "code" tries to update the +same "variable". Outputs *have* to be separated from inputs. So although +the "code" to do some work may be split out into a separate class, it's +necessary to also cleanly separate the inputs from the outputs. *No* +variables may be overwritten without them being properly protected, and +in a pipeline paradigm, global variables are not an option. + +In addition, modules need to be "connected" to the places where they are +used. It's not possible to "call" a module, and expect the parameters +to be passed in and automatically the inputs and outputs magically work: +nmigen is a different paradigm because you can either use "sync" or +"comb" - clock-synchronisation or combinatorial logic. + +If you use "comb", it generates hardware that is updated immediately +from its inputs. However if you use "sync", nmigen knows to auto-generate +hardware where on the **next** cycle, the result is updated from its +inputs. The problem in converting code over to a module and using local +inputs and outputs *and* removing globals is that it's too many things at +once to tackle. + +It took about ten days to work all this out, keeping the unit tests running +at all times and using their success or failure as an indicator of whether +things were on track. Eventually however it all worked out. + +# Add Example module + +It's worthwhile showing some of the pipeline stages. Here's the python +nmigen code for the adder first stage: + +{add_code_screenshot.png} + +A prerequisite is that an "alignment" phase was run, which ensured that +the exponents were both the same, so there is no need in this phase to +perform bit-shifting of the mantissas: that's already been handled. + +There's two inputs (in_a and in_b) and one output (out_z): these are +modules in their own right, each containing a sign, mantissa and exponent. +in_a.m is the mantissa of input A, for example. So the first thing is: +four intermediate variables are created: one for testing whether the +signs of A and B are equal (or not), the second for comparing the +mantissas of A and B, and two further intermediates are used to store +the mantissas A and B zero-extended by one bit. + +Next we have some simple combinatorial tests: if the signs were the +same, we perform an add of A and B's mantissas, storing them in Z's +mantissa. If we get to the next "If" statement, we know that this +is to be a subtraction, not an addition. However, for subtraction, +it matters which way round the subtractions are done, depending on +which of A or B is the larger. + +It's really quite straightforward, and the important thing here is to +note that the code is properly commented. It's not the most compact +code in the world: it's not the prettiest-looking. Python cannot +handle overloading of the assignment operator (not without overloading +getattr and setattr, that is), so nmigen creates and uses a method +named "eq" to handle assignment. + +One aspect of this project that's considered to be extremely important +is to do a visual inspection of each module. Here's what add looks like +when yosys "show" command is run on it: + +{add_graph.png} + +On the left it can be seen that the names are a bit of a mess: the members +of A and B (s, e and m) are extracted and, because they clash, are given +auto-generated names. m can be seen to go into a square (a graphviz module) +with "e" and "m" on it, in a box named "add0_n_a". That's the name we +chose to give to the submodule in the nmigen code, shown above, purely +so that it would be easy to visually identify in the graphviz output. + +Note that there is an arrow into a block that takes m (bits 26 down to 0) +and a single-bit zero, and outputs that Concatenated together: these then +go into a diamond-block named "am0". We've identified am0 from the python +code! + +The m (mantissa A) and m$2 (mantissa B) also go into $9, a "ge" (Greater +or Equal) operator, which in turn goes to a diamond-block named "mge": +this is the check to see which of the mantissas is larger! + +Then we can see $15, $12 and $18 are add and subtraction operations, +which feed through to a selection procedure ($group_5), which ultimately +goes into the "out_tot" variable. This is the mantissa output of the +addition. + +So with a little careful analysis, by tracking the names of the inputs, +intermediates and outputs, we can verify that the resultant auto-generated +output from nmigen looks reasonable. The thing is: the module has been +*deliberately* kept small so as to be able to do *exactly this*. One of +the reasons for this is illustrated below. + +# Where things go horribly wrong + +In nmigen, it's perfectly possible to use python variables to assign +(accumulate) intermediate results, without actually storing them in +actual "named" hardware (so to speak). Note in the add above, that +the tests for the If and Elif statements were placed into intermediate +variables? The reason for this was that if they were not, yosys +**duplicated** the expressions. Here's an example of where that goes +horribly wrong. Note the simple-looking innocuous code, below: + +{shift_screenshot.png} + +sm.rshift basically does a variable-length right shift (the "<<" operator +in both python and verilog). Note the assignment of the intermediary +calculation m_mask to a python temporary variable. Note the commented-out +code which uses the "reduce" operator to or all of the bits of +a *secondary* expression, which ANDs all of the bits of "m_mask" with +the input mantissa? Watch what happens when that's handed over to yosys: + +{align_single_fail.png} + +It's an absolute mess. If you zoom in close on the left side, what's +happened is that the shift expression has been **multiplied** (duplicated) +a whopping **twenty four** times (this is a 32-bit FP number so the mantissa +is 24 bits). The reason is because the reduce operation needed 24 copies +of the input, in order to select one bit at a time. Then, on the right +hand side, each bit is ORed in a chain with the previous bit, exactly as +would be expected to be carried out by a sequential processor performing +a "reduce" operation. + +On seeing this graph output, it was immediately apparent that it would be +totally unacceptable, yet from the python nmigen it is not in the slightest +bit obvious that there's a problem. **This is why the yosys "show" output is +so important**. + +On further investigation, it was discovered that there is a "bool" function +of nmigen, which ORs all bits of a variable together. In yosys it even +has a name, "reduce_bool". Here's the graph output once that function has +been used instead: + +{align_single_clean.png} + +*Now* we are looking at something that's much clearer, smaller, cleaner and +easier to understand. It's still quite amazing how so few lines of code +can turn into something so comprehensive. The line of "1s" (11111111...) +is where the variable "m_mask" gets created: this line of "1s" is right-shifted +to create the mask. In the box named "$43" it is then ANDed with the original +mantissa, reduced to a single boolean OR ($44) with a $reduce_bool operation, +and so on. + +This shift-mask is basically for the creation of the "sticky" bit in +IEEE754 rounding. It's essential to get right, and it's an essential part +of IEEE754 Floating-Point. By doing this kind of close visual inspection, +and keeping things to small, compact modules, in combination with comprehensive +unit test coverage and performing incremental minimalist changes, we stand a +reasonable chance of not making huge glaring design errors and being able +to bootstrap up to a decent design. + +Not knowing how to do something is not an excuse for not trying. Having a +strategy for being able to work things out is essential to succeeding, even +when faced with a huge number of unknowns. Go from known-good to known-good, +create the building blocks first, make sure that they're reasonable, make +sure that they're unit-tested comprehensively, then incremental changes can +be attempted with the confidence that mistakes can be weeded out immediately +by a unit test failing when it should not. + +However, as this update demonstrates: both those versions of the normalisation +alignment produced the correct answer, yet one of them was deeply flawed. +Even code that produces the correct answer may have design flaws: +that's what the visual inspection is for. +