From: Luke Kenneth Casson Leighton Date: Tue, 4 Dec 2018 11:47:08 +0000 (+0000) Subject: add 003 microarch update X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=9ab4c6bc1ae930b6569569413484ccf3ebbda230;p=crowdsupply.git add 003 microarch update --- diff --git a/updates/003_2018dec04_microarchitecture.mdwn b/updates/003_2018dec04_microarchitecture.mdwn new file mode 100644 index 0000000..e97788a --- /dev/null +++ b/updates/003_2018dec04_microarchitecture.mdwn @@ -0,0 +1,83 @@ +# Microarchitectural Decisions + +The Libre-RISCV core is planning to deploy innovative vectorisation +system, known as +[Simple-V](https://libre-riscv.org/simple_v_extension/specification). +Honestly, it's not very simple at all! The principle is straightforward: +mark ordinary registers as "vectorised", and when an instruction uses one +such "tagged" register, go into a hardware-unrolled version of a software +macro loop, issuing otherwise identical instructions with *contiguous* +sequentially-increasing register numbers. + +It sounds easy... except that the "tag" table behind the registers has +been extended to: + +* add predication (switching on and off individual "elements" of the vector) +* change the element width so that what was previously a 64-bit ADD can be + used to do 16-bit or even 8-bit ADDs +* reordering of the elements, for 2D and 3D arrays and Matrices: very + useful for in-place transposing for Matrix Multiplication) +* extend the instructions so that they can access up to 128 registers, + where previously that was limited to 32 (and only 8 for Compressed + instructions) + +All of these turn out to be important for GPU workloads. + +One of the most challenging aspects of SV is that there is no restriction +on the "redirection". Whilst one instruction could use register 5 and +another uses register 10, *both* of them could actually be "redirected" +to use register 112, for example. One of those could even be changed +to 32-bit operations whilst the other is set to 16-bit element widths. + +Our initial thoughts were to try a standard simple in-order SIMD architecture, +with predication bits passed down into the SIMD ALUs. If a bit is "off", +that "lane" within the ALU does not calculate a result, saving power. +However, a pre-analysis engine is required that re-orders the registers, +packs lanes of data together so that it fits into one SIMD ALU, and, on +exit from the ALU, it may be necessary to split and "redirect" parts of the +data to *multiple* actual 64-bit registers. In other words, bit-level +(or byte-level) manipulation is required, both pre- and post- ALU. + +This is complicated! + +As part of the initial design of SV, there was an accidental assumption +that it would be perfectly reasonable to use a multi-issue execution +engine, and to simply drop multiple of those "hardware-loop-unrolled" +operations into the instruction queue. This turns out to be a radically +different paradigm from standard vector processors, where a loop allocates +elements to "lanes", and if a predication bit is not set, the lane +runs "empty". By contrast, with the multi-issue execution model, an +operation that is predicated out means that the element-based instruction +simply does not make it into the instruction queue. + +Simple! + +There are many other benefits to a multi-issue microarchitecture, and +these are being discussed +[here](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html) +on the mailing list. Personally I favour a modified version of the +[Tomasulo Algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm), +which includes what is known as a +[Reorder Buffer](https://en.wikipedia.org/wiki/Re-order_buffer). +What is particularly nice about this algorithm is that it was first introduced +in 1967, and came to prominence in the early 1990s when Moore's Law started +to hit a speed wall. That in particular means that, firstly, it's extremely +commonly taught in Universities, and, secondly, patents on the algorithm +have long since expired. + +Also, there are both memory hazards and register hazards that a Reorder +Buffer augmented Tomasulo algorithm takes care of, whilst also allowing +for branch prediction and really simple roll-back, preservation of +execution order even though instructions may actually be done out of order, +and, crucially, some ALUs may take longer than others, and the algorithm +simply does not care. In addition, there may be a really simple way to +extend the Reorder Buffer tags to accomodate SIMD-style characteristics. + +We also may need to have simple Branch Prediction, because some of the +loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/) are particularly +tight. A Reorder Buffer can easily be used to implement Branch Prediction, +because, just as with an Exception, the ROB needs to be cleared out +(flushed) if the branch is mispredicted. As it is necessary to respect +Exceptions, the logic has to exist to clear out the ROB: Branch Prediction +simply uses this pre-existing logic. +