From df9a9732f702f34647efe9d83ed871f00facd8c4 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Sun, 23 Dec 2018 06:19:30 +0000 Subject: [PATCH] add 23dec floorplan update --- updates/006_2018dec23_floorplan.mdwn | 145 +++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 updates/006_2018dec23_floorplan.mdwn diff --git a/updates/006_2018dec23_floorplan.mdwn b/updates/006_2018dec23_floorplan.mdwn new file mode 100644 index 0000000..fbd8d4b --- /dev/null +++ b/updates/006_2018dec23_floorplan.mdwn @@ -0,0 +1,145 @@ +# A Reasonably Sane Plan + +Honestly there is nothing sane about merging a variable-size polymorphic +vectorisation front-end onto a standard RISC register file in an MMX/SSE +fashion, right down to the byte level, however that's what we've chosen +to do. Why? well, because it's not been done before, and we'd like to +see how it works out. Plus, there's no new instructions needed, and +unlike a traditional vector system, which has its own pipeline and its +own register file, we don't need special instructions to transfer between +the vector register file (which will contain both integer and floating +point numbers), and we can leverage an absolutely standard superscalar +out-of-order microarchitecture, to save on design development effort. +That's the theory. + +One of the things that's proved to be rather scary is both the size of +the register files (128 FP and 128 INT 64-bit registers), and the number +of ports needed for high-end processors: reports of 8R3W are not uncommon. +We're going for an odd-even hi-lo approach: 4 banks with a 32-32 bit +split and dividing further into odd register numbers and even register +numbers. + +In the previous update it was explained that we will fully route source +registers (and sub-register "elements") down to the byte level, so that +after they have been processed through the ALU, there is absolutely no +need to do any further routing. This is akin to a standard vectorisation +system's "lanes". Additionally, every byte on the register file will +have its own separate "write" line, such that for 16-bit and 8-bit +element widths we do not need to do extraneous read-merge-write cycles. + +It was also explained that to do byte-level source register routing, +across all four banks, that's a 16-to-16 crossbar routing 8 bit values +from any 16 to any 16 destination locations. This is simply too much, +particularly given that if we use 2R1W we will need *two* 16-to-16 +crossbars. The number of gates is massive. + +We have an accompanying [[video]](https://www.youtube.com/watch?v=78het1cfz_8) +walkthough, however here is a photo of the scheme currently under discussion: + +{{libreriscv_floorplan.jpg}} + +What we will likely go with is a hybrid arrangement. In the top right of +the above photo is a 4-bank arrangement, 32-bit wide as before. However +there is only 4-to-4 crossbar routing, 32-bit wide. Again, this is only +on the source registers. Two of these crossbars will be needed: one for +src1, one for src2. + +In the bottom middle you can see that we decided to put in xBitManip +Function Units onto the 8-bit Function Unit Area. These are actually +32-bit bit manipulation ALUs, however we are putting them in the *8-bit* area. +The reason is very simple: these xBitManip ALUs will *also* be used, in +a pseudo-micro-code fashion, to serve the dual purpose of reordering +and routing source element register bytes to the correct "lane". + +What will happen is: + +* Each 8-bit Function Unit (synonymous in this scheme with a + "Reservation Station" row), will have src1 and src2 latches + for incoming registers. +* 32-bit data will be latched into the **wrong** 8-bit Function Unit, + along with the remainder of the element "address" to which the + source value **should** be directed. +* The "wrong" data will be sent through the xBitManip ALUs, to shuffle + and permute it to the **right** order. +* **Pre-existing** operand "forwarding" routing will take the output + from the xBitManip ALUs and put it **back** into the Function Unit + Reservation Station src1 (or src2) latches. +* With the source sub-register 8-bit values now in their correct "lanes", + the actual required 8-bit ALU operation may now proceed. + +So it's a multi-stage process that's very similar to micro-code operations: +it is however easier to hard-wire the use of the xBitManip ALUs than it +is to create multiple micro-code instructions, which was one possibility +that was considered. + +In essence, the xBitManip ALU can handle 4x4 crossbar routing at the byte +level with no difficulties whatsoever, so we might as well use it for +precisely that job. What's nice is that we can decide how many xBitManip +ALUs to put in, depending on how the VPU workload works out. Plus, the +infrastructure to handle queueing, routing and temporary storage of the +in-flight source register values *already exists*. The alternative previously +discussed was to have massive duplicated dedicated 16x16 crossbars: now +we have only 4x4 32-bit crossbars plus a *small* number of 4x4 8-bit +crossbars (aka xBitManip ALUs), saving significantly on the number of gates. + +# Reducing Register-FU Matrix sizes + +Also, one significant detail. Recall in the previous update that a scheme +was finally envisaged where 64-bit Function Units would cascade-block +32-bit Function Units right down to 8-bit, on any given register. We decided +that this, too, was insane, given that it would result in a whopping 16 +fold increase in the Function Unit Matrices. + +Instead we decided to go with 32-bit to 8-bit cascade-blocking, where +two adjacent 32-bit Function Units would be required to perform 64-bit +operations, and two adjacent 8-bit Function Units required to do 16-bit +operations. In this way the FU-to-FU Dependency Matrices are reduced +down to only a four-fold size increase when compared to a more traditional +SIMD arrangement. + +In the middle towards the top of the above picture, we can therefore +see a four-wide group of 32-bit Function Units: FU1 through FU4. These, +unsurprisingly, are dedicated to *destination* register banks, i.e. the +write port is connected very specifically and exclusively to their +respective RegFile bank. + +Function Units 8 through 12 are the 8-bit FUs. Really there should +be sixteen of these, because it is likely that we will need one for +every byte of the full width of 4 32-bit register banks. If we do not +have 16 of them, having say only 8, it will be necessary to do *destination* +routing to the correct 32-bit-wide RegFile bank. This is something that +we are keen to avoid. + +Also bear in mind that we have not shown, in the above diagram, +the enhancements designed by Mitch Alsup, to the 6600 Scoreboard system. +These enhancements basically add LOAD/STORE "Function Units", which cover +the exact same role as the Tomasulo scheme's LOAD/STORE queues (provide +out-of-order correctly sequenced LOAD/STORE operations). One Function +Unit (aka Reservation Station) is required per outstanding LOAD/STORE +needed, and we need LOAD/STOREs on **both** the 32-bit FUs **and** the +8-bit FUs. It **may** be possible to merge these into one: we will have +to see. + +Also, Branch Prediction (including speculative execution) requires individual +Function Units: one for each branch that is intended to run ahead. Remember +that it was previously mentioned that there would be a "Schroedinger" wire +indicating that the instructions operating in the "shadow" of the branch +would be neither alive nor dead, and that until this was determined they +would be treated as "Write Hazards", allowing them to *execute* but **not** +commit (write) their results. We will need such Function Units on both +the 32-bit **and** the 8-bit areas. Exceptions likewise. + +So if we are not careful we could easily end up with 64 Function Units: +32 for the 32-bit area and 32 for the 8-bit area. This is going to need +some experimentation and some detailed thought, when it comes to actual +implementation. A 64x64 Function Unit Dependency Matrix is pretty massive, +even if the cell size (and power consumption) is very small compared to +Tomasulo plus Reorder Buffers, with associated CAMs. + +There is a lot of detail that still needs to be done: we are however reaching +the end of the critical "overview" planning phase. Really it is time to +start implementing a first iteration, to see how it works out. For that, +we will be looking closely at Mitch Alsup's unpublished book chapters, +as there really is no reason why we should not just implement the gate-level +diagrams that he has kindly given permission to use (with credit). + -- 2.30.2