From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Sun, 23 Dec 2018 06:19:30 +0000 (+0000)
Subject: add 23dec floorplan update
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=df9a9732f702f34647efe9d83ed871f00facd8c4;p=crowdsupply.git

add 23dec floorplan update
---

diff --git a/updates/006_2018dec23_floorplan.mdwn b/updates/006_2018dec23_floorplan.mdwn
new file mode 100644
index 0000000..fbd8d4b
--- /dev/null
+++ b/updates/006_2018dec23_floorplan.mdwn
@@ -0,0 +1,145 @@
+# A Reasonably Sane Plan
+
+Honestly there is nothing sane about merging a variable-size polymorphic
+vectorisation front-end onto a standard RISC register file in an MMX/SSE
+fashion, right down to the byte level, however that's what we've chosen
+to do.  Why? well, because it's not been done before, and we'd like to
+see how it works out.  Plus, there's no new instructions needed, and
+unlike a traditional vector system, which has its own pipeline and its
+own register file, we don't need special instructions to transfer between
+the vector register file (which will contain both integer and floating
+point numbers), and we can leverage an absolutely standard superscalar
+out-of-order microarchitecture, to save on design development effort.
+That's the theory.
+
+One of the things that's proved to be rather scary is both the size of
+the register files (128 FP and 128 INT 64-bit registers), and the number
+of ports needed for high-end processors: reports of 8R3W are not uncommon.
+We're going for an odd-even hi-lo approach: 4 banks with a 32-32 bit
+split and dividing further into odd register numbers and even register
+numbers.
+
+In the previous update it was explained that we will fully route source
+registers (and sub-register "elements") down to the byte level, so that
+after they have been processed through the ALU, there is absolutely no
+need to do any further routing.  This is akin to a standard vectorisation
+system's "lanes".  Additionally, every byte on the register file will
+have its own separate "write" line, such that for 16-bit and 8-bit
+element widths we do not need to do extraneous read-merge-write cycles.
+
+It was also explained that to do byte-level source register routing,
+across all four banks, that's a 16-to-16 crossbar routing 8 bit values
+from any 16 to any 16 destination locations.  This is simply too much,
+particularly given that if we use 2R1W we will need *two* 16-to-16
+crossbars.  The number of gates is massive.
+
+We have an accompanying [[video]](https://www.youtube.com/watch?v=78het1cfz_8)
+walkthough, however here is a photo of the scheme currently under discussion:
+
+{{libreriscv_floorplan.jpg}}
+
+What we will likely go with is a hybrid arrangement.  In the top right of
+the above photo is a 4-bank arrangement, 32-bit wide as before.  However
+there is only 4-to-4 crossbar routing, 32-bit wide.  Again, this is only
+on the source registers.  Two of these crossbars will be needed: one for
+src1, one for src2.
+
+In the bottom middle you can see that we decided to put in xBitManip
+Function Units onto the 8-bit Function Unit Area.  These are actually
+32-bit bit manipulation ALUs, however we are putting them in the *8-bit* area.
+The reason is very simple: these xBitManip ALUs will *also* be used, in
+a pseudo-micro-code fashion, to serve the dual purpose of reordering
+and routing source element register bytes to the correct "lane".
+
+What will happen is:
+
+* Each 8-bit Function Unit (synonymous in this scheme with a
+  "Reservation Station" row), will have src1 and src2 latches
+  for incoming registers.
+* 32-bit data will be latched into the **wrong** 8-bit Function Unit,
+  along with the remainder of the element "address" to which the
+  source value **should** be directed.
+* The "wrong" data will be sent through the xBitManip ALUs, to shuffle
+  and permute it to the **right** order.
+* **Pre-existing** operand "forwarding" routing will take the output
+  from the xBitManip ALUs and put it **back** into the Function Unit
+  Reservation Station src1 (or src2) latches.
+* With the source sub-register 8-bit values now in their correct "lanes",
+  the actual required 8-bit ALU operation may now proceed.
+
+So it's a multi-stage process that's very similar to micro-code operations:
+it is however easier to hard-wire the use of the xBitManip ALUs than it
+is to create multiple micro-code instructions, which was one possibility
+that was considered.
+
+In essence, the xBitManip ALU can handle 4x4 crossbar routing at the byte
+level with no difficulties whatsoever, so we might as well use it for
+precisely that job.  What's nice is that we can decide how many xBitManip
+ALUs to put in, depending on how the VPU workload works out.  Plus, the
+infrastructure to handle queueing, routing and temporary storage of the
+in-flight source register values *already exists*.  The alternative previously
+discussed was to have massive duplicated dedicated 16x16 crossbars: now
+we have only 4x4 32-bit crossbars plus a *small* number of 4x4 8-bit
+crossbars (aka xBitManip ALUs), saving significantly on the number of gates.
+
+# Reducing Register-FU Matrix sizes
+
+Also, one significant detail.  Recall in the previous update that a scheme
+was finally envisaged where 64-bit Function Units would cascade-block
+32-bit Function Units right down to 8-bit, on any given register.  We decided
+that this, too, was insane, given that it would result in a whopping 16
+fold increase in the Function Unit Matrices.
+
+Instead we decided to go with 32-bit to 8-bit cascade-blocking, where
+two adjacent 32-bit Function Units would be required to perform 64-bit
+operations, and two adjacent 8-bit Function Units required to do 16-bit
+operations.  In this way the FU-to-FU Dependency Matrices are reduced
+down to only a four-fold size increase when compared to a more traditional
+SIMD arrangement.
+
+In the middle towards the top of the above picture, we can therefore
+see a four-wide group of 32-bit Function Units: FU1 through FU4.  These,
+unsurprisingly, are dedicated to *destination* register banks, i.e. the
+write port is connected very specifically and exclusively to their
+respective RegFile bank.
+
+Function Units 8 through 12 are the 8-bit FUs.  Really there should
+be sixteen of these, because it is likely that we will need one for
+every byte of the full width of 4 32-bit register banks.  If we do not
+have 16 of them, having say only 8, it will be necessary to do *destination*
+routing to the correct 32-bit-wide RegFile bank.  This is something that
+we are keen to avoid.
+
+Also bear in mind that we have not shown, in the above diagram,
+the enhancements designed by Mitch Alsup, to the 6600 Scoreboard system.
+These enhancements basically add LOAD/STORE "Function Units", which cover
+the exact same role as the Tomasulo scheme's LOAD/STORE queues (provide
+out-of-order correctly sequenced LOAD/STORE operations).  One Function
+Unit (aka Reservation Station) is required per outstanding LOAD/STORE
+needed, and we need LOAD/STOREs on **both** the 32-bit FUs **and** the
+8-bit FUs.  It **may** be possible to merge these into one: we will have
+to see.
+
+Also, Branch Prediction (including speculative execution) requires individual
+Function Units: one for each branch that is intended to run ahead.  Remember
+that it was previously mentioned that there would be a "Schroedinger" wire
+indicating that the instructions operating in the "shadow" of the branch
+would be neither alive nor dead, and that until this was determined they
+would be treated as "Write Hazards", allowing them to *execute* but **not**
+commit (write) their results.  We will need such Function Units on both
+the 32-bit **and** the 8-bit areas.  Exceptions likewise.
+
+So if we are not careful we could easily end up with 64 Function Units:
+32 for the 32-bit area and 32 for the 8-bit area.  This is going to need
+some experimentation and some detailed thought, when it comes to actual
+implementation.  A 64x64 Function Unit Dependency Matrix is pretty massive,
+even if the cell size (and power consumption) is very small compared to
+Tomasulo plus Reorder Buffers, with associated CAMs.
+
+There is a lot of detail that still needs to be done: we are however reaching
+the end of the critical "overview" planning phase.  Really it is time to
+start implementing a first iteration, to see how it works out.  For that,
+we will be looking closely at Mitch Alsup's unpublished book chapters,
+as there really is no reason why we should not just implement the gate-level
+diagrams that he has kindly given permission to use (with credit).
+