From fe70e17aed062bf5a0b0d655e63559b4e96d0681 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Fri, 27 Mar 2020 11:29:45 +0000 Subject: [PATCH] add LD/ST buffer section --- ...23_2020mar26_decoder_emulator_started.mdwn | 74 ++++++++++++++++++- 1 file changed, 73 insertions(+), 1 deletion(-) diff --git a/updates/023_2020mar26_decoder_emulator_started.mdwn b/updates/023_2020mar26_decoder_emulator_started.mdwn index 838b977..3cecbda 100644 --- a/updates/023_2020mar26_decoder_emulator_started.mdwn +++ b/updates/023_2020mar26_decoder_emulator_started.mdwn @@ -197,7 +197,79 @@ TODO # LOAD/STORE Buffer and 6600 design documentation -TODO +A critical part of this project is not just to create a chip, it's to +*document* the chip design, the decisions along the way, for both +educational, research, and ongoing maintenance purposes. With an +augmented CDC 6600 design being chosen as the fundamental basis, +[documenting that](https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/) +as well as the key differences is particularly important. At the very least, +the extremely simple and highly effective hardware but timing-critical +design aspects of the circular loops in the 6600 were recognised by James +Thornton (the co-designer of the 6600) as being paradoxically challenging +to understand why so few gates could be so effective. Consequently, +documenting it just to be able to *develop* it is extremely important. + +We're getting to the point where we need to connect the LOAD/STORE Computation +Units up to an actual memory architecture. We've chosen +[minerva](https://github.com/lambdaconcept/minerva/blob/master/minerva/units/loadstore.py) +as the basis because it is written in nmigen, works, and, crucially, uses +wishbone (which we decided to use as the main Bus Backbone a few months ago). + +However, unlike minerva, which is a single-issue 32-bit embedded chip, +where it's perfectly ok to have one single LD/ST operation per clock, +and not only that but to have that operation take a few clock cycles, +to get anything like the level of performance needed of a GPU, we need +at least four 64-bit LOADs or STOREs *every clock cycle*. + +For a first ASIC from a team that's never done a chip before, this is, +officially, "Bonkers Territory". Where minerva is doing 32-bit-wide +Buses (and does not support 64-bit LD/ST at all), we need internal +data buses of a minimum whopping **2000** wires wide. + +Let that sink in for a moment. + +The reason why the internal buses need to be 2000 wires wide comes down +to the fact that we need, realistically, 6 to eight LOAD/STORE Computation +Units. 4 of them will be operational, 2 to 4 of them will be waiting +with pending instructions from the multi-issue Vectorisation Engine. + +We chose to use a system which expands the first 4 bits of the address, +plus the operation width (1,2,4,8 bytes) into a "bitmap" - a byte-mask - +that corresponds directly with the 16 byte "cache line" byte enable +columns, in the L1 Cache. These bitmaps can then be "merged" such +that requests that go to the same cache line can be served *in the +same clock cycle* to multiple LOAD/STORE Computation Units. This +being absolutely critical for effective Vector Processing. + +Additionally, in order to deal with misaligned memory requests, each of those +needs to put out *two* such 16-byte-wide requests (see where this is going?) +out to the L1 Cache. +So, we now have eight times two times 128 bits which is a staggering +2048 wires *just for the data*. There do exist ways to get that down +(potentially to half), and there do exist ways to get that cut in half +again, however doing so would miss opportunities for merging of requests +into cache lines. + +At that point, thanks to Mitch Alsup's input (Mitch is the designer of +the Motorola 68000, Motorola 88120, key architecture on AMD's Opteron +Series, the AMD K9, AMDGPU and Samsung's latest GPU), we learned that +L1 cache design critically depends on what type of SRAM you have. We +initially, naively, wanted dual-ported L1 SRAM and that's when Staf +and Mitch taught us that this results in half-duty rate. Only +1-Read **or** 1-Write SRAM Cells give you fast enough (single-cycle) +data rates to be useable for L1 Caches. + +Part of the conversation has wandered into +[why we chose dynamic pipelines](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005459.html) +as well as receiving that +[important advice](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005354.html) +from both Mitch Alsup and Staf Verhaegen. + +(Staf is also [sponsored by NLNet](https://nlnet.nl/project/Chips4Makers/) +to create Libre-licensed Cell Libraries, busting through one of the - +many - layers of NDAs and reducing NREs for ASIC development: I helped him +put in the submission, and he was really happy to do the Cell Libraries +that we will be using for LibreSOC's 180nm test tape-out in October 2020.) # Public-Inbox and Domain Migration -- 2.30.2