From 428e899e0e97f24f8b5c9aeb12dcc2d715f74348 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 11 Dec 2018 14:14:24 +0000 Subject: [PATCH] add conversation notes --- 3d_gpu/microarchitecture.mdwn | 78 +++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index 1f677dc41..0cb7d6188 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -354,6 +354,84 @@ have to forward multiple times from the miss buffers and avoid damaging the cache until the instruction retires. VVM uses this to avoid having a vector strip mine the data cache. +# Design Layout + +ok,so continuing some thoughts-in-order notes: + +* scoreboards are not just scoreboards, they are dependency matrices, + and there are several of them: + - one for LOAD/STORE-to-LOAD/STORE: +   + most recent LOADs prevent later STOREs +   + most recent STOREs prevent later LOADs. + - one for Function-Unit to Function-Unit. +   + it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read" +      signals, which are stopped from proceeding by dependent 1-bit CAM latches +   + exceptions may ALSO be made "precise" by holding a "Write prevention" +      signal.  only when the Function Unit knows that an exception is not going +      to occur (memory has been fetched, for example), does it release the + signal +    + speculative branch execution likewise may hold a "Write prevention", + however it also needs a "Go die" signal, to clear out the + incorrectly-taken branch. +    + LOADs/STOREs *also* must be considered as "Functional Units" and thus +       must also have corresponding entries (plural) in the FU-to-FU Matrix +    + it is permitted for ALUs to *BEGIN* execution (read operands are valid) +       without being permitted to *COMMIT*.  thus, each FU must store (buffer) +       results, until such time as a "commit" signal is received +    + we may need to express an inter-dependence on the instruction order +       (raising the WAW hazard line to do so) as a way to preserve execution +       order.  only the oldest instructions will have this flag dropped, + permitting execution that has *begun* to also reach "commit" phase. +   - one for Function-Unit to Registers. +    + it expresses the read and write requirements: the source and destination +       registers on which the operation depends.  source registers are marked +       "need read", dest registers marked "need write". +    + by having *more than one* Functional Unit matrix row per ALU it becomes +       possible to effectively achieve "Reservation Stations" orthogonality with +       the Tomasulo Algorithm.  the FU row must, like RS's, take and store a + copy of the src register values. +* we may potentially have 2-issue (or 4-issue) and a simpler issue and + detection by "striping" the register file according to modulo 2 (or 4) + on the destination   register number +  - the Function Unit rows are multiplied up by 2 (or 4) however they are +    actually connected to the same ALUs (pipelined and with both src and +    dest register buffers/latches). +  - the Register Read and Write signals are then "striped" such that read/write +    requests for every 2nd (or 4th) register are "grouped" and will have to +    fight for access to a multiplexer in order to access registers that do not +    have the same modulo 2 (or 4) match. +  - we MAY potentially be able to drop the destination (write) multiplexer(s) +    by only permitting FU rows with the same modulo to write to that destination +    bank.  FUs with indices 0,4,8,12 may only write to registers similarly +    numbered. +  - there will therefore be FOUR separate register-data buses, with (at least) +    the Read buses multiplexed so that all FU banks may read all src registers +    (even if there is contention for the multiplexers) +* an oddity / artefact of the FU-to-Registers Dependency Matrix is that the +  write/read enable signals already exist as single-bits.  "normal" processors +  store the src/dest registers as an index (5 bits == 0-31), where in this +  design, that has been expanded out to 32 individual Read/Write wires, +  already. +  - the register file verilog implementation therefore must take in an +    array of 128-bit write-enable and 128-bit read-enable signals. + - however the data buses will be multiplexed modulo 2 (or 4) according +   to the lower bits of the register number, in order to cross "lanes". +* with so many Function Units in RISC-V (dozens of instructions, times 2 +  to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue), +  we almost certainly are going to have to deploy a "grouping" scheme: +  - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs +    to MUL etc., instead we group the FUs by how many src and dest +    registers are required, and *pass the opcode down to them* +  - only FUs with the exact same number (and type) of register profile +    will receive like-minded opcodes. +  - when src and dest are free for a particular op (and an ALU pipeline is +    not stalled) the FU is at liberty to push the operands into the +    appropriate free ALU. +  - FUs therefore only really express the register, memory, and execution +    dependencies: they don't actually do the execution. + + + # References * -- 2.30.2