From be66d53a2e4b046e4d2c9b174a5933517db811f6 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 11 Dec 2018 14:18:52 +0000 Subject: [PATCH] add conversation notes --- 3d_gpu/microarchitecture.mdwn | 119 +++++++++++++++++----------------- 1 file changed, 61 insertions(+), 58 deletions(-) diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index 0cb7d6188..0ca845c4f 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -360,75 +360,78 @@ ok,so continuing some thoughts-in-order notes: * scoreboards are not just scoreboards, they are dependency matrices, and there are several of them: - - one for LOAD/STORE-to-LOAD/STORE: -   + most recent LOADs prevent later STOREs -   + most recent STOREs prevent later LOADs. - - one for Function-Unit to Function-Unit. -   + it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read" -      signals, which are stopped from proceeding by dependent 1-bit CAM latches -   + exceptions may ALSO be made "precise" by holding a "Write prevention" -      signal.  only when the Function Unit knows that an exception is not going -      to occur (memory has been fetched, for example), does it release the - signal -    + speculative branch execution likewise may hold a "Write prevention", - however it also needs a "Go die" signal, to clear out the - incorrectly-taken branch. -    + LOADs/STOREs *also* must be considered as "Functional Units" and thus -       must also have corresponding entries (plural) in the FU-to-FU Matrix -    + it is permitted for ALUs to *BEGIN* execution (read operands are valid) -       without being permitted to *COMMIT*.  thus, each FU must store (buffer) -       results, until such time as a "commit" signal is received -    + we may need to express an inter-dependence on the instruction order -       (raising the WAW hazard line to do so) as a way to preserve execution -       order.  only the oldest instructions will have this flag dropped, - permitting execution that has *begun* to also reach "commit" phase. -   - one for Function-Unit to Registers. -    + it expresses the read and write requirements: the source and destination -       registers on which the operation depends.  source registers are marked -       "need read", dest registers marked "need write". -    + by having *more than one* Functional Unit matrix row per ALU it becomes -       possible to effectively achieve "Reservation Stations" orthogonality with -       the Tomasulo Algorithm.  the FU row must, like RS's, take and store a - copy of the src register values. + - one for LOAD/STORE-to-LOAD/STORE: +    + most recent LOADs prevent later STOREs +    + most recent STOREs prevent later LOADs. + - one for Function-Unit to Function-Unit. +    + it exxpresses both RAW and WAW hazards through "Go_Write" + and "Go_Read" signals, which are stopped from proceeding by + dependent 1-bit CAM latches +    + exceptions may ALSO be made "precise" by holding a "Write prevention" +      signal.  only when the Function Unit knows that an exception is + not going to occur (memory has been fetched, for example), does + it release the signal +     + speculative branch execution likewise may hold a "Write prevention", + however it also needs a "Go die" signal, to clear out the + incorrectly-taken branch. +     + LOADs/STOREs *also* must be considered as "Functional Units" and thus +        must also have corresponding entries (plural) in the FU-to-FU Matrix +     + it is permitted for ALUs to *BEGIN* execution (read operands are + valid) without being permitted to *COMMIT*.  thus, each FU must + store (buffer) results, until such time as a "commit" signal is + received +     + we may need to express an inter-dependence on the instruction order +        (raising the WAW hazard line to do so) as a way to preserve execution +        order.  only the oldest instructions will have this flag dropped, + permitting execution that has *begun* to also reach "commit" phase. + - one for Function-Unit to Registers. +     + it expresses the read and write requirements: the source + and destination registers on which the operation depends.  source + registers are marked "need read", dest registers marked + "need write". +     + by having *more than one* Functional Unit matrix row per ALU + it becomes possible to effectively achieve "Reservation Stations" + orthogonality with the Tomasulo Algorithm.  the FU row must, like + RS's, take and store a copy of the src register values. * we may potentially have 2-issue (or 4-issue) and a simpler issue and detection by "striping" the register file according to modulo 2 (or 4) on the destination   register number -  - the Function Unit rows are multiplied up by 2 (or 4) however they are -    actually connected to the same ALUs (pipelined and with both src and -    dest register buffers/latches). -  - the Register Read and Write signals are then "striped" such that read/write -    requests for every 2nd (or 4th) register are "grouped" and will have to -    fight for access to a multiplexer in order to access registers that do not -    have the same modulo 2 (or 4) match. -  - we MAY potentially be able to drop the destination (write) multiplexer(s) -    by only permitting FU rows with the same modulo to write to that destination -    bank.  FUs with indices 0,4,8,12 may only write to registers similarly -    numbered. -  - there will therefore be FOUR separate register-data buses, with (at least) -    the Read buses multiplexed so that all FU banks may read all src registers -    (even if there is contention for the multiplexers) + - the Function Unit rows are multiplied up by 2 (or 4) however they are +   actually connected to the same ALUs (pipelined and with both src and +   dest register buffers/latches). + - the Register Read and Write signals are then "striped" such that read/write +   requests for every 2nd (or 4th) register are "grouped" and will have to +   fight for access to a multiplexer in order to access registers that do not +   have the same modulo 2 (or 4) match. + - we MAY potentially be able to drop the destination (write) multiplexer(s) +   by only permitting FU rows with the same modulo to write to that destination +   bank.  FUs with indices 0,4,8,12 may only write to registers similarly +   numbered. + - there will therefore be FOUR separate register-data buses, with (at least) +   the Read buses multiplexed so that all FU banks may read all src registers +   (even if there is contention for the multiplexers) * an oddity / artefact of the FU-to-Registers Dependency Matrix is that the   write/read enable signals already exist as single-bits.  "normal" processors   store the src/dest registers as an index (5 bits == 0-31), where in this   design, that has been expanded out to 32 individual Read/Write wires,   already. -  - the register file verilog implementation therefore must take in an -    array of 128-bit write-enable and 128-bit read-enable signals. - - however the data buses will be multiplexed modulo 2 (or 4) according -   to the lower bits of the register number, in order to cross "lanes". +  - the register file verilog implementation therefore must take in an +    array of 128-bit write-enable and 128-bit read-enable signals. +  - however the data buses will be multiplexed modulo 2 (or 4) according +    to the lower bits of the register number, in order to cross "lanes". * with so many Function Units in RISC-V (dozens of instructions, times 2   to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),   we almost certainly are going to have to deploy a "grouping" scheme: -  - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs -    to MUL etc., instead we group the FUs by how many src and dest -    registers are required, and *pass the opcode down to them* -  - only FUs with the exact same number (and type) of register profile -    will receive like-minded opcodes. -  - when src and dest are free for a particular op (and an ALU pipeline is -    not stalled) the FU is at liberty to push the operands into the -    appropriate free ALU. -  - FUs therefore only really express the register, memory, and execution -    dependencies: they don't actually do the execution. +  - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs +    to MUL etc., instead we group the FUs by how many src and dest +    registers are required, and *pass the opcode down to them* +  - only FUs with the exact same number (and type) of register profile +    will receive like-minded opcodes. +  - when src and dest are free for a particular op (and an ALU pipeline is +    not stalled) the FU is at liberty to push the operands into the +    appropriate free ALU. +  - FUs therefore only really express the register, memory, and execution +    dependencies: they don't actually do the execution. -- 2.30.2