From be66d53a2e4b046e4d2c9b174a5933517db811f6 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Tue, 11 Dec 2018 14:18:52 +0000
Subject: [PATCH] add conversation notes

---
 3d_gpu/microarchitecture.mdwn | 119 +++++++++++++++++-----------------
 1 file changed, 61 insertions(+), 58 deletions(-)

diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn
index 0cb7d6188..0ca845c4f 100644
--- a/3d_gpu/microarchitecture.mdwn
+++ b/3d_gpu/microarchitecture.mdwn
@@ -360,75 +360,78 @@ ok,so continuing some thoughts-in-order notes:
 
 * scoreboards are not just scoreboards, they are dependency matrices,
   and there are several of them:
-Â - one for LOAD/STORE-to-LOAD/STORE:
-Â  Â +Â most recent LOADs prevent later STOREs
-Â  Â + most recent STOREs prevent later LOADs.
-Â - one for Function-Unit to Function-Unit.
-Â  Â + it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read"
-Â  Â  Â  signals, which are stopped from proceeding by dependent 1-bit CAM latches
-Â  Â + exceptions may ALSO be made "precise" by holding a "Write prevention"
-Â  Â  Â  signal.Â  only when the Function Unit knows that an exception is not going
-Â  Â  Â  to occur (memory has been fetched, for example), does it release the
-      signal
-Â  Â Â + speculative branch execution likewise may hold a "Write prevention",
-       however it also needs a "Go die" signal, to clear out the
-       incorrectly-taken branch.
-Â  Â Â + LOADs/STOREs *also* must be considered as "Functional Units" and thus
-Â  Â  Â  Â must also have corresponding entries (plural) in the FU-to-FU Matrix
-Â  Â Â + it is permitted for ALUs to *BEGIN* execution (read operands are valid)
-Â  Â  Â  Â without being permitted to *COMMIT*.Â  thus, each FU must store (buffer)
-Â  Â  Â  Â results, until such time as a "commit" signal is received
-Â  Â Â + we may need to express an inter-dependence on the instruction order
-Â  Â  Â  Â (raising the WAW hazard line to do so) as a way to preserve execution
-Â  Â  Â  Â order.Â  only the oldest instructions will have this flag dropped,
-       permitting execution that has *begun* to also reach "commit" phase.
-Â  Â - one for Function-Unit to Registers.
-Â  Â Â + it expresses the read and write requirements: the source and destination
-Â  Â  Â  Â registers on which the operation depends.Â  source registers are marked
-Â  Â  Â  Â "need read", dest registers marked "need write".
-Â  Â Â + by having *more than one* Functional Unit matrix row per ALU it becomes
-Â  Â  Â  Â possible to effectively achieve "Reservation Stations" orthogonality with
-Â  Â  Â  Â the Tomasulo Algorithm.Â  the FU row must, like RS's, take and store a
-       copy of the src register values.
+    - one for LOAD/STORE-to-LOAD/STORE:
+    Â  Â +Â most recent LOADs prevent later STOREs
+    Â  Â + most recent STOREs prevent later LOADs.
+    - one for Function-Unit to Function-Unit.
+    Â  Â + it exxpresses both RAW and WAW hazards through "Go_Write"
+         and "Go_Read" signals, which are stopped from proceeding by
+         dependent 1-bit CAM latches
+    Â  Â + exceptions may ALSO be made "precise" by holding a "Write prevention"
+    Â  Â  Â signal.Â  only when the Function Unit knows that an exception is
+         not going to occur (memory has been fetched, for example), does
+         it release the signal
+    Â  Â Â + speculative branch execution likewise may hold a "Write prevention",
+           however it also needs a "Go die" signal, to clear out the
+           incorrectly-taken branch.
+    Â  Â Â + LOADs/STOREs *also* must be considered as "Functional Units" and thus
+    Â  Â  Â  Â must also have corresponding entries (plural) in the FU-to-FU Matrix
+    Â  Â Â + it is permitted for ALUs to *BEGIN* execution (read operands are
+           valid) without being permitted to *COMMIT*.Â  thus, each FU must
+           store (buffer) results, until such time as a "commit" signal is
+           received
+    Â  Â Â + we may need to express an inter-dependence on the instruction order
+    Â  Â  Â  Â (raising the WAW hazard line to do so) as a way to preserve execution
+    Â  Â  Â  Â order.Â  only the oldest instructions will have this flag dropped,
+           permitting execution that has *begun* to also reach "commit" phase.
+    - one for Function-Unit to Registers.
+    Â  Â Â + it expresses the read and write requirements: the source
+          and destination registers on which the operation depends.Â  source
+          registers are marked "need read", dest registers marked
+          "need write".
+    Â  Â Â + by having *more than one* Functional Unit matrix row per ALU
+          it becomes possible to effectively achieve "Reservation Stations"
+          orthogonality with the Tomasulo Algorithm.Â  the FU row must, like
+          RS's, take and store a copy of the src register values.
 * we may potentially have 2-issue (or 4-issue) and a simpler issue and
   detection by "striping" the register file according to modulo 2 (or 4)
   on the destination Â  register number
-Â  - the Function Unit rows are multiplied up by 2 (or 4) however they are
-Â  Â  actually connected to the same ALUs (pipelined and with both src and
-Â  Â  dest register buffers/latches).
-Â  - the Register Read and Write signals are then "striped" such that read/write
-Â  Â  requests for every 2nd (or 4th) register are "grouped" and will have to
-Â  Â  fight for access to a multiplexer in order to access registers that do not
-Â  Â  have the same modulo 2 (or 4) match.
-Â  - we MAY potentially be able to drop the destination (write) multiplexer(s)
-Â  Â  by only permitting FU rows with the same modulo to write to that destination
-Â  Â  bank.Â  FUs with indices 0,4,8,12 may only write to registers similarly
-Â  Â  numbered.
-Â  - there will therefore be FOUR separate register-data buses, with (at least)
-Â  Â  the Read buses multiplexed so that all FU banks may read all src registers
-Â  Â  (even if there is contention for the multiplexers)
+    - the Function Unit rows are multiplied up by 2 (or 4) however they are
+    Â  actually connected to the same ALUs (pipelined and with both src and
+    Â  dest register buffers/latches).
+    - the Register Read and Write signals are then "striped" such that read/write
+    Â  requests for every 2nd (or 4th) register are "grouped" and will have to
+    Â  fight for access to a multiplexer in order to access registers that do not
+    Â  have the same modulo 2 (or 4) match.
+    - we MAY potentially be able to drop the destination (write) multiplexer(s)
+    Â  by only permitting FU rows with the same modulo to write to that destination
+    Â  bank.Â  FUs with indices 0,4,8,12 may only write to registers similarly
+    Â  numbered.
+    - there will therefore be FOUR separate register-data buses, with (at least)
+    Â  the Read buses multiplexed so that all FU banks may read all src registers
+    Â  (even if there is contention for the multiplexers)
 * an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
 Â  write/read enable signals already exist as single-bits.Â  "normal" processors
 Â  store the src/dest registers as an index (5 bits == 0-31), where in this
 Â  design, that has been expanded out to 32 individual Read/Write wires,
 Â  already.
-Â  - the register file verilog implementation therefore must take in an
-Â  Â  array of 128-bit write-enable and 128-bit read-enable signals.
-Â - however the data buses will be multiplexed modulo 2 (or 4) according
-Â  Â to the lower bits of the register number, in order to cross "lanes".
+Â    - the register file verilog implementation therefore must take in an
+Â  Â    array of 128-bit write-enable and 128-bit read-enable signals.
+Â    - however the data buses will be multiplexed modulo 2 (or 4) according
+Â  Â    to the lower bits of the register number, in order to cross "lanes".
 * with so many Function Units in RISC-V (dozens of instructions, times 2
 Â  to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
 Â  we almost certainly are going to have to deploy a "grouping" scheme:
-Â  - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
-Â  Â  to MUL etc., instead we group the FUs by how many src and dest
-Â  Â  registers are required, and *pass the opcode down to them*
-Â  - only FUs with the exact same number (and type) of register profile
-Â  Â  will receive like-minded opcodes.
-Â  - when src and dest are free for a particular op (and an ALU pipeline is
-Â  Â  not stalled) the FU is at liberty to push the operands into the
-Â  Â  appropriate free ALU.
-Â  - FUs therefore only really express the register, memory, and execution
-Â  Â  dependencies: they don't actually do the execution.
+Â    - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
+Â  Â    to MUL etc., instead we group the FUs by how many src and dest
+Â  Â    registers are required, and *pass the opcode down to them*
+Â    - only FUs with the exact same number (and type) of register profile
+Â  Â    will receive like-minded opcodes.
+Â    - when src and dest are free for a particular op (and an ALU pipeline is
+Â  Â    not stalled) the FU is at liberty to push the operands into the
+Â  Â    appropriate free ALU.
+Â    - FUs therefore only really express the register, memory, and execution
+Â  Â    dependencies: they don't actually do the execution.
 
 
 
-- 
2.30.2