the cache until the instruction retires. VVM uses this to avoid having
a vector strip mine the data cache.
+# Design Layout
+
+ok,so continuing some thoughts-in-order notes:
+
+* scoreboards are not just scoreboards, they are dependency matrices,
+ and there are several of them:
+ - one for LOAD/STORE-to-LOAD/STORE:
+ + most recent LOADs prevent later STOREs
+ + most recent STOREs prevent later LOADs.
+ - one for Function-Unit to Function-Unit.
+ + it exxpresses both RAW and WAW hazards through "Go_Write" and "Go_Read"
+ signals, which are stopped from proceeding by dependent 1-bit CAM latches
+ + exceptions may ALSO be made "precise" by holding a "Write prevention"
+ signal. only when the Function Unit knows that an exception is not going
+ to occur (memory has been fetched, for example), does it release the
+ signal
+ + speculative branch execution likewise may hold a "Write prevention",
+ however it also needs a "Go die" signal, to clear out the
+ incorrectly-taken branch.
+ + LOADs/STOREs *also* must be considered as "Functional Units" and thus
+ must also have corresponding entries (plural) in the FU-to-FU Matrix
+ + it is permitted for ALUs to *BEGIN* execution (read operands are valid)
+ without being permitted to *COMMIT*. thus, each FU must store (buffer)
+ results, until such time as a "commit" signal is received
+ + we may need to express an inter-dependence on the instruction order
+ (raising the WAW hazard line to do so) as a way to preserve execution
+ order. only the oldest instructions will have this flag dropped,
+ permitting execution that has *begun* to also reach "commit" phase.
+ - one for Function-Unit to Registers.
+ + it expresses the read and write requirements: the source and destination
+ registers on which the operation depends. source registers are marked
+ "need read", dest registers marked "need write".
+ + by having *more than one* Functional Unit matrix row per ALU it becomes
+ possible to effectively achieve "Reservation Stations" orthogonality with
+ the Tomasulo Algorithm. the FU row must, like RS's, take and store a
+ copy of the src register values.
+* we may potentially have 2-issue (or 4-issue) and a simpler issue and
+ detection by "striping" the register file according to modulo 2 (or 4)
+ on the destination register number
+ - the Function Unit rows are multiplied up by 2 (or 4) however they are
+ actually connected to the same ALUs (pipelined and with both src and
+ dest register buffers/latches).
+ - the Register Read and Write signals are then "striped" such that read/write
+ requests for every 2nd (or 4th) register are "grouped" and will have to
+ fight for access to a multiplexer in order to access registers that do not
+ have the same modulo 2 (or 4) match.
+ - we MAY potentially be able to drop the destination (write) multiplexer(s)
+ by only permitting FU rows with the same modulo to write to that destination
+ bank. FUs with indices 0,4,8,12 may only write to registers similarly
+ numbered.
+ - there will therefore be FOUR separate register-data buses, with (at least)
+ the Read buses multiplexed so that all FU banks may read all src registers
+ (even if there is contention for the multiplexers)
+* an oddity / artefact of the FU-to-Registers Dependency Matrix is that the
+ write/read enable signals already exist as single-bits. "normal" processors
+ store the src/dest registers as an index (5 bits == 0-31), where in this
+ design, that has been expanded out to 32 individual Read/Write wires,
+ already.
+ - the register file verilog implementation therefore must take in an
+ array of 128-bit write-enable and 128-bit read-enable signals.
+ - however the data buses will be multiplexed modulo 2 (or 4) according
+ to the lower bits of the register number, in order to cross "lanes".
+* with so many Function Units in RISC-V (dozens of instructions, times 2
+ to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue),
+ we almost certainly are going to have to deploy a "grouping" scheme:
+ - rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs
+ to MUL etc., instead we group the FUs by how many src and dest
+ registers are required, and *pass the opcode down to them*
+ - only FUs with the exact same number (and type) of register profile
+ will receive like-minded opcodes.
+ - when src and dest are free for a particular op (and an ALU pipeline is
+ not stalled) the FU is at liberty to push the operands into the
+ appropriate free ALU.
+ - FUs therefore only really express the register, memory, and execution
+ dependencies: they don't actually do the execution.
+
+
+
# References
* <https://en.wikipedia.org/wiki/Tomasulo_algorithm>