----
+Justification for Branch Prediction
+
+<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>
+
+We can combine several branch predictors to make a decent predictor:
+call/return predictor -- important as it can predict calls and returns
+with around 99.8% accuracy loop predictor -- basically counts loop
+iterations some kind of global predictor -- handles everything else
+
+We will also want a btb, a smaller one will work, it reduces average
+branch cycle count from 2-3 to 1 since it predicts which instructions
+are taken branches while the instructions are still being fetched,
+allowing the fetch to go to the target address on the next clock rather
+than having to wait for the fetched instructions to be decoded.
+
+----
+
+> https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures
+
+well, there is this concept:
+https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf
+
+it is a 2-level hierarchy for register cacheing. honestly, though, the
+reservation stations of the tomasulo algorithm are similar to a cache,
+although only of the intermediate results, not of the initial operands.
+
+i have a feeling we should investigate putting a 2-level register cache
+in front of a multiplexed SRAM.
+
+----
+
For GPU workloads FP64 is not common so I think having 1 FP64 alu would
be sufficient. Since indexed loads and stores are not supported, it will
be important to support 4x64 integer operations to generate addresses
the head of the ROB, and hence, no earlier loads or stores can still
be pending
* RAW hazards are maintained by two restrictions:
- 1. not allowing a load to initiate the second step of its execution if
- any active ROB entry occupied by a store has a destination
- field that matches the value of the A field of the load and
- 2. maintaining the program order for the computation of an effective
+ 1. not allowing a load to initiate the second step of its execution if
+ any active ROB entry occupied by a store has a destination
+ field that matches the value of the A field of the load and
+ 2. maintaining the program order for the computation of an effective
address of a load with respect to all earlier stores
* These restrictions ensure that any load that access a memory location
written to by an earlier store cannot perform the memory access until
the store has written the data.
+Advantages of Speculation, Load and Store hazards:
+
+* A store updates memoryy only when it reached the head of the ROB
+* WAW and WAR type of hazards are eliminated with speculation
+ (actual updating of memory occurs in order)
+* RAW hazards through memory are maintained by not allowing a load
+ to initiate the second step of its execution
+* Check if any store has a destination field that matched the
+ value of the load:
+ - SD F1 100(R2)
+ - LD F2 100(R2)
+
+Exceptions
+
+* Exceptions are handled by not recognising the exception until
+ instruction that caused it is ready to commit in ROB (reaches head
+ of ROB)
+
+Reorder Buffer
+
+* Results of an instruction become visible externally when it leaves
+ the ROB
+ - Registers updated
+ - Memory updated
+
+Reorder Buffer Entry
+
+* Instruction type
+ - branch (no destination resutl)
+ - store (has a memory address destination)
+ - register operation (ALU operation or load, which has reg dests)
+* Destination
+ - register number (for loads and ALU ops) or
+ - memory address (for stores) where the result should be written
+* Value
+ - value of instruction result, pending a commit
+* Ready
+ - indicates that the instruction has completed execution: value is ready
+
+----
+
+Register Renaming resources
+
+* <https://www.youtube.com/watch?v=p4SdrUhZrBM>
+* <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
+* ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>
+
+Video @ 3:24, "RAT" table - Register Aliasing Table:
+
+<img src="/3d_gpu/rat_table.png" />
+
+This scheme looks very much like a Reservation Station.
+
+----
+
+There is another way to get precise ordering of the writes in a scoreboard.
+First, one has to implement forwarding in the scoreboard.
+Second, the function units need an output queue <of say 4 registers>
+Now, one can launch an instruction and pick up its operand either
+from the RF or from the function unit output while the result sits
+in the function unit waiting for its GO_Write signal.
+
+Thus the launching of instructions is not delayed due to hazards
+but the results are delivered to the RF in program order.
+
+This looks surprisingly like a 'belt' at the end of the function unit.
+
+----
+
+> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg
+
+ so, coming back to this diagram, i think if we stratify the
+Functional Units into lanes as well, we may get a multi-issue
+architecture.
+
+ the 6600 scoreboard rules - which are awesomely simple and actually
+involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
+in parallel because there will be no overlap between stratified registers.
+
+ if using that odd-even / msw-lsw division (instead of modulo 4 on the
+register number) it will be more like a 2-issue for standard RV
+instructions and a 4-issue for when SV 32-bit ops are loop-generated.
+
+ by subdividing the registers into odd-even banks we will need a
+_pair_ of (completely independent) register-renaming tables:
+ https://libre-riscv.org/3d_gpu/rat_table.png
+
+ for SIMD'd operations, if we have the same type of reservation
+station queue as with Tomasulo, it can be augmented with the byte-mask:
+if the byte-masks in the queue of both the src and dest registers do
+not overlap, the operations may be done in parallel.
+
+ i still have not yet thought through how the Reorder Buffer would
+work: here, again, i am tempted to recommend that, again, we "stratify"
+the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
+however the CAM is only 4-bit or 3-bit wide.
+
+ if an instruction's destination register does not meet the modulo
+requirements, that ROB entry is *left empty*. this does mean that,
+for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
+4), and there are 4 sequential instructions that happen e.g. to have
+a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
+etc.... the ROB will only hold 8 such instructions
+
+and that i think is perfectly fine, because, statistically, it'll balance
+out, and SV generates sequentially-incrementing instruction registers,
+so *that* is fine, too.
+
+i'll keep working on diagrams, and also reading mitch alsup's chapters
+on the 6600. they're frickin awesome. the 6600 could do multi-issue
+LD and ST by way of having dedicated registers to LD and ST. X1-X5 were
+for ST, X6 and X7 for LD.
+
+----
+
+i took a shot at explaining this also on comp.arch today, and that
+allowed me to identify a problem with the proposed modulo-4 "lanes"
+stratification.
+
+when a result is created in one lane, it may need to be passed to the next
+lane. that means that each of the other lanes needs to keep a watchful
+eye on when another lane updates the other regfiles (all 3 of them).
+
+when an incoming update occurs, there may be up to 3 register writes
+(that need to be queued?) that need to be broadcast (written) into
+reservation stations.
+
+what i'm not sure of is: can data consistency be preserved, even if
+there's a delay? my big concern is that during the time where the data is
+broadcast from one lane, the head of the ROB arrives at that instruction
+(which is the "commit" condition), it gets committed, then, unfortunately,
+the same ROB# gets *reused*.
+
+now that i think about it, as long as the length of the queue is below
+the size of the Reorder Buffer (preferably well below), and as long as
+it's guaranteed to be emptied by the time the ROB cycles through the
+whole buffer, it *should* be okay.
+
+----
+
+> Don't forget that in these days of Spectre and Meltdown, merely
+> preventing dead instruction results from being written to registers or
+> memory is NOT ENOUGH. You also need to prevent load instructions from
+> altering cache and branch instructions from altering branch prediction
+> state.
+
+Which, oddly enough, provides a necessity for being able to consume
+multiple containers from the cache Miss buffers, which oddly enough,
+are what makes a crucial mechanism in the Virtual Vector Method work.
+
+In the past, one would forward the demand container to the waiting
+memref and then write the whole the line into the cache. S&M means you
+have to forward multiple times from the miss buffers and avoid damaging
+the cache until the instruction retires. VVM uses this to avoid having
+a vector strip mine the data cache.
+
# References
* <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
* <https://en.wikipedia.org/wiki/Reservation_station>
* <https://en.wikipedia.org/wiki/Register_renaming> points out that
reservation stations take a *lot* of power.
+* <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
* MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
<https://github.com/afwolfe/mesi-simulator>
* <https://kshitizdange.github.io/418CacheSim/final-report> report on
* Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
* <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
* <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
+* <https://www.eda.ncsu.edu/wiki/FreePDK>