# High-level architectural Requirements

* SMP Cache coherency (TileLink?)
* Minumum 800mhz
* Minimum 2-core SMP, more likely 4-core uniform design,
  each core with full 4-wide SIMD-style predicated ALUs
* 6GFLOPS single-precision FP
* 128 64-bit FP and 128 64-bit INT register files
* RV64GC compliance for running full GNU/Linux-based OS
* SimpleV compliance
* xBitManip (required for VPU and ideal for predication)
* On-chip tile buffer (memory-mapped SRAM), likely shared
  between all cores, for the collaborative creation of pixel "tiles".
* 4-lane 2Rx1W SRAMs for registers numbered 32 and above;
  Multi-R x Multi-W for registers 1-31.
  TODO: consider 2R for registers to be used as predication targets
  if >= 32.
* Idea: generic implementation of ports on register file so as to be able
  to experiment with different arrangements.
* Potentially: Lane-swapping / crossing / data-multiplexing
  bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D)
* Potentially: Registers subdivided into 16-bit, to match
  elwidth down to 16-bit (for FP16).  8-bit elwidth only
  goes down as far as twin-SIMD (with predication).  This
  requires registers to have extra hidden bits: register
  x30 is now "x30:0+x30.1+x30.2+x30.3".  have to discuss.

# Conversation Notes

----

'm thinking about using tilelink (or something similar) internally as
having a cache-coherent protocol is required for implementing Vulkan
(unless you want to turn off the cache for the GPU memory, which I
don't think is a good idea), axi is not a cache-coherent protocol,
and tilelink already has atomic rmw operations built into the protocol.
We can use an axi to tilelink bridge to interface with the memory.

I'm thinking we will want to have a dual-core GPU since a single
core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable
clock speed. Additionally, that allows us to use an 800MHz core clock
instead of the 1.6GHz we would otherwise need, allowing us to lower the
core voltage and save power, since the power used is proportional to
F\*V^2. (just guessing on clock speeds.)

----

I don't know about power, however I have done some research and a 4Kbyte
(or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the
ballpark of 1000 um^2 in 28nm.
Using a 4xFMA with a banked register file where the bank is selected by the
lower order register number means we could probably get away with 1Rx1W
SRAM as the backing memory for the register file, similarly to Hwacha. I
would suggest 8 banks allowing us to do more in parallel since we could run
other units in parallel with a 4xFMA. 8 banks would also allow us to clock
gate the SRAM banks that are not in use for the current clock cycle
allowing us to save more power. Note that the 4xFMA could be 4 separately
allocated FMA units, it doesn't have to be SIMD style. If we have enough hw
parallelism, we can under-volt and under-clock the GPU cores allowing for a
more efficient GPU. If we are using the GPU cores as CPU cores as well, I
think it would be important to be able to use a faster clock speed when not
using the extended registers (similar to how Intel processors use a lower
clock rate when AVX512 is in use) so that scalar code is not slowed down
too much.

> > Using a 4xFMA with a banked register file where the bank is selected by
> the
> > lower order register number means we could probably get away with 1Rx1W
> > SRAM as the backing memory for the register file, similarly to Hwacha.
>
>  okaaay.... sooo... we make an assumption that the top higher "banks"
> are pretty much always going to be "vectorised", such that, actually,
> they genuinely don't need to be 6R-4W (or whatever).
>
Yeah pretty much, though I had meant the bank number comes from the
least-significant bits of the 7-bit register number.

----

Assuming 64-bit operands:
If you could organize 2 SRAM macros and use the pair of them to
read/write 4 registers at a time (256-bits). The pipeline will allow you to
dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).

<pre>
RS1 = Read of operand S1
WRd = Write of result Dst
FMx = Floating Point Multiplier, x = stage.

   |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
                   |FWD|FM1|FM2|FM3|FM4|
                       |FWD|FM1|FM2|FM3|FM4|
                           |FWD|FM1|FM2|FM3|FM4|WRd|
                   |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
                                   |FWD|FM1|FM2|FM3|FM4|
                                       |FWD|FM1|FM2|FM3|FM4|
                                           |FWD|FM1|FM2|FM3|FM4|WRd|
                                   |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4|
                                                   |FWD|FM1|FM2|FM3|FM4|
                                                       |FWD|FM1|FM2|FM3|FM4|
                                                           |FWD|FM1|FM2|FM3|FM4|WRd|
</pre>

The only trick is getting the read and write dedicated on different clocks.
When the RS3 operand is not needed (60% of the time) you can use
the time slot for reading or writing on behalf of memory refs; STs read,
LDs write.

You will find doing VRFs a lot more compact this way. In GPU land we
called the flip-flops orchestrating the timing "collectors".

----

Justification for Branch Prediction

<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000212.html>

We can combine several branch predictors to make a decent predictor:
call/return predictor -- important as it can predict calls and returns
with around 99.8% accuracy loop predictor -- basically counts loop
iterations some kind of global predictor -- handles everything else

We will also want a btb, a smaller one will work, it reduces average
branch cycle count from 2-3 to 1 since it predicts which instructions
are taken branches while the instructions are still being fetched,
allowing the fetch to go to the target address on the next clock rather
than having to wait for the fetched instructions to be decoded.

----

> https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures

well, there is this concept:
https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf

it is a 2-level hierarchy for register cacheing.  honestly, though, the
reservation stations of the tomasulo algorithm are similar to a cache,
although only of the intermediate results, not of the initial operands.

i have a feeling we should investigate putting a 2-level register cache
in front of a multiplexed SRAM.

----

For GPU workloads FP64 is not common so I think having 1 FP64 alu would
be sufficient. Since indexed loads and stores are not supported, it will
be important to support 4x64 integer operations to generate addresses
for loads/stores.

I was thinking we would use scoreboarding to keep track of operations
and dependencies since it doesn't need a cam per alu. We should be able
to design it to forward past the register file to allow for 0-latency
forwarding. If we combined that with register renaming it should prevent
most war and waw data hazards.

I think branch prediction will be essential if only to fetch and decode
operations since it will reduce the branch penalty substantially.

Note that even if we have a zero-overhead loop extension, branch
prediction will still be useful as we will want to be able to run code
like compilers and standard RV code with decent performance. Additionally,
quite a few shaders have branching in their internal loops so
zero-overhead loops won't be able to fix all the branching problems.

----

> you would need a 4-wide cdb anyway, since that's the performance we're
> trying for.

 if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU,
then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle
per core, which means only a 2-wide CDB, a heck of a lot better than
4.

 oh: i thought of another way to cut the power-impact of the Reorder
Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address
length equal to the number of registers, 2 is because of 2-issue).

 the CAM of a ROB is on the instruction destination register.  key:
ROBnum, value: instr-dest-reg.  if you have a bitfleid that says "this
destreg has no ROB tag", it's dead-easy to check that bitfield, first.

----

Avoiding Memory Hazards

* WAR and WAR hazards through memory are eliminated with speculation
because actual updating of memory occurs in order, when a store is at
the head of the ROB, and hence, no earlier loads or stores can still
be pending
* RAW hazards are maintained by two restrictions:
  1. not allowing a load to initiate the second step of its execution if
    any active ROB entry occupied by a store has a destination
    field that matches the value of the A field of the load and
  2. maintaining the program order for the computation of an effective
      address of a load with respect to all earlier stores
* These restrictions ensure that any load that access a memory location
  written to by an earlier store cannot perform the memory access until
  the store has written the data.

Advantages of Speculation, Load and Store hazards:

* A store updates memoryy only when it reached the head of the ROB
* WAW and WAR type of hazards are eliminated with speculation
  (actual updating of memory occurs in order)
* RAW hazards through memory are maintained by not allowing a load
  to initiate the second step of its execution
* Check if any store has a destination field that matched the
  value of the load:
    - SD F1 100(R2)
    - LD F2 100(R2)

Exceptions

* Exceptions are handled by not recognising the exception until
  instruction that caused it is ready to commit in ROB (reaches head
  of ROB)

Reorder Buffer

* Results of an instruction become visible externally when it leaves
  the ROB
    - Registers updated
    - Memory updated

Reorder Buffer Entry

* Instruction type
    - branch (no destination resutl)
    - store (has a memory address destination)
    - register operation (ALU operation or load, which has reg dests)
* Destination
    - register number (for loads and ALU ops) or
    - memory address (for stores) where the result should be written
* Value
    - value of instruction result, pending a commit
* Ready
    - indicates that the instruction has completed execution: value is ready

----

Register Renaming resources

* <https://www.youtube.com/watch?v=p4SdrUhZrBM>
* <https://www.d.umn.edu/~gshute/arch/register-renaming.xhtml>
* ROBs + Rename <http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323010.html>

Video @ 3:24, "RAT" table - Register Aliasing Table:

<img src="/3d_gpu/rat_table.png" />

This scheme looks very much like a Reservation Station.

----

There is another way to get precise ordering of the writes in a scoreboard.
First, one has to implement forwarding in the scoreboard.
Second, the function units need an output queue <of say 4 registers>
Now, one can launch an instruction and pick up its operand either
from the RF or from the function unit output while the result sits
in the function unit waiting for its GO_Write signal.

Thus the launching of instructions is not delayed due to hazards
but the results are delivered to the RF in program order.

This looks surprisingly like a 'belt' at the end of the function unit.

----

> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg

 so, coming back to this diagram, i think if we stratify the
Functional Units into lanes as well, we may get a multi-issue
architecture.

 the 6600 scoreboard rules - which are awesomely simple and actually
involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed
in parallel because there will be no overlap between stratified registers.

 if using that odd-even / msw-lsw division (instead of modulo 4 on the
register number) it will be more like a 2-issue for standard RV
instructions and a 4-issue for when SV 32-bit ops are loop-generated.

 by subdividing the registers into odd-even banks we will need a
_pair_ of (completely independent) register-renaming tables:
  https://libre-riscv.org/3d_gpu/rat_table.png

 for SIMD'd operations, if we have the same type of reservation
station queue as with Tomasulo, it can be augmented with the byte-mask:
if the byte-masks in the queue of both the src and dest registers do
not overlap, the operations may be done in parallel.

 i still have not yet thought through how the Reorder Buffer would
work: here, again, i am tempted to recommend that, again, we "stratify"
the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries,
however the CAM is only 4-bit or 3-bit wide.

 if an instruction's destination register does not meet the modulo
requirements, that ROB entry is *left empty*.  this does mean that,
for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo
4), and there are 4 sequential instructions that happen e.g. to have
a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc.
etc.... the ROB will only hold 8 such instructions

and that i think is perfectly fine, because, statistically, it'll balance
out, and SV generates sequentially-incrementing instruction registers,
so *that* is fine, too.

i'll keep working on diagrams, and also reading mitch alsup's chapters
on the 6600.  they're frickin awesome.  the 6600 could do multi-issue
LD and ST by way of having dedicated registers to LD and ST.  X1-X5 were
for ST, X6 and X7 for LD.

----

i took a shot at explaining this also on comp.arch today, and that
allowed me to identify a problem with the proposed modulo-4 "lanes"
stratification.

when a result is created in one lane, it may need to be passed to the next
lane.  that means that each of the other lanes needs to keep a watchful
eye on when another lane updates the other regfiles (all 3 of them).

when an incoming update occurs, there may be up to 3 register writes
(that need to be queued?) that need to be broadcast (written) into
reservation stations.

what i'm not sure of is: can data consistency be preserved, even if
there's a delay?  my big concern is that during the time where the data is
broadcast from one lane, the head of the ROB arrives at that instruction
(which is the "commit" condition), it gets committed, then, unfortunately,
the same ROB# gets *reused*.

now that i think about it, as long as the length of the queue is below
the size of the Reorder Buffer (preferably well below), and as long as
it's guaranteed to be emptied by the time the ROB cycles through the
whole buffer, it *should* be okay.

----

> Don't forget that in these days of Spectre and Meltdown, merely
> preventing dead instruction results from being written to registers or
> memory is NOT ENOUGH. You also need to prevent load instructions from
> altering cache and branch instructions from altering branch prediction
> state.

Which, oddly enough, provides a necessity for being able to consume
multiple containers from the cache Miss buffers, which oddly enough,
are what makes a crucial mechanism in the Virtual Vector Method work.

In the past, one would forward the demand container to the waiting
memref and then write the whole the line into the cache. S&M means you
have to forward multiple times from the miss buffers and avoid damaging
the cache until the instruction retires. VVM uses this to avoid having
a vector strip mine the data cache.

# References

* <https://en.wikipedia.org/wiki/Tomasulo_algorithm>
* <https://en.wikipedia.org/wiki/Reservation_station>
* <https://en.wikipedia.org/wiki/Register_renaming> points out that
  reservation stations take a *lot* of power.
* <http://home.deib.polimi.it/silvano/FilePDF/AAC/Lesson_4_ILP_PartII_Scoreboard.pdf> scoreboarding
* MESI cache protocol, python <https://github.com/sunkarapk/mesi-cache.git>
  <https://github.com/afwolfe/mesi-simulator>
* <https://kshitizdange.github.io/418CacheSim/final-report> report on
  types of caches
* <https://github.com/ssc3?tab=repositories> interesting stuff
* <https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing>
  pipeline bypassing
* <http://ece-research.unm.edu/jimp/611/slides/chap4_7.html> Tomasulo / Reorder
* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
* Discussion <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-November/000157.html>
* <https://github.com/UCSBarchlab/PyRTL/blob/master/examples/example5-instrospection.py>
* <https://github.com/ataradov/riscv/blob/master/rtl/riscv_core.v#L210>
* <https://www.eda.ncsu.edu/wiki/FreePDK>