# High-level architectural Requirements * SMP Cache coherency (TileLink?) * Minumum 800mhz * Minimum 2-core SMP, more likely 4-core uniform design, each core with full 4-wide SIMD-style predicated ALUs * 6GFLOPS single-precision FP * 128 64-bit FP and 128 64-bit INT register files * RV64GC compliance for running full GNU/Linux-based OS * SimpleV compliance * xBitManip (required for VPU and ideal for predication) * On-chip tile buffer (memory-mapped SRAM), likely shared between all cores, for the collaborative creation of pixel "tiles". * 4-lane 2Rx1W SRAMs for registers numbered 32 and above; Multi-R x Multi-W for registers 1-31. TODO: consider 2R for registers to be used as predication targets if >= 32. * Idea: generic implementation of ports on register file so as to be able to experiment with different arrangements. * Potentially: Lane-swapping / crossing / data-multiplexing bus on register data (particularly because of SHAPE-REMAP (1D/2D/3D) * Potentially: Registers subdivided into 16-bit, to match elwidth down to 16-bit (for FP16). 8-bit elwidth only goes down as far as twin-SIMD (with predication). This requires registers to have extra hidden bits: register x30 is now "x30:0+x30.1+x30.2+x30.3". have to discuss. # Conversation Notes ---- 'm thinking about using tilelink (or something similar) internally as having a cache-coherent protocol is required for implementing Vulkan (unless you want to turn off the cache for the GPU memory, which I don't think is a good idea), axi is not a cache-coherent protocol, and tilelink already has atomic rmw operations built into the protocol. We can use an axi to tilelink bridge to interface with the memory. I'm thinking we will want to have a dual-core GPU since a single core with 4xSIMD is too slow to achieve 6GFLOPS with a reasonable clock speed. Additionally, that allows us to use an 800MHz core clock instead of the 1.6GHz we would otherwise need, allowing us to lower the core voltage and save power, since the power used is proportional to F\*V^2. (just guessing on clock speeds.) ---- I don't know about power, however I have done some research and a 4Kbyte (or 16, icr) SRAM (what I was thinking of for a tile buffer) takes in the ballpark of 1000 um^2 in 28nm. Using a 4xFMA with a banked register file where the bank is selected by the lower order register number means we could probably get away with 1Rx1W SRAM as the backing memory for the register file, similarly to Hwacha. I would suggest 8 banks allowing us to do more in parallel since we could run other units in parallel with a 4xFMA. 8 banks would also allow us to clock gate the SRAM banks that are not in use for the current clock cycle allowing us to save more power. Note that the 4xFMA could be 4 separately allocated FMA units, it doesn't have to be SIMD style. If we have enough hw parallelism, we can under-volt and under-clock the GPU cores allowing for a more efficient GPU. If we are using the GPU cores as CPU cores as well, I think it would be important to be able to use a faster clock speed when not using the extended registers (similar to how Intel processors use a lower clock rate when AVX512 is in use) so that scalar code is not slowed down too much. > > Using a 4xFMA with a banked register file where the bank is selected by > the > > lower order register number means we could probably get away with 1Rx1W > > SRAM as the backing memory for the register file, similarly to Hwacha. > > okaaay.... sooo... we make an assumption that the top higher "banks" > are pretty much always going to be "vectorised", such that, actually, > they genuinely don't need to be 6R-4W (or whatever). > Yeah pretty much, though I had meant the bank number comes from the least-significant bits of the 7-bit register number. ---- Assuming 64-bit operands: If you could organize 2 SRAM macros and use the pair of them to read/write 4 registers at a time (256-bits). The pipeline will allow you to dedicate 3 cycles for reading and 1 cycle for writing (4 registers each).
RS1 = Read of operand S1 WRd = Write of result Dst FMx = Floating Point Multiplier, x = stage. |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4|WRd| |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4|WRd| |RS1|RS2|RS3|FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4| |FWD|FM1|FM2|FM3|FM4|WRd|The only trick is getting the read and write dedicated on different clocks. When the RS3 operand is not needed (60% of the time) you can use the time slot for reading or writing on behalf of memory refs; STs read, LDs write. You will find doing VRFs a lot more compact this way. In GPU land we called the flip-flops orchestrating the timing "collectors". ---- Justification for Branch Prediction
|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd| |F123|F123|F123|F123| |Esk1|EsK2|EsK3|EsK4| |EfK1|EfK2|EfK3|EfK4|4 cycle FU shown. Read as much as you need in 4 cycles for one operand, Read as much as you need in 4 cycles for another operand, read as much as you need in 4 cycles for the last operand, then write as much as you can for the result. This simply requires flip-flops to capture the width and then deliver operands in parallel (serial to parallel converter) and similarly for writing. # Design Layout ok,so continuing some thoughts-in-order notes: ## Scoreboards scoreboards are not just scoreboards, they are dependency matrices, and there are several of them: * one for LOAD/STORE-to-LOAD/STORE - most recent LOADs prevent later STOREs - most recent STOREs prevent later LOADs. - a separate process analyses LOAD-STORE addresses for conflicts, based on sufficient bits to assess uniqueness as opposed to precise and exact matches * one for Function-Unit to Function-Unit. - it expresses both RAW and WAW hazards through "Go_Write" and "Go_Read" signals, which are stopped from proceeding by dependent 1-bit CAM latches - exceptions may ALSO be made "precise" by holding a "Write prevention" signal. only when the Function Unit knows that an exception is not going to occur (memory has been fetched, for example), does it release the signal - speculative branch execution likewise may hold a "Write prevention", however it also needs a "Go die" signal, to clear out the incorrectly-taken branch. - LOADs/STOREs *also* must be considered as "Functional Units" and thus must also have corresponding entries (plural) in the FU-to-FU Matrix - it is permitted for ALUs to *BEGIN* execution (read operands are valid) without being permitted to *COMMIT*. thus, each FU must store (buffer) results, until such time as a "commit" signal is received - we may need to express an inter-dependence on the instruction order (raising the WAW hazard line to do so) as a way to preserve execution order. only the oldest instructions will have this flag dropped, permitting execution that has *begun* to also reach "commit" phase. * one for Function-Unit to Registers. - it expresses the read and write requirements: the source and destination registers on which the operation depends. source registers are marked "need read", dest registers marked "need write". - by having *more than one* Functional Unit matrix row per ALU it becomes possible to effectively achieve "Reservation Stations" orthogonality with the Tomasulo Algorithm. the FU row must, like RS's, take and store a copy of the src register values. ## Register Renaming There are several potential well-known schemes for register-renaming: *none of them will be used here*. The scheme below is a new form of renaming that is a topologically and functionally **direct** equivalent of the Tomasulo Algorithm with a Reorder Buffer, that came from the "Register Alias Table" concept that is better suited to Scoreboards. It works by flattening out Reservation Stations to one per FU (requiring more FUs as a result). On top of this the function normally carried out by "tags" of the RAT table may be merged-morphed into the role carried out by the ROB Destination Register CAM which may be merged-morphed into a single vector (per register) of 1-bit mutually-exclusive "CAMs" that are added, very simply, to the FU-Register Dependency Matrix. In this way, exactly as in the Tomasulo Algorithm, there is absolutely no need whatsoever for a separate PRF-ARF scheme. The PRF *is* the ARF. Register-renaming will be done with a single extra mutually-exclusive bit in the FUxReg Dependency Matrix, which may be set on only one FU (per register). This bit indicates which of the FUs has the **most recent** destination register value pending. It is **directly** functionally equivalent to the Reorder Buffer Dest Reg# CAM value, except that now it is a string of 1-bit "CAMs". When an FU needs a src reg and finds that it needs to create a dependency waiting for a result to be created, it must use this bit to determine which FU it creates a dependency on. If there is a destination register that already has a bit set (anywhere in the column), it is **cleared** and **replaced** with a bit in the FU's row and the destination register's column. See https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/c80jRn4PCQAJ MUL r1, r2, r3 FU name Reg name 12345678 add-0 ........ add-1 ........ mul-0 X....... mul-1 ........ ADD r4, r1, r3 FU name Reg name 12345678 add-0 ...X.... add-1 ........ mul-0 X....... mul-1 ........ ADD r1, r5, r6 FU name Reg name 12345678 add-0 ...X.... add-1 X....... mul-0 ........ mul-1 ........ note how on the 3rd instruction, the (mul-0,R1) entry is **cleared** and **replaced** with an (add-1,R1) entry. future instructions now know that if their src operands require R1, they are to place a RaW dependency on **add-1**, not mul-0 ## Multi-issue we may potentially have 2-issue (or 4-issue) and a simpler issue and detection by "striping" the register file according to modulo 2 (or 4) on the destination register number * the Function Unit rows are multiplied up by 2 (or 4) however they are actually connected to the same ALUs (pipelined and with both src and dest register buffers/latches). * the Register Read and Write signals are then "striped" such that read/write requests for every 2nd (or 4th) register are "grouped" and will have to fight for access to a multiplexer in order to access registers that do not have the same modulo 2 (or 4) match. * we MAY potentially be able to drop the destination (write) multiplexer(s) by only permitting FU rows with the same modulo to write to that destination bank. FUs with indices 0,4,8,12 may only write to registers similarly numbered. * there will therefore be FOUR separate register-data buses, with (at least) the Read buses multiplexed so that all FU banks may read all src registers (even if there is contention for the multiplexers) ## FU-to-Register address de-muxed already an oddity / artefact of the FU-to-Registers Dependency Matrix is that the write/read enable signals already exist as single-bits. "normal" processors store the src/dest registers as an index (5 bits == 0-31), where in this design, that has been expanded out to 32 individual Read/Write wires, already. * the register file verilog implementation therefore must take in an array of 128-bit write-enable and 128-bit read-enable signals. * however the data buses will be multiplexed modulo 2 (or 4) according to the lower bits of the register number, in order to cross "lanes". ## FU "Grouping" with so many Function Units in RISC-V (dozens of instructions, times 2 to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue), we almost certainly are going to have to deploy a "grouping" scheme: * rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs to MUL etc., instead we group the FUs by how many src and dest registers are required, and *pass the opcode down to them* * only FUs with the exact same number (and type) of register profile will receive like-minded opcodes. * when src and dest are free for a particular op (and an ALU pipeline is not stalled) the FU is at liberty to push the operands into the appropriate free ALU. * FUs therefore only really express the register, memory, and execution dependencies: they don't actually do the execution. ## Recommendations * Include a merged address-generator in the INT ALU * Have simple ALU units duplicated and allow more than one FU to receive (and process) the src operands. ## Register file workloads Note: Vectorisation also includes predication, which is one extra integer read Integer workloads: * 43% Integer * 21% Load * 12% store * 24% branch * 100% of the instruction stream can be integer instructions * 75% utilize two source operand registers. * 50% of the instruction stream can be Load instructions * 25% can be store instructions, * 25% can be branch instructions FP workloads: * 30% Integer * 25% Load * 10% Store * 13% Multiplication * 17% Addition * 5% branch ---- > in particular i found it fascinating that analysis of INT > instructions found a 50% LD, 25% ST and 25% branch, and that > 70% were 2-src ops. therefore you made sure that the number > of read and write ports matched these, to ensure no bottlenecks, > bearing in mind that ST requires reading an address *and* > a data register. I never had a problem in "reading the write slot" in any of my pipelines. That is, take a pipeline where LD (cache hit) has a latency of 3 cycles (AGEN, Cache, Align). Align would be in the cycle where the data was being forwarded, and the subsequent cycle, data could be written into the RF. |dec|AGN|$$$|ALN|LDW| For stores I would read the LDs write slot Align the store data and merge into the cache as:: |dec|AGEN|tag|---|STR|ALN|$$$| You know 4 cycles in advance that a store is coming, 2 cycles after hit so there is easy logic to decide to read the write slot (or not), and it costs 2 address comparators to disambiguate this short shadow in the pipeline. This is a lower expense than building another read port into the RF, in both area and power, and uses the pipeline efficiently. # References *