X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=3d_gpu%2Fmicroarchitecture.mdwn;h=c9473f5d5a78512c434938d77932e14a5c6ea158;hb=eb93d93bb6fa4f31f71816f06473a6589c0bb7c5;hp=bd5a0ac875627790c1a0f333ebbed6d0239cc952;hpb=5eb1d608a7932e9f9619301688c15f4c91bd8145;p=libreriscv.git diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index bd5a0ac87..c9473f5d5 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -9,6 +9,8 @@ * RV64GC compliance for running full GNU/Linux-based OS * SimpleV compliance * xBitManip (required for VPU and ideal for predication) +* On-chip tile buffer (memory-mapped SRAM), likely shared + between all cores, for the collaborative creation of pixel "tiles". * 4-lane 2Rx1W SRAMs for registers numbered 32 and above; Multi-R x Multi-W for registers 1-31. TODO: consider 2R for registers to be used as predication targets @@ -23,10 +25,21 @@ requires registers to have extra hidden bits: register x30 is now "x30:0+x30.1+x30.2+x30.3". have to discuss. +See [[requirements_specification]] + # Conversation Notes ---- +http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000310.html + +> We will need fast f32 <-> i16 at least since that is used for 16-bit +> z-buffers. Since we don't have indexed load/store and need to manually +> construct pointer vectors we will need fast i32 -> i64. We will also need +> fast i32 <-> f32. + +---- + 'm thinking about using tilelink (or something similar) internally as having a cache-coherent protocol is required for implementing Vulkan (unless you want to turn off the cache for the GPU memory, which I @@ -107,12 +120,697 @@ LDs write. You will find doing VRFs a lot more compact this way. In GPU land we called the flip-flops orchestrating the timing "collectors". +---- + +Justification for Branch Prediction + + + +We can combine several branch predictors to make a decent predictor: +call/return predictor -- important as it can predict calls and returns +with around 99.8% accuracy loop predictor -- basically counts loop +iterations some kind of global predictor -- handles everything else + +We will also want a btb, a smaller one will work, it reduces average +branch cycle count from 2-3 to 1 since it predicts which instructions +are taken branches while the instructions are still being fetched, +allowing the fetch to go to the target address on the next clock rather +than having to wait for the fetched instructions to be decoded. + +---- + +> https://www.researchgate.net/publication/316727584_A_case_for_standard-cell_based_RAMs_in_highly-ported_superscalar_processor_structures + +well, there is this concept: +https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf + +it is a 2-level hierarchy for register cacheing. honestly, though, the +reservation stations of the tomasulo algorithm are similar to a cache, +although only of the intermediate results, not of the initial operands. + +i have a feeling we should investigate putting a 2-level register cache +in front of a multiplexed SRAM. + +---- + +For GPU workloads FP64 is not common so I think having 1 FP64 alu would +be sufficient. Since indexed loads and stores are not supported, it will +be important to support 4x64 integer operations to generate addresses +for loads/stores. + +I was thinking we would use scoreboarding to keep track of operations +and dependencies since it doesn't need a cam per alu. We should be able +to design it to forward past the register file to allow for 0-latency +forwarding. If we combined that with register renaming it should prevent +most war and waw data hazards. + +I think branch prediction will be essential if only to fetch and decode +operations since it will reduce the branch penalty substantially. + +Note that even if we have a zero-overhead loop extension, branch +prediction will still be useful as we will want to be able to run code +like compilers and standard RV code with decent performance. Additionally, +quite a few shaders have branching in their internal loops so +zero-overhead loops won't be able to fix all the branching problems. + +---- + +> you would need a 4-wide cdb anyway, since that's the performance we're +> trying for. + + if the 32-bit ops can be grouped as 2x SIMD to a 64-bit-wide ALU, +then only 2 such ALUs would be needed to give 4x 32-bit FP per cycle +per core, which means only a 2-wide CDB, a heck of a lot better than +4. + + oh: i thought of another way to cut the power-impact of the Reorder +Buffer CAMs: a simple bit-field (a single-bit 2RWW memory, of address +length equal to the number of registers, 2 is because of 2-issue). + + the CAM of a ROB is on the instruction destination register. key: +ROBnum, value: instr-dest-reg. if you have a bitfleid that says "this +destreg has no ROB tag", it's dead-easy to check that bitfield, first. + +---- + +Avoiding Memory Hazards + +* WAR and WAR hazards through memory are eliminated with speculation +because actual updating of memory occurs in order, when a store is at +the head of the ROB, and hence, no earlier loads or stores can still +be pending +* RAW hazards are maintained by two restrictions: + 1. not allowing a load to initiate the second step of its execution if + any active ROB entry occupied by a store has a destination + field that matches the value of the A field of the load and + 2. maintaining the program order for the computation of an effective + address of a load with respect to all earlier stores +* These restrictions ensure that any load that access a memory location + written to by an earlier store cannot perform the memory access until + the store has written the data. + +Advantages of Speculation, Load and Store hazards: + +* A store updates memory only when it reached the head of the ROB +* WAW and WAR type of hazards are eliminated with speculation + (actual updating of memory occurs in order) +* RAW hazards through memory are maintained by not allowing a load + to initiate the second step of its execution +* Check if any store has a destination field that matched the + value of the load: + - SD F1 100(R2) + - LD F2 100(R2) + +Exceptions + +* Exceptions are handled by not recognising the exception until + instruction that caused it is ready to commit in ROB (reaches head + of ROB) + +Reorder Buffer + +* Results of an instruction become visible externally when it leaves + the ROB + - Registers updated + - Memory updated + +Reorder Buffer Entry + +* Instruction type + - branch (no destination resutl) + - store (has a memory address destination) + - register operation (ALU operation or load, which has reg dests) +* Destination + - register number (for loads and ALU ops) or + - memory address (for stores) where the result should be written +* Value + - value of instruction result, pending a commit +* Ready + - indicates that the instruction has completed execution: value is ready + +---- + +Register Renaming resources + +* +* +* ROBs + Rename + +Video @ 3:24, "RAT" table - Register Aliasing Table: + + + +This scheme looks very much like a Reservation Station. + +---- + +There is another way to get precise ordering of the writes in a scoreboard. +First, one has to implement forwarding in the scoreboard. +Second, the function units need an output queue +Now, one can launch an instruction and pick up its operand either +from the RF or from the function unit output while the result sits +in the function unit waiting for its GO_Write signal. + +Thus the launching of instructions is not delayed due to hazards +but the results are delivered to the RF in program order. + +This looks surprisingly like a 'belt' at the end of the function unit. + +---- + +> https://salsa.debian.org/Kazan-team/kazan/blob/e4b516e29469e26146e717e0ef4b552efdac694b/docs/ALU%20lanes.svg + + so, coming back to this diagram, i think if we stratify the +Functional Units into lanes as well, we may get a multi-issue +architecture. + + the 6600 scoreboard rules - which are awesomely simple and actually +involve D-Latches (3 gates) *not* flip-flops (10 gates) can be executed +in parallel because there will be no overlap between stratified registers. + + if using that odd-even / msw-lsw division (instead of modulo 4 on the +register number) it will be more like a 2-issue for standard RV +instructions and a 4-issue for when SV 32-bit ops are loop-generated. + + by subdividing the registers into odd-even banks we will need a +_pair_ of (completely independent) register-renaming tables: + https://libre-riscv.org/3d_gpu/rat_table.png + + for SIMD'd operations, if we have the same type of reservation +station queue as with Tomasulo, it can be augmented with the byte-mask: +if the byte-masks in the queue of both the src and dest registers do +not overlap, the operations may be done in parallel. + + i still have not yet thought through how the Reorder Buffer would +work: here, again, i am tempted to recommend that, again, we "stratify" +the ROB into odd-even (modulo 2) or perhaps modulo 4, with 32 entries, +however the CAM is only 4-bit or 3-bit wide. + + if an instruction's destination register does not meet the modulo +requirements, that ROB entry is *left empty*. this does mean that, +for a 32-entry Reorder Buffer, if the stratification is 4-wide (modulo +4), and there are 4 sequential instructions that happen e.g. to have +a destination of r4 for insn1, r24 for insn2, r16 for insn3.... etc. +etc.... the ROB will only hold 8 such instructions + +and that i think is perfectly fine, because, statistically, it'll balance +out, and SV generates sequentially-incrementing instruction registers, +so *that* is fine, too. + +i'll keep working on diagrams, and also reading mitch alsup's chapters +on the 6600. they're frickin awesome. the 6600 could do multi-issue +LD and ST by way of having dedicated registers to LD and ST. X1-X5 were +for ST, X6 and X7 for LD. + +---- + +i took a shot at explaining this also on comp.arch today, and that +allowed me to identify a problem with the proposed modulo-4 "lanes" +stratification. + +when a result is created in one lane, it may need to be passed to the next +lane. that means that each of the other lanes needs to keep a watchful +eye on when another lane updates the other regfiles (all 3 of them). + +when an incoming update occurs, there may be up to 3 register writes +(that need to be queued?) that need to be broadcast (written) into +reservation stations. + +what i'm not sure of is: can data consistency be preserved, even if +there's a delay? my big concern is that during the time where the data is +broadcast from one lane, the head of the ROB arrives at that instruction +(which is the "commit" condition), it gets committed, then, unfortunately, +the same ROB# gets *reused*. + +now that i think about it, as long as the length of the queue is below +the size of the Reorder Buffer (preferably well below), and as long as +it's guaranteed to be emptied by the time the ROB cycles through the +whole buffer, it *should* be okay. + +---- + +> Don't forget that in these days of Spectre and Meltdown, merely +> preventing dead instruction results from being written to registers or +> memory is NOT ENOUGH. You also need to prevent load instructions from +> altering cache and branch instructions from altering branch prediction +> state. + +Which, oddly enough, provides a necessity for being able to consume +multiple containers from the cache Miss buffers, which oddly enough, +are what makes a crucial mechanism in the Virtual Vector Method work. + +In the past, one would forward the demand container to the waiting +memref and then write the whole the line into the cache. S&M means you +have to forward multiple times from the miss buffers and avoid damaging +the cache until the instruction retires. VVM uses this to avoid having +a vector strip mine the data cache. + +---- + +> I meant the renaming done as part of the SV extension, not the +> microarchitectural renaming. + +ah ok, yes. right. ok, so i don't know what to name that, and i'd +been thinking of it in terms of "post-renaming", as in my mind, it's +not really renaming, at all, it's... remapping. or, vector +"elements". + +as in: architecturally we already have a name (vector "elements"). +physically we already have a name: register file. + +i was initially thinking that the issue stage would take care of it, +by producing: + +* post-remapped elements which are basically post-remapped register indices +* a byte-mask indicating which *bytes* of the register are to be + modified and which left alone +* an element-width that is effectively an augmentation of (part of) the opcode + +the element width goes into the ALU as an augmentation of the opcode +because the 64-bit "register" now contains e.g. 16-bit "elements" +indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a +SIMD-style (predicated) operation to take place. + +now that i think about it, i think we may need to have the three +phases be part of a pipeline, in a single dependency matrix. + +---- + +I had a state machine in one chip that could come up out of power on in a +state it could not get out of. Since this experience, I have a rule with +state machines, A state machine must be able to go from any state to idle +when the reset line is asserted. + +You have to prove that the logic can never create a circular dependency, +not a proof with test vectors, a logical proof like what we do with FP +arithmetic these days. + +---- + + +> however... we don't mind that, as the vectorisation engine will +> be, for the most part, generating sequentially-increasing index +> dest *and* src registers, so we kinda get away with it. + +In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM) +and read 4 registers at a time or write 4 registers at a time. Timing +looks like: + +
+     |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|
+                    |F123|F123|F123|F123|
+                         |Esk1|EsK2|EsK3|EsK4|
+                                        |EfK1|EfK2|EfK3|EfK4|
+
+ +4 cycle FU shown. Read as much as you need in 4 cycles for one operand, +Read as much as you need in 4 cycles for another operand, read as much +as you need in 4 cycles for the last operand, then write as much as you +can for the result. This simply requires flip-flops to capture the width +and then deliver operands in parallel (serial to parallel converter) and +similarly for writing. + +---- + +* + +discussion of how to do dest-latches rather than src-latches. + +also includes need for forwarding to achieve it (synonymous with +Tomasulo CDB). + +also, assigning a result number at issue time allows multiple results +to be stored-and-forwarded, meaning that multiplying up the FUs is +not needed. + +also, discussion of how to have multiple instructions issued even with +the same dest reg: drop the reg-store and effectively rename them +to "R.FU#". exceptions under discussion. + +---- + +Speculation + + + +There is a minimal partial order that is immune to Spetré amd friends, +You have the dependence matrix that imposes a minimal partial order on +executing instructions (at least in the architecture you have been +discussing herein) You just have to prove that your matrix provides +that minimal partial order for instructions. + +Then you have to prove that no cache/tlb state can be updated prior to the +causing instruction being made retirable (not retired retirable). + +As to cache updates, all "reasonable" interfaces that service cache misses +will have line buffers to deal with the inbound and outbound memory traffic. +These buffers will provide the appropriate data to the execution stream, +but not update the cache until the causing instruction becomes transitively +retirable. This will put "a little" extra pressure on these buffers. + +As to the TLB it is easy enough on a TLB miss to fetch the mapping tables +transitively and arrive at a PTE. This PTE cannot be installed until the +causing instruction becomes retirable. The miss buffers are probably the +right place, and if a second TLB miss occurs, you might just as well walk +the tables again and if it hits the line in the buffer use the data from +there. When we looked at this a long time ago, there was little benefit +for being able to walk more than one TLB miss at a time. + +---- + +Register Prefixes + +
+|           3      |           2      |           1      |           0      |
+| ---------------- | ---------------- | ---------------- | ---------------- |
+|                  | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+
+ +
+2x16-bit / 32-bit:
+
+| 9 8   | 7 6 5 |     4 3 |     2 1 | 0 |
+| ----- | ----- | ------- | ------- | - |
+| elwid | VL    | rs[6:5] | rd[6:5] | 0 |
+
+| 9 8 7 6 5 |      4 3 |   2 |   1 | 0 |
+| --------- | -------- | --- | --- | - |
+| predicate | predtarg | end | inv | 1 |
+
+
+|                  | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 |
+|                  | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 |
+| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 |
+
+ +# MVX and other reg-shuffling + +
+> Crucial strategic op missing is MVX:
+> regs[rd]= regs[regs[rs1]]
+>
+we could modify the definition slightly:
+for i in 0..VL {
+    let offset = regs[rs1 + i];
+    // we could also limit on out-of-range
+    assert!(offset < VL); // trap on fail
+    regs[rd + i] = regs[rs2 + offset];
+}
+
+The dependency matrix would have the instruction depend on everything from
+rs2 to rs2 + VL and we let the execution unit figure it out. for
+simplicity, we could extend the dependencies to a power of 2 or something.
+
+We should add some constrained swizzle instructions for the more
+pipeline-friendly cases. One that will be important is:
+for i in (0..VL) {
+    let i = i * 4;
+    let s1: [0; 4];
+    for j in 0..4 {
+        s1[j] = regs[rs1 + i + j];
+    }
+    for j in 0..4 {
+        regs[rd + i + j] = s1[(imm >> j * 2) & 0x3];
+    }
+}
+Another is matrix transpose for (2-4)x(2-4) matrices which we can implement
+as similar to a strided ld/st except for registers.
+
+ +# TLBs / Virtual Memory
+ +---- + +We were specifically looking for ways to not need large CAMs since they are +power-hungry when designing the instruction scheduling logic, so it may be +a good idea to have a smaller L1 TLB and a larger, slower, more +power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can +be 32-128 as long as the L2 cam isn't being activated every clock cycle. We +can also share the L2 between the instruction and data caches. + +# Register File having same-cycle "forwarding" + +discussion about CDC 6600 Register File: it was capable of forwarding +operands being written out to "reads", *in the same cycle*. this +effectively turns the Reg File *into* a "Forwarding Bus". + +we aim to only have (4 banks of) 2R1W ported register files, +with *additional* Forwarding Multiplexers (which look exactly +like multi-port regfile gate logic). + +suggestion by Mitch is to have a "demon" on the front of the regfile, +, +which: + + basically, you are going to end up with a "demon" at the RF and when + all read reservations have been satisfied the demon determines if the + result needs to be written to the RF or discarded. The demon sees + the instruction issue process, the branch resolutions, and the FU + exceptions, and keeps track of whether the result needs to be written. + It then forwards the result from the FU and clears the slot, then writes + the result to the RF if needed. + +# Design Layout + +ok,so continuing some thoughts-in-order notes: + +## Scoreboards + +scoreboards are not just scoreboards, they are dependency matrices, +and there are several of them: + +* one for LOAD/STORE-to-LOAD/STORE + - most recent LOADs prevent later STOREs + - most recent STOREs prevent later LOADs. + - a separate process analyses LOAD-STORE addresses for + conflicts, based on sufficient bits to assess uniqueness + as opposed to precise and exact matches +* one for Function-Unit to Function-Unit. + - it expresses both RAW and WAW hazards through "Go_Write" + and "Go_Read" signals, which are stopped from proceeding by + dependent 1-bit CAM latches + - exceptions may ALSO be made "precise" by holding a "Write prevention" +    signal.  only when the Function Unit knows that an exception is + not going to occur (memory has been fetched, for example), does + it release the signal + - speculative branch execution likewise may hold a "Write prevention", + however it also needs a "Go die" signal, to clear out the + incorrectly-taken branch. + - LOADs/STOREs *also* must be considered as "Functional Units" and thus +    must also have corresponding entries (plural) in the FU-to-FU Matrix + - it is permitted for ALUs to *BEGIN* execution (read operands are + valid) without being permitted to *COMMIT*.  thus, each FU must + store (buffer) results, until such time as a "commit" signal is + received + - we may need to express an inter-dependence on the instruction order +    (raising the WAW hazard line to do so) as a way to preserve execution +    order.  only the oldest instructions will have this flag dropped, + permitting execution that has *begun* to also reach "commit" phase. +* one for Function-Unit to Registers. + - it expresses the read and write requirements: the source + and destination registers on which the operation depends.  source + registers are marked "need read", dest registers marked + "need write". + - by having *more than one* Functional Unit matrix row per ALU + it becomes possible to effectively achieve "Reservation Stations" + orthogonality with the Tomasulo Algorithm.  the FU row must, like + RS's, take and store a copy of the src register values. + +## Register Renaming + +There are several potential well-known schemes for register-renaming: +*none of them will be used here*. The scheme below is a new form of +renaming that is a topologically and functionally **direct** equivalent +of the Tomasulo Algorithm with a Reorder Buffer, that came from the +"Register Alias Table" concept that is better suited to Scoreboards. +It works by flattening out Reservation Stations to one per FU (requiring +more FUs as a result). On top of this the function normally carried +out by "tags" of the RAT table may be merged-morphed into the role +carried out by the ROB Destination Register CAM which may be merged-morphed +into a single vector (per register) of 1-bit mutually-exclusive "CAMs" +that are added, very simply, to the FU-Register Dependency Matrix. + +In this way, exactly as in the Tomasulo Algorithm, there is absolutely no +need whatsoever for a separate PRF-ARF scheme. The PRF *is* the ARF. + +Register-renaming will be done with a single extra mutually-exclusive bit +in the FUxReg Dependency Matrix, which may be set on only one FU (per register). +This bit indicates which of the FUs has the **most recent** destination +register value pending. It is **directly** functionally equivalent to +the Reorder Buffer Dest Reg# CAM value, except that now it is a +string of 1-bit "CAMs". + +When an FU needs a src reg and finds that it needs to create a +dependency waiting for a result to be created, it must use this +bit to determine which FU it creates a dependency on. + +If there is a destination register that already has a bit set +(anywhere in the column), it is **cleared** and **replaced** +with a bit in the FU's row and the destination register's column. + +See https://groups.google.com/d/msg/comp.arch/w5fUBkrcw-s/c80jRn4PCQAJ + +MUL r1, r2, r3 + + FU name Reg name + 12345678 + add-0 ........ + add-1 ........ + mul-0 X....... + mul-1 ........ + +ADD r4, r1, r3 + + FU name Reg name + 12345678 + add-0 ...X.... + add-1 ........ + mul-0 X....... + mul-1 ........ + +ADD r1, r5, r6 + + FU name Reg name + 12345678 + add-0 ...X.... + add-1 X....... + mul-0 ........ + mul-1 ........ + +note how on the 3rd instruction, the (mul-0,R1) entry is **cleared** +and **replaced** with an (add-1,R1) entry. future instructions now +know that if their src operands require R1, they are to place a +RaW dependency on **add-1**, not mul-0 + +## Multi-issue + +we may potentially have 2-issue (or 4-issue) and a simpler issue and +detection by "striping" the register file according to modulo 2 (or 4) +on the destination   register number + +* the Function Unit rows are multiplied up by 2 (or 4) however they are +  actually connected to the same ALUs (pipelined and with both src and +  dest register buffers/latches). +* the Register Read and Write signals are then "striped" such that + read/write requests for every 2nd (or 4th) register are "grouped" and + will have to fight for access to a multiplexer in order to access + registers that do not have the same modulo 2 (or 4) match. +* we MAY potentially be able to drop the destination (write) multiplexer(s) +  by only permitting FU rows with the same modulo to write to that + destination bank.  FUs with indices 0,4,8,12 may only write to registers + similarly numbered. +* there will therefore be FOUR separate register-data buses, with (at least) +  the Read buses multiplexed so that all FU banks may read all src registers +  (even if there is contention for the multiplexers) + +## FU-to-Register address de-muxed already + +an oddity / artefact of the FU-to-Registers Dependency Matrix is that the +write/read enable signals already exist as single-bits.  "normal" processors +store the src/dest registers as an index (5 bits == 0-31), where in this +design, that has been expanded out to 32 individual Read/Write wires, +already. + +* the register file verilog implementation therefore must take in an + array of 128-bit write-enable and 128-bit read-enable signals. +* however the data buses will be multiplexed modulo 2 (or 4) according + to the lower bits of the register number, in order to cross "lanes". + +## FU "Grouping" + +with so many Function Units in RISC-V (dozens of instructions, times 2 +to provide Reservation Stations, times 2 OR 4 for dual (or quad) issue), +we almost certainly are going to have to deploy a "grouping" scheme: + +* rather than dedicate 2x4 FUs to ADD, 2x4 FUs to SUB, 2x4 FUs + to MUL etc., instead we group the FUs by how many src and dest + registers are required, and *pass the opcode down to them* +* only FUs with the exact same number (and type) of register profile + will receive like-minded opcodes. +* when src and dest are free for a particular op (and an ALU pipeline is + not stalled) the FU is at liberty to push the operands into the + appropriate free ALU. +* FUs therefore only really express the register, memory, and execution + dependencies: they don't actually do the execution. + +## Recommendations + +* Include a merged address-generator in the INT ALU +* Have simple ALU units duplicated and allow more than one FU to + receive (and process) the src operands. + +## Register file workloads + +Note: Vectorisation also includes predication, which is one extra integer read + +Integer workloads: + +* 43% Integer +* 21% Load +* 12% store +* 24% branch + +* 100% of the instruction stream can be integer instructions +* 75% utilize two source operand registers. +* 50% of the instruction stream can be Load instructions +* 25% can be store instructions, +* 25% can be branch instructions + +FP workloads: + +* 30% Integer +* 25% Load +* 10% Store +* 13% Multiplication +* 17% Addition +* 5% branch + +---- + +> in particular i found it fascinating that analysis of INT +> instructions found a 50% LD, 25% ST and 25% branch, and that +> 70% were 2-src ops. therefore you made sure that the number +> of read and write ports matched these, to ensure no bottlenecks, +> bearing in mind that ST requires reading an address *and* +> a data register. + +I never had a problem in "reading the write slot" in any of my pipelines. +That is, take a pipeline where LD (cache hit) has a latency of 3 cycles +(AGEN, Cache, Align). Align would be in the cycle where the data was being +forwarded, and the subsequent cycle, data could be written into the RF. + +|dec|AGN|$$$|ALN|LDW| + +For stores I would read the LDs write slot Align the store data and merge +into the cache as:: + +|dec|AGEN|tag|---|STR|ALN|$$$| + +You know 4 cycles in advance that a store is coming, 2 cycles after hit +so there is easy logic to decide to read the write slot (or not), and it +costs 2 address comparators to disambiguate this short shadow in the pipeline. + +This is a lower expense than building another read port into the RF, in +both area and power, and uses the pipeline efficiently. + +# Explicit Vector Length (EVL) extension to LLVM + +* +* +* + # References * * * points out that reservation stations take a *lot* of power. +* scoreboarding * MESI cache protocol, python * report on @@ -125,3 +823,4 @@ called the flip-flops orchestrating the timing "collectors". * Discussion * * +*