X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=3d_gpu%2Fmicroarchitecture.mdwn;h=d26d8948478b9e9c3c5e76b2c6489c2b6cbc50c6;hb=1361c489ba642cbf1395097f10b730edc146b81d;hp=7975be84c063dc5e0eb71d69137156f9bd46ed2f;hpb=152ffba5919446044f3bbb3b8892913d672af3db;p=libreriscv.git diff --git a/3d_gpu/microarchitecture.mdwn b/3d_gpu/microarchitecture.mdwn index 7975be84c..d26d89484 100644 --- a/3d_gpu/microarchitecture.mdwn +++ b/3d_gpu/microarchitecture.mdwn @@ -25,10 +25,21 @@ requires registers to have extra hidden bits: register x30 is now "x30:0+x30.1+x30.2+x30.3". have to discuss. +See [[requirements_specification]] + # Conversation Notes ---- +http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000310.html + +> We will need fast f32 <-> i16 at least since that is used for 16-bit +> z-buffers. Since we don't have indexed load/store and need to manually +> construct pointer vectors we will need fast i32 -> i64. We will also need +> fast i32 <-> f32. + +---- + 'm thinking about using tilelink (or something similar) internally as having a cache-coherent protocol is required for implementing Vulkan (unless you want to turn off the cache for the GPU memory, which I @@ -354,6 +365,216 @@ have to forward multiple times from the miss buffers and avoid damaging the cache until the instruction retires. VVM uses this to avoid having a vector strip mine the data cache. +---- + +> I meant the renaming done as part of the SV extension, not the +> microarchitectural renaming. + +ah ok, yes. right. ok, so i don't know what to name that, and i'd +been thinking of it in terms of "post-renaming", as in my mind, it's +not really renaming, at all, it's... remapping. or, vector +"elements". + +as in: architecturally we already have a name (vector "elements"). +physically we already have a name: register file. + +i was initially thinking that the issue stage would take care of it, +by producing: + +* post-remapped elements which are basically post-remapped register indices +* a byte-mask indicating which *bytes* of the register are to be + modified and which left alone +* an element-width that is effectively an augmentation of (part of) the opcode + +the element width goes into the ALU as an augmentation of the opcode +because the 64-bit "register" now contains e.g. 16-bit "elements" +indexed 0-3, or 8-bit "elements" indexed 0-7, and we now want a +SIMD-style (predicated) operation to take place. + +now that i think about it, i think we may need to have the three +phases be part of a pipeline, in a single dependency matrix. + +---- + +I had a state machine in one chip that could come up out of power on in a +state it could not get out of. Since this experience, I have a rule with +state machines, A state machine must be able to go from any state to idle +when the reset line is asserted. + +You have to prove that the logic can never create a circular dependency, +not a proof with test vectors, a logical proof like what we do with FP +arithmetic these days. + +---- + + +> however... we don't mind that, as the vectorisation engine will +> be, for the most part, generating sequentially-increasing index +> dest *and* src registers, so we kinda get away with it. + +In this case:: you could simply design a 1R or 1W file (A.K.A. SRAM) +and read 4 registers at a time or write 4 registers at a time. Timing +looks like: + +
+ |RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd|RdS1|RdS2|RdS3|WtRd| + |F123|F123|F123|F123| + |Esk1|EsK2|EsK3|EsK4| + |EfK1|EfK2|EfK3|EfK4| ++ +4 cycle FU shown. Read as much as you need in 4 cycles for one operand, +Read as much as you need in 4 cycles for another operand, read as much +as you need in 4 cycles for the last operand, then write as much as you +can for the result. This simply requires flip-flops to capture the width +and then deliver operands in parallel (serial to parallel converter) and +similarly for writing. + +---- + +*
+| 3 | 2 | 1 | 0 | +| ---------------- | ---------------- | ---------------- | ---------------- | +| | xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | +| | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 | +| | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 | +| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 | +| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 | ++ +
+2x16-bit / 32-bit: + +| 9 8 | 7 6 5 | 4 3 | 2 1 | 0 | +| ----- | ----- | ------- | ------- | - | +| elwid | VL | rs[6:5] | rd[6:5] | 0 | + +| 9 8 7 6 5 | 4 3 | 2 | 1 | 0 | +| --------- | -------- | --- | --- | - | +| predicate | predtarg | end | inv | 1 | + + +| | xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXX011111 | +| | xxxxxxxxxxxxxxaa | XXXXXXXXXX011111 | XXXXXXXXXX011111 | +| xxxxxxxxxxxxxxaa | xxxxxxxxxxxxxxaa | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 | +| xxxxxxxxxxxxxxxx | xxxxxxxxxxxbbb11 | XXXXXXXXXXXXXXXX | XXXXXXXXX0111111 | ++ +# MVX and other reg-shuffling + +
+> Crucial strategic op missing is MVX: +> regs[rd]= regs[regs[rs1]] +> +we could modify the definition slightly: +for i in 0..VL { + let offset = regs[rs1 + i]; + // we could also limit on out-of-range + assert!(offset < VL); // trap on fail + regs[rd + i] = regs[rs2 + offset]; +} + +The dependency matrix would have the instruction depend on everything from +rs2 to rs2 + VL and we let the execution unit figure it out. for +simplicity, we could extend the dependencies to a power of 2 or something. + +We should add some constrained swizzle instructions for the more +pipeline-friendly cases. One that will be important is: +for i in (0..VL) { + let i = i * 4; + let s1: [0; 4]; + for j in 0..4 { + s1[j] = regs[rs1 + i + j]; + } + for j in 0..4 { + regs[rd + i + j] = s1[(imm >> j * 2) & 0x3]; + } +} +Another is matrix transpose for (2-4)x(2-4) matrices which we can implement +as similar to a strided ld/st except for registers. ++ +# TLBs / Virtual Memory + +---- + +We were specifically looking for ways to not need large CAMs since they are +power-hungry when designing the instruction scheduling logic, so it may be +a good idea to have a smaller L1 TLB and a larger, slower, more +power-efficient, L2 TLB. I would have the L1 be 4-32 entries and the L2 can +be 32-128 as long as the L2 cam isn't being activated every clock cycle. We +can also share the L2 between the instruction and data caches. + +# Register File having same-cycle "forwarding" + +discussion about CDC 6600 Register File: it was capable of forwarding +operands being written out to "reads", *in the same cycle*. this +effectively turns the Reg File *into* a "Forwarding Bus". + +we aim to only have (4 banks of) 2R1W ported register files, +with *additional* Forwarding Multiplexers (which look exactly +like multi-port regfile gate logic). + +suggestion by Mitch is to have a "demon" on the front of the regfile, +