X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=0642ce926d702469769bae2210773b1035398a95;hb=df271280bc596d26430655337c5998fd24a2f050;hp=d8cb615d29fefecc14049225a5b918bc1b04f9a5;hpb=dc67c035bdd97d3b4d6681fa48da4674a0d3c003;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index d8cb615d2..0642ce926 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,13 +1,6 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -* TODO 23may2018: CSR-CAM-ify regfile tables -* TODO 23may2018: zero-mark predication CSR -* TODO 28may2018: sort out VSETVL: CSR length to be removed? -* TODO 09jun2018: Chennai Presentation more up-to-date -* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16) -* TODO 09jun2019: extra register banks (future option) -* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N) - +**Note: this document is out of date and involved early ideas and discussions** Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. @@ -18,8 +11,13 @@ instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. + +**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. + +* Talk slides: +* Specification: now move to its own page: [[specification]] [[!toc ]] @@ -135,7 +133,8 @@ reducing power consumption for the same. SIMD again has a severe disadvantage here, over Vector: huge proliferation of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and have to then have operations *for each and between each*. It gets very -messy, very quickly. +messy, very quickly: *six* separate dimensions giving an O(N^6) instruction +proliferation profile. The V-Extension on the other hand proposes to set the bit-width of future instructions on a per-register basis, such that subsequent instructions @@ -363,588 +362,218 @@ absolute bare minimum level of compliance with the "API" (software-traps when vectorisation is detected), all the way up to full supercomputing level all-hardware parallelism. Options are covered in the Appendix. -# CSRs - -There are a number of CSRs needed, which are used at the instruction -decode phase to re-interpret RV opcodes (a practice that has -precedent in the setting of MISA to enable / disable extensions). - -* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Integer Register N is a Predication Register (note: a key-value store) -* Vector Length CSR (VSETVL, VGETVL) - -Also (see Appendix, "Context Switch Example") it may turn out to be important -to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that -Vectorised LOAD / STORE may be used to load and store multiple registers: -something that is missing from the Base RV ISA. - -Notes: - -* for the purposes of LOAD / STORE, Integer Registers which are - marked as a Vector will result in a Vector LOAD / STORE. -* Vector Lengths are *not* the same as vsetl but are an integral part - of vsetl. -* Actual vector length is *multipled* by how many blocks of length - "bitwidth" may fit into an XLEN-sized register file. -* Predication is a key-value store due to the implicit referencing, - as opposed to having the predicate register explicitly in the instruction. -* Whilst the predication CSR is a key-value store it *generates* easier-to-use - state information. -* TODO: assess whether the same technique could be applied to the other - Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4, - V2.3-Draft ISA Reference) it becomes possible to greatly reduce state - needed for context-switches (empty slots need never be stored). - -## Predication CSR - -The Predication CSR is a key-value store indicating whether, if a given -destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. The first entry is whether predication -is enabled. The second entry is whether the register index refers to a -floating-point or an integer register. The third entry is the index -of that register which is to be predicated (if referred to). The fourth entry -is the integer register that is treated as a bitfield, indexable by the -vector element index. - -| PrCSR | 7 | 6 | 5 | (4..0) | (4..0) | -| ----- | - | - | - | ------- | ------- | -| 0 | zero0 | inv0 | i/f | regidx | predidx | -| 1 | zero1 | inv1 | i/f | regidx | predidx | -| .. | zero.. | inv.. | i/f | regidx | predidx | -| 15 | zero15 | inv15 | i/f | regidx | predidx | - -The Predication CSR Table is a key-value store, so implementation-wise -it will be faster to turn the table around (maintain topologically -equivalent state): - - struct pred { - bool zero; - bool inv; - bool enabled; - int predidx; // redirection: actual int register to use - } - - struct pred fp_pred_reg[32]; - struct pred int_pred_reg[32]; - - for (i = 0; i < 16; i++) - tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg; - idx = CSRpred[i].regidx - tb[idx].zero = CSRpred[i].zero - tb[idx].inv = CSRpred[i].inv - tb[idx].predidx = CSRpred[i].predidx - tb[idx].enabled = true - -So when an operation is to be predicated, it is the internal state that -is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following -pseudo-code for operations is given, where p is the explicit (direct) -reference to the predication register to be used: - - for (int i=0; i +These are again identical in form to C.MV, except that they cover +floating-point to integer and integer to floating-point. When element +width in each vector is set to default, the instructions behave exactly +as they are defined for standard RV (scalar) operations, except vectorised +in exactly the same fashion as outlined in C.MV. -So the issue is as follows: +However when the source or destination element width is not set to default, +the opcode's explicit element widths are *over-ridden* to new definitions, +and the opcode's element width is taken as indicative of the SIMD width +(if applicable i.e. if packed SIMD is requested) instead. -* CSRs are used to set the "span" of a vector (how many of the standard - register file to contiguously use) -* VSETVL in RVV works as follows: it sets the vector length (copy of which - is placed in a dest register), and if the "required" length is longer - than the *available* length, the dest reg is set to the MIN of those - two. -* **HOWEVER**... in SV, *EVERY* vector register has its own separate - length and thus there is no way (at the time that VSETVL is called) to - know what to set the vector length *to*. -* At first glance it seems that it would be perfectly fine to just limit - the vector operation to the length specified in the destination - register's CSR, at the time that each instruction is issued... - except that that cannot possibly be guaranteed to match - with the value *already loaded into the target register from VSETVL*. +For example FCVT.S.L would normally be used to convert a 64-bit +integer in register rs1 to a 64-bit floating-point number in rd. +If however the source rs1 is set to be a vector, where elwidth is set to +default/2 and "packed SIMD" is enabled, then the first 32 bits of +rs1 are converted to a floating-point number to be stored in rd's +first element and the higher 32-bits *also* converted to floating-point +and stored in the second. The 32 bit size comes from the fact that +FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to +divide that by two it means that rs1 element width is to be taken as 32. -Therefore a different approach is needed. +Similar rules apply to the destination register. -Possible options include: +# Exceptions -* Removing the CSR "Vector Length" and always using the value from - VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* - destreg equal to MIN(counterreg, lenimmed), with register-based - variant "VSETVL destreg, counterreg, lenreg" doing the same. -* Keeping the CSR "Vector Length" and having the lenreg version have - a "twist": "if lengreg is vectorised, read the length from the CSR" -* Other (TBD) +> What does an ADD of two different-sized vectors do in simple-V? -The first option (of the ones brainstormed so far) is a lot simpler. -It does however mean that the length set in VSETVL will apply across-the-board -to all src1, src2 and dest vectorised registers until it is otherwise changed -(by another VSETVL call). This is probably desirable behaviour. +* if the two source operands are not the same, throw an exception. +* if the destination operand is also a vector, and the source is longer + than the destination, throw an exception. -## Branch Instruction: +> And what about instructions like JALR?  +> What does jumping to a vector do? -Branch operations use standard RV opcodes that are reinterpreted to be -"predicate variants" in the instance where either of the two src registers -have their corresponding CSRvectorlen[src] entry as non-zero. When this -reinterpretation is enabled the predicate target register rs3 is to be -treated as a bitfield (up to a maximum of XLEN bits corresponding to a -maximum of XLEN elements). +* Throw an exception. Whether that actually results in spawning threads + as part of the trap-handling remains to be seen. -If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison -goes ahead as vector-scalar or scalar-vector. Implementors should note that -this could require considerable multi-porting of the register file in order -to parallelise properly, so may have to involve the use of register cacheing -and transparent copying (see Multiple-Banked Register File Architectures -paper). +# Under consideration -In instances where no vectorisation is detected on either src registers -the operation is treated as an absolutely standard scalar branch operation. +From the Chennai 2018 slides the following issues were raised. +Efforts to analyse and answer these questions are below. + +* Should future extra bank be included now? +* How many Register and Predication CSRs should there be? + (and how many in RV32E) +* How many in M-Mode (for doing context-switch)? +* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? +* Can CLIP be done as a CSR (mode, like elwidth) +* SIMD saturation (etc.) also set as a mode? +* Include src1/src2 predication on Comparison Ops? + (same arrangement as C.MV, with same flexibility/power) +* 8/16-bit ops is it worthwhile adding a "start offset"? + (a bit like misaligned addressing... for registers) + or just use predication to skip start? -This is the overloaded table for Integer-base Branch operations. Opcode -(bits 6..0) is set in all cases to 1100011. +## Future (extra) bank be included (made mandatory) -[[!table data=""" -31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | -imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | -7 | 5 | 5 | 3 | 4 | 1 | 7 | -reserved | src2 | src1 | BPR | predicate rs3 || BRANCH | -reserved | src2 | src1 | 000 | predicate rs3 || BEQ | -reserved | src2 | src1 | 001 | predicate rs3 || BNE | -reserved | src2 | src1 | 010 | predicate rs3 || rsvd | -reserved | src2 | src1 | 011 | predicate rs3 || rsvd | -reserved | src2 | src1 | 100 | predicate rs3 || BLE | -reserved | src2 | src1 | 101 | predicate rs3 || BGE | -reserved | src2 | src1 | 110 | predicate rs3 || BLTU | -reserved | src2 | src1 | 111 | predicate rs3 || BGEU | -"""]] +The implications of expanding the *standard* register file from +32 entries per bank to 64 per bank is quite an extensive architectural +change. Also it has implications for context-switching. -Note that just as with the standard (scalar, non-predicated) branch -operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting -src1 and src2. - -Below is the overloaded table for Floating-point Predication operations. -Interestingly no change is needed to the instruction format because -FP Compare already stores a 1 or a zero in its "rd" integer register -target, i.e. it's not actually a Branch at all: it's a compare. -The target needs to simply change to be a predication bitfield (done -implicitly). - -As with -Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011. -Likewise Single-precision, fmt bits 26..25) is still set to 00. -Double-precision is still set to 01, whilst Quad-precision -appears not to have a definition in V2.3-Draft (but should be unaffected). - -It is however noted that an entry "FNE" (the opposite of FEQ) is missing, -and whilst in ordinary branch code this is fine because the standard -RVF compare can always be followed up with an integer BEQ or a BNE (or -a compressed comparison to zero or non-zero), in predication terms that -becomes more of an impact as an explicit (scalar) instruction is needed -to invert the predicate bitmask. An additional encoding funct3=011 is -therefore proposed to cater for this. +Therefore, on balance, it is not recommended and certainly should +not be made a *mandatory* requirement for the use of SV. SV's design +ethos is to be minimally-disruptive for implementors to shoe-horn +into an existing design. -[[!table data=""" -31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 | -funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode | -5 | 2 | 5 | 5 | 3 | 4 | 7 | -10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ | -10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | FNE | -10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT | -10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE | -"""]] +## How large should the Register and Predication CSR key-value stores be? -Note (**TBD**): floating-point exceptions will need to be extended -to cater for multiple exceptions (and statuses of the same). The -usual approach is to have an array of status codes and bit-fields, -and one exception, rather than throw separate exceptions for each -Vector element. +This is something that definitely needs actual evaluation and for +code to be run and the results analysed. At the time of writing +(12jul2018) that is too early to tell. An approximate best-guess +however would be 16 entries. -In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given -for predicated compare operations of function "cmp": +RV32E however is a special case, given that it is highly unlikely +(but not outside the realm of possibility) that it would be used +for performance reasons but instead for reducing instruction count. +The number of CSR entries therefore has to be considered extremely +carefully. - for (int i=0; i 1; - s2 = CSRvectorlen[src2] > 1; - for (int i=0; i +On balance it's a neat idea however it does seem to be one where the +benefits are not really clear. It would however obviate the need for +an exception to be raised if the VL runs out of registers to put +things in (gets to x31, tries a non-existent x32 and fails), however +the "fly in the ointment" is that x0 is hard-coded to "zero". The +increment therefore would need to be double-stepped to skip over x0. +Some microarchitectures could run into difficulties (SIMD-like ones +in particular) so it needs a lot more thought. -For full analysis of topological adaptation of RVV LOAD/STORE -see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X) -may be implicitly overloaded into the one base RV LOAD instruction, -and likewise for STORE. +## Can CLIP be done as a CSR (mode, like elwidth) -Revised LOAD: +RVV appears to be going this way. At the time of writing (12jun2018) +it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do +clip by way of exactly this method: setting a "clip mode" in a CSR. -[[!table data=""" -31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 | -imm[11:0] |||| rs1 | funct3 | rd | opcode | -1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 | -? | s | rs2 | imm[4:0] | base | width | dest | LOAD | -"""]] +No details are given however the most sensible thing to have would be +to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have +extra bits specifying the type of clipping to be carried out, on +a per-register basis. Other bits may be used for other purposes +(see SIMD saturation below) -The exact same corresponding adaptation is also carried out on the single, -double and quad precision floating-point LOAD-FP and STORE-FP operations, -which fit the exact same instruction format. Thus all three types -(unit, stride and indexed) may be fitted into FLW, FLD and FLQ, -as well as FSW, FSD and FSQ. +## SIMD saturation (etc.) also set as a mode? -Notes: +Similar to "CLIP" as an extension to the CSR key-value store, "saturate" +may also need extra details (what the saturation maximum is for example). -* LOAD remains functionally (topologically) identical to RVV LOAD - (for both integer and floating-point variants). -* Predication CSR-marking register is not explicitly shown in instruction, it's - implicit based on the CSR predicate state for the rd (destination) register -* rs2, the source, may *also be marked as a vector*, which implicitly - is taken to indicate "Indexed Load" (LD.X) -* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S) -* Bit 31 is reserved (ideas under consideration: auto-increment) -* **TODO**: include CSR SIMD bitwidth in the pseudo-code below. -* **TODO**: clarify where width maps to elsize +## Include src1/src2 predication on Comparison Ops? -Pseudo-code (excludes CSR SIMD bitwidth for simplicity): +In the C.MV (and other ops - see "C.MV Instruction"), the decision +was taken, unlike in ADD (etc.) which are 3-operand ops, to use +*both* the src *and* dest predication masks to give an extremely +powerful and flexible instruction that covers a huge number of +"traditional" vector opcodes. - if (unit-strided) stride = elsize; - else stride = areg[as2]; // constant-strided +The natural question therefore to ask is: where else could this +flexibility be deployed? What about comparison operations? - pred_enabled = int_pred_enabled - preg = int_pred_reg[rd] +Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst +predicated comparison operations are actually a *three* operand +instruction: + + regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) - for (int i=0; i What does an ADD of two different-sized vectors do in simple-V? +an offset of 1 would result in four operations as follows, instead: -* if the two source operands are not the same, throw an exception. -* if the destination operand is also a vector, and the source is longer - than the destination, throw an exception. + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] -> And what about instructions like JALR?  -> What does jumping to a vector do? +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. -* Throw an exception. Whether that actually results in spawning threads - as part of the trap-handling remains to be seen. +Two ways in which this could be implemented / emulated (without special +hardware): + +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. + +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. # Impementing V on top of Simple-V @@ -1227,37 +856,35 @@ the question is asked "How can each of the proposals effectively implement ### Example Instruction translation: -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FIFO: +Instructions "ADD r7 r4 r4" would result in three instructions being +generated and placed into the FIFO. r7 and r4 are marked as "vectorised": -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 +* ADD r7 r4 r4 +* ADD r8 r5 r5 +* ADD r9 r6 r6 + +Instructions "ADD r7 r4 r1" would result in three instructions being +generated and placed into the FIFO. r7 and r1 are marked as "vectorised" +whilst r4 is not: + +* ADD r7 r4 r1 +* ADD r8 r4 r2 +* ADD r9 r4 r3 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i @@ -1625,7 +1252,7 @@ To illustrate how this works, here is some example code from FreeRTOS ... STORE x30, 29 * REGBYTES(sp) STORE x31, 30 * REGBYTES(sp) - + /* Store current stackpointer in task control block (TCB) */ LOAD t0, pxCurrentTCB //pointer STORE sp, 0x0(t0) @@ -1686,11 +1313,11 @@ bank of registers is to be loaded/saved: .macroVectorSetup MVECTORCSRx1 = 31, defaultlen MVECTORCSRx4 = 28, defaultlen - + /* Save Context */ SETVL x0, x0, 31 /* x0 ignored silently */ - STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth - + STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth + /* Restore registers, Skip global pointer because that does not change */ LOAD x1, 0x0(sp) @@ -1896,7 +1523,7 @@ or may not require an additional pipeline stage) >> FIFO). > Those bits cannot be known until after the registers are decoded from the -> instruction and a lookup in the "vector length table" has completed. +> instruction and a lookup in the "vector length table" has completed. > Considering that one of the reasons RISC-V keeps registers in invariant > positions across all instructions is to simplify register decoding, I expect > that inserting an SRAM read would lengthen the critical path in most @@ -1939,6 +1566,45 @@ summary: don't restrict / remove. it's fine. +## Under review / discussion: remove CSR vector length, use VSETVL + +**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines +length on all regs**. This section kept for historical reasons. + +So the issue is as follows: + +* CSRs are used to set the "span" of a vector (how many of the standard + register file to contiguously use) +* VSETVL in RVV works as follows: it sets the vector length (copy of which + is placed in a dest register), and if the "required" length is longer + than the *available* length, the dest reg is set to the MIN of those + two. +* **HOWEVER**... in SV, *EVERY* vector register has its own separate + length and thus there is no way (at the time that VSETVL is called) to + know what to set the vector length *to*. +* At first glance it seems that it would be perfectly fine to just limit + the vector operation to the length specified in the destination + register's CSR, at the time that each instruction is issued... + except that that cannot possibly be guaranteed to match + with the value *already loaded into the target register from VSETVL*. + +Therefore a different approach is needed. + +Possible options include: + +* Removing the CSR "Vector Length" and always using the value from + VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* + destreg equal to MIN(counterreg, lenimmed), with register-based + variant "VSETVL destreg, counterreg, lenreg" doing the same. +* Keeping the CSR "Vector Length" and having the lenreg version have + a "twist": "if lengreg is vectorised, read the length from the CSR" +* Other (TBD) + +The first option (of the ones brainstormed so far) is a lot simpler. +It does however mean that the length set in VSETVL will apply across-the-board +to all src1, src2 and dest vectorised registers until it is otherwise changed +(by another VSETVL call). This is probably desirable behaviour. + ## Implementation Paradigms TODO: assess various implementation paradigms. These are listed roughly @@ -1993,6 +1659,61 @@ in advance to avoid. TBD: floating-point compare and other exception handling +------ + +Multi-LR/SC + +Please don't try to use the L1 itself. + +Use the Load and Store buffers which capture instruction state prior +to being accessed in the L1 (and prior to data arriving in the case of +Store buffer). + +Also, use the L1 Miss buffers as these already HAVE to be snooped by +coherence traffic. These are used to monitor that all participating +cache lines remain interference free, and amalgamate same into a CPU +signal accessible ia branch or predicate. + +The Load buffers manage inbound traffic +The Store buffers manage outbound traffic. + +Done properly, the participating cache lines can exceed the associativity +of the L1 cache without architectural harm (may incur additional latency). + + + +> > > so, let's say instead of another LR *cancelling* the load +> > > reservation, the SMP core / hardware thread *blocks* for +> > > up to 63 further instructions, waiting for the reservation +> > > to clear. +> > +> > Can you explain what you mean by this paragraph? +> +> best put in sequential events, probably. +> +> LR <-- 64-instruction countdown starts here +> ... 63 +> ... 62 +> LR same address <--- notes that core1 is on 61, +> so pauses for **UP TO** 61 cycles +> ... 32 +> SC <- core1 didn't reach zero, therefore valid, therefore +> core2 is now **UNBLOCKED**, is granted the +> load-reservation (and begins its **own** 64-cycle +> LR instruction countdown) +> ... 63 +> ... 62 +> ... +> ... +> SC <- also valid + +Looks to me that you could effect the same functionality by simply +holding onto the cache line in core 1 preventing core 2 from + getting past the LR. + +On the other hand, the freeze is similar to how the MP CRAYs did +ATOMIC stuff. + # References * SIMD considered harmful @@ -2024,9 +1745,11 @@ TBD: floating-point compare and other exception handling * Dot Product Vector * RVV slides 2017 -* Wavefront skipping using BRAMS +* Wavefront skipping using BRAMS * Streaming Pipelines * Barcelona SIMD Presentation * * Full Description (last page) of RVV instructions +* PULP Low-energy Cluster Vector Processor +