X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=723cc07259960db0b00ed61b10873a310b9e3ab8;hb=73743225d7873f2803c5481be1772211205c5be7;hp=1ae1dfce3ef681119a8a39beb041d3cbbfa269aa;hpb=911c91e9a76ad3388c93e8ed0ee99299bb7c09d6;p=libreriscv.git
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index 1ae1dfce3..723cc0725 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -9,8 +9,8 @@ instruction queue (FIFO), pending execution.
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
[[!toc ]]
@@ -126,7 +126,8 @@ reducing power consumption for the same.
SIMD again has a severe disadvantage here, over Vector: huge proliferation
of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
have to then have operations *for each and between each*. It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
The V-Extension on the other hand proposes to set the bit-width of
future instructions on a per-register basis, such that subsequent instructions
@@ -334,26 +335,282 @@ basis* whether and how much "Virtual Parallelism" to deploy.
It is absolutely critical to note that it is proposed that such choices MUST
be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
+a Vector (varible-width SIMD) may not precisely match the width of the
parallelism within the implementation, the end-user **should not care**
and in this way the performance benefits are gained but the ISA remains
straightforward. All that happens at the end of an instruction run is: some
parallel units (if there are any) would remain offline, completely
transparently to the ISA, the program, and the compiler.
-The "SIMD considered harmful" trap of having huge complexity and extra
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
instructions to deal with corner-cases is thus avoided, and implementors
get to choose precisely where to focus and target the benefits of their
implementation efforts, without "extra baggage".
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
+# CSRs
+
+There are two CSR tables needed to create lookup tables which are used at
+the register decode phase.
+
+* Integer Register N is Vector
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. However it is important to note
+that the *actual* register is *different* from the one that ends up
+being used, due to the level of indirection through the lookup table.
+This includes (in the future) redirecting to a *second* bank of
+integer registers (as a future option)
+
+* regidx is the actual register that in combination with the
+ i/f flag, if that integer or floating-point register is referred to,
+ results in the lookup table being referenced to find the predication
+ mask to use on the operation in which that (regidx) register has
+ been used
+* predidx (in combination with the bank bit in the future) is the
+ *actual* register to be used for the predication mask. Note:
+ in effect predidx is actually a 6-bit register address, as the bank
+ bit is the MSB (and is nominally set to zero for now).
+* inv indicates that the predication mask bits are to be inverted
+ prior to use *without* actually modifying the contents of the
+ register itself.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+ place zeros in any element position where the predication mask is
+ set to zero. If zeroing is set to 1, unpredicated elements *must*
+ be left alone. Some microarchitectures may choose to interpret
+ this as skipping the operation entirely. Others which wish to
+ stick more closely to a SIMD architecture may choose instead to
+ interpret unpredicated elements as an internal "copy element"
+ operation (which would be necessary in SIMD microarchitectures
+ that perform register-renaming)
+
+| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx |
+| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx |
+| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx |
+| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ struct pred {
+ bool zero;
+ bool inv;
+ bool bank; // 0 for now, 1=rsvd
+ bool enabled;
+ int predidx; // redirection: actual int register to use
+ }
+
+ struct pred fp_pred_reg[32]; // 64 in future (bank=1)
+ struct pred int_pred_reg[32]; // 64 in future (bank=1)
+
+ for (i = 0; i < 16; i++)
+ tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
+ idx = CSRpred[i].regidx
+ tb[idx].zero = CSRpred[i].zero
+ tb[idx].inv = CSRpred[i].inv
+ tb[idx].bank = CSRpred[i].bank
+ tb[idx].predidx = CSRpred[i].predidx
+ tb[idx].enabled = true
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i
For full analysis of topological adaptation of RVV LOAD/STORE
see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
@@ -541,15 +869,14 @@ Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided
- pred_enabled = int_pred_enabled
preg = int_pred_reg[rd]
for (int i=0; i
+## Vectorised Copy/Move (and conversion) instructions
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
+There is a series of 2-operand instructions involving copying (and
+alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
+follow the same pattern, as it is *both* the source *and* destination
+predication masks that are taken into account. This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
+### C.MV Instruction
-Notes:
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
-* for the purposes of LOAD / STORE, Integer Registers which are
- marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
- of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
- "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
- as opposed to having the predicate register explicitly in the instruction.
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change:
-## Predication CSR
+[[!table data="""
+15 12 | 11 7 | 6 2 | 1 0 |
+funct4 | rd | rs | op |
+4 | 5 | 5 | 2 |
+C.MV | dest | src | C0 |
+"""]]
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated. The first entry is whether predication
-is enabled. The second entry is whether the register index refers to a
-floating-point or an integer register. The third entry is the index
-of that register which is to be predicated (if referred to). The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6 | 5 | (4..0) | (4..0) |
-| ----- | - | - | ------- | ------- |
-| r0 | pren0 | i/f | regidx | predidx |
-| r1 | pren1 | i/f | regidx | predidx |
-| .. | pren.. | i/f | regidx | predidx |
-| r15 | pren15 | i/f | regidx | predidx |
+A simplified version of the pseudocode for this operation is as follows:
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
+ function op_mv(rd, rs) # MV not VMV!
+ Â rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+ Â rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+ Â ps = get_pred_val(FALSE, rs); # predication on src
+ Â pd = get_pred_val(FALSE, rd); # ... AND on dest
+ Â for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_vec[rs].isvec) while (!(ps & 1< What does an ADD of two different-sized vectors do in simple-V?
- CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+ than the destination, throw an exception.
-This is in contrast to RVV:
+> And what about instructions like JALR?Â
+> What does jumping to a vector do?
- CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+* Throw an exception. Whether that actually results in spawning threads
+ as part of the trap-handling remains to be seen.
-## Element (SIMD) bitwidth CSRs
+# Under consideration
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+ (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+ (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+ (a bit like misaligned addressing... for registers)
+ or just use predication to skip start?
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0 | vew0 |
-| r1 | vew1 |
-| .. | vew.. |
-| r31 | vew31 |
+## Future (extra) bank be included (made mandatory)
-vew may be one of the following (giving a table "bytestable", used below):
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change. Also it has implications for context-switching.
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV. SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
+## How large should the Register and Predication CSR key-value stores be?
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed. At the time of writing
+(12jul2018) that is too early to tell. An approximate best-guess
+however would be 16 entries.
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
- CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point. However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
-# Exceptions
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free. However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
-> What does an ADD of two different-sized vectors do in simple-V?
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes. This would be considered quite
+rare but not outside of the realm of possibility.
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
+Conclusion: all needs careful analysis and future work.
-> And what about instructions like JALR?Â
-> What does jumping to a vector do?
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear. It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero". The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
+
+## Can CLIP be done as a CSR (mode, like elwidth)
+
+RVV appears to be going this way. At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
+
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis. Other bits may be used for other purposes
+(see SIMD saturation below)
+
+## SIMD saturation (etc.) also set as a mode?
+
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
+
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed? What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+ regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+ r3[0] = add r4[0], r6[0]
+ r3[1] = add r4[1], r6[1]
+ r3[2] = add r4[2], r6[2]
+ r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+ r3[0] = add r4[1], r6[0]
+ r3[1] = add r4[2], r6[1]
+ r3[2] = add r4[3], r6[2]
+ r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in. So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+ either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+ penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset. On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
# Impementing V on top of Simple-V
@@ -839,7 +1253,7 @@ levels: Base and reserved future functionality.
up to 16 (TBD) of either the floating-point or integer registers to
be marked as "predicated" (key), and if so, which integer register to
use as the predication mask (value).
-
+
**TODO**
# Implementing P (renamed to DSP) on top of Simple-V
@@ -864,6 +1278,11 @@ This section compares the various parallelism proposals as they stand,
including traditional SIMD, in terms of features, ease of implementation,
complexity, flexibility, and die area.
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
### [[alt_rvp]]
Primary benefit of Alt-RVP is the simplicity with which parallelism
@@ -894,9 +1313,9 @@ from actual (internal) parallel hardware. It's an API in effect that's
designed to be slotted in to an existing implementation (just after
instruction decode) with minimum disruption and effort.
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
* plus: transparent re-use of existing opcodes as-is just indirectly
saying "this register's now a vector" which
* plus: means that future instructions also get to be inherently
@@ -1005,7 +1424,7 @@ the question is asked "How can each of the proposals effectively implement
a SIMD architecture where the ALU becomes responsible for the parallelism,
Alt-RVP ALUs would likewise be so responsible... with *additional*
(lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
at least one dimension are avoided (architectural upgrades introducing
128-bit then 256-bit then 512-bit variants of the exact same 64-bit
SIMD block)
@@ -1083,37 +1502,35 @@ the question is asked "How can each of the proposals effectively implement
### Example Instruction translation:
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO. r7 and r4 are marked as "vectorised":
+
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO. r7 and r1 are marked as "vectorised"
+whilst r4 is not:
+
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
## Example of vector / vector, vector / scalar, scalar / scalar => vector add
- register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
- register CSRpredicate[XLEN][4]; # 2^4 is max vector length
- register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
- register x[32][XLEN];
-
- function op_add(rd, rs1, rs2, predr)
- {
- Â Â /* note that this is ADD, not PADD */
- Â Â int i, id, irs1, irs2;
- Â Â # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
- Â Â # also destination makes no sense as a scalar but what the hell...
- Â Â for (i = 0, id=0, irs1=0, irs2=0; i
@@ -1141,10 +1558,10 @@ There is, in the standard Conditional Branch instruction, more than
adequate space to interpret it in a similar fashion:
[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
This would become:
@@ -1164,19 +1581,19 @@ not only to add in a second source register, but also use some of the bits as
a predication target as well.
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
"""]]
Now uses the CS format:
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
"""]]
Bit 6 would be decoded as "operation refers to Integer or Float" including
@@ -1279,16 +1696,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
@@ -1406,7 +1823,7 @@ So the question boils down to:
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
@@ -1456,6 +1873,112 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
+## Context Switch Example
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction). An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+ /* Macro for saving task context */
+ .macro portSAVE_CONTEXT
+ .global pxCurrentTCB
+ /* make room in stack */
+ addi sp, sp, -REGBYTES * 32
+
+ /* Save Context */
+ STORE x1, 0x0(sp)
+ STORE x2, 1 * REGBYTES(sp)
+ STORE x3, 2 * REGBYTES(sp)
+ ...
+ ...
+ STORE x30, 29 * REGBYTES(sp)
+ STORE x31, 30 * REGBYTES(sp)
+
+ /* Store current stackpointer in task control block (TCB) */
+ LOAD t0, pxCurrentTCB //pointer
+ STORE sp, 0x0(t0)
+ .endm
+
+ /* Saves current error program counter (EPC) as task program counter */
+ .macro portSAVE_EPC
+ csrr t0, mepc
+ STORE t0, 31 * REGBYTES(sp)
+ .endm
+
+ /* Saves current return adress (RA) as task program counter */
+ .macro portSAVE_RA
+ STORE ra, 31 * REGBYTES(sp)
+ .endm
+
+ /* Macro for restoring task context */
+ .macro portRESTORE_CONTEXT
+
+ .global pxCurrentTCB
+ /* Load stack pointer from the current TCB */
+ LOAD sp, pxCurrentTCB
+ LOAD sp, 0x0(sp)
+
+ /* Load task program counter */
+ LOAD t0, 31 * REGBYTES(sp)
+ csrw mepc, t0
+
+ /* Run in machine mode */
+ li t0, MSTATUS_PRV1
+ csrs mstatus, t0
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ LOAD x4, 3 * REGBYTES(sp)
+ LOAD x5, 4 * REGBYTES(sp)
+ ...
+ ...
+ LOAD x30, 29 * REGBYTES(sp)
+ LOAD x31, 30 * REGBYTES(sp)
+
+ addi sp, sp, REGBYTES * 32
+ mret
+ .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+ /* a few things are assumed here: (a) that when switching to
+ M-Mode an entirely different set of CSRs is used from that
+ which is used in U-Mode and (b) that the M-Mode x1 and x4
+ vectors are also not used anywhere else in M-Mode, consequently
+ only need to be set up just the once.
+ */
+ .macroVectorSetup
+ MVECTORCSRx1 = 31, defaultlen
+ MVECTORCSRx4 = 28, defaultlen
+
+ /* Save Context */
+ SETVL x0, x0, 31 /* x0 ignored silently */
+ STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ SETVL x0, x0, 28 /* x0 ignored silently */
+ LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored. If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
## Virtual Memory page-faults on LOAD/STORE
@@ -1584,9 +2107,9 @@ would still be there (and stalled). hmmm.
>
> Thrown away.
-discussion then led to the question of OoO architectures
+discussion then led to the question of OoO architectures
-> The costs of the imprecise-exception model are greater than the benefit.
+> The costs of the imprecise-exception model are greater than the benefit.
> Software doesn't want to cope with it. Â It's hard to debug. Â You can't
> migrate state between different microarchitectures--unless you force all
> implementations to support the same imprecise-exception model, which would
@@ -1594,8 +2117,141 @@ discussion then led to the question of OoO architectures
> relevant, is that the imprecise model increases the size of the context
> structure, as the microarchitectural guts have to be spilled to memory.)
+## Zero/Non-zero Predication
+
+>> > Â it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. Â if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > Â if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > Â whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesnât apply to only your
+> model with the âstandard register fileâ is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a âzero bitâ per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU Â (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove. it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay. so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+
+
+## Under review / discussion: remove CSR vector length, use VSETVL
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
-## Implementation Paradigms
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
+## Implementation Paradigms
TODO: assess various implementation paradigms. These are listed roughly
in order of simplicity (minimum compliance, for ultra-light-weight
@@ -1617,6 +2273,20 @@ Also to be taken into consideration:
* Comphrensive vectorisation: FIFOs and internal parallelism
* Hybrid Parallelism
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised. On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support. In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
# TODO Research
> For great floating point DSPs check TIâs C3x, C4X, and C6xx DSPs
@@ -1633,6 +2303,8 @@ ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
Destruction of destination indices requires a copy of the entire vector
in advance to avoid.
+TBD: floating-point compare and other exception handling
+
# References
* SIMD considered harmful
@@ -1663,3 +2335,12 @@ in advance to avoid.
restarted if an exception occurs (VM page-table miss)
* Dot Product Vector
+* RVV slides 2017
+* Wavefront skipping using BRAMS
+* Streaming Pipelines
+* Barcelona SIMD Presentation
+*
+* Full Description (last page) of RVV instructions
+
+* PULP Low-energy Cluster Vector Processor
+