From cd4a24bbcfaada2f3ea748e4f246cd9f7fc76a7d Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 17 Apr 2018 02:17:09 +0100 Subject: [PATCH] rewrite CSR section --- simple_v_extension.mdwn | 319 +++++++++++++++++++++++++++------------- 1 file changed, 217 insertions(+), 102 deletions(-) diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 0a101c835..d0cc8ba30 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -82,108 +82,6 @@ of not being widely adopted. I'm inclined towards recommending: **TODO**: propose "mask" (predication) registers likewise. combination with standard RV instructions and overflow registers extremely powerful -## CSRs marking registers as Vector - -A 32-bit CSR would be needed (1 bit per integer register) to indicate -whether a register was, if referred to, implicitly to be treated as -a vector. - -A second 32-bit CSR would be needed (1 bit per floating-point register) -to indicate whether a floating-point register was to be treated as a -vector. - -In this way any standard (current or future) operation involving -register operands may detect if the operation is to be vector-vector, -vector-scalar or scalar-scalar (standard) simply through a single -bit test. - -## CSR vector-length and CSR SIMD packed-bitwidth - -**TODO** analyse each of these: - -* splitting out the loop-aspects, vector aspects and data-width aspects -* integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0 -* integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1 -* .... -* ....  - -instead: - -* CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* ... -* ... - -Have to be very *very* careful about not implementing too few of those -(or too many). Assess implementation impact on decode latency. Is it -worth it? - -Implementation of the latter: - -Operation involving (referring to) register M: - - bitwidth = default # default for opcode? - vectorlen = 1 # scalar - - for (o = 0, o < 2, o++) -   if (CSR-Vector_registernum[o] == M) -       bitwidth = CSR-Vector_bitwidth[o] -       vectorlen = CSR-Vector_len[o] -       break - -and for the former it would simply be: - - bitwidth = CSR-Vector_bitwidth[M] - vectorlen = CSR-Vector_len[M] - -Alternatives: - -* One single "global" vector-length CSR - -## Stride - -**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular -register as being "if you use this reg in LOAD/STORE, use the offset -amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous". -can be used for matrix spanning. - -> For LOAD/STORE, could a better option be to interpret the offset in the -> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is -> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), -> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the -> vector-control CSRs to select between offset-as-stride and unit-stride -> memory accesses? - -So there would be an instruction like this: - -| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM | -| opcode | 5 bit | 1 bit | 1 bit | 5 bit, OFFn=XLEN | - - -which would mean: - -* CSR-Offset register n <= (float|int) register number N -* CSR-Offset Stride-mode = offset or unit -* CSR-Offset amount register n = contents of register M - -LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set): - - offs = 0 - stride = 1 - vector-len = CSR-Vector-length register N - - for (o = 0, o < 2, o++) - if (CSR-Offset register o == M) - offs = CSR-Offset amount register o - if CSR-Offset Stride-mode == offset: - stride = ldoffs - break - - for (i = 0, i < vector-len; i++) - r[N+i] = mem[(offs*i + r[M+i])*stride] - # Analysis and discussion of Vector vs SIMD There are four combined areas between the two proposals that help with @@ -487,6 +385,222 @@ instructions to deal with corner-cases is thus avoided, and implementors get to choose precisely where to focus and target the benefits of their implementation efforts, without "extra baggage". +# CSRs + +There are a number of CSRs needed, which are used at the instruction +decode phase to re-interpret standard RV opcodes (a practice that has precedent +in the setting of MISA to enable / disable extensions). + +* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Integer Register N is a Predication Register (key-value store) + +Notes: + +* for the purposes of LOAD / STORE, Integer Registers which are + marked as a Vector will result in a Vector LOAD / STORE. +* Vector Lengths are *not* the same as vsetl but are an integral part + of vsetl. +* Actual vector length is *multipled* by how many blocks of length + "bitwidth" may fit into an XLEN-sized register file. +* Predication is a key-value store due to the implicit referencing, + as opposed to having the predicate register explicitly in the instruction. + +## Predication CSR + +The Predication CSR is a key-value store indicating whether, if a given +destination register (integer or floating-point) is referred to in an +instruction, it is to be predicated. The first entry is whether predication +is enabled. The second entry is whether the register index refers to a +floating-point or an integer register. The third entry is the index +of that register which is to be predicated (if referred to). The fourth entry +is the integer register that is treated as a bitfield, indexable by the +vector element index. + +| RegNo | 6 | 5 | (4..0) | (4..0) | +| ----- | - | - | ------- | ------- | +| r0 | pren0 | i/f | regidx | predidx | +| r1 | pren1 | i/f | regidx | predidx | +| .. | pren.. | i/f | regidx | predidx | +| r15 | pren15 | i/f | regidx | predidx | + +The Predication CSR Table is a key-value store, so implementation-wise +it will be faster to turn the table around (maintain topologically +equivalent state): + + fp_pred_enabled[32]; + int_pred_enabled[32]; + for (i = 0; i < 16; i++) + if CSRpred[i].pren: + idx = CSRpred[i].regidx + predidx = CSRpred[i].predidx + if CSRpred[i].type == 0: # integer + int_pred_enabled[idx] = 1 + int_pred_reg[idx] = predidx + else: + fp_pred_enabled[idx] = 1 + fp_pred_reg[idx] = predidx + +So when an operation is to be predicated, it is the internal state that +is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following +pseudo-code for operations is given, where p is the explicit (direct) +reference to the predication register to be used: + + for (int i=0; i For LOAD/STORE, could a better option be to interpret the offset in the +> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is +> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), +> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the +> vector-control CSRs to select between offset-as-stride and unit-stride +> memory accesses? + +So there would be an instruction like this: + +| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM | +| opcode | 5 bit | 1 bit | 1 bit | 5 bit, OFFn=XLEN | + + +which would mean: + +* CSR-Offset register n <= (float|int) register number N +* CSR-Offset Stride-mode = offset or unit +* CSR-Offset amount register n = contents of register M + +LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set): + + offs = 0 + stride = 1 + vector-len = CSR-Vector-length register N + + for (o = 0, o < 2, o++) + if (CSR-Offset register o == M) + offs = CSR-Offset amount register o + if CSR-Offset Stride-mode == offset: + stride = ldoffs + break + + for (i = 0, i < vector-len; i++) + r[N+i] = mem[(offs*i + r[M+i])*stride] + # Example of vector / vector, vector / scalar, scalar / scalar => vector add register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... @@ -788,6 +902,7 @@ Notes: * j is multiplied by stride, not elsize, including in the rs2 vectorised case. * There may be more sophisticated variants involving the 31st bit, however it would be nice to reserve that bit for post-increment of address registers +* ## 17.19 Vector Register Gather -- 2.30.2