# Simple-V (Parallelism Extension Proposal) Appendix
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
* Last edited: 25 jun 2019
* main spec [[specification]]
[[!toc ]]
# Element bitwidth polymorphism
Element bitwidth is best covered as its own special section, as it
is quite involved and applies uniformly across-the-board. SV restricts
bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
The effect of setting an element bitwidth is to re-cast each entry
in the register table, and for all memory operations involving
load/stores of certain specific sizes, to a completely different width.
Thus In c-style terms, on an RV64 architecture, effectively each register
now looks like this:
typedef union {
uint8_t b[8];
uint16_t s[4];
uint32_t i[2];
uint64_t l[1];
} reg_t;
// integer table: assume maximum SV 7-bit regfile size
reg_t int_regfile[128];
where the CSR Register table entry (not the instruction alone) determines
which of those union entries is to be used on each operation, and the
VL element offset in the hardware-loop specifies the index into each array.
However a naive interpretation of the data structure above masks the
fact that setting VL greater than 8, for example, when the bitwidth is 8,
accessing one specific register "spills over" to the following parts of
the register file in a sequential fashion. So a much more accurate way
to reflect this would be:
typedef union {
uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
uint8_t b[0]; // array of type uint8_t
uint16_t s[0];
uint32_t i[0];
uint64_t l[0];
uint128_t d[0];
} reg_t;
reg_t int_regfile[128];
where when accessing any individual regfile[n].b entry it is permitted
(in c) to arbitrarily over-run the *declared* length of the array (zero),
and thus "overspill" to consecutive register file entries in a fashion
that is completely transparent to a greatly-simplified software / pseudo-code
representation.
It is however critical to note that it is clearly the responsibility of
the implementor to ensure that, towards the end of the register file,
an exception is thrown if attempts to access beyond the "real" register
bytes is ever attempted.
Now we may modify pseudo-code an operation where all element bitwidths have
been set to the same size, where this pseudo-code is otherwise identical
to its "non" polymorphic versions (above):
function op_add(rd, rs1, rs2) # add not VADD!
...
...
for (i = 0; i < VL; i++)
...
...
// TODO, calculate if over-run occurs, for each elwidth
if (elwidth == 8) {
int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
int_regfile[rs2].i[irs2];
} else if elwidth == 16 {
int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
int_regfile[rs2].s[irs2];
} else if elwidth == 32 {
int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
int_regfile[rs2].i[irs2];
} else { // elwidth == 64
int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
int_regfile[rs2].l[irs2];
}
...
...
So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
following sequentially on respectively from the same) are "type-cast"
to 8-bit; for 16-bit entries likewise and so on.
However that only covers the case where the element widths are the same.
Where the element widths are different, the following algorithm applies:
* Analyse the bitwidth of all source operands and work out the
maximum. Record this as "maxsrcbitwidth"
* If any given source operand requires sign-extension or zero-extension
(ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
sign-extension / zero-extension or whatever is specified in the standard
RV specification, **change** that to sign-extending from the respective
individual source operand's bitwidth from the CSR table out to
"maxsrcbitwidth" (previously calculated), instead.
* Following separate and distinct (optional) sign/zero-extension of all
source operands as specifically required for that operation, carry out the
operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
this may be a "null" (copy) operation, and that with FCVT, the changes
to the source and destination bitwidths may also turn FVCT effectively
into a copy).
* If the destination operand requires sign-extension or zero-extension,
instead of a mandatory fixed size (typically 32-bit for arithmetic,
for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
etc.), overload the RV specification with the bitwidth from the
destination register's elwidth entry.
* Finally, store the (optionally) sign/zero-extended value into its
destination: memory for sb/sw etc., or an offset section of the register
file for an arithmetic operation.
In this way, polymorphic bitwidths are achieved without requiring a
massive 64-way permutation of calculations **per opcode**, for example
(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
rd bitwidths). The pseudo-code is therefore as follows:
typedef union {
uint8_t b;
uint16_t s;
uint32_t i;
uint64_t l;
} el_reg_t;
bw(elwidth):
if elwidth == 0: return xlen
if elwidth == 1: return 8
if elwidth == 2: return 16
// elwidth == 3:
return 32
get_max_elwidth(rs1, rs2):
return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
bw(int_csr[rs2].elwidth)) # again XLEN if no entry
get_polymorphed_reg(reg, bitwidth, offset):
el_reg_t res;
res.l = 0; // TODO: going to need sign-extending / zero-extending
if bitwidth == 8:
reg.b = int_regfile[reg].b[offset]
elif bitwidth == 16:
reg.s = int_regfile[reg].s[offset]
elif bitwidth == 32:
reg.i = int_regfile[reg].i[offset]
elif bitwidth == 64:
reg.l = int_regfile[reg].l[offset]
return res
set_polymorphed_reg(reg, bitwidth, offset, val):
if (!int_csr[reg].isvec):
# sign/zero-extend depending on opcode requirements, from
# the reg's bitwidth out to the full bitwidth of the regfile
val = sign_or_zero_extend(val, bitwidth, xlen)
int_regfile[reg].l[0] = val
elif bitwidth == 8:
int_regfile[reg].b[offset] = val
elif bitwidth == 16:
int_regfile[reg].s[offset] = val
elif bitwidth == 32:
int_regfile[reg].i[offset] = val
elif bitwidth == 64:
int_regfile[reg].l[offset] = val
maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
destwid = int_csr[rs1].elwidth # destination element width
for (i = 0; i < VL; i++)
if (predval & 1<
Polymorphic element widths in vectorised form means that the data
being loaded (or stored) across multiple registers needs to be treated
(reinterpreted) as a contiguous stream of elwidth-wide items, where
the source register's element width is **independent** from the destination's.
This makes for a slightly more complex algorithm when using indirection
on the "addressed" register (source for LOAD and destination for STORE),
particularly given that the LOAD/STORE instruction provides important
information about the width of the data to be reinterpreted.
Let's illustrate the "load" part, where the pseudo-code for elwidth=default
was as follows, and i is the loop from 0 to VL-1:
srcbase = ireg[rs+i];
return mem[srcbase + imm]; // returns XLEN bits
Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
chunks are taken from the source memory location addressed by the current
indexed source address register, and only when a full 32-bits-worth
are taken will the index be moved on to the next contiguous source
address register:
bitwidth = bw(elwidth); // source elwidth from CSR reg entry
elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
srcbase = ireg[rs+i/(elsperblock)]; // integer divide
offs = i % elsperblock; // modulo
return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
and 128 for LQ.
The principle is basically exactly the same as if the srcbase were pointing
at the memory of the *register* file: memory is re-interpreted as containing
groups of elwidth-wide discrete elements.
When storing the result from a load, it's important to respect the fact
that the destination register has its *own separate element width*. Thus,
when each element is loaded (at the source element width), any sign-extension
or zero-extension (or truncation) needs to be done to the *destination*
bitwidth. Also, the storing has the exact same analogous algorithm as
above, where in fact it is just the set\_polymorphed\_reg pseudocode
(completely unchanged) used above.
One issue remains: when the source element width is **greater** than
the width of the operation, it is obvious that a single LB for example
cannot possibly obtain 16-bit-wide data. This condition may be detected
where, when using integer divide, elsperblock (the width of the LOAD
divided by the bitwidth of the element) is zero.
The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
The elements, if the element bitwidth is larger than the LD operation's
size, will then be sign/zero-extended to the full LD operation size, as
specified by the LOAD (LDU instead of LD, LBU instead of LB), before
being passed on to the second phase.
As LOAD/STORE may be twin-predicated, it is important to note that
the rules on twin predication still apply, except where in previous
pseudo-code (elwidth=default for both source and target) it was
the *registers* that the predication was applied to, it is now the
**elements** that the predication is applied to.
Thus the full pseudocode for all LD operations may be written out
as follows:
function LBU(rd, rs):
load_elwidthed(rd, rs, 8, true)
function LB(rd, rs):
load_elwidthed(rd, rs, 8, false)
function LH(rd, rs):
load_elwidthed(rd, rs, 16, false)
...
...
function LQ(rd, rs):
load_elwidthed(rd, rs, 128, false)
# returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
function load_memory(rs, imm, i, opwidth):
elwidth = int_csr[rs].elwidth
bitwidth = bw(elwidth);
elsperblock = min(1, opwidth / bitwidth)
srcbase = ireg[rs+i/(elsperblock)];
offs = i % elsperblock;
return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
function load_elwidthed(rd, rs, opwidth, unsigned):
destwid = int_csr[rd].elwidth # destination element width
rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
ps = get_pred_val(FALSE, rs); # predication on src
pd = get_pred_val(FALSE, rd); # ... AND on dest
for (int i = 0, int j = 0; i < VL && j < VL;):
if (int_csr[rs].isvec) while (!(ps & 1< max(rs1, rs2) otherwise truncate
Note here that polymorphic add zero-extends its source operands,
where addw sign-extends.
### addw
The RV Specification specifically states that "W" variants of arithmetic
operations always produce 32-bit signed values. In a polymorphic
environment it is reasonable to assume that the signed aspect is
preserved, where it is the length of the operands and the result
that may be changed.
Standard Scalar RV64 (xlen):
* RS1 @ xlen bits
* RS2 @ xlen bits
* add @ xlen bits
* RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
Polymorphic variant:
* RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
* RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
* add @ max(rs1, rs2) bits
* RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
Note here that polymorphic addw sign-extends its source operands,
where add zero-extends.
This requires a little more in-depth analysis. Where the bitwidth of
rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
only where the bitwidth of either rs1 or rs2 are different, will the
lesser-width operand be sign-extended.
Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
where for add they are both zero-extended. This holds true for all arithmetic
operations ending with "W".
### addiw
Standard Scalar RV64I:
* RS1 @ xlen bits, truncated to 32-bit
* immed @ 12 bits, sign-extended to 32-bit
* add @ 32 bits
* RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
Polymorphic variant:
* RS1 @ rs1 bits
* immed @ 12 bits, sign-extend to max(rs1, 12) bits
* add @ max(rs1, 12) bits
* RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
# Predication Element Zeroing
The introduction of zeroing on traditional vector predication is usually
intended as an optimisation for lane-based microarchitectures with register
renaming to be able to save power by avoiding a register read on elements
that are passed through en-masse through the ALU. Simpler microarchitectures
do not have this issue: they simply do not pass the element through to
the ALU at all, and therefore do not store it back in the destination.
More complex non-lane-based micro-architectures can, when zeroing is
not set, use the predication bits to simply avoid sending element-based
operations to the ALUs, entirely: thus, over the long term, potentially
keeping all ALUs 100% occupied even when elements are predicated out.
SimpleV's design principle is not based on or influenced by
microarchitectural design factors: it is a hardware-level API.
Therefore, looking purely at whether zeroing is *useful* or not,
(whether less instructions are needed for certain scenarios),
given that a case can be made for zeroing *and* non-zeroing, the
decision was taken to add support for both.
## Single-predication (based on destination register)
Zeroing on predication for arithmetic operations is taken from
the destination register's predicate. i.e. the predication *and*
zeroing settings to be applied to the whole operation come from the
CSR Predication table entry for the destination register.
Thus when zeroing is set on predication of a destination element,
if the predication bit is clear, then the destination element is *set*
to zero (twin-predication is slightly different, and will be covered
next).
Thus the pseudo-code loop for a predicated arithmetic operation
is modified to as follows:
for (i = 0; i < VL; i++)
if not zeroing: # an optimisation
while (!(predval & 1<
## strncpy
RVV version: >
strncpy:
mv a3, a0 # Copy dst
loop:
setvli x0, a2, vint8 # Vectors of bytes.
vlbff.v v1, (a1) # Get src bytes
vseq.vi v0, v1, 0 # Flag zero bytes
vmfirst a4, v0 # Zero found?
vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
vsb.v v1, (a3), v0.t # Write out bytes
bgez a4, exit # Done
csrr t1, vl # Get number of bytes fetched
add a1, a1, t1 # Bump src pointer
sub a2, a2, t1 # Decrement count.
add a3, a3, t1 # Bump dst pointer
bnez a2, loop # Anymore?
exit:
ret
SV version (WIP):
strncpy:
mv a3, a0
SETMVLI 8 # set max vector to 8
RegCSR[a3] = 8bit, a3, scalar
RegCSR[a1] = 8bit, a1, scalar
RegCSR[t0] = 8bit, t0, vector
PredTb[t0] = ffirst, x0, inv
loop:
SETVLI a2, t4 # t4 and VL now 1..8
ldb t0, (a1) # t0 fail first mode
bne t0, x0, allnonzero # still ff
# VL points to last nonzero
GETVL t4 # from bne tests
addi t4, t4, 1 # include zero
SETVL t4 # set exactly to t4
stb t0, (a3) # store incl zero
ret # end subroutine
allnonzero:
stb t0, (a3) # VL legal range
GETVL t4 # from bne tests
add a1, a1, t4 # Bump src pointer
sub a2, a2, t4 # Decrement count.
add a3, a3, t4 # Bump dst pointer
bnez a2, loop # Anymore?
exit:
ret
Notes:
* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
* ldb and bne are both using t0, both in ffirst mode
* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
* the branch only goes to allnonzero if all tests succeed
* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
* SETVL sets *exactly* the requested amount into VL.
* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
* this would cause the stb to copy up to the end of the legal memory
* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
## strcpy
RVV version:
mv a3, a0 # Save start
loop:
setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes actually read e.g. if fault
vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
add a3, a3, a1 # Bump pointer
vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index of zero byte
sub a0, a3, a0 # Subtract start address+bump
ret