# Simple-V (Parallelism Extension Proposal) Appendix
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
* Last edited: 28 jun 2019
* main spec [[specification]]
[[!toc ]]
# Fail-on-first modes
Fail-on-first data dependency has different behaviour for traps than
for conditional testing. "Conditional" is taken to mean "anything
that is zero", however with traps, the first element has to
be given the opportunity to throw the exact same trap that would
be thrown if this were a scalar operation (when VL=1).
Note that implementors are required to mutually exclusively choose one
or the other modes: an instruction is **not** permitted to fail on a
trap *and* fail a conditional test at the same time. This advice to
custom opcode writers as well as future extension writers.
## Fail-on-first traps
Except for the first element, ffirst stops sequential element processing
when a trap occurs. The first element is treated normally (as if ffirst
is clear). Should any subsequent element instruction require a trap,
instead it and subsequent indexed elements are ignored (or cancelled in
out-of-order designs), and VL is set to the *last* in-sequence instruction
that did not take the trap.
Note that predicated-out elements (where the predicate mask bit is
zero) are clearly excluded (i.e. the trap will not occur). However,
note that the loop still had to test the predicate bit: thus on return,
VL is set to include elements that did not take the trap *and* includes
the elements that were predicated (masked) out (not tested up to the
point where the trap occurred).
If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
will cause a trap as normal (as if ffirst is not set); subsequently, the
trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
determine the element that caused the trap.
Given that predication bits apply to SUBVL groups, the same rules apply
to predicated-out (masked-out) sub-groups in calculating the value that
VL is set to.
## Fail-on-first conditional tests
ffirst stops sequential (or sequentially-appearing in the case of
out-of-order designs) element conditional testing on the first element
result being zero (or other "fail" condition). VL is set to the number
of elements that were (sequentially) processed before the fail-condition
was encountered.
Note that just as with traps, if SUBVL!=1, the first trap in the
*sub-group* will cause the processing to end, and, even if there were
elements within the *sub-group* that passed the test, that sub-group is
still (entirely) excluded from the count (from setting VL). i.e. VL is
set to the total number of *sub-groups* that had no fail-condition up
until execution was stopped. However, again: SUBVL must not be modified:
traps must analyse (x)eSTATE (subvl offset indices) to determine the
element that caused the trap.
Note again that, just as with traps, predicated-out (masked-out) elements
are included in the (sequential) count leading up to the fail-condition,
even though they were not tested.
# Instructions
Despite being a 98% complete and accurate topological remap of RVV
concepts and functionality, no new instructions are needed.
Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
becomes a critical dependency for efficient manipulation of predication
masks (as a bit-field). Despite the removal of all operations,
with the exception of CLIP and VSELECT.X
*all instructions from RVV Base are topologically re-mapped and retain their
complete functionality, intact*. Note that if RV64G ever had
a MV.X added as well as FCLIP, the full functionality of RVV-Base would
be obtained in SV.
Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
equivalents, so are left out of Simple-V. VSELECT could be included if
there existed a MV.X instruction in RV (MV.X is a hypothetical
non-immediate variant of MV that would allow another register to
specify which register was to be copied). Note that if any of these three
instructions are added to any given RV extension, their functionality
will be inherently parallelised.
With some exceptions, where it does not make sense or is simply too
challenging, all RV-Base instructions are parallelised:
* CSR instructions, whilst a case could be made for fast-polling of
a CSR into multiple registers, or for being able to copy multiple
contiguously addressed CSRs into contiguous registers, and so on,
are the fundamental core basis of SV. If parallelised, extreme
care would need to be taken. Additionally, CSR reads are done
using x0, and it is *really* inadviseable to tag x0.
* LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
left as scalar.
* LR/SC could hypothetically be parallelised however their purpose is
single (complex) atomic memory operations where the LR must be followed
up by a matching SC. A sequence of parallel LR instructions followed
by a sequence of parallel SC instructions therefore is guaranteed to
not be useful. Not least: the guarantees of a Multi-LR/SC
would be impossible to provide if emulated in a trap.
* EBREAK, NOP, FENCE and others do not use registers so are not inherently
paralleliseable anyway.
All other operations using registers are automatically parallelised.
This includes AMOMAX, AMOSWAP and so on, where particular care and
attention must be paid.
Example pseudo-code for an integer ADD operation (including scalar
operations). Floating-point uses the FP Register Table.
[[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
Note that for simplicity there is quite a lot missing from the above
pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
reshaping and offsets and so on. However it demonstrates the basic
principle. Augmentations that produce the full pseudo-code are covered in
other sections.
## SUBVL Pseudocode
Adding in support for SUBVL is a matter of adding in an extra inner
for-loop, where register src and dest are still incremented inside the
inner part. Note that the predication is still taken from the VL index.
So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
indexed by "(i)"
function op_add(rd, rs1, rs2) # add not VADD!
int i, id=0, irs1=0, irs2=0;
predval = get_pred_val(FALSE, rd);
rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
for (i = 0; i < VL; i++)
xSTATE.srcoffs = i # save context
for (s = 0; s < SUBVL; s++)
xSTATE.ssvoffs = s # save context
if (predval & 1<
Branch operations use standard RV opcodes that are reinterpreted to
be "predicate variants" in the instance where either of the two src
registers are marked as vectors (active=1, vector=1).
Note that the predication register to use (if one is enabled) is taken from
the *first* src register, and that this is used, just as with predicated
arithmetic operations, to mask whether the comparison operations take
place or not. The target (destination) predication register
to use (if one is enabled) is taken from the *second* src register.
If either of src1 or src2 are scalars (whether by there being no
CSR register entry or whether by the CSR entry specifically marking
the register as "scalar") the comparison goes ahead as vector-scalar
or scalar-vector.
In instances where no vectorisation is detected on either src registers
the operation is treated as an absolutely standard scalar branch operation.
Where vectorisation is present on either or both src registers, the
branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
those tests that are predicated out).
Note that when zero-predication is enabled (from source rs1),
a cleared bit in the predicate indicates that the result
of the compare is set to "false", i.e. that the corresponding
destination bit (or result)) be set to zero. Contrast this with
when zeroing is not set: bits in the destination predicate are
only *set*; they are **not** cleared. This is important to appreciate,
as there may be an expectation that, going into the hardware-loop,
the destination predicate is always expected to be set to zero:
this is **not** the case. The destination predicate is only set
to zero if **zeroing** is enabled.
Note that just as with the standard (scalar, non-predicated) branch
operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
src1 and src2.
In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
for predicated compare operations of function "cmp":
for (int i=0; i
There is no MV instruction in RV however there is a C.MV instruction.
It is used for copying integer-to-integer registers (vectorised FMV
is used for copying floating-point).
If either the source or the destination register are marked as vectors
C.MV is reinterpreted to be a vectorised (multi-register) predicated
move operation. The actual instruction's format does not change:
[[!table data="""
15 12 | 11 7 | 6 2 | 1 0 |
funct4 | rd | rs | op |
4 | 5 | 5 | 2 |
C.MV | dest | src | C0 |
"""]]
A simplified version of the pseudocode for this operation is as follows:
function op_mv(rd, rs) # MV not VMV!
rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
ps = get_pred_val(FALSE, rs); # predication on src
pd = get_pred_val(FALSE, rd); # ... AND on dest
for (int i = 0, int j = 0; i < VL && j < VL;):
if (int_csr[rs].isvec) while (!(ps & 1<
An earlier draft of SV modified the behaviour of LOAD/STORE (modified
the interpretation of the instruction fields). This
actually undermined the fundamental principle of SV, namely that there
be no modifications to the scalar behaviour (except where absolutely
necessary), in order to simplify an implementor's task if considering
converting a pre-existing scalar design to support parallelism.
So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
do not change in SV, however just as with C.MV it is important to note
that dual-predication is possible.
In vectorised architectures there are usually at least two different modes
for LOAD/STORE:
* Read (or write for STORE) from sequential locations, where one
register specifies the address, and the one address is incremented
by a fixed amount. This is usually known as "Unit Stride" mode.
* Read (or write) from multiple indirected addresses, where the
vector elements each specify separate and distinct addresses.
To support these different addressing modes, the CSR Register "isvector"
bit is used. So, for a LOAD, when the src register is set to
scalar, the LOADs are sequentially incremented by the src register
element width, and when the src register is set to "vector", the
elements are treated as indirection addresses. Simplified
pseudo-code would look like this:
function op_ld(rd, rs) # LD not VLD!
rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
ps = get_pred_val(FALSE, rs); # predication on src
pd = get_pred_val(FALSE, rd); # ... AND on dest
for (int i = 0, int j = 0; i < VL && j < VL;):
if (int_csr[rs].isvec) while (!(ps & 1<
C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
It is therefore possible to use predicated C.LWSP to efficiently
pop registers off the stack (by predicating x2 as the source), cherry-picking
which registers to store to (by predicating the destination). Likewise
for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
The two modes ("unit stride" and multi-indirection) are still supported,
as with standard LD/ST. Essentially, the only difference is that the
use of x2 is hard-coded into the instruction.
**Note**: it is still possible to redirect x2 to an alternative target
register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
general-purpose LOAD/STORE operations.
## Compressed LOAD / STORE Instructions
Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
where the same rules apply and the same pseudo-code apply as for
non-compressed LOAD/STORE. Again: setting scalar or vector mode
on the src for LOAD and dest for STORE switches mode from "Unit Stride"
to "Multi-indirection", respectively.
# Element bitwidth polymorphism
Element bitwidth is best covered as its own special section, as it
is quite involved and applies uniformly across-the-board. SV restricts
bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
The effect of setting an element bitwidth is to re-cast each entry
in the register table, and for all memory operations involving
load/stores of certain specific sizes, to a completely different width.
Thus In c-style terms, on an RV64 architecture, effectively each register
now looks like this:
typedef union {
uint8_t b[8];
uint16_t s[4];
uint32_t i[2];
uint64_t l[1];
} reg_t;
// integer table: assume maximum SV 7-bit regfile size
reg_t int_regfile[128];
where the CSR Register table entry (not the instruction alone) determines
which of those union entries is to be used on each operation, and the
VL element offset in the hardware-loop specifies the index into each array.
However a naive interpretation of the data structure above masks the
fact that setting VL greater than 8, for example, when the bitwidth is 8,
accessing one specific register "spills over" to the following parts of
the register file in a sequential fashion. So a much more accurate way
to reflect this would be:
typedef union {
uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
uint8_t b[0]; // array of type uint8_t
uint16_t s[0];
uint32_t i[0];
uint64_t l[0];
uint128_t d[0];
} reg_t;
reg_t int_regfile[128];
where when accessing any individual regfile[n].b entry it is permitted
(in c) to arbitrarily over-run the *declared* length of the array (zero),
and thus "overspill" to consecutive register file entries in a fashion
that is completely transparent to a greatly-simplified software / pseudo-code
representation.
It is however critical to note that it is clearly the responsibility of
the implementor to ensure that, towards the end of the register file,
an exception is thrown if attempts to access beyond the "real" register
bytes is ever attempted.
Now we may modify pseudo-code an operation where all element bitwidths have
been set to the same size, where this pseudo-code is otherwise identical
to its "non" polymorphic versions (above):
function op_add(rd, rs1, rs2) # add not VADD!
...
...
for (i = 0; i < VL; i++)
...
...
// TODO, calculate if over-run occurs, for each elwidth
if (elwidth == 8) {
int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
int_regfile[rs2].i[irs2];
} else if elwidth == 16 {
int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
int_regfile[rs2].s[irs2];
} else if elwidth == 32 {
int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
int_regfile[rs2].i[irs2];
} else { // elwidth == 64
int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
int_regfile[rs2].l[irs2];
}
...
...
So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
following sequentially on respectively from the same) are "type-cast"
to 8-bit; for 16-bit entries likewise and so on.
However that only covers the case where the element widths are the same.
Where the element widths are different, the following algorithm applies:
* Analyse the bitwidth of all source operands and work out the
maximum. Record this as "maxsrcbitwidth"
* If any given source operand requires sign-extension or zero-extension
(ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
sign-extension / zero-extension or whatever is specified in the standard
RV specification, **change** that to sign-extending from the respective
individual source operand's bitwidth from the CSR table out to
"maxsrcbitwidth" (previously calculated), instead.
* Following separate and distinct (optional) sign/zero-extension of all
source operands as specifically required for that operation, carry out the
operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
this may be a "null" (copy) operation, and that with FCVT, the changes
to the source and destination bitwidths may also turn FVCT effectively
into a copy).
* If the destination operand requires sign-extension or zero-extension,
instead of a mandatory fixed size (typically 32-bit for arithmetic,
for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
etc.), overload the RV specification with the bitwidth from the
destination register's elwidth entry.
* Finally, store the (optionally) sign/zero-extended value into its
destination: memory for sb/sw etc., or an offset section of the register
file for an arithmetic operation.
In this way, polymorphic bitwidths are achieved without requiring a
massive 64-way permutation of calculations **per opcode**, for example
(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
rd bitwidths). The pseudo-code is therefore as follows:
typedef union {
uint8_t b;
uint16_t s;
uint32_t i;
uint64_t l;
} el_reg_t;
bw(elwidth):
if elwidth == 0: return xlen
if elwidth == 1: return 8
if elwidth == 2: return 16
// elwidth == 3:
return 32
get_max_elwidth(rs1, rs2):
return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
bw(int_csr[rs2].elwidth)) # again XLEN if no entry
get_polymorphed_reg(reg, bitwidth, offset):
el_reg_t res;
res.l = 0; // TODO: going to need sign-extending / zero-extending
if bitwidth == 8:
reg.b = int_regfile[reg].b[offset]
elif bitwidth == 16:
reg.s = int_regfile[reg].s[offset]
elif bitwidth == 32:
reg.i = int_regfile[reg].i[offset]
elif bitwidth == 64:
reg.l = int_regfile[reg].l[offset]
return res
set_polymorphed_reg(reg, bitwidth, offset, val):
if (!int_csr[reg].isvec):
# sign/zero-extend depending on opcode requirements, from
# the reg's bitwidth out to the full bitwidth of the regfile
val = sign_or_zero_extend(val, bitwidth, xlen)
int_regfile[reg].l[0] = val
elif bitwidth == 8:
int_regfile[reg].b[offset] = val
elif bitwidth == 16:
int_regfile[reg].s[offset] = val
elif bitwidth == 32:
int_regfile[reg].i[offset] = val
elif bitwidth == 64:
int_regfile[reg].l[offset] = val
maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
destwid = int_csr[rs1].elwidth # destination element width
for (i = 0; i < VL; i++)
if (predval & 1<
Polymorphic element widths in vectorised form means that the data
being loaded (or stored) across multiple registers needs to be treated
(reinterpreted) as a contiguous stream of elwidth-wide items, where
the source register's element width is **independent** from the destination's.
This makes for a slightly more complex algorithm when using indirection
on the "addressed" register (source for LOAD and destination for STORE),
particularly given that the LOAD/STORE instruction provides important
information about the width of the data to be reinterpreted.
Let's illustrate the "load" part, where the pseudo-code for elwidth=default
was as follows, and i is the loop from 0 to VL-1:
srcbase = ireg[rs+i];
return mem[srcbase + imm]; // returns XLEN bits
Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
chunks are taken from the source memory location addressed by the current
indexed source address register, and only when a full 32-bits-worth
are taken will the index be moved on to the next contiguous source
address register:
bitwidth = bw(elwidth); // source elwidth from CSR reg entry
elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
srcbase = ireg[rs+i/(elsperblock)]; // integer divide
offs = i % elsperblock; // modulo
return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
and 128 for LQ.
The principle is basically exactly the same as if the srcbase were pointing
at the memory of the *register* file: memory is re-interpreted as containing
groups of elwidth-wide discrete elements.
When storing the result from a load, it's important to respect the fact
that the destination register has its *own separate element width*. Thus,
when each element is loaded (at the source element width), any sign-extension
or zero-extension (or truncation) needs to be done to the *destination*
bitwidth. Also, the storing has the exact same analogous algorithm as
above, where in fact it is just the set\_polymorphed\_reg pseudocode
(completely unchanged) used above.
One issue remains: when the source element width is **greater** than
the width of the operation, it is obvious that a single LB for example
cannot possibly obtain 16-bit-wide data. This condition may be detected
where, when using integer divide, elsperblock (the width of the LOAD
divided by the bitwidth of the element) is zero.
The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
The elements, if the element bitwidth is larger than the LD operation's
size, will then be sign/zero-extended to the full LD operation size, as
specified by the LOAD (LDU instead of LD, LBU instead of LB), before
being passed on to the second phase.
As LOAD/STORE may be twin-predicated, it is important to note that
the rules on twin predication still apply, except where in previous
pseudo-code (elwidth=default for both source and target) it was
the *registers* that the predication was applied to, it is now the
**elements** that the predication is applied to.
Thus the full pseudocode for all LD operations may be written out
as follows:
function LBU(rd, rs):
load_elwidthed(rd, rs, 8, true)
function LB(rd, rs):
load_elwidthed(rd, rs, 8, false)
function LH(rd, rs):
load_elwidthed(rd, rs, 16, false)
...
...
function LQ(rd, rs):
load_elwidthed(rd, rs, 128, false)
# returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
function load_memory(rs, imm, i, opwidth):
elwidth = int_csr[rs].elwidth
bitwidth = bw(elwidth);
elsperblock = min(1, opwidth / bitwidth)
srcbase = ireg[rs+i/(elsperblock)];
offs = i % elsperblock;
return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
function load_elwidthed(rd, rs, opwidth, unsigned):
destwid = int_csr[rd].elwidth # destination element width
rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
ps = get_pred_val(FALSE, rs); # predication on src
pd = get_pred_val(FALSE, rd); # ... AND on dest
for (int i = 0, int j = 0; i < VL && j < VL;):
if (int_csr[rs].isvec) while (!(ps & 1< max(rs1, rs2) otherwise truncate
Note here that polymorphic add zero-extends its source operands,
where addw sign-extends.
### addw
The RV Specification specifically states that "W" variants of arithmetic
operations always produce 32-bit signed values. In a polymorphic
environment it is reasonable to assume that the signed aspect is
preserved, where it is the length of the operands and the result
that may be changed.
Standard Scalar RV64 (xlen):
* RS1 @ xlen bits
* RS2 @ xlen bits
* add @ xlen bits
* RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
Polymorphic variant:
* RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
* RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
* add @ max(rs1, rs2) bits
* RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
Note here that polymorphic addw sign-extends its source operands,
where add zero-extends.
This requires a little more in-depth analysis. Where the bitwidth of
rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
only where the bitwidth of either rs1 or rs2 are different, will the
lesser-width operand be sign-extended.
Effectively however, both rs1 and rs2 are being sign-extended (or
truncated), where for add they are both zero-extended. This holds true
for all arithmetic operations ending with "W".
### addiw
Standard Scalar RV64I:
* RS1 @ xlen bits, truncated to 32-bit
* immed @ 12 bits, sign-extended to 32-bit
* add @ 32 bits
* RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
Polymorphic variant:
* RS1 @ rs1 bits
* immed @ 12 bits, sign-extend to max(rs1, 12) bits
* add @ max(rs1, 12) bits
* RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
# Predication Element Zeroing
The introduction of zeroing on traditional vector predication is usually
intended as an optimisation for lane-based microarchitectures with register
renaming to be able to save power by avoiding a register read on elements
that are passed through en-masse through the ALU. Simpler microarchitectures
do not have this issue: they simply do not pass the element through to
the ALU at all, and therefore do not store it back in the destination.
More complex non-lane-based micro-architectures can, when zeroing is
not set, use the predication bits to simply avoid sending element-based
operations to the ALUs, entirely: thus, over the long term, potentially
keeping all ALUs 100% occupied even when elements are predicated out.
SimpleV's design principle is not based on or influenced by
microarchitectural design factors: it is a hardware-level API.
Therefore, looking purely at whether zeroing is *useful* or not,
(whether less instructions are needed for certain scenarios),
given that a case can be made for zeroing *and* non-zeroing, the
decision was taken to add support for both.
## Single-predication (based on destination register)
Zeroing on predication for arithmetic operations is taken from
the destination register's predicate. i.e. the predication *and*
zeroing settings to be applied to the whole operation come from the
CSR Predication table entry for the destination register.
Thus when zeroing is set on predication of a destination element,
if the predication bit is clear, then the destination element is *set*
to zero (twin-predication is slightly different, and will be covered
next).
Thus the pseudo-code loop for a predicated arithmetic operation
is modified to as follows:
for (i = 0; i < VL; i++)
if not zeroing: # an optimisation
while (!(predval & 1<
## strncpy
RVV version: >
strncpy:
mv a3, a0 # Copy dst
loop:
setvli x0, a2, vint8 # Vectors of bytes.
vlbff.v v1, (a1) # Get src bytes
vseq.vi v0, v1, 0 # Flag zero bytes
vmfirst a4, v0 # Zero found?
vmsif.v v0, v0 # Set mask up to and including zero byte.
vsb.v v1, (a3), v0.t # Write out bytes
bgez a4, exit # Done
csrr t1, vl # Get number of bytes fetched
add a1, a1, t1 # Bump src pointer
sub a2, a2, t1 # Decrement count.
add a3, a3, t1 # Bump dst pointer
bnez a2, loop # Anymore?
exit:
ret
SV version (WIP):
strncpy:
mv a3, a0
SETMVLI 8 # set max vector to 8
RegCSR[a3] = 8bit, a3, scalar
RegCSR[a1] = 8bit, a1, scalar
RegCSR[t0] = 8bit, t0, vector
PredTb[t0] = ffirst, x0, inv
loop:
SETVLI a2, t4 # t4 and VL now 1..8
ldb t0, (a1) # t0 fail first mode
bne t0, x0, allnonzero # still ff
# VL points to last nonzero
GETVL t4 # from bne tests
addi t4, t4, 1 # include zero
SETVL t4 # set exactly to t4
stb t0, (a3) # store incl zero
ret # end subroutine
allnonzero:
stb t0, (a3) # VL legal range
GETVL t4 # from bne tests
add a1, a1, t4 # Bump src pointer
sub a2, a2, t4 # Decrement count.
add a3, a3, t4 # Bump dst pointer
bnez a2, loop # Anymore?
exit:
ret
Notes:
* Setting MVL to 8 is just an example. If enough registers are spare it
may be set to XLEN which will require a bank of 8 scalar registers for
a1, a3 and t0.
* obviously if that is done, t0 is not separated by 8 full registers, and
would overwrite t1 thru t7. x80 would work well, as an example, instead.
* with the exception of the GETVL (a pseudo code alias for csrr), every
single instruction above may use RVC.
* RVC C.BNEZ can be used because rs1' may be extended to the full 128
registers through redirection
* RVC C.LW and C.SW may be used because the W format may be overridden by
the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
* with the exception of the GETVL, all Vector Context may be done in
VBLOCK form.
* setting predication to x0 (zero) and invert on t0 is a trick to enable
just ffirst on t0
* ldb and bne are both using t0, both in ffirst mode
* t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
vectorised, no (un)sign-extension or truncation" mode.
* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
into t0 (could contain zeros).
* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
scalar x0
* however as t0 is in ffirst mode, the first fail wil ALSO stop the
compares, and reduce VL as well
* the branch only goes to allnonzero if all tests succeed
* if it did not, we can safely increment VL by 1 (using a4) to include
the zero.
* SETVL sets *exactly* the requested amount into VL.
* the SETVL just after allnonzero label is needed in case the ldb ffirst
activates but the bne allzeros does not.
* this would cause the stb to copy up to the end of the legal memory
* of course, on the next loop the ldb would throw a trap, as a1 now
points to the first illegal mem location.
## strcpy
RVV version:
mv a3, a0 # Save start
loop:
setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes actually read e.g. if fault
vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
add a3, a3, a1 # Bump pointer
vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index of zero byte
sub a0, a3, a0 # Subtract start address+bump
ret
## DAXPY
[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]