--- /dev/null
+# Simple-V (Parallelism Extension Proposal) Appendix
+
+* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
+* Status: DRAFTv0.6
+* Last edited: 25 jun 2019
+* main spec [[specification]]
+
+[[!toc ]]
+
+# Element bitwidth polymorphism <a name="elwidth"></a>
+
+Element bitwidth is best covered as its own special section, as it
+is quite involved and applies uniformly across-the-board. SV restricts
+bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
+
+The effect of setting an element bitwidth is to re-cast each entry
+in the register table, and for all memory operations involving
+load/stores of certain specific sizes, to a completely different width.
+Thus In c-style terms, on an RV64 architecture, effectively each register
+now looks like this:
+
+ typedef union {
+ uint8_t b[8];
+ uint16_t s[4];
+ uint32_t i[2];
+ uint64_t l[1];
+ } reg_t;
+
+ // integer table: assume maximum SV 7-bit regfile size
+ reg_t int_regfile[128];
+
+where the CSR Register table entry (not the instruction alone) determines
+which of those union entries is to be used on each operation, and the
+VL element offset in the hardware-loop specifies the index into each array.
+
+However a naive interpretation of the data structure above masks the
+fact that setting VL greater than 8, for example, when the bitwidth is 8,
+accessing one specific register "spills over" to the following parts of
+the register file in a sequential fashion. So a much more accurate way
+to reflect this would be:
+
+ typedef union {
+ uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
+ uint8_t b[0]; // array of type uint8_t
+ uint16_t s[0];
+ uint32_t i[0];
+ uint64_t l[0];
+ uint128_t d[0];
+ } reg_t;
+
+ reg_t int_regfile[128];
+
+where when accessing any individual regfile[n].b entry it is permitted
+(in c) to arbitrarily over-run the *declared* length of the array (zero),
+and thus "overspill" to consecutive register file entries in a fashion
+that is completely transparent to a greatly-simplified software / pseudo-code
+representation.
+It is however critical to note that it is clearly the responsibility of
+the implementor to ensure that, towards the end of the register file,
+an exception is thrown if attempts to access beyond the "real" register
+bytes is ever attempted.
+
+Now we may modify pseudo-code an operation where all element bitwidths have
+been set to the same size, where this pseudo-code is otherwise identical
+to its "non" polymorphic versions (above):
+
+ function op_add(rd, rs1, rs2) # add not VADD!
+ ...
+ ...
+ for (i = 0; i < VL; i++)
+ ...
+ ...
+ // TODO, calculate if over-run occurs, for each elwidth
+ if (elwidth == 8) {
+ int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
+ int_regfile[rs2].i[irs2];
+ } else if elwidth == 16 {
+ int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
+ int_regfile[rs2].s[irs2];
+ } else if elwidth == 32 {
+ int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
+ int_regfile[rs2].i[irs2];
+ } else { // elwidth == 64
+ int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
+ int_regfile[rs2].l[irs2];
+ }
+ ...
+ ...
+
+So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
+following sequentially on respectively from the same) are "type-cast"
+to 8-bit; for 16-bit entries likewise and so on.
+
+However that only covers the case where the element widths are the same.
+Where the element widths are different, the following algorithm applies:
+
+* Analyse the bitwidth of all source operands and work out the
+ maximum. Record this as "maxsrcbitwidth"
+* If any given source operand requires sign-extension or zero-extension
+ (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
+ sign-extension / zero-extension or whatever is specified in the standard
+ RV specification, **change** that to sign-extending from the respective
+ individual source operand's bitwidth from the CSR table out to
+ "maxsrcbitwidth" (previously calculated), instead.
+* Following separate and distinct (optional) sign/zero-extension of all
+ source operands as specifically required for that operation, carry out the
+ operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
+ this may be a "null" (copy) operation, and that with FCVT, the changes
+ to the source and destination bitwidths may also turn FVCT effectively
+ into a copy).
+* If the destination operand requires sign-extension or zero-extension,
+ instead of a mandatory fixed size (typically 32-bit for arithmetic,
+ for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
+ etc.), overload the RV specification with the bitwidth from the
+ destination register's elwidth entry.
+* Finally, store the (optionally) sign/zero-extended value into its
+ destination: memory for sb/sw etc., or an offset section of the register
+ file for an arithmetic operation.
+
+In this way, polymorphic bitwidths are achieved without requiring a
+massive 64-way permutation of calculations **per opcode**, for example
+(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
+rd bitwidths). The pseudo-code is therefore as follows:
+
+ typedef union {
+ uint8_t b;
+ uint16_t s;
+ uint32_t i;
+ uint64_t l;
+ } el_reg_t;
+
+ bw(elwidth):
+ if elwidth == 0: return xlen
+ if elwidth == 1: return 8
+ if elwidth == 2: return 16
+ // elwidth == 3:
+ return 32
+
+ get_max_elwidth(rs1, rs2):
+ return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
+ bw(int_csr[rs2].elwidth)) # again XLEN if no entry
+
+ get_polymorphed_reg(reg, bitwidth, offset):
+ el_reg_t res;
+ res.l = 0; // TODO: going to need sign-extending / zero-extending
+ if bitwidth == 8:
+ reg.b = int_regfile[reg].b[offset]
+ elif bitwidth == 16:
+ reg.s = int_regfile[reg].s[offset]
+ elif bitwidth == 32:
+ reg.i = int_regfile[reg].i[offset]
+ elif bitwidth == 64:
+ reg.l = int_regfile[reg].l[offset]
+ return res
+
+ set_polymorphed_reg(reg, bitwidth, offset, val):
+ if (!int_csr[reg].isvec):
+ # sign/zero-extend depending on opcode requirements, from
+ # the reg's bitwidth out to the full bitwidth of the regfile
+ val = sign_or_zero_extend(val, bitwidth, xlen)
+ int_regfile[reg].l[0] = val
+ elif bitwidth == 8:
+ int_regfile[reg].b[offset] = val
+ elif bitwidth == 16:
+ int_regfile[reg].s[offset] = val
+ elif bitwidth == 32:
+ int_regfile[reg].i[offset] = val
+ elif bitwidth == 64:
+ int_regfile[reg].l[offset] = val
+
+ maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
+ destwid = int_csr[rs1].elwidth # destination element width
+ for (i = 0; i < VL; i++)
+ if (predval & 1<<i) # predication uses intregs
+ // TODO, calculate if over-run occurs, for each elwidth
+ src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
+ // TODO, sign/zero-extend src1 and src2 as operation requires
+ if (op_requires_sign_extend_src1)
+ src1 = sign_extend(src1, maxsrcwid)
+ src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
+ result = src1 + src2 # actual add here
+ // TODO, sign/zero-extend result, as operation requires
+ if (op_requires_sign_extend_dest)
+ result = sign_extend(result, maxsrcwid)
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if (!int_vec[rd].isvector) break
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+
+Whilst specific sign-extension and zero-extension pseudocode call
+details are left out, due to each operation being different, the above
+should be clear that;
+
+* the source operands are extended out to the maximum bitwidth of all
+ source operands
+* the operation takes place at that maximum source bitwidth (the
+ destination bitwidth is not involved at this point, at all)
+* the result is extended (or potentially even, truncated) before being
+ stored in the destination. i.e. truncation (if required) to the
+ destination width occurs **after** the operation **not** before.
+* when the destination is not marked as "vectorised", the **full**
+ (standard, scalar) register file entry is taken up, i.e. the
+ element is either sign-extended or zero-extended to cover the
+ full register bitwidth (XLEN) if it is not already XLEN bits long.
+
+Implementors are entirely free to optimise the above, particularly
+if it is specifically known that any given operation will complete
+accurately in less bits, as long as the results produced are
+directly equivalent and equal, for all inputs and all outputs,
+to those produced by the above algorithm.
+
+## Polymorphic floating-point operation exceptions and error-handling
+
+For floating-point operations, conversion takes place without
+raising any kind of exception. Exactly as specified in the standard
+RV specification, NAN (or appropriate) is stored if the result
+is beyond the range of the destination, and, again, exactly as
+with the standard RV specification just as with scalar
+operations, the floating-point flag is raised (FCSR). And, again, just as
+with scalar operations, it is software's responsibility to check this flag.
+Given that the FCSR flags are "accrued", the fact that multiple element
+operations could have occurred is not a problem.
+
+Note that it is perfectly legitimate for floating-point bitwidths of
+only 8 to be specified. However whilst it is possible to apply IEEE 754
+principles, no actual standard yet exists. Implementors wishing to
+provide hardware-level 8-bit support rather than throw a trap to emulate
+in software should contact the author of this specification before
+proceeding.
+
+## Polymorphic shift operators
+
+A special note is needed for changing the element width of left and right
+shift operators, particularly right-shift. Even for standard RV base,
+in order for correct results to be returned, the second operand RS2 must
+be truncated to be within the range of RS1's bitwidth. spike's implementation
+of sll for example is as follows:
+
+ WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
+
+which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
+range 0..31 so that RS1 will only be left-shifted by the amount that
+is possible to fit into a 32-bit register. Whilst this appears not
+to matter for hardware, it matters greatly in software implementations,
+and it also matters where an RV64 system is set to "RV32" mode, such
+that the underlying registers RS1 and RS2 comprise 64 hardware bits
+each.
+
+For SV, where each operand's element bitwidth may be over-ridden, the
+rule about determining the operation's bitwidth *still applies*, being
+defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
+**also applies to the truncation of RS2**. In other words, *after*
+determining the maximum bitwidth, RS2's range must **also be truncated**
+to ensure a correct answer. Example:
+
+* RS1 is over-ridden to a 16-bit width
+* RS2 is over-ridden to an 8-bit width
+* RD is over-ridden to a 64-bit width
+* the maximum bitwidth is thus determined to be 16-bit - max(8,16)
+* RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
+
+Pseudocode (in spike) for this example would therefore be:
+
+ WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
+
+This example illustrates that considerable care therefore needs to be
+taken to ensure that left and right shift operations are implemented
+correctly. The key is that
+
+* The operation bitwidth is determined by the maximum bitwidth
+ of the *source registers*, **not** the destination register bitwidth
+* The result is then sign-extend (or truncated) as appropriate.
+
+## Polymorphic MULH/MULHU/MULHSU
+
+MULH is designed to take the top half MSBs of a multiply that
+does not fit within the range of the source operands, such that
+smaller width operations may produce a full double-width multiply
+in two cycles. The issue is: SV allows the source operands to
+have variable bitwidth.
+
+Here again special attention has to be paid to the rules regarding
+bitwidth, which, again, are that the operation is performed at
+the maximum bitwidth of the **source** registers. Therefore:
+
+* An 8-bit x 8-bit multiply will create a 16-bit result that must
+ be shifted down by 8 bits
+* A 16-bit x 8-bit multiply will create a 24-bit result that must
+ be shifted down by 16 bits (top 8 bits being zero)
+* A 16-bit x 16-bit multiply will create a 32-bit result that must
+ be shifted down by 16 bits
+* A 32-bit x 16-bit multiply will create a 48-bit result that must
+ be shifted down by 32 bits
+* A 32-bit x 8-bit multiply will create a 40-bit result that must
+ be shifted down by 32 bits
+
+So again, just as with shift-left and shift-right, the result
+is shifted down by the maximum of the two source register bitwidths.
+And, exactly again, truncation or sign-extension is performed on the
+result. If sign-extension is to be carried out, it is performed
+from the same maximum of the two source register bitwidths out
+to the result element's bitwidth.
+
+If truncation occurs, i.e. the top MSBs of the result are lost,
+this is "Officially Not Our Problem", i.e. it is assumed that the
+programmer actually desires the result to be truncated. i.e. if the
+programmer wanted all of the bits, they would have set the destination
+elwidth to accommodate them.
+
+## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
+
+Polymorphic element widths in vectorised form means that the data
+being loaded (or stored) across multiple registers needs to be treated
+(reinterpreted) as a contiguous stream of elwidth-wide items, where
+the source register's element width is **independent** from the destination's.
+
+This makes for a slightly more complex algorithm when using indirection
+on the "addressed" register (source for LOAD and destination for STORE),
+particularly given that the LOAD/STORE instruction provides important
+information about the width of the data to be reinterpreted.
+
+Let's illustrate the "load" part, where the pseudo-code for elwidth=default
+was as follows, and i is the loop from 0 to VL-1:
+
+ srcbase = ireg[rs+i];
+ return mem[srcbase + imm]; // returns XLEN bits
+
+Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
+chunks are taken from the source memory location addressed by the current
+indexed source address register, and only when a full 32-bits-worth
+are taken will the index be moved on to the next contiguous source
+address register:
+
+ bitwidth = bw(elwidth); // source elwidth from CSR reg entry
+ elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
+ srcbase = ireg[rs+i/(elsperblock)]; // integer divide
+ offs = i % elsperblock; // modulo
+ return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
+
+Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
+and 128 for LQ.
+
+The principle is basically exactly the same as if the srcbase were pointing
+at the memory of the *register* file: memory is re-interpreted as containing
+groups of elwidth-wide discrete elements.
+
+When storing the result from a load, it's important to respect the fact
+that the destination register has its *own separate element width*. Thus,
+when each element is loaded (at the source element width), any sign-extension
+or zero-extension (or truncation) needs to be done to the *destination*
+bitwidth. Also, the storing has the exact same analogous algorithm as
+above, where in fact it is just the set\_polymorphed\_reg pseudocode
+(completely unchanged) used above.
+
+One issue remains: when the source element width is **greater** than
+the width of the operation, it is obvious that a single LB for example
+cannot possibly obtain 16-bit-wide data. This condition may be detected
+where, when using integer divide, elsperblock (the width of the LOAD
+divided by the bitwidth of the element) is zero.
+
+The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
+
+ elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
+
+The elements, if the element bitwidth is larger than the LD operation's
+size, will then be sign/zero-extended to the full LD operation size, as
+specified by the LOAD (LDU instead of LD, LBU instead of LB), before
+being passed on to the second phase.
+
+As LOAD/STORE may be twin-predicated, it is important to note that
+the rules on twin predication still apply, except where in previous
+pseudo-code (elwidth=default for both source and target) it was
+the *registers* that the predication was applied to, it is now the
+**elements** that the predication is applied to.
+
+Thus the full pseudocode for all LD operations may be written out
+as follows:
+
+ function LBU(rd, rs):
+ load_elwidthed(rd, rs, 8, true)
+ function LB(rd, rs):
+ load_elwidthed(rd, rs, 8, false)
+ function LH(rd, rs):
+ load_elwidthed(rd, rs, 16, false)
+ ...
+ ...
+ function LQ(rd, rs):
+ load_elwidthed(rd, rs, 128, false)
+
+ # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
+ function load_memory(rs, imm, i, opwidth):
+ elwidth = int_csr[rs].elwidth
+ bitwidth = bw(elwidth);
+ elsperblock = min(1, opwidth / bitwidth)
+ srcbase = ireg[rs+i/(elsperblock)];
+ offs = i % elsperblock;
+ return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
+
+ function load_elwidthed(rd, rs, opwidth, unsigned):
+ destwid = int_csr[rd].elwidth # destination element width
+ rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+ val = load_memory(rs, imm, i, opwidth)
+ if unsigned:
+ val = zero_extend(val, min(opwidth, bitwidth))
+ else:
+ val = sign_extend(val, min(opwidth, bitwidth))
+ set_polymorphed_reg(rd, bitwidth, j, val)
+ if (int_csr[rs].isvec) i++;
+ if (int_csr[rd].isvec) j++; else break;
+
+Note:
+
+* when comparing against for example the twin-predicated c.mv
+ pseudo-code, the pattern of independent incrementing of rd and rs
+ is preserved unchanged.
+* just as with the c.mv pseudocode, zeroing is not included and must be
+ taken into account (TODO).
+* that due to the use of a twin-predication algorithm, LOAD/STORE also
+ take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
+ VSCATTER characteristics.
+* that due to the use of the same set\_polymorphed\_reg pseudocode,
+ a destination that is not vectorised (marked as scalar) will
+ result in the element being fully sign-extended or zero-extended
+ out to the full register file bitwidth (XLEN). When the source
+ is also marked as scalar, this is how the compatibility with
+ standard RV LOAD/STORE is preserved by this algorithm.
+
+### Example Tables showing LOAD elements
+
+This section contains examples of vectorised LOAD operations, showing
+how the two stage process works (three if zero/sign-extension is included).
+
+
+#### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
+
+This is:
+
+* a 64-bit load, with an offset of zero
+* with a source-address elwidth of 16-bit
+* into a destination-register with an elwidth of 32-bit
+* where VL=7
+* from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
+* RV64, where XLEN=64 is assumed.
+
+First, the memory table, which, due to the
+element width being 16 and the operation being LD (64), the 64-bits
+loaded from memory are subdivided into groups of **four** elements.
+And, with VL being 7 (deliberately to illustrate that this is reasonable
+and possible), the first four are sourced from the offset addresses pointed
+to by x5, and the next three from the ofset addresses pointed to by
+the next contiguous register, x6:
+
+[[!table data="""
+addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
+@x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
+@x6 | elem 4 || elem 5 || elem 6 || not loaded ||
+"""]]
+
+Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
+the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
+
+[[!table data="""
+byte 3 | byte 2 | byte 1 | byte 0 |
+0x0 | 0x0 | elem0 ||
+0x0 | 0x0 | elem1 ||
+0x0 | 0x0 | elem2 ||
+0x0 | 0x0 | elem3 ||
+0x0 | 0x0 | elem4 ||
+0x0 | 0x0 | elem5 ||
+0x0 | 0x0 | elem6 ||
+0x0 | 0x0 | elem7 ||
+"""]]
+
+Lastly, the elements are stored in contiguous blocks, as if x8 was also
+byte-addressable "memory". That "memory" happens to cover registers
+x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
+
+[[!table data="""
+reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
+x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
+x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
+x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
+x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
+"""]]
+
+Thus we have data that is loaded from the **addresses** pointed to by
+x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
+x8 through to half of x11.
+The end result is that elements 0 and 1 end up in x8, with element 8 being
+shifted up 32 bits, and so on, until finally element 6 is in the
+LSBs of x11.
+
+Note that whilst the memory addressing table is shown left-to-right byte order,
+the registers are shown in right-to-left (MSB) order. This does **not**
+imply that bit or byte-reversal is carried out: it's just easier to visualise
+memory as being contiguous bytes, and emphasises that registers are not
+really actually "memory" as such.
+
+## Why SV bitwidth specification is restricted to 4 entries
+
+The four entries for SV element bitwidths only allows three over-rides:
+
+* 8 bit
+* 16 hit
+* 32 bit
+
+This would seem inadequate, surely it would be better to have 3 bits or
+more and allow 64, 128 and some other options besides. The answer here
+is, it gets too complex, no RV128 implementation yet exists, and so RV64's
+default is 64 bit, so the 4 major element widths are covered anyway.
+
+There is an absolutely crucial aspect oF SV here that explicitly
+needs spelling out, and it's whether the "vectorised" bit is set in
+the Register's CSR entry.
+
+If "vectorised" is clear (not set), this indicates that the operation
+is "scalar". Under these circumstances, when set on a destination (RD),
+then sign-extension and zero-extension, whilst changed to match the
+override bitwidth (if set), will erase the **full** register entry
+(64-bit if RV64).
+
+When vectorised is *set*, this indicates that the operation now treats
+**elements** as if they were independent registers, so regardless of
+the length, any parts of a given actual register that are not involved
+in the operation are **NOT** modified, but are **PRESERVED**.
+
+For example:
+
+* when the vector bit is clear and elwidth set to 16 on the destination
+ register, operations are truncated to 16 bit and then sign or zero
+ extended to the *FULL* XLEN register width.
+* when the vector bit is set, elwidth is 16 and VL=1 (or other value where
+ groups of elwidth sized elements do not fill an entire XLEN register),
+ the "top" bits of the destination register do *NOT* get modified, zero'd
+ or otherwise overwritten.
+
+SIMD micro-architectures may implement this by using predication on
+any elements in a given actual register that are beyond the end of
+multi-element operation.
+
+Other microarchitectures may choose to provide byte-level write-enable
+lines on the register file, such that each 64 bit register in an RV64
+system requires 8 WE lines. Scalar RV64 operations would require
+activation of all 8 lines, where SV elwidth based operations would
+activate the required subset of those byte-level write lines.
+
+Example:
+
+* rs1, rs2 and rd are all set to 8-bit
+* VL is set to 3
+* RV64 architecture is set (UXL=64)
+* add operation is carried out
+* bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
+ concatenated with similar add operations on bits 15..8 and 7..0
+* bits 24 through 63 **remain as they originally were**.
+
+Example SIMD micro-architectural implementation:
+
+* SIMD architecture works out the nearest round number of elements
+ that would fit into a full RV64 register (in this case: 8)
+* SIMD architecture creates a hidden predicate, binary 0b00000111
+ i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
+* SIMD architecture goes ahead with the add operation as if it
+ was a full 8-wide batch of 8 adds
+* SIMD architecture passes top 5 elements through the adders
+ (which are "disabled" due to zero-bit predication)
+* SIMD architecture gets the 5 unmodified top 8-bits back unmodified
+ and stores them in rd.
+
+This requires a read on rd, however this is required anyway in order
+to support non-zeroing mode.
+
+## Polymorphic floating-point
+
+Standard scalar RV integer operations base the register width on XLEN,
+which may be changed (UXL in USTATUS, and the corresponding MXL and
+SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
+arithmetic operations are therefore restricted to an active XLEN bits,
+with sign or zero extension to pad out the upper bits when XLEN has
+been dynamically set to less than the actual register size.
+
+For scalar floating-point, the active (used / changed) bits are
+specified exclusively by the operation: ADD.S specifies an active
+32-bits, with the upper bits of the source registers needing to
+be all 1s ("NaN-boxed"), and the destination upper bits being
+*set* to all 1s (including on LOAD/STOREs).
+
+Where elwidth is set to default (on any source or the destination)
+it is obvious that this NaN-boxing behaviour can and should be
+preserved. When elwidth is non-default things are less obvious,
+so need to be thought through. Here is a normal (scalar) sequence,
+assuming an RV64 which supports Quad (128-bit) FLEN:
+
+* FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
+* ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
+* FSD stores lowest 64-bits from the 128-bit-wide register to memory:
+ top 64 MSBs ignored.
+
+Therefore it makes sense to mirror this behaviour when, for example,
+elwidth is set to 32. Assume elwidth set to 32 on all source and
+destination registers:
+
+* FLD loads 64-bit wide from memory as **two** 32-bit single-precision
+ floating-point numbers.
+* ADD.D performs **two** 32-bit-wide adds, storing one of the adds
+ in bits 0-31 and the second in bits 32-63.
+* FSD stores lowest 64-bits from the 128-bit-wide register to memory
+
+Here's the thing: it does not make sense to overwrite the top 64 MSBs
+of the registers either during the FLD **or** the ADD.D. The reason
+is that, effectively, the top 64 MSBs actually represent a completely
+independent 64-bit register, so overwriting it is not only gratuitous
+but may actually be harmful for a future extension to SV which may
+have a way to directly access those top 64 bits.
+
+The decision is therefore **not** to touch the upper parts of floating-point
+registers whereever elwidth is set to non-default values, including
+when "isvec" is false in a given register's CSR entry. Only when the
+elwidth is set to default **and** isvec is false will the standard
+RV behaviour be followed, namely that the upper bits be modified.
+
+Ultimately if elwidth is default and isvec false on *all* source
+and destination registers, a SimpleV instruction defaults completely
+to standard RV scalar behaviour (this holds true for **all** operations,
+right across the board).
+
+The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
+non-default values are effectively all the same: they all still perform
+multiple ADD operations, just at different widths. A future extension
+to SimpleV may actually allow ADD.S to access the upper bits of the
+register, effectively breaking down a 128-bit register into a bank
+of 4 independently-accesible 32-bit registers.
+
+In the meantime, although when e.g. setting VL to 8 it would technically
+make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
+using ADD.Q may be an easy way to signal to the microarchitecture that
+it is to receive a higher VL value. On a superscalar OoO architecture
+there may be absolutely no difference, however on simpler SIMD-style
+microarchitectures they may not necessarily have the infrastructure in
+place to know the difference, such that when VL=8 and an ADD.D instruction
+is issued, it completes in 2 cycles (or more) rather than one, where
+if an ADD.Q had been issued instead on such simpler microarchitectures
+it would complete in one.
+
+## Specific instruction walk-throughs
+
+This section covers walk-throughs of the above-outlined procedure
+for converting standard RISC-V scalar arithmetic operations to
+polymorphic widths, to ensure that it is correct.
+
+### add
+
+Standard Scalar RV32/RV64 (xlen):
+
+* RS1 @ xlen bits
+* RS2 @ xlen bits
+* add @ xlen bits
+* RD @ xlen bits
+
+Polymorphic variant:
+
+* RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
+* RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
+* add @ max(rs1, rs2) bits
+* RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
+
+Note here that polymorphic add zero-extends its source operands,
+where addw sign-extends.
+
+### addw
+
+The RV Specification specifically states that "W" variants of arithmetic
+operations always produce 32-bit signed values. In a polymorphic
+environment it is reasonable to assume that the signed aspect is
+preserved, where it is the length of the operands and the result
+that may be changed.
+
+Standard Scalar RV64 (xlen):
+
+* RS1 @ xlen bits
+* RS2 @ xlen bits
+* add @ xlen bits
+* RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
+
+Polymorphic variant:
+
+* RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
+* RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
+* add @ max(rs1, rs2) bits
+* RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
+
+Note here that polymorphic addw sign-extends its source operands,
+where add zero-extends.
+
+This requires a little more in-depth analysis. Where the bitwidth of
+rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
+only where the bitwidth of either rs1 or rs2 are different, will the
+lesser-width operand be sign-extended.
+
+Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
+where for add they are both zero-extended. This holds true for all arithmetic
+operations ending with "W".
+
+### addiw
+
+Standard Scalar RV64I:
+
+* RS1 @ xlen bits, truncated to 32-bit
+* immed @ 12 bits, sign-extended to 32-bit
+* add @ 32 bits
+* RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
+
+Polymorphic variant:
+
+* RS1 @ rs1 bits
+* immed @ 12 bits, sign-extend to max(rs1, 12) bits
+* add @ max(rs1, 12) bits
+* RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
+
+# Predication Element Zeroing
+
+The introduction of zeroing on traditional vector predication is usually
+intended as an optimisation for lane-based microarchitectures with register
+renaming to be able to save power by avoiding a register read on elements
+that are passed through en-masse through the ALU. Simpler microarchitectures
+do not have this issue: they simply do not pass the element through to
+the ALU at all, and therefore do not store it back in the destination.
+More complex non-lane-based micro-architectures can, when zeroing is
+not set, use the predication bits to simply avoid sending element-based
+operations to the ALUs, entirely: thus, over the long term, potentially
+keeping all ALUs 100% occupied even when elements are predicated out.
+
+SimpleV's design principle is not based on or influenced by
+microarchitectural design factors: it is a hardware-level API.
+Therefore, looking purely at whether zeroing is *useful* or not,
+(whether less instructions are needed for certain scenarios),
+given that a case can be made for zeroing *and* non-zeroing, the
+decision was taken to add support for both.
+
+## Single-predication (based on destination register)
+
+Zeroing on predication for arithmetic operations is taken from
+the destination register's predicate. i.e. the predication *and*
+zeroing settings to be applied to the whole operation come from the
+CSR Predication table entry for the destination register.
+Thus when zeroing is set on predication of a destination element,
+if the predication bit is clear, then the destination element is *set*
+to zero (twin-predication is slightly different, and will be covered
+next).
+
+Thus the pseudo-code loop for a predicated arithmetic operation
+is modified to as follows:
+
+ for (i = 0; i < VL; i++)
+ if not zeroing: # an optimisation
+ while (!(predval & 1<<i) && i < VL)
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+ if i == VL:
+ return
+ if (predval & 1<<i)
+ src1 = ....
+ src2 = ...
+ else:
+ result = src1 + src2 # actual add (or other op) here
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if int_vec[rd].ffirst and result == 0:
+ VL = i # result was zero, end loop early, return VL
+ return
+ if (!int_vec[rd].isvector) return
+ else if zeroing:
+ result = 0
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if (int_vec[rd ].isvector) { id += 1; }
+ else if (predval & 1<<i) return
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+ if (rd == VL or rs1 == VL or rs2 == VL): return
+
+The optimisation to skip elements entirely is only possible for certain
+micro-architectures when zeroing is not set. However for lane-based
+micro-architectures this optimisation may not be practical, as it
+implies that elements end up in different "lanes". Under these
+circumstances it is perfectly fine to simply have the lanes
+"inactive" for predicated elements, even though it results in
+less than 100% ALU utilisation.
+
+## Twin-predication (based on source and destination register)
+
+Twin-predication is not that much different, except that that
+the source is independently zero-predicated from the destination.
+This means that the source may be zero-predicated *or* the
+destination zero-predicated *or both*, or neither.
+
+When with twin-predication, zeroing is set on the source and not
+the destination, if a predicate bit is set it indicates that a zero
+data element is passed through the operation (the exception being:
+if the source data element is to be treated as an address - a LOAD -
+then the data returned *from* the LOAD is zero, rather than looking up an
+*address* of zero.
+
+When zeroing is set on the destination and not the source, then just
+as with single-predicated operations, a zero is stored into the destination
+element (or target memory address for a STORE).
+
+Zeroing on both source and destination effectively result in a bitwise
+NOR operation of the source and destination predicate: the result is that
+where either source predicate OR destination predicate is set to 0,
+a zero element will ultimately end up in the destination register.
+
+However: this may not necessarily be the case for all operations;
+implementors, particularly of custom instructions, clearly need to
+think through the implications in each and every case.
+
+Here is pseudo-code for a twin zero-predicated operation:
+
+ function op_mv(rd, rs) # MV not VMV!
+ rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
+ pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL):
+ if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
+ if ((pd & 1<<j))
+ if ((pd & 1<<j))
+ sourcedata = ireg[rs+i];
+ else
+ sourcedata = 0
+ ireg[rd+j] <= sourcedata
+ else if (zerodst)
+ ireg[rd+j] <= 0
+ if (int_csr[rs].isvec)
+ i++;
+ if (int_csr[rd].isvec)
+ j++;
+ else
+ if ((pd & 1<<j))
+ break;
+
+Note that in the instance where the destination is a scalar, the hardware
+loop is ended the moment a value *or a zero* is placed into the destination
+register/element. Also note that, for clarity, variable element widths
+have been left out of the above.
+
+# Subsets of RV functionality
+
+This section describes the differences when SV is implemented on top of
+different subsets of RV.
+
+## Common options
+
+It is permitted to only implement SVprefix and not the VBLOCK instruction
+format option, and vice-versa. UNIX Platforms **MUST** raise illegal
+instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
+traps may emulate the format.
+
+It is permitted in SVprefix to either not implement VL or not implement
+SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
+*MUST* raise illegal instruction on implementations that do not support
+VL or SUBVL.
+
+It is permitted to limit the size of either (or both) the register files
+down to the original size of the standard RV architecture. However, below
+the mandatory limits set in the RV standard will result in non-compliance
+with the SV Specification.
+
+## RV32 / RV32F
+
+When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
+maximum limit for predication is also restricted to 32 bits. Whilst not
+actually specifically an "option" it is worth noting.
+
+## RV32G
+
+Normally in standard RV32 it does not make much sense to have
+RV32G, The critical instructions that are missing in standard RV32
+are those for moving data to and from the double-width floating-point
+registers into the integer ones, as well as the FCVT routines.
+
+In an earlier draft of SV, it was possible to specify an elwidth
+of double the standard register size: this had to be dropped,
+and may be reintroduced in future revisions.
+
+## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
+
+When floating-point is not implemented, the size of the User Register and
+Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
+per table).
+
+## RV32E
+
+In embedded scenarios the User Register and Predication CSRs may be
+dropped entirely, or optionally limited to 1 CSR, such that the combined
+number of entries from the M-Mode CSR Register table plus U-Mode
+CSR Register table is either 4 16-bit entries or (if the U-Mode is
+zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
+the Predication CSR tables.
+
+RV32E is the most likely candidate for simply detecting that registers
+are marked as "vectorised", and generating an appropriate exception
+for the VL loop to be implemented in software.
+
+## RV128
+
+RV128 has not been especially considered, here, however it has some
+extremely large possibilities: double the element width implies
+256-bit operands, spanning 2 128-bit registers each, and predication
+of total length 128 bit given that XLEN is now 128.
+
+# Example usage
+
+TODO evaluate strncpy and strlen
+<https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
+
+## strncpy
+
+RVV version: <a name="strncpy"></>
+
+ strncpy:
+ mv a3, a0 # Copy dst
+ loop:
+ setvli x0, a2, vint8 # Vectors of bytes.
+ vlbff.v v1, (a1) # Get src bytes
+ vseq.vi v0, v1, 0 # Flag zero bytes
+ vmfirst a4, v0 # Zero found?
+ vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
+ vsb.v v1, (a3), v0.t # Write out bytes
+ bgez a4, exit # Done
+ csrr t1, vl # Get number of bytes fetched
+ add a1, a1, t1 # Bump src pointer
+ sub a2, a2, t1 # Decrement count.
+ add a3, a3, t1 # Bump dst pointer
+ bnez a2, loop # Anymore?
+
+ exit:
+ ret
+
+SV version (WIP):
+
+ strncpy:
+ mv a3, a0
+ SETMVLI 8 # set max vector to 8
+ RegCSR[a3] = 8bit, a3, scalar
+ RegCSR[a1] = 8bit, a1, scalar
+ RegCSR[t0] = 8bit, t0, vector
+ PredTb[t0] = ffirst, x0, inv
+ loop:
+ SETVLI a2, t4 # t4 and VL now 1..8
+ ldb t0, (a1) # t0 fail first mode
+ bne t0, x0, allnonzero # still ff
+ # VL points to last nonzero
+ GETVL t4 # from bne tests
+ addi t4, t4, 1 # include zero
+ SETVL t4 # set exactly to t4
+ stb t0, (a3) # store incl zero
+ ret # end subroutine
+ allnonzero:
+ stb t0, (a3) # VL legal range
+ GETVL t4 # from bne tests
+ add a1, a1, t4 # Bump src pointer
+ sub a2, a2, t4 # Decrement count.
+ add a3, a3, t4 # Bump dst pointer
+ bnez a2, loop # Anymore?
+ exit:
+ ret
+
+Notes:
+
+* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
+* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
+* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
+* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
+* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
+* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
+* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
+* ldb and bne are both using t0, both in ffirst mode
+* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
+* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
+* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
+* the branch only goes to allnonzero if all tests succeed
+* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
+* SETVL sets *exactly* the requested amount into VL.
+* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
+* this would cause the stb to copy up to the end of the legal memory
+* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
+
+## strcpy
+
+RVV version:
+
+ mv a3, a0 # Save start
+ loop:
+ setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
+ vldbff.v v1, (a3) # Get bytes
+ csrr a1, vl # Get bytes actually read e.g. if fault
+ vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
+ add a3, a3, a1 # Bump pointer
+ vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
+ bltz a2, loop # Not found?
+ add a0, a0, a1 # Sum start + bump
+ add a3, a3, a2 # Add index of zero byte
+ sub a0, a3, a0 # Subtract start address+bump
+ ret
on the src for LOAD and dest for STORE switches mode from "Unit Stride"
to "Multi-indirection", respectively.
-# Element bitwidth polymorphism <a name="elwidth"></a>
-
-Element bitwidth is best covered as its own special section, as it
-is quite involved and applies uniformly across-the-board. SV restricts
-bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
-
-The effect of setting an element bitwidth is to re-cast each entry
-in the register table, and for all memory operations involving
-load/stores of certain specific sizes, to a completely different width.
-Thus In c-style terms, on an RV64 architecture, effectively each register
-now looks like this:
-
- typedef union {
- uint8_t b[8];
- uint16_t s[4];
- uint32_t i[2];
- uint64_t l[1];
- } reg_t;
-
- // integer table: assume maximum SV 7-bit regfile size
- reg_t int_regfile[128];
-
-where the CSR Register table entry (not the instruction alone) determines
-which of those union entries is to be used on each operation, and the
-VL element offset in the hardware-loop specifies the index into each array.
-
-However a naive interpretation of the data structure above masks the
-fact that setting VL greater than 8, for example, when the bitwidth is 8,
-accessing one specific register "spills over" to the following parts of
-the register file in a sequential fashion. So a much more accurate way
-to reflect this would be:
-
- typedef union {
- uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
- uint8_t b[0]; // array of type uint8_t
- uint16_t s[0];
- uint32_t i[0];
- uint64_t l[0];
- uint128_t d[0];
- } reg_t;
-
- reg_t int_regfile[128];
-
-where when accessing any individual regfile[n].b entry it is permitted
-(in c) to arbitrarily over-run the *declared* length of the array (zero),
-and thus "overspill" to consecutive register file entries in a fashion
-that is completely transparent to a greatly-simplified software / pseudo-code
-representation.
-It is however critical to note that it is clearly the responsibility of
-the implementor to ensure that, towards the end of the register file,
-an exception is thrown if attempts to access beyond the "real" register
-bytes is ever attempted.
-
-Now we may modify pseudo-code an operation where all element bitwidths have
-been set to the same size, where this pseudo-code is otherwise identical
-to its "non" polymorphic versions (above):
-
- function op_add(rd, rs1, rs2) # add not VADD!
- ...
- ...
- for (i = 0; i < VL; i++)
- ...
- ...
- // TODO, calculate if over-run occurs, for each elwidth
- if (elwidth == 8) {
- int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else if elwidth == 16 {
- int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
- int_regfile[rs2].s[irs2];
- } else if elwidth == 32 {
- int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else { // elwidth == 64
- int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
- int_regfile[rs2].l[irs2];
- }
- ...
- ...
-
-So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
-following sequentially on respectively from the same) are "type-cast"
-to 8-bit; for 16-bit entries likewise and so on.
-
-However that only covers the case where the element widths are the same.
-Where the element widths are different, the following algorithm applies:
-
-* Analyse the bitwidth of all source operands and work out the
- maximum. Record this as "maxsrcbitwidth"
-* If any given source operand requires sign-extension or zero-extension
- (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
- sign-extension / zero-extension or whatever is specified in the standard
- RV specification, **change** that to sign-extending from the respective
- individual source operand's bitwidth from the CSR table out to
- "maxsrcbitwidth" (previously calculated), instead.
-* Following separate and distinct (optional) sign/zero-extension of all
- source operands as specifically required for that operation, carry out the
- operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
- this may be a "null" (copy) operation, and that with FCVT, the changes
- to the source and destination bitwidths may also turn FVCT effectively
- into a copy).
-* If the destination operand requires sign-extension or zero-extension,
- instead of a mandatory fixed size (typically 32-bit for arithmetic,
- for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
- etc.), overload the RV specification with the bitwidth from the
- destination register's elwidth entry.
-* Finally, store the (optionally) sign/zero-extended value into its
- destination: memory for sb/sw etc., or an offset section of the register
- file for an arithmetic operation.
-
-In this way, polymorphic bitwidths are achieved without requiring a
-massive 64-way permutation of calculations **per opcode**, for example
-(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
-rd bitwidths). The pseudo-code is therefore as follows:
-
- typedef union {
- uint8_t b;
- uint16_t s;
- uint32_t i;
- uint64_t l;
- } el_reg_t;
-
- bw(elwidth):
- if elwidth == 0: return xlen
- if elwidth == 1: return 8
- if elwidth == 2: return 16
- // elwidth == 3:
- return 32
-
- get_max_elwidth(rs1, rs2):
- return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
- bw(int_csr[rs2].elwidth)) # again XLEN if no entry
-
- get_polymorphed_reg(reg, bitwidth, offset):
- el_reg_t res;
- res.l = 0; // TODO: going to need sign-extending / zero-extending
- if bitwidth == 8:
- reg.b = int_regfile[reg].b[offset]
- elif bitwidth == 16:
- reg.s = int_regfile[reg].s[offset]
- elif bitwidth == 32:
- reg.i = int_regfile[reg].i[offset]
- elif bitwidth == 64:
- reg.l = int_regfile[reg].l[offset]
- return res
-
- set_polymorphed_reg(reg, bitwidth, offset, val):
- if (!int_csr[reg].isvec):
- # sign/zero-extend depending on opcode requirements, from
- # the reg's bitwidth out to the full bitwidth of the regfile
- val = sign_or_zero_extend(val, bitwidth, xlen)
- int_regfile[reg].l[0] = val
- elif bitwidth == 8:
- int_regfile[reg].b[offset] = val
- elif bitwidth == 16:
- int_regfile[reg].s[offset] = val
- elif bitwidth == 32:
- int_regfile[reg].i[offset] = val
- elif bitwidth == 64:
- int_regfile[reg].l[offset] = val
-
- maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
- destwid = int_csr[rs1].elwidth # destination element width
- for (i = 0; i < VL; i++)
- if (predval & 1<<i) # predication uses intregs
- // TODO, calculate if over-run occurs, for each elwidth
- src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
- // TODO, sign/zero-extend src1 and src2 as operation requires
- if (op_requires_sign_extend_src1)
- src1 = sign_extend(src1, maxsrcwid)
- src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
- result = src1 + src2 # actual add here
- // TODO, sign/zero-extend result, as operation requires
- if (op_requires_sign_extend_dest)
- result = sign_extend(result, maxsrcwid)
- set_polymorphed_reg(rd, destwid, ird, result)
- if (!int_vec[rd].isvector) break
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-Whilst specific sign-extension and zero-extension pseudocode call
-details are left out, due to each operation being different, the above
-should be clear that;
-
-* the source operands are extended out to the maximum bitwidth of all
- source operands
-* the operation takes place at that maximum source bitwidth (the
- destination bitwidth is not involved at this point, at all)
-* the result is extended (or potentially even, truncated) before being
- stored in the destination. i.e. truncation (if required) to the
- destination width occurs **after** the operation **not** before.
-* when the destination is not marked as "vectorised", the **full**
- (standard, scalar) register file entry is taken up, i.e. the
- element is either sign-extended or zero-extended to cover the
- full register bitwidth (XLEN) if it is not already XLEN bits long.
-
-Implementors are entirely free to optimise the above, particularly
-if it is specifically known that any given operation will complete
-accurately in less bits, as long as the results produced are
-directly equivalent and equal, for all inputs and all outputs,
-to those produced by the above algorithm.
-
-## Polymorphic floating-point operation exceptions and error-handling
-
-For floating-point operations, conversion takes place without
-raising any kind of exception. Exactly as specified in the standard
-RV specification, NAN (or appropriate) is stored if the result
-is beyond the range of the destination, and, again, exactly as
-with the standard RV specification just as with scalar
-operations, the floating-point flag is raised (FCSR). And, again, just as
-with scalar operations, it is software's responsibility to check this flag.
-Given that the FCSR flags are "accrued", the fact that multiple element
-operations could have occurred is not a problem.
-
-Note that it is perfectly legitimate for floating-point bitwidths of
-only 8 to be specified. However whilst it is possible to apply IEEE 754
-principles, no actual standard yet exists. Implementors wishing to
-provide hardware-level 8-bit support rather than throw a trap to emulate
-in software should contact the author of this specification before
-proceeding.
-
-## Polymorphic shift operators
-
-A special note is needed for changing the element width of left and right
-shift operators, particularly right-shift. Even for standard RV base,
-in order for correct results to be returned, the second operand RS2 must
-be truncated to be within the range of RS1's bitwidth. spike's implementation
-of sll for example is as follows:
-
- WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
-
-which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
-range 0..31 so that RS1 will only be left-shifted by the amount that
-is possible to fit into a 32-bit register. Whilst this appears not
-to matter for hardware, it matters greatly in software implementations,
-and it also matters where an RV64 system is set to "RV32" mode, such
-that the underlying registers RS1 and RS2 comprise 64 hardware bits
-each.
-
-For SV, where each operand's element bitwidth may be over-ridden, the
-rule about determining the operation's bitwidth *still applies*, being
-defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
-**also applies to the truncation of RS2**. In other words, *after*
-determining the maximum bitwidth, RS2's range must **also be truncated**
-to ensure a correct answer. Example:
-
-* RS1 is over-ridden to a 16-bit width
-* RS2 is over-ridden to an 8-bit width
-* RD is over-ridden to a 64-bit width
-* the maximum bitwidth is thus determined to be 16-bit - max(8,16)
-* RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
-
-Pseudocode (in spike) for this example would therefore be:
-
- WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
-
-This example illustrates that considerable care therefore needs to be
-taken to ensure that left and right shift operations are implemented
-correctly. The key is that
-
-* The operation bitwidth is determined by the maximum bitwidth
- of the *source registers*, **not** the destination register bitwidth
-* The result is then sign-extend (or truncated) as appropriate.
-
-## Polymorphic MULH/MULHU/MULHSU
-
-MULH is designed to take the top half MSBs of a multiply that
-does not fit within the range of the source operands, such that
-smaller width operations may produce a full double-width multiply
-in two cycles. The issue is: SV allows the source operands to
-have variable bitwidth.
-
-Here again special attention has to be paid to the rules regarding
-bitwidth, which, again, are that the operation is performed at
-the maximum bitwidth of the **source** registers. Therefore:
-
-* An 8-bit x 8-bit multiply will create a 16-bit result that must
- be shifted down by 8 bits
-* A 16-bit x 8-bit multiply will create a 24-bit result that must
- be shifted down by 16 bits (top 8 bits being zero)
-* A 16-bit x 16-bit multiply will create a 32-bit result that must
- be shifted down by 16 bits
-* A 32-bit x 16-bit multiply will create a 48-bit result that must
- be shifted down by 32 bits
-* A 32-bit x 8-bit multiply will create a 40-bit result that must
- be shifted down by 32 bits
-
-So again, just as with shift-left and shift-right, the result
-is shifted down by the maximum of the two source register bitwidths.
-And, exactly again, truncation or sign-extension is performed on the
-result. If sign-extension is to be carried out, it is performed
-from the same maximum of the two source register bitwidths out
-to the result element's bitwidth.
-
-If truncation occurs, i.e. the top MSBs of the result are lost,
-this is "Officially Not Our Problem", i.e. it is assumed that the
-programmer actually desires the result to be truncated. i.e. if the
-programmer wanted all of the bits, they would have set the destination
-elwidth to accommodate them.
-
-## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
-
-Polymorphic element widths in vectorised form means that the data
-being loaded (or stored) across multiple registers needs to be treated
-(reinterpreted) as a contiguous stream of elwidth-wide items, where
-the source register's element width is **independent** from the destination's.
-
-This makes for a slightly more complex algorithm when using indirection
-on the "addressed" register (source for LOAD and destination for STORE),
-particularly given that the LOAD/STORE instruction provides important
-information about the width of the data to be reinterpreted.
-
-Let's illustrate the "load" part, where the pseudo-code for elwidth=default
-was as follows, and i is the loop from 0 to VL-1:
-
- srcbase = ireg[rs+i];
- return mem[srcbase + imm]; // returns XLEN bits
-
-Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
-chunks are taken from the source memory location addressed by the current
-indexed source address register, and only when a full 32-bits-worth
-are taken will the index be moved on to the next contiguous source
-address register:
-
- bitwidth = bw(elwidth); // source elwidth from CSR reg entry
- elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
- srcbase = ireg[rs+i/(elsperblock)]; // integer divide
- offs = i % elsperblock; // modulo
- return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
-
-Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
-and 128 for LQ.
-
-The principle is basically exactly the same as if the srcbase were pointing
-at the memory of the *register* file: memory is re-interpreted as containing
-groups of elwidth-wide discrete elements.
-
-When storing the result from a load, it's important to respect the fact
-that the destination register has its *own separate element width*. Thus,
-when each element is loaded (at the source element width), any sign-extension
-or zero-extension (or truncation) needs to be done to the *destination*
-bitwidth. Also, the storing has the exact same analogous algorithm as
-above, where in fact it is just the set\_polymorphed\_reg pseudocode
-(completely unchanged) used above.
-
-One issue remains: when the source element width is **greater** than
-the width of the operation, it is obvious that a single LB for example
-cannot possibly obtain 16-bit-wide data. This condition may be detected
-where, when using integer divide, elsperblock (the width of the LOAD
-divided by the bitwidth of the element) is zero.
-
-The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
-
- elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
-
-The elements, if the element bitwidth is larger than the LD operation's
-size, will then be sign/zero-extended to the full LD operation size, as
-specified by the LOAD (LDU instead of LD, LBU instead of LB), before
-being passed on to the second phase.
-
-As LOAD/STORE may be twin-predicated, it is important to note that
-the rules on twin predication still apply, except where in previous
-pseudo-code (elwidth=default for both source and target) it was
-the *registers* that the predication was applied to, it is now the
-**elements** that the predication is applied to.
-
-Thus the full pseudocode for all LD operations may be written out
-as follows:
-
- function LBU(rd, rs):
- load_elwidthed(rd, rs, 8, true)
- function LB(rd, rs):
- load_elwidthed(rd, rs, 8, false)
- function LH(rd, rs):
- load_elwidthed(rd, rs, 16, false)
- ...
- ...
- function LQ(rd, rs):
- load_elwidthed(rd, rs, 128, false)
-
- # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
- function load_memory(rs, imm, i, opwidth):
- elwidth = int_csr[rs].elwidth
- bitwidth = bw(elwidth);
- elsperblock = min(1, opwidth / bitwidth)
- srcbase = ireg[rs+i/(elsperblock)];
- offs = i % elsperblock;
- return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
-
- function load_elwidthed(rd, rs, opwidth, unsigned):
- destwid = int_csr[rd].elwidth # destination element width
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- val = load_memory(rs, imm, i, opwidth)
- if unsigned:
- val = zero_extend(val, min(opwidth, bitwidth))
- else:
- val = sign_extend(val, min(opwidth, bitwidth))
- set_polymorphed_reg(rd, bitwidth, j, val)
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++; else break;
-
-Note:
-
-* when comparing against for example the twin-predicated c.mv
- pseudo-code, the pattern of independent incrementing of rd and rs
- is preserved unchanged.
-* just as with the c.mv pseudocode, zeroing is not included and must be
- taken into account (TODO).
-* that due to the use of a twin-predication algorithm, LOAD/STORE also
- take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
- VSCATTER characteristics.
-* that due to the use of the same set\_polymorphed\_reg pseudocode,
- a destination that is not vectorised (marked as scalar) will
- result in the element being fully sign-extended or zero-extended
- out to the full register file bitwidth (XLEN). When the source
- is also marked as scalar, this is how the compatibility with
- standard RV LOAD/STORE is preserved by this algorithm.
-
-### Example Tables showing LOAD elements
-
-This section contains examples of vectorised LOAD operations, showing
-how the two stage process works (three if zero/sign-extension is included).
-
-
-#### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
-
-This is:
-
-* a 64-bit load, with an offset of zero
-* with a source-address elwidth of 16-bit
-* into a destination-register with an elwidth of 32-bit
-* where VL=7
-* from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
-* RV64, where XLEN=64 is assumed.
-
-First, the memory table, which, due to the
-element width being 16 and the operation being LD (64), the 64-bits
-loaded from memory are subdivided into groups of **four** elements.
-And, with VL being 7 (deliberately to illustrate that this is reasonable
-and possible), the first four are sourced from the offset addresses pointed
-to by x5, and the next three from the ofset addresses pointed to by
-the next contiguous register, x6:
-
-[[!table data="""
-addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
-@x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
-@x6 | elem 4 || elem 5 || elem 6 || not loaded ||
-"""]]
-
-Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
-the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
-
-[[!table data="""
-byte 3 | byte 2 | byte 1 | byte 0 |
-0x0 | 0x0 | elem0 ||
-0x0 | 0x0 | elem1 ||
-0x0 | 0x0 | elem2 ||
-0x0 | 0x0 | elem3 ||
-0x0 | 0x0 | elem4 ||
-0x0 | 0x0 | elem5 ||
-0x0 | 0x0 | elem6 ||
-0x0 | 0x0 | elem7 ||
-"""]]
-
-Lastly, the elements are stored in contiguous blocks, as if x8 was also
-byte-addressable "memory". That "memory" happens to cover registers
-x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
-
-[[!table data="""
-reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
-x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
-x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
-x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
-x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
-"""]]
-
-Thus we have data that is loaded from the **addresses** pointed to by
-x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
-x8 through to half of x11.
-The end result is that elements 0 and 1 end up in x8, with element 8 being
-shifted up 32 bits, and so on, until finally element 6 is in the
-LSBs of x11.
-
-Note that whilst the memory addressing table is shown left-to-right byte order,
-the registers are shown in right-to-left (MSB) order. This does **not**
-imply that bit or byte-reversal is carried out: it's just easier to visualise
-memory as being contiguous bytes, and emphasises that registers are not
-really actually "memory" as such.
-
-## Why SV bitwidth specification is restricted to 4 entries
-
-The four entries for SV element bitwidths only allows three over-rides:
-
-* 8 bit
-* 16 hit
-* 32 bit
-
-This would seem inadequate, surely it would be better to have 3 bits or
-more and allow 64, 128 and some other options besides. The answer here
-is, it gets too complex, no RV128 implementation yet exists, and so RV64's
-default is 64 bit, so the 4 major element widths are covered anyway.
-
-There is an absolutely crucial aspect oF SV here that explicitly
-needs spelling out, and it's whether the "vectorised" bit is set in
-the Register's CSR entry.
-
-If "vectorised" is clear (not set), this indicates that the operation
-is "scalar". Under these circumstances, when set on a destination (RD),
-then sign-extension and zero-extension, whilst changed to match the
-override bitwidth (if set), will erase the **full** register entry
-(64-bit if RV64).
-
-When vectorised is *set*, this indicates that the operation now treats
-**elements** as if they were independent registers, so regardless of
-the length, any parts of a given actual register that are not involved
-in the operation are **NOT** modified, but are **PRESERVED**.
-
-For example:
-
-* when the vector bit is clear and elwidth set to 16 on the destination
- register, operations are truncated to 16 bit and then sign or zero
- extended to the *FULL* XLEN register width.
-* when the vector bit is set, elwidth is 16 and VL=1 (or other value where
- groups of elwidth sized elements do not fill an entire XLEN register),
- the "top" bits of the destination register do *NOT* get modified, zero'd
- or otherwise overwritten.
-
-SIMD micro-architectures may implement this by using predication on
-any elements in a given actual register that are beyond the end of
-multi-element operation.
-
-Other microarchitectures may choose to provide byte-level write-enable
-lines on the register file, such that each 64 bit register in an RV64
-system requires 8 WE lines. Scalar RV64 operations would require
-activation of all 8 lines, where SV elwidth based operations would
-activate the required subset of those byte-level write lines.
-
-Example:
-
-* rs1, rs2 and rd are all set to 8-bit
-* VL is set to 3
-* RV64 architecture is set (UXL=64)
-* add operation is carried out
-* bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
- concatenated with similar add operations on bits 15..8 and 7..0
-* bits 24 through 63 **remain as they originally were**.
-
-Example SIMD micro-architectural implementation:
-
-* SIMD architecture works out the nearest round number of elements
- that would fit into a full RV64 register (in this case: 8)
-* SIMD architecture creates a hidden predicate, binary 0b00000111
- i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
-* SIMD architecture goes ahead with the add operation as if it
- was a full 8-wide batch of 8 adds
-* SIMD architecture passes top 5 elements through the adders
- (which are "disabled" due to zero-bit predication)
-* SIMD architecture gets the 5 unmodified top 8-bits back unmodified
- and stores them in rd.
-
-This requires a read on rd, however this is required anyway in order
-to support non-zeroing mode.
-
-## Polymorphic floating-point
-
-Standard scalar RV integer operations base the register width on XLEN,
-which may be changed (UXL in USTATUS, and the corresponding MXL and
-SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
-arithmetic operations are therefore restricted to an active XLEN bits,
-with sign or zero extension to pad out the upper bits when XLEN has
-been dynamically set to less than the actual register size.
-
-For scalar floating-point, the active (used / changed) bits are
-specified exclusively by the operation: ADD.S specifies an active
-32-bits, with the upper bits of the source registers needing to
-be all 1s ("NaN-boxed"), and the destination upper bits being
-*set* to all 1s (including on LOAD/STOREs).
-
-Where elwidth is set to default (on any source or the destination)
-it is obvious that this NaN-boxing behaviour can and should be
-preserved. When elwidth is non-default things are less obvious,
-so need to be thought through. Here is a normal (scalar) sequence,
-assuming an RV64 which supports Quad (128-bit) FLEN:
-
-* FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
-* ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
-* FSD stores lowest 64-bits from the 128-bit-wide register to memory:
- top 64 MSBs ignored.
-
-Therefore it makes sense to mirror this behaviour when, for example,
-elwidth is set to 32. Assume elwidth set to 32 on all source and
-destination registers:
-
-* FLD loads 64-bit wide from memory as **two** 32-bit single-precision
- floating-point numbers.
-* ADD.D performs **two** 32-bit-wide adds, storing one of the adds
- in bits 0-31 and the second in bits 32-63.
-* FSD stores lowest 64-bits from the 128-bit-wide register to memory
-
-Here's the thing: it does not make sense to overwrite the top 64 MSBs
-of the registers either during the FLD **or** the ADD.D. The reason
-is that, effectively, the top 64 MSBs actually represent a completely
-independent 64-bit register, so overwriting it is not only gratuitous
-but may actually be harmful for a future extension to SV which may
-have a way to directly access those top 64 bits.
-
-The decision is therefore **not** to touch the upper parts of floating-point
-registers whereever elwidth is set to non-default values, including
-when "isvec" is false in a given register's CSR entry. Only when the
-elwidth is set to default **and** isvec is false will the standard
-RV behaviour be followed, namely that the upper bits be modified.
-
-Ultimately if elwidth is default and isvec false on *all* source
-and destination registers, a SimpleV instruction defaults completely
-to standard RV scalar behaviour (this holds true for **all** operations,
-right across the board).
-
-The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
-non-default values are effectively all the same: they all still perform
-multiple ADD operations, just at different widths. A future extension
-to SimpleV may actually allow ADD.S to access the upper bits of the
-register, effectively breaking down a 128-bit register into a bank
-of 4 independently-accesible 32-bit registers.
-
-In the meantime, although when e.g. setting VL to 8 it would technically
-make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
-using ADD.Q may be an easy way to signal to the microarchitecture that
-it is to receive a higher VL value. On a superscalar OoO architecture
-there may be absolutely no difference, however on simpler SIMD-style
-microarchitectures they may not necessarily have the infrastructure in
-place to know the difference, such that when VL=8 and an ADD.D instruction
-is issued, it completes in 2 cycles (or more) rather than one, where
-if an ADD.Q had been issued instead on such simpler microarchitectures
-it would complete in one.
-
-## Specific instruction walk-throughs
-
-This section covers walk-throughs of the above-outlined procedure
-for converting standard RISC-V scalar arithmetic operations to
-polymorphic widths, to ensure that it is correct.
-
-### add
-
-Standard Scalar RV32/RV64 (xlen):
-
-* RS1 @ xlen bits
-* RS2 @ xlen bits
-* add @ xlen bits
-* RD @ xlen bits
-
-Polymorphic variant:
-
-* RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
-* RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
-* add @ max(rs1, rs2) bits
-* RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
-
-Note here that polymorphic add zero-extends its source operands,
-where addw sign-extends.
-
-### addw
-
-The RV Specification specifically states that "W" variants of arithmetic
-operations always produce 32-bit signed values. In a polymorphic
-environment it is reasonable to assume that the signed aspect is
-preserved, where it is the length of the operands and the result
-that may be changed.
-
-Standard Scalar RV64 (xlen):
-
-* RS1 @ xlen bits
-* RS2 @ xlen bits
-* add @ xlen bits
-* RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
-
-Polymorphic variant:
-
-* RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
-* RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
-* add @ max(rs1, rs2) bits
-* RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
-
-Note here that polymorphic addw sign-extends its source operands,
-where add zero-extends.
-
-This requires a little more in-depth analysis. Where the bitwidth of
-rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
-only where the bitwidth of either rs1 or rs2 are different, will the
-lesser-width operand be sign-extended.
-
-Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
-where for add they are both zero-extended. This holds true for all arithmetic
-operations ending with "W".
-
-### addiw
-
-Standard Scalar RV64I:
-
-* RS1 @ xlen bits, truncated to 32-bit
-* immed @ 12 bits, sign-extended to 32-bit
-* add @ 32 bits
-* RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
-
-Polymorphic variant:
-
-* RS1 @ rs1 bits
-* immed @ 12 bits, sign-extend to max(rs1, 12) bits
-* add @ max(rs1, 12) bits
-* RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
-
-# Predication Element Zeroing
-
-The introduction of zeroing on traditional vector predication is usually
-intended as an optimisation for lane-based microarchitectures with register
-renaming to be able to save power by avoiding a register read on elements
-that are passed through en-masse through the ALU. Simpler microarchitectures
-do not have this issue: they simply do not pass the element through to
-the ALU at all, and therefore do not store it back in the destination.
-More complex non-lane-based micro-architectures can, when zeroing is
-not set, use the predication bits to simply avoid sending element-based
-operations to the ALUs, entirely: thus, over the long term, potentially
-keeping all ALUs 100% occupied even when elements are predicated out.
-
-SimpleV's design principle is not based on or influenced by
-microarchitectural design factors: it is a hardware-level API.
-Therefore, looking purely at whether zeroing is *useful* or not,
-(whether less instructions are needed for certain scenarios),
-given that a case can be made for zeroing *and* non-zeroing, the
-decision was taken to add support for both.
-
-## Single-predication (based on destination register)
-
-Zeroing on predication for arithmetic operations is taken from
-the destination register's predicate. i.e. the predication *and*
-zeroing settings to be applied to the whole operation come from the
-CSR Predication table entry for the destination register.
-Thus when zeroing is set on predication of a destination element,
-if the predication bit is clear, then the destination element is *set*
-to zero (twin-predication is slightly different, and will be covered
-next).
-
-Thus the pseudo-code loop for a predicated arithmetic operation
-is modified to as follows:
-
- for (i = 0; i < VL; i++)
- if not zeroing: # an optimisation
- while (!(predval & 1<<i) && i < VL)
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
- if i == VL:
- return
- if (predval & 1<<i)
- src1 = ....
- src2 = ...
- else:
- result = src1 + src2 # actual add (or other op) here
- set_polymorphed_reg(rd, destwid, ird, result)
- if int_vec[rd].ffirst and result == 0:
- VL = i # result was zero, end loop early, return VL
- return
- if (!int_vec[rd].isvector) return
- else if zeroing:
- result = 0
- set_polymorphed_reg(rd, destwid, ird, result)
- if (int_vec[rd ].isvector) { id += 1; }
- else if (predval & 1<<i) return
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
- if (rd == VL or rs1 == VL or rs2 == VL): return
-
-The optimisation to skip elements entirely is only possible for certain
-micro-architectures when zeroing is not set. However for lane-based
-micro-architectures this optimisation may not be practical, as it
-implies that elements end up in different "lanes". Under these
-circumstances it is perfectly fine to simply have the lanes
-"inactive" for predicated elements, even though it results in
-less than 100% ALU utilisation.
-
-## Twin-predication (based on source and destination register)
-
-Twin-predication is not that much different, except that that
-the source is independently zero-predicated from the destination.
-This means that the source may be zero-predicated *or* the
-destination zero-predicated *or both*, or neither.
-
-When with twin-predication, zeroing is set on the source and not
-the destination, if a predicate bit is set it indicates that a zero
-data element is passed through the operation (the exception being:
-if the source data element is to be treated as an address - a LOAD -
-then the data returned *from* the LOAD is zero, rather than looking up an
-*address* of zero.
-
-When zeroing is set on the destination and not the source, then just
-as with single-predicated operations, a zero is stored into the destination
-element (or target memory address for a STORE).
-
-Zeroing on both source and destination effectively result in a bitwise
-NOR operation of the source and destination predicate: the result is that
-where either source predicate OR destination predicate is set to 0,
-a zero element will ultimately end up in the destination register.
-
-However: this may not necessarily be the case for all operations;
-implementors, particularly of custom instructions, clearly need to
-think through the implications in each and every case.
-
-Here is pseudo-code for a twin zero-predicated operation:
-
- function op_mv(rd, rs) # MV not VMV!
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
- pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL):
- if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
- if ((pd & 1<<j))
- if ((pd & 1<<j))
- sourcedata = ireg[rs+i];
- else
- sourcedata = 0
- ireg[rd+j] <= sourcedata
- else if (zerodst)
- ireg[rd+j] <= 0
- if (int_csr[rs].isvec)
- i++;
- if (int_csr[rd].isvec)
- j++;
- else
- if ((pd & 1<<j))
- break;
-
-Note that in the instance where the destination is a scalar, the hardware
-loop is ended the moment a value *or a zero* is placed into the destination
-register/element. Also note that, for clarity, variable element widths
-have been left out of the above.
-
# Exceptions
TODO: expand. Exceptions may occur at any time, in any given underlying
See ancillary resource: [[vblock_format]]
-# Subsets of RV functionality
-
-This section describes the differences when SV is implemented on top of
-different subsets of RV.
-
-## Common options
-
-It is permitted to only implement SVprefix and not the VBLOCK instruction
-format option, and vice-versa. UNIX Platforms **MUST** raise illegal
-instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
-traps may emulate the format.
-
-It is permitted in SVprefix to either not implement VL or not implement
-SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
-*MUST* raise illegal instruction on implementations that do not support
-VL or SUBVL.
-
-It is permitted to limit the size of either (or both) the register files
-down to the original size of the standard RV architecture. However, below
-the mandatory limits set in the RV standard will result in non-compliance
-with the SV Specification.
-
-## RV32 / RV32F
-
-When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
-maximum limit for predication is also restricted to 32 bits. Whilst not
-actually specifically an "option" it is worth noting.
-
-## RV32G
-
-Normally in standard RV32 it does not make much sense to have
-RV32G, The critical instructions that are missing in standard RV32
-are those for moving data to and from the double-width floating-point
-registers into the integer ones, as well as the FCVT routines.
-
-In an earlier draft of SV, it was possible to specify an elwidth
-of double the standard register size: this had to be dropped,
-and may be reintroduced in future revisions.
-
-## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
-
-When floating-point is not implemented, the size of the User Register and
-Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
-per table).
-
-## RV32E
-
-In embedded scenarios the User Register and Predication CSRs may be
-dropped entirely, or optionally limited to 1 CSR, such that the combined
-number of entries from the M-Mode CSR Register table plus U-Mode
-CSR Register table is either 4 16-bit entries or (if the U-Mode is
-zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
-the Predication CSR tables.
-
-RV32E is the most likely candidate for simply detecting that registers
-are marked as "vectorised", and generating an appropriate exception
-for the VL loop to be implemented in software.
-
-## RV128
-
-RV128 has not been especially considered, here, however it has some
-extremely large possibilities: double the element width implies
-256-bit operands, spanning 2 128-bit registers each, and predication
-of total length 128 bit given that XLEN is now 128.
-
# Under consideration <a name="issues"></a>
for element-grouping, if there is unused space within a register
Expand the range of SUBVL and its associated svsrcoffs and svdestoffs by
adding a 2nd STATE CSR (or extending STATE to 64 bits). Future version?
---
-
-TODO evaluate strncpy and strlen
-<https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
-
-RVV version: <a name="strncpy"></>
-
- strncpy:
- mv a3, a0 # Copy dst
- loop:
- setvli x0, a2, vint8 # Vectors of bytes.
- vlbff.v v1, (a1) # Get src bytes
- vseq.vi v0, v1, 0 # Flag zero bytes
- vmfirst a4, v0 # Zero found?
- vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
- vsb.v v1, (a3), v0.t # Write out bytes
- bgez a4, exit # Done
- csrr t1, vl # Get number of bytes fetched
- add a1, a1, t1 # Bump src pointer
- sub a2, a2, t1 # Decrement count.
- add a3, a3, t1 # Bump dst pointer
- bnez a2, loop # Anymore?
-
- exit:
- ret
-
-SV version (WIP):
-
- strncpy:
- mv a3, a0
- SETMVLI 8 # set max vector to 8
- RegCSR[a3] = 8bit, a3, scalar
- RegCSR[a1] = 8bit, a1, scalar
- RegCSR[t0] = 8bit, t0, vector
- PredTb[t0] = ffirst, x0, inv
- loop:
- SETVLI a2, t4 # t4 and VL now 1..8
- ldb t0, (a1) # t0 fail first mode
- bne t0, x0, allnonzero # still ff
- # VL points to last nonzero
- GETVL t4 # from bne tests
- addi t4, t4, 1 # include zero
- SETVL t4 # set exactly to t4
- stb t0, (a3) # store incl zero
- ret # end subroutine
- allnonzero:
- stb t0, (a3) # VL legal range
- GETVL t4 # from bne tests
- add a1, a1, t4 # Bump src pointer
- sub a2, a2, t4 # Decrement count.
- add a3, a3, t4 # Bump dst pointer
- bnez a2, loop # Anymore?
- exit:
- ret
-
-Notes:
-
-* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
-* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
-* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
-* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
-* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
-* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
-* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
-* ldb and bne are both using t0, both in ffirst mode
-* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
-* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
-* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
-* the branch only goes to allnonzero if all tests succeed
-* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
-* SETVL sets *exactly* the requested amount into VL.
-* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
-* this would cause the stb to copy up to the end of the legal memory
-* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
-
-RVV version:
-
- mv a3, a0 # Save start
- loop:
- setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
- vldbff.v v1, (a3) # Get bytes
- csrr a1, vl # Get bytes actually read e.g. if fault
- vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
- add a3, a3, a1 # Bump pointer
- vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
- bltz a2, loop # Not found?
- add a0, a0, a1 # Sum start + bump
- add a3, a3, a2 # Add index of zero byte
- sub a0, a3, a0 # Subtract start address+bump
- ret