# Simple-V (Parallelism Extension Proposal) Appendix * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton * Status: DRAFTv0.6 * Last edited: 25 jun 2019 * main spec [[specification]] [[!toc ]] # Element bitwidth polymorphism Element bitwidth is best covered as its own special section, as it is quite involved and applies uniformly across-the-board. SV restricts bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit. The effect of setting an element bitwidth is to re-cast each entry in the register table, and for all memory operations involving load/stores of certain specific sizes, to a completely different width. Thus In c-style terms, on an RV64 architecture, effectively each register now looks like this: typedef union { uint8_t b[8]; uint16_t s[4]; uint32_t i[2]; uint64_t l[1]; } reg_t; // integer table: assume maximum SV 7-bit regfile size reg_t int_regfile[128]; where the CSR Register table entry (not the instruction alone) determines which of those union entries is to be used on each operation, and the VL element offset in the hardware-loop specifies the index into each array. However a naive interpretation of the data structure above masks the fact that setting VL greater than 8, for example, when the bitwidth is 8, accessing one specific register "spills over" to the following parts of the register file in a sequential fashion. So a much more accurate way to reflect this would be: typedef union { uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128 uint8_t b[0]; // array of type uint8_t uint16_t s[0]; uint32_t i[0]; uint64_t l[0]; uint128_t d[0]; } reg_t; reg_t int_regfile[128]; where when accessing any individual regfile[n].b entry it is permitted (in c) to arbitrarily over-run the *declared* length of the array (zero), and thus "overspill" to consecutive register file entries in a fashion that is completely transparent to a greatly-simplified software / pseudo-code representation. It is however critical to note that it is clearly the responsibility of the implementor to ensure that, towards the end of the register file, an exception is thrown if attempts to access beyond the "real" register bytes is ever attempted. Now we may modify pseudo-code an operation where all element bitwidths have been set to the same size, where this pseudo-code is otherwise identical to its "non" polymorphic versions (above): function op_add(rd, rs1, rs2) # add not VADD! ... ... for (i = 0; i < VL; i++) ... ... // TODO, calculate if over-run occurs, for each elwidth if (elwidth == 8) { int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] + int_regfile[rs2].i[irs2]; } else if elwidth == 16 { int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] + int_regfile[rs2].s[irs2]; } else if elwidth == 32 { int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] + int_regfile[rs2].i[irs2]; } else { // elwidth == 64 int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] + int_regfile[rs2].l[irs2]; } ... ... So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers following sequentially on respectively from the same) are "type-cast" to 8-bit; for 16-bit entries likewise and so on. However that only covers the case where the element widths are the same. Where the element widths are different, the following algorithm applies: * Analyse the bitwidth of all source operands and work out the maximum. Record this as "maxsrcbitwidth" * If any given source operand requires sign-extension or zero-extension (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit sign-extension / zero-extension or whatever is specified in the standard RV specification, **change** that to sign-extending from the respective individual source operand's bitwidth from the CSR table out to "maxsrcbitwidth" (previously calculated), instead. * Following separate and distinct (optional) sign/zero-extension of all source operands as specifically required for that operation, carry out the operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV this may be a "null" (copy) operation, and that with FCVT, the changes to the source and destination bitwidths may also turn FVCT effectively into a copy). * If the destination operand requires sign-extension or zero-extension, instead of a mandatory fixed size (typically 32-bit for arithmetic, for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw etc.), overload the RV specification with the bitwidth from the destination register's elwidth entry. * Finally, store the (optionally) sign/zero-extended value into its destination: memory for sb/sw etc., or an offset section of the register file for an arithmetic operation. In this way, polymorphic bitwidths are achieved without requiring a massive 64-way permutation of calculations **per opcode**, for example (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible rd bitwidths). The pseudo-code is therefore as follows: typedef union { uint8_t b; uint16_t s; uint32_t i; uint64_t l; } el_reg_t; bw(elwidth): if elwidth == 0: return xlen if elwidth == 1: return 8 if elwidth == 2: return 16 // elwidth == 3: return 32 get_max_elwidth(rs1, rs2): return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set bw(int_csr[rs2].elwidth)) # again XLEN if no entry get_polymorphed_reg(reg, bitwidth, offset): el_reg_t res; res.l = 0; // TODO: going to need sign-extending / zero-extending if bitwidth == 8: reg.b = int_regfile[reg].b[offset] elif bitwidth == 16: reg.s = int_regfile[reg].s[offset] elif bitwidth == 32: reg.i = int_regfile[reg].i[offset] elif bitwidth == 64: reg.l = int_regfile[reg].l[offset] return res set_polymorphed_reg(reg, bitwidth, offset, val): if (!int_csr[reg].isvec): # sign/zero-extend depending on opcode requirements, from # the reg's bitwidth out to the full bitwidth of the regfile val = sign_or_zero_extend(val, bitwidth, xlen) int_regfile[reg].l[0] = val elif bitwidth == 8: int_regfile[reg].b[offset] = val elif bitwidth == 16: int_regfile[reg].s[offset] = val elif bitwidth == 32: int_regfile[reg].i[offset] = val elif bitwidth == 64: int_regfile[reg].l[offset] = val maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s) destwid = int_csr[rs1].elwidth # destination element width for (i = 0; i < VL; i++) if (predval & 1< Polymorphic element widths in vectorised form means that the data being loaded (or stored) across multiple registers needs to be treated (reinterpreted) as a contiguous stream of elwidth-wide items, where the source register's element width is **independent** from the destination's. This makes for a slightly more complex algorithm when using indirection on the "addressed" register (source for LOAD and destination for STORE), particularly given that the LOAD/STORE instruction provides important information about the width of the data to be reinterpreted. Let's illustrate the "load" part, where the pseudo-code for elwidth=default was as follows, and i is the loop from 0 to VL-1: srcbase = ireg[rs+i]; return mem[srcbase + imm]; // returns XLEN bits Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide chunks are taken from the source memory location addressed by the current indexed source address register, and only when a full 32-bits-worth are taken will the index be moved on to the next contiguous source address register: bitwidth = bw(elwidth); // source elwidth from CSR reg entry elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8 srcbase = ireg[rs+i/(elsperblock)]; // integer divide offs = i % elsperblock; // modulo return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc. Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD and 128 for LQ. The principle is basically exactly the same as if the srcbase were pointing at the memory of the *register* file: memory is re-interpreted as containing groups of elwidth-wide discrete elements. When storing the result from a load, it's important to respect the fact that the destination register has its *own separate element width*. Thus, when each element is loaded (at the source element width), any sign-extension or zero-extension (or truncation) needs to be done to the *destination* bitwidth. Also, the storing has the exact same analogous algorithm as above, where in fact it is just the set\_polymorphed\_reg pseudocode (completely unchanged) used above. One issue remains: when the source element width is **greater** than the width of the operation, it is obvious that a single LB for example cannot possibly obtain 16-bit-wide data. This condition may be detected where, when using integer divide, elsperblock (the width of the LOAD divided by the bitwidth of the element) is zero. The issue is "fixed" by ensuring that elsperblock is a minimum of 1: elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth) The elements, if the element bitwidth is larger than the LD operation's size, will then be sign/zero-extended to the full LD operation size, as specified by the LOAD (LDU instead of LD, LBU instead of LB), before being passed on to the second phase. As LOAD/STORE may be twin-predicated, it is important to note that the rules on twin predication still apply, except where in previous pseudo-code (elwidth=default for both source and target) it was the *registers* that the predication was applied to, it is now the **elements** that the predication is applied to. Thus the full pseudocode for all LD operations may be written out as follows: function LBU(rd, rs): load_elwidthed(rd, rs, 8, true) function LB(rd, rs): load_elwidthed(rd, rs, 8, false) function LH(rd, rs): load_elwidthed(rd, rs, 16, false) ... ... function LQ(rd, rs): load_elwidthed(rd, rs, 128, false) # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16.. function load_memory(rs, imm, i, opwidth): elwidth = int_csr[rs].elwidth bitwidth = bw(elwidth); elsperblock = min(1, opwidth / bitwidth) srcbase = ireg[rs+i/(elsperblock)]; offs = i % elsperblock; return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes function load_elwidthed(rd, rs, opwidth, unsigned): destwid = int_csr[rd].elwidth # destination element width rd = int_csr[rd].active ? int_csr[rd].regidx : rd; rs = int_csr[rs].active ? int_csr[rs].regidx : rs; ps = get_pred_val(FALSE, rs); # predication on src pd = get_pred_val(FALSE, rd); # ... AND on dest for (int i = 0, int j = 0; i < VL && j < VL;): if (int_csr[rs].isvec) while (!(ps & 1<