From: Luke Kenneth Casson Leighton Date: Tue, 25 Jun 2019 11:43:34 +0000 (+0100) Subject: add new appendix, separate page X-Git-Tag: convert-csv-opcode-to-binary~4449 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=1914cd83a9d0b70e28b363170240bdce4c6848fc;p=libreriscv.git add new appendix, separate page --- diff --git a/simple_v_extension/appendix.mdwn b/simple_v_extension/appendix.mdwn new file mode 100644 index 000000000..b17320105 --- /dev/null +++ b/simple_v_extension/appendix.mdwn @@ -0,0 +1,1012 @@ +# Simple-V (Parallelism Extension Proposal) Appendix + +* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton +* Status: DRAFTv0.6 +* Last edited: 25 jun 2019 +* main spec [[specification]] + +[[!toc ]] + +# Element bitwidth polymorphism + +Element bitwidth is best covered as its own special section, as it +is quite involved and applies uniformly across-the-board. SV restricts +bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit. + +The effect of setting an element bitwidth is to re-cast each entry +in the register table, and for all memory operations involving +load/stores of certain specific sizes, to a completely different width. +Thus In c-style terms, on an RV64 architecture, effectively each register +now looks like this: + + typedef union { + uint8_t b[8]; + uint16_t s[4]; + uint32_t i[2]; + uint64_t l[1]; + } reg_t; + + // integer table: assume maximum SV 7-bit regfile size + reg_t int_regfile[128]; + +where the CSR Register table entry (not the instruction alone) determines +which of those union entries is to be used on each operation, and the +VL element offset in the hardware-loop specifies the index into each array. + +However a naive interpretation of the data structure above masks the +fact that setting VL greater than 8, for example, when the bitwidth is 8, +accessing one specific register "spills over" to the following parts of +the register file in a sequential fashion. So a much more accurate way +to reflect this would be: + + typedef union { + uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128 + uint8_t b[0]; // array of type uint8_t + uint16_t s[0]; + uint32_t i[0]; + uint64_t l[0]; + uint128_t d[0]; + } reg_t; + + reg_t int_regfile[128]; + +where when accessing any individual regfile[n].b entry it is permitted +(in c) to arbitrarily over-run the *declared* length of the array (zero), +and thus "overspill" to consecutive register file entries in a fashion +that is completely transparent to a greatly-simplified software / pseudo-code +representation. +It is however critical to note that it is clearly the responsibility of +the implementor to ensure that, towards the end of the register file, +an exception is thrown if attempts to access beyond the "real" register +bytes is ever attempted. + +Now we may modify pseudo-code an operation where all element bitwidths have +been set to the same size, where this pseudo-code is otherwise identical +to its "non" polymorphic versions (above): + + function op_add(rd, rs1, rs2) # add not VADD! + ... + ... +  for (i = 0; i < VL; i++) + ... + ... + // TODO, calculate if over-run occurs, for each elwidth + if (elwidth == 8) { +    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] + +     int_regfile[rs2].i[irs2]; + } else if elwidth == 16 { +    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] + +     int_regfile[rs2].s[irs2]; + } else if elwidth == 32 { +    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] + +     int_regfile[rs2].i[irs2]; + } else { // elwidth == 64 +    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] + +     int_regfile[rs2].l[irs2]; + } + ... + ... + +So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers +following sequentially on respectively from the same) are "type-cast" +to 8-bit; for 16-bit entries likewise and so on. + +However that only covers the case where the element widths are the same. +Where the element widths are different, the following algorithm applies: + +* Analyse the bitwidth of all source operands and work out the + maximum. Record this as "maxsrcbitwidth" +* If any given source operand requires sign-extension or zero-extension + (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit + sign-extension / zero-extension or whatever is specified in the standard + RV specification, **change** that to sign-extending from the respective + individual source operand's bitwidth from the CSR table out to + "maxsrcbitwidth" (previously calculated), instead. +* Following separate and distinct (optional) sign/zero-extension of all + source operands as specifically required for that operation, carry out the + operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV + this may be a "null" (copy) operation, and that with FCVT, the changes + to the source and destination bitwidths may also turn FVCT effectively + into a copy). +* If the destination operand requires sign-extension or zero-extension, + instead of a mandatory fixed size (typically 32-bit for arithmetic, + for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw + etc.), overload the RV specification with the bitwidth from the + destination register's elwidth entry. +* Finally, store the (optionally) sign/zero-extended value into its + destination: memory for sb/sw etc., or an offset section of the register + file for an arithmetic operation. + +In this way, polymorphic bitwidths are achieved without requiring a +massive 64-way permutation of calculations **per opcode**, for example +(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible +rd bitwidths). The pseudo-code is therefore as follows: + + typedef union { + uint8_t b; + uint16_t s; + uint32_t i; + uint64_t l; + } el_reg_t; + + bw(elwidth): + if elwidth == 0: return xlen + if elwidth == 1: return 8 + if elwidth == 2: return 16 + // elwidth == 3: + return 32 + + get_max_elwidth(rs1, rs2): + return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set + bw(int_csr[rs2].elwidth)) # again XLEN if no entry + + get_polymorphed_reg(reg, bitwidth, offset): + el_reg_t res; + res.l = 0; // TODO: going to need sign-extending / zero-extending + if bitwidth == 8: + reg.b = int_regfile[reg].b[offset] + elif bitwidth == 16: + reg.s = int_regfile[reg].s[offset] + elif bitwidth == 32: + reg.i = int_regfile[reg].i[offset] + elif bitwidth == 64: + reg.l = int_regfile[reg].l[offset] + return res + + set_polymorphed_reg(reg, bitwidth, offset, val): + if (!int_csr[reg].isvec): + # sign/zero-extend depending on opcode requirements, from + # the reg's bitwidth out to the full bitwidth of the regfile + val = sign_or_zero_extend(val, bitwidth, xlen) + int_regfile[reg].l[0] = val + elif bitwidth == 8: + int_regfile[reg].b[offset] = val + elif bitwidth == 16: + int_regfile[reg].s[offset] = val + elif bitwidth == 32: + int_regfile[reg].i[offset] = val + elif bitwidth == 64: + int_regfile[reg].l[offset] = val + + maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s) + destwid = int_csr[rs1].elwidth # destination element width +  for (i = 0; i < VL; i++) + if (predval & 1< + +Polymorphic element widths in vectorised form means that the data +being loaded (or stored) across multiple registers needs to be treated +(reinterpreted) as a contiguous stream of elwidth-wide items, where +the source register's element width is **independent** from the destination's. + +This makes for a slightly more complex algorithm when using indirection +on the "addressed" register (source for LOAD and destination for STORE), +particularly given that the LOAD/STORE instruction provides important +information about the width of the data to be reinterpreted. + +Let's illustrate the "load" part, where the pseudo-code for elwidth=default +was as follows, and i is the loop from 0 to VL-1: + + srcbase = ireg[rs+i]; + return mem[srcbase + imm]; // returns XLEN bits + +Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide +chunks are taken from the source memory location addressed by the current +indexed source address register, and only when a full 32-bits-worth +are taken will the index be moved on to the next contiguous source +address register: + + bitwidth = bw(elwidth); // source elwidth from CSR reg entry + elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8 + srcbase = ireg[rs+i/(elsperblock)]; // integer divide + offs = i % elsperblock; // modulo + return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc. + +Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD +and 128 for LQ. + +The principle is basically exactly the same as if the srcbase were pointing +at the memory of the *register* file: memory is re-interpreted as containing +groups of elwidth-wide discrete elements. + +When storing the result from a load, it's important to respect the fact +that the destination register has its *own separate element width*. Thus, +when each element is loaded (at the source element width), any sign-extension +or zero-extension (or truncation) needs to be done to the *destination* +bitwidth. Also, the storing has the exact same analogous algorithm as +above, where in fact it is just the set\_polymorphed\_reg pseudocode +(completely unchanged) used above. + +One issue remains: when the source element width is **greater** than +the width of the operation, it is obvious that a single LB for example +cannot possibly obtain 16-bit-wide data. This condition may be detected +where, when using integer divide, elsperblock (the width of the LOAD +divided by the bitwidth of the element) is zero. + +The issue is "fixed" by ensuring that elsperblock is a minimum of 1: + + elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth) + +The elements, if the element bitwidth is larger than the LD operation's +size, will then be sign/zero-extended to the full LD operation size, as +specified by the LOAD (LDU instead of LD, LBU instead of LB), before +being passed on to the second phase. + +As LOAD/STORE may be twin-predicated, it is important to note that +the rules on twin predication still apply, except where in previous +pseudo-code (elwidth=default for both source and target) it was +the *registers* that the predication was applied to, it is now the +**elements** that the predication is applied to. + +Thus the full pseudocode for all LD operations may be written out +as follows: + + function LBU(rd, rs): + load_elwidthed(rd, rs, 8, true) + function LB(rd, rs): + load_elwidthed(rd, rs, 8, false) + function LH(rd, rs): + load_elwidthed(rd, rs, 16, false) + ... + ... + function LQ(rd, rs): + load_elwidthed(rd, rs, 128, false) + + # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16.. + function load_memory(rs, imm, i, opwidth): + elwidth = int_csr[rs].elwidth + bitwidth = bw(elwidth); + elsperblock = min(1, opwidth / bitwidth) + srcbase = ireg[rs+i/(elsperblock)]; + offs = i % elsperblock; + return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes + + function load_elwidthed(rd, rs, opwidth, unsigned): + destwid = int_csr[rd].elwidth # destination element width +  rd = int_csr[rd].active ? int_csr[rd].regidx : rd; +  rs = int_csr[rs].active ? int_csr[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_csr[rs].isvec) while (!(ps & 1<