simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Element bitwidth polymorphism <a name="elwidth"></a>
  11
  12 Element bitwidth is best covered as its own special section, as it
  13 is quite involved and applies uniformly across-the-board.  SV restricts
  14 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
  15
  16 The effect of setting an element bitwidth is to re-cast each entry
  17 in the register table, and for all memory operations involving
  18 load/stores of certain specific sizes, to a completely different width.
  19 Thus In c-style terms, on an RV64 architecture, effectively each register
  20 now looks like this:
  21
  22     typedef union {
  23         uint8_t  b[8];
  24         uint16_t s[4];
  25         uint32_t i[2];
  26         uint64_t l[1];
  27     } reg_t;
  28
  29     // integer table: assume maximum SV 7-bit regfile size
  30     reg_t int_regfile[128];
  31
  32 where the CSR Register table entry (not the instruction alone) determines
  33 which of those union entries is to be used on each operation, and the
  34 VL element offset in the hardware-loop specifies the index into each array.
  35
  36 However a naive interpretation of the data structure above masks the
  37 fact that setting VL greater than 8, for example, when the bitwidth is 8,
  38 accessing one specific register "spills over" to the following parts of
  39 the register file in a sequential fashion.  So a much more accurate way
  40 to reflect this would be:
  41
  42     typedef union {
  43         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
  44         uint8_t  b[0]; // array of type uint8_t
  45         uint16_t s[0];
  46         uint32_t i[0];
  47         uint64_t l[0];
  48         uint128_t d[0];
  49     } reg_t;
  50
  51     reg_t int_regfile[128];
  52
  53 where when accessing any individual regfile[n].b entry it is permitted
  54 (in c) to arbitrarily over-run the *declared* length of the array (zero),
  55 and thus "overspill" to consecutive register file entries in a fashion
  56 that is completely transparent to a greatly-simplified software / pseudo-code
  57 representation.
  58 It is however critical to note that it is clearly the responsibility of
  59 the implementor to ensure that, towards the end of the register file,
  60 an exception is thrown if attempts to access beyond the "real" register
  61 bytes is ever attempted.
  62
  63 Now we may modify pseudo-code an operation where all element bitwidths have
  64 been set to the same size, where this pseudo-code is otherwise identical
  65 to its "non" polymorphic versions (above):
  66
  67     function op_add(rd, rs1, rs2) # add not VADD!
  68       ...
  69       ...
  70       for (i = 0; i < VL; i++)
  71            ...
  72            ...
  73            // TODO, calculate if over-run occurs, for each elwidth
  74            if (elwidth == 8) {
  75                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
  76                                         int_regfile[rs2].i[irs2];
  77             } else if elwidth == 16 {
  78                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
  79                                         int_regfile[rs2].s[irs2];
  80             } else if elwidth == 32 {
  81                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
  82                                         int_regfile[rs2].i[irs2];
  83             } else { // elwidth == 64
  84                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
  85                                         int_regfile[rs2].l[irs2];
  86             }
  87            ...
  88            ...
  89
  90 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
  91 following sequentially on respectively from the same) are "type-cast"
  92 to 8-bit; for 16-bit entries likewise and so on.
  93
  94 However that only covers the case where the element widths are the same.
  95 Where the element widths are different, the following algorithm applies:
  96
  97 * Analyse the bitwidth of all source operands and work out the
  98   maximum.  Record this as "maxsrcbitwidth"
  99 * If any given source operand requires sign-extension or zero-extension
 100   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 101   sign-extension / zero-extension or whatever is specified in the standard
 102   RV specification, **change** that to sign-extending from the respective
 103   individual source operand's bitwidth from the CSR table out to
 104   "maxsrcbitwidth" (previously calculated), instead.
 105 * Following separate and distinct (optional) sign/zero-extension of all
 106   source operands as specifically required for that operation, carry out the
 107   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 108   this may be a "null" (copy) operation, and that with FCVT, the changes
 109   to the source and destination bitwidths may also turn FVCT effectively
 110   into a copy).
 111 * If the destination operand requires sign-extension or zero-extension,
 112   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 113   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 114   etc.), overload the RV specification with the bitwidth from the
 115   destination register's elwidth entry.
 116 * Finally, store the (optionally) sign/zero-extended value into its
 117   destination: memory for sb/sw etc., or an offset section of the register
 118   file for an arithmetic operation.
 119
 120 In this way, polymorphic bitwidths are achieved without requiring a
 121 massive 64-way permutation of calculations **per opcode**, for example
 122 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 123 rd bitwidths).  The pseudo-code is therefore as follows:
 124
 125     typedef union {
 126         uint8_t  b;
 127         uint16_t s;
 128         uint32_t i;
 129         uint64_t l;
 130     } el_reg_t;
 131
 132     bw(elwidth):
 133         if elwidth == 0: return xlen
 134         if elwidth == 1: return 8
 135         if elwidth == 2: return 16
 136         // elwidth == 3:
 137         return 32
 138
 139     get_max_elwidth(rs1, rs2):
 140         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 141                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 142
 143     get_polymorphed_reg(reg, bitwidth, offset):
 144         el_reg_t res;
 145         res.l = 0; // TODO: going to need sign-extending / zero-extending
 146         if bitwidth == 8:
 147             reg.b = int_regfile[reg].b[offset]
 148         elif bitwidth == 16:
 149             reg.s = int_regfile[reg].s[offset]
 150         elif bitwidth == 32:
 151             reg.i = int_regfile[reg].i[offset]
 152         elif bitwidth == 64:
 153             reg.l = int_regfile[reg].l[offset]
 154         return res
 155
 156     set_polymorphed_reg(reg, bitwidth, offset, val):
 157         if (!int_csr[reg].isvec):
 158             # sign/zero-extend depending on opcode requirements, from
 159             # the reg's bitwidth out to the full bitwidth of the regfile
 160             val = sign_or_zero_extend(val, bitwidth, xlen)
 161             int_regfile[reg].l[0] = val
 162         elif bitwidth == 8:
 163             int_regfile[reg].b[offset] = val
 164         elif bitwidth == 16:
 165             int_regfile[reg].s[offset] = val
 166         elif bitwidth == 32:
 167             int_regfile[reg].i[offset] = val
 168         elif bitwidth == 64:
 169             int_regfile[reg].l[offset] = val
 170
 171       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 172       destwid = int_csr[rs1].elwidth         # destination element width
 173       for (i = 0; i < VL; i++)
 174         if (predval & 1<<i) # predication uses intregs
 175            // TODO, calculate if over-run occurs, for each elwidth
 176            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 177            // TODO, sign/zero-extend src1 and src2 as operation requires
 178            if (op_requires_sign_extend_src1)
 179               src1 = sign_extend(src1, maxsrcwid)
 180            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 181            result = src1 + src2 # actual add here
 182            // TODO, sign/zero-extend result, as operation requires
 183            if (op_requires_sign_extend_dest)
 184               result = sign_extend(result, maxsrcwid)
 185            set_polymorphed_reg(rd, destwid, ird, result)
 186            if (!int_vec[rd].isvector) break
 187         if (int_vec[rd ].isvector)  { id += 1; }
 188         if (int_vec[rs1].isvector)  { irs1 += 1; }
 189         if (int_vec[rs2].isvector)  { irs2 += 1; }
 190
 191 Whilst specific sign-extension and zero-extension pseudocode call
 192 details are left out, due to each operation being different, the above
 193 should be clear that;
 194
 195 * the source operands are extended out to the maximum bitwidth of all
 196   source operands
 197 * the operation takes place at that maximum source bitwidth (the
 198   destination bitwidth is not involved at this point, at all)
 199 * the result is extended (or potentially even, truncated) before being
 200   stored in the destination.  i.e. truncation (if required) to the
 201   destination width occurs **after** the operation **not** before.
 202 * when the destination is not marked as "vectorised", the **full**
 203   (standard, scalar) register file entry is taken up, i.e. the
 204   element is either sign-extended or zero-extended to cover the
 205   full register bitwidth (XLEN) if it is not already XLEN bits long.
 206
 207 Implementors are entirely free to optimise the above, particularly
 208 if it is specifically known that any given operation will complete
 209 accurately in less bits, as long as the results produced are
 210 directly equivalent and equal, for all inputs and all outputs,
 211 to those produced by the above algorithm.
 212
 213 ## Polymorphic floating-point operation exceptions and error-handling
 214
 215 For floating-point operations, conversion takes place without
 216 raising any kind of exception.  Exactly as specified in the standard
 217 RV specification, NAN (or appropriate) is stored if the result
 218 is beyond the range of the destination, and, again, exactly as
 219 with the standard RV specification just as with scalar
 220 operations, the floating-point flag is raised (FCSR).  And, again, just as
 221 with scalar operations, it is software's responsibility to check this flag.
 222 Given that the FCSR flags are "accrued", the fact that multiple element
 223 operations could have occurred is not a problem.
 224
 225 Note that it is perfectly legitimate for floating-point bitwidths of
 226 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 227 principles, no actual standard yet exists.  Implementors wishing to
 228 provide hardware-level 8-bit support rather than throw a trap to emulate
 229 in software should contact the author of this specification before
 230 proceeding.
 231
 232 ## Polymorphic shift operators
 233
 234 A special note is needed for changing the element width of left and right
 235 shift operators, particularly right-shift.  Even for standard RV base,
 236 in order for correct results to be returned, the second operand RS2 must
 237 be truncated to be within the range of RS1's bitwidth.  spike's implementation
 238 of sll for example is as follows:
 239
 240     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 241
 242 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 243 range 0..31 so that RS1 will only be left-shifted by the amount that
 244 is possible to fit into a 32-bit register.  Whilst this appears not
 245 to matter for hardware, it matters greatly in software implementations,
 246 and it also matters where an RV64 system is set to "RV32" mode, such
 247 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 248 each.
 249
 250 For SV, where each operand's element bitwidth may be over-ridden, the
 251 rule about determining the operation's bitwidth *still applies*, being
 252 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 253 **also applies to the truncation of RS2**.  In other words, *after*
 254 determining the maximum bitwidth, RS2's range must **also be truncated**
 255 to ensure a correct answer.  Example:
 256
 257 * RS1 is over-ridden to a 16-bit width
 258 * RS2 is over-ridden to an 8-bit width
 259 * RD is over-ridden to a 64-bit width
 260 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 261 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 262
 263 Pseudocode (in spike) for this example would therefore be:
 264
 265     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 266
 267 This example illustrates that considerable care therefore needs to be
 268 taken to ensure that left and right shift operations are implemented
 269 correctly.  The key is that
 270
 271 * The operation bitwidth is determined by the maximum bitwidth
 272   of the *source registers*, **not** the destination register bitwidth
 273 * The result is then sign-extend (or truncated) as appropriate.
 274
 275 ## Polymorphic MULH/MULHU/MULHSU
 276
 277 MULH is designed to take the top half MSBs of a multiply that
 278 does not fit within the range of the source operands, such that
 279 smaller width operations may produce a full double-width multiply
 280 in two cycles.  The issue is: SV allows the source operands to
 281 have variable bitwidth.
 282
 283 Here again special attention has to be paid to the rules regarding
 284 bitwidth, which, again, are that the operation is performed at
 285 the maximum bitwidth of the **source** registers.  Therefore:
 286
 287 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 288   be shifted down by 8 bits
 289 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 290   be shifted down by 16 bits (top 8 bits being zero)
 291 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 292   be shifted down by 16 bits
 293 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 294   be shifted down by 32 bits
 295 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 296   be shifted down by 32 bits
 297
 298 So again, just as with shift-left and shift-right, the result
 299 is shifted down by the maximum of the two source register bitwidths.
 300 And, exactly again, truncation or sign-extension is performed on the
 301 result.  If sign-extension is to be carried out, it is performed
 302 from the same maximum of the two source register bitwidths out
 303 to the result element's bitwidth.
 304
 305 If truncation occurs, i.e. the top MSBs of the result are lost,
 306 this is "Officially Not Our Problem", i.e. it is assumed that the
 307 programmer actually desires the result to be truncated.  i.e. if the
 308 programmer wanted all of the bits, they would have set the destination
 309 elwidth to accommodate them.
 310
 311 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 312
 313 Polymorphic element widths in vectorised form means that the data
 314 being loaded (or stored) across multiple registers needs to be treated
 315 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 316 the source register's element width is **independent** from the destination's.
 317
 318 This makes for a slightly more complex algorithm when using indirection
 319 on the "addressed" register (source for LOAD and destination for STORE),
 320 particularly given that the LOAD/STORE instruction provides important
 321 information about the width of the data to be reinterpreted.
 322
 323 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 324 was as follows, and i is the loop from 0 to VL-1:
 325
 326     srcbase = ireg[rs+i];
 327     return mem[srcbase + imm]; // returns XLEN bits
 328
 329 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 330 chunks are taken from the source memory location addressed by the current
 331 indexed source address register, and only when a full 32-bits-worth
 332 are taken will the index be moved on to the next contiguous source
 333 address register:
 334
 335     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 336     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 337     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 338     offs = i % elsperblock;             // modulo
 339     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 340
 341 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 342 and 128 for LQ.
 343
 344 The principle is basically exactly the same as if the srcbase were pointing
 345 at the memory of the *register* file: memory is re-interpreted as containing
 346 groups of elwidth-wide discrete elements.
 347
 348 When storing the result from a load, it's important to respect the fact
 349 that the destination register has its *own separate element width*.  Thus,
 350 when each element is loaded (at the source element width), any sign-extension
 351 or zero-extension (or truncation) needs to be done to the *destination*
 352 bitwidth.  Also, the storing has the exact same analogous algorithm as
 353 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 354 (completely unchanged) used above.
 355
 356 One issue remains: when the source element width is **greater** than
 357 the width of the operation, it is obvious that a single LB for example
 358 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 359 where, when using integer divide, elsperblock (the width of the LOAD
 360 divided by the bitwidth of the element) is zero.
 361
 362 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 363
 364     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 365
 366 The elements, if the element bitwidth is larger than the LD operation's
 367 size, will then be sign/zero-extended to the full LD operation size, as
 368 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 369 being passed on to the second phase.
 370
 371 As LOAD/STORE may be twin-predicated, it is important to note that
 372 the rules on twin predication still apply, except where in previous
 373 pseudo-code (elwidth=default for both source and target) it was
 374 the *registers* that the predication was applied to, it is now the
 375 **elements** that the predication is applied to.
 376
 377 Thus the full pseudocode for all LD operations may be written out
 378 as follows:
 379
 380     function LBU(rd, rs):
 381         load_elwidthed(rd, rs, 8, true)
 382     function LB(rd, rs):
 383         load_elwidthed(rd, rs, 8, false)
 384     function LH(rd, rs):
 385         load_elwidthed(rd, rs, 16, false)
 386     ...
 387     ...
 388     function LQ(rd, rs):
 389         load_elwidthed(rd, rs, 128, false)
 390
 391     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 392     function load_memory(rs, imm, i, opwidth):
 393         elwidth = int_csr[rs].elwidth
 394         bitwidth = bw(elwidth);
 395         elsperblock = min(1, opwidth / bitwidth)
 396         srcbase = ireg[rs+i/(elsperblock)];
 397         offs = i % elsperblock;
 398         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 399
 400     function load_elwidthed(rd, rs, opwidth, unsigned):
 401       destwid = int_csr[rd].elwidth # destination element width
 402       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 403       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 404       ps = get_pred_val(FALSE, rs); # predication on src
 405       pd = get_pred_val(FALSE, rd); # ... AND on dest
 406       for (int i = 0, int j = 0; i < VL && j < VL;):
 407         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 408         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 409         val = load_memory(rs, imm, i, opwidth)
 410         if unsigned:
 411             val = zero_extend(val, min(opwidth, bitwidth))
 412         else:
 413             val = sign_extend(val, min(opwidth, bitwidth))
 414         set_polymorphed_reg(rd, bitwidth, j, val)
 415         if (int_csr[rs].isvec) i++;
 416         if (int_csr[rd].isvec) j++; else break;
 417
 418 Note:
 419
 420 * when comparing against for example the twin-predicated c.mv
 421   pseudo-code, the pattern of independent incrementing of rd and rs
 422   is preserved unchanged.
 423 * just as with the c.mv pseudocode, zeroing is not included and must be
 424   taken into account (TODO).
 425 * that due to the use of a twin-predication algorithm, LOAD/STORE also
 426   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
 427   VSCATTER characteristics.
 428 * that due to the use of the same set\_polymorphed\_reg pseudocode,
 429   a destination that is not vectorised (marked as scalar) will
 430   result in the element being fully sign-extended or zero-extended
 431   out to the full register file bitwidth (XLEN).  When the source
 432   is also marked as scalar, this is how the compatibility with
 433   standard RV LOAD/STORE is preserved by this algorithm.
 434
 435 ### Example Tables showing LOAD elements
 436
 437 This section contains examples of vectorised LOAD operations, showing
 438 how the two stage process works (three if zero/sign-extension is included).
 439
 440
 441 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
 442
 443 This is:
 444
 445 * a 64-bit load, with an offset of zero
 446 * with a source-address elwidth of 16-bit
 447 * into a destination-register with an elwidth of 32-bit
 448 * where VL=7
 449 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
 450 * RV64, where XLEN=64 is assumed.
 451
 452 First, the memory table, which, due to the
 453 element width being 16 and the operation being LD (64), the 64-bits
 454 loaded from memory are subdivided into groups of **four** elements.
 455 And, with VL being 7 (deliberately to illustrate that this is reasonable
 456 and possible), the first four are sourced from the offset addresses pointed
 457 to by x5, and the next three from the ofset addresses pointed to by
 458 the next contiguous register, x6:
 459
 460 [[!table  data="""
 461 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
 462 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
 463 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
 464 """]]
 465
 466 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
 467 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
 468
 469 [[!table  data="""
 470 byte 3 | byte 2 | byte 1 | byte 0 |
 471 0x0    | 0x0    | elem0          ||
 472 0x0    | 0x0    | elem1          ||
 473 0x0    | 0x0    | elem2          ||
 474 0x0    | 0x0    | elem3          ||
 475 0x0    | 0x0    | elem4          ||
 476 0x0    | 0x0    | elem5          ||
 477 0x0    | 0x0    | elem6          ||
 478 0x0    | 0x0    | elem7          ||
 479 """]]
 480
 481 Lastly, the elements are stored in contiguous blocks, as if x8 was also
 482 byte-addressable "memory".  That "memory" happens to cover registers
 483 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
 484
 485 [[!table  data="""
 486 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
 487 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
 488 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
 489 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
 490 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
 491 """]]
 492
 493 Thus we have data that is loaded from the **addresses** pointed to by
 494 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
 495 x8 through to half of x11.
 496 The end result is that elements 0 and 1 end up in x8, with element 8 being
 497 shifted up 32 bits, and so on, until finally element 6 is in the
 498 LSBs of x11.
 499
 500 Note that whilst the memory addressing table is shown left-to-right byte order,
 501 the registers are shown in right-to-left (MSB) order.  This does **not**
 502 imply that bit or byte-reversal is carried out: it's just easier to visualise
 503 memory as being contiguous bytes, and emphasises that registers are not
 504 really actually "memory" as such.
 505
 506 ## Why SV bitwidth specification is restricted to 4 entries
 507
 508 The four entries for SV element bitwidths only allows three over-rides:
 509
 510 * 8 bit
 511 * 16 hit
 512 * 32 bit
 513
 514 This would seem inadequate, surely it would be better to have 3 bits or
 515 more and allow 64, 128 and some other options besides.  The answer here
 516 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
 517 default is 64 bit, so the 4 major element widths are covered anyway.
 518
 519 There is an absolutely crucial aspect oF SV here that explicitly
 520 needs spelling out, and it's whether the "vectorised" bit is set in
 521 the Register's CSR entry.
 522
 523 If "vectorised" is clear (not set), this indicates that the operation
 524 is "scalar".  Under these circumstances, when set on a destination (RD),
 525 then sign-extension and zero-extension, whilst changed to match the
 526 override bitwidth (if set), will erase the **full** register entry
 527 (64-bit if RV64).
 528
 529 When vectorised is *set*, this indicates that the operation now treats
 530 **elements** as if they were independent registers, so regardless of
 531 the length, any parts of a given actual register that are not involved
 532 in the operation are **NOT** modified, but are **PRESERVED**.
 533
 534 For example:
 535
 536 * when the vector bit is clear and elwidth set to 16 on the destination
 537   register, operations are truncated to 16 bit and then sign or zero
 538   extended to the *FULL* XLEN register width.
 539 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
 540   groups of elwidth sized elements do not fill an entire XLEN register),
 541   the "top" bits of the destination register do *NOT* get modified, zero'd
 542   or otherwise overwritten.
 543
 544 SIMD micro-architectures may implement this by using predication on
 545 any elements in a given actual register that are beyond the end of
 546 multi-element operation.
 547
 548 Other microarchitectures may choose to provide byte-level write-enable
 549 lines on the register file, such that each 64 bit register in an RV64
 550 system requires 8 WE lines.  Scalar RV64 operations would require
 551 activation of all 8 lines, where SV elwidth based operations would
 552 activate the required subset of those byte-level write lines.
 553
 554 Example:
 555
 556 * rs1, rs2 and rd are all set to 8-bit
 557 * VL is set to 3
 558 * RV64 architecture is set (UXL=64)
 559 * add operation is carried out
 560 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
 561   concatenated with similar add operations on bits 15..8 and 7..0
 562 * bits 24 through 63 **remain as they originally were**.
 563
 564 Example SIMD micro-architectural implementation:
 565
 566 * SIMD architecture works out the nearest round number of elements
 567   that would fit into a full RV64 register (in this case: 8)
 568 * SIMD architecture creates a hidden predicate, binary 0b00000111
 569   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
 570 * SIMD architecture goes ahead with the add operation as if it
 571   was a full 8-wide batch of 8 adds
 572 * SIMD architecture passes top 5 elements through the adders
 573   (which are "disabled" due to zero-bit predication)
 574 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
 575   and stores them in rd.
 576
 577 This requires a read on rd, however this is required anyway in order
 578 to support non-zeroing mode.
 579
 580 ## Polymorphic floating-point
 581
 582 Standard scalar RV integer operations base the register width on XLEN,
 583 which may be changed (UXL in USTATUS, and the corresponding MXL and
 584 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
 585 arithmetic operations are therefore restricted to an active XLEN bits,
 586 with sign or zero extension to pad out the upper bits when XLEN has
 587 been dynamically set to less than the actual register size.
 588
 589 For scalar floating-point, the active (used / changed) bits are
 590 specified exclusively by the operation: ADD.S specifies an active
 591 32-bits, with the upper bits of the source registers needing to
 592 be all 1s ("NaN-boxed"), and the destination upper bits being
 593 *set* to all 1s (including on LOAD/STOREs).
 594
 595 Where elwidth is set to default (on any source or the destination)
 596 it is obvious that this NaN-boxing behaviour can and should be
 597 preserved.  When elwidth is non-default things are less obvious,
 598 so need to be thought through.  Here is a normal (scalar) sequence,
 599 assuming an RV64 which supports Quad (128-bit) FLEN:
 600
 601 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
 602 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
 603 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
 604   top 64 MSBs ignored.
 605
 606 Therefore it makes sense to mirror this behaviour when, for example,
 607 elwidth is set to 32.  Assume elwidth set to 32 on all source and
 608 destination registers:
 609
 610 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
 611   floating-point numbers.
 612 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
 613   in bits 0-31 and the second in bits 32-63.
 614 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
 615
 616 Here's the thing: it does not make sense to overwrite the top 64 MSBs
 617 of the registers either during the FLD **or** the ADD.D.  The reason
 618 is that, effectively, the top 64 MSBs actually represent a completely
 619 independent 64-bit register, so overwriting it is not only gratuitous
 620 but may actually be harmful for a future extension to SV which may
 621 have a way to directly access those top 64 bits.
 622
 623 The decision is therefore **not** to touch the upper parts of floating-point
 624 registers whereever elwidth is set to non-default values, including
 625 when "isvec" is false in a given register's CSR entry.  Only when the
 626 elwidth is set to default **and** isvec is false will the standard
 627 RV behaviour be followed, namely that the upper bits be modified.
 628
 629 Ultimately if elwidth is default and isvec false on *all* source
 630 and destination registers, a SimpleV instruction defaults completely
 631 to standard RV scalar behaviour (this holds true for **all** operations,
 632 right across the board).
 633
 634 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
 635 non-default values are effectively all the same: they all still perform
 636 multiple ADD operations, just at different widths.  A future extension
 637 to SimpleV may actually allow ADD.S to access the upper bits of the
 638 register, effectively breaking down a 128-bit register into a bank
 639 of 4 independently-accesible 32-bit registers.
 640
 641 In the meantime, although when e.g. setting VL to 8 it would technically
 642 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
 643 using ADD.Q may be an easy way to signal to the microarchitecture that
 644 it is to receive a higher VL value.  On a superscalar OoO architecture
 645 there may be absolutely no difference, however on simpler SIMD-style
 646 microarchitectures they may not necessarily have the infrastructure in
 647 place to know the difference, such that when VL=8 and an ADD.D instruction
 648 is issued, it completes in 2 cycles (or more) rather than one, where
 649 if an ADD.Q had been issued instead on such simpler microarchitectures
 650 it would complete in one.
 651
 652 ## Specific instruction walk-throughs
 653
 654 This section covers walk-throughs of the above-outlined procedure
 655 for converting standard RISC-V scalar arithmetic operations to
 656 polymorphic widths, to ensure that it is correct.
 657
 658 ### add
 659
 660 Standard Scalar RV32/RV64 (xlen):
 661
 662 * RS1 @ xlen bits
 663 * RS2 @ xlen bits
 664 * add @ xlen bits
 665 * RD @ xlen bits
 666
 667 Polymorphic variant:
 668
 669 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
 670 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
 671 * add @ max(rs1, rs2) bits
 672 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
 673
 674 Note here that polymorphic add zero-extends its source operands,
 675 where addw sign-extends.
 676
 677 ### addw
 678
 679 The RV Specification specifically states that "W" variants of arithmetic
 680 operations always produce 32-bit signed values.  In a polymorphic
 681 environment it is reasonable to assume that the signed aspect is
 682 preserved, where it is the length of the operands and the result
 683 that may be changed.
 684
 685 Standard Scalar RV64 (xlen):
 686
 687 * RS1 @ xlen bits
 688 * RS2 @ xlen bits
 689 * add @ xlen bits
 690 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
 691
 692 Polymorphic variant:
 693
 694 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
 695 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
 696 * add @ max(rs1, rs2) bits
 697 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
 698
 699 Note here that polymorphic addw sign-extends its source operands,
 700 where add zero-extends.
 701
 702 This requires a little more in-depth analysis.  Where the bitwidth of
 703 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
 704 only where the bitwidth of either rs1 or rs2 are different, will the
 705 lesser-width operand be sign-extended.
 706
 707 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
 708 where for add they are both zero-extended.  This holds true for all arithmetic
 709 operations ending with "W".
 710
 711 ### addiw
 712
 713 Standard Scalar RV64I:
 714
 715 * RS1 @ xlen bits, truncated to 32-bit
 716 * immed @ 12 bits, sign-extended to 32-bit
 717 * add @ 32 bits
 718 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
 719
 720 Polymorphic variant:
 721
 722 * RS1 @ rs1 bits
 723 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
 724 * add @ max(rs1, 12) bits
 725 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
 726
 727 # Predication Element Zeroing
 728
 729 The introduction of zeroing on traditional vector predication is usually
 730 intended as an optimisation for lane-based microarchitectures with register
 731 renaming to be able to save power by avoiding a register read on elements
 732 that are passed through en-masse through the ALU.  Simpler microarchitectures
 733 do not have this issue: they simply do not pass the element through to
 734 the ALU at all, and therefore do not store it back in the destination.
 735 More complex non-lane-based micro-architectures can, when zeroing is
 736 not set, use the predication bits to simply avoid sending element-based
 737 operations to the ALUs, entirely: thus, over the long term, potentially
 738 keeping all ALUs 100% occupied even when elements are predicated out.
 739
 740 SimpleV's design principle is not based on or influenced by
 741 microarchitectural design factors: it is a hardware-level API.
 742 Therefore, looking purely at whether zeroing is *useful* or not,
 743 (whether less instructions are needed for certain scenarios),
 744 given that a case can be made for zeroing *and* non-zeroing, the
 745 decision was taken to add support for both.
 746
 747 ## Single-predication (based on destination register)
 748
 749 Zeroing on predication for arithmetic operations is taken from
 750 the destination register's predicate.  i.e. the predication *and*
 751 zeroing settings to be applied to the whole operation come from the
 752 CSR Predication table entry for the destination register.
 753 Thus when zeroing is set on predication of a destination element,
 754 if the predication bit is clear, then the destination element is *set*
 755 to zero (twin-predication is slightly different, and will be covered
 756 next).
 757
 758 Thus the pseudo-code loop for a predicated arithmetic operation
 759 is modified to as follows:
 760
 761       for (i = 0; i < VL; i++)
 762         if not zeroing: # an optimisation
 763            while (!(predval & 1<<i) && i < VL)
 764              if (int_vec[rd ].isvector)  { id += 1; }
 765              if (int_vec[rs1].isvector)  { irs1 += 1; }
 766              if (int_vec[rs2].isvector)  { irs2 += 1; }
 767            if i == VL:
 768              return
 769         if (predval & 1<<i)
 770            src1 = ....
 771            src2 = ...
 772            else:
 773                result = src1 + src2 # actual add (or other op) here
 774            set_polymorphed_reg(rd, destwid, ird, result)
 775            if int_vec[rd].ffirst and result == 0:
 776               VL = i # result was zero, end loop early, return VL
 777               return
 778            if (!int_vec[rd].isvector) return
 779         else if zeroing:
 780            result = 0
 781            set_polymorphed_reg(rd, destwid, ird, result)
 782         if (int_vec[rd ].isvector)  { id += 1; }
 783         else if (predval & 1<<i) return
 784         if (int_vec[rs1].isvector)  { irs1 += 1; }
 785         if (int_vec[rs2].isvector)  { irs2 += 1; }
 786         if (rd == VL or rs1 == VL or rs2 == VL): return
 787
 788 The optimisation to skip elements entirely is only possible for certain
 789 micro-architectures when zeroing is not set.  However for lane-based
 790 micro-architectures this optimisation may not be practical, as it
 791 implies that elements end up in different "lanes".  Under these
 792 circumstances it is perfectly fine to simply have the lanes
 793 "inactive" for predicated elements, even though it results in
 794 less than 100% ALU utilisation.
 795
 796 ## Twin-predication (based on source and destination register)
 797
 798 Twin-predication is not that much different, except that that
 799 the source is independently zero-predicated from the destination.
 800 This means that the source may be zero-predicated *or* the
 801 destination zero-predicated *or both*, or neither.
 802
 803 When with twin-predication, zeroing is set on the source and not
 804 the destination, if a predicate bit is set it indicates that a zero
 805 data element is passed through the operation (the exception being:
 806 if the source data element is to be treated as an address - a LOAD -
 807 then the data returned *from* the LOAD is zero, rather than looking up an
 808 *address* of zero.
 809
 810 When zeroing is set on the destination and not the source, then just
 811 as with single-predicated operations, a zero is stored into the destination
 812 element (or target memory address for a STORE).
 813
 814 Zeroing on both source and destination effectively result in a bitwise
 815 NOR operation of the source and destination predicate: the result is that
 816 where either source predicate OR destination predicate is set to 0,
 817 a zero element will ultimately end up in the destination register.
 818
 819 However: this may not necessarily be the case for all operations;
 820 implementors, particularly of custom instructions, clearly need to
 821 think through the implications in each and every case.
 822
 823 Here is pseudo-code for a twin zero-predicated operation:
 824
 825     function op_mv(rd, rs) # MV not VMV!
 826       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 827       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 828       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
 829       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
 830       for (int i = 0, int j = 0; i < VL && j < VL):
 831         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
 832         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
 833         if ((pd & 1<<j))
 834             if ((pd & 1<<j))
 835                 sourcedata = ireg[rs+i];
 836             else
 837                 sourcedata = 0
 838             ireg[rd+j] <= sourcedata
 839         else if (zerodst)
 840             ireg[rd+j] <= 0
 841         if (int_csr[rs].isvec)
 842             i++;
 843         if (int_csr[rd].isvec)
 844             j++;
 845         else
 846             if ((pd & 1<<j))
 847                 break;
 848
 849 Note that in the instance where the destination is a scalar, the hardware
 850 loop is ended the moment a value *or a zero* is placed into the destination
 851 register/element.  Also note that, for clarity, variable element widths
 852 have been left out of the above.
 853
 854 # Subsets of RV functionality
 855
 856 This section describes the differences when SV is implemented on top of
 857 different subsets of RV.
 858
 859 ## Common options
 860
 861 It is permitted to only implement SVprefix and not the VBLOCK instruction
 862 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
 863 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
 864 traps may emulate the format.
 865
 866 It is permitted in SVprefix to either not implement VL or not implement
 867 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
 868 *MUST* raise illegal instruction on implementations that do not support
 869 VL or SUBVL.
 870
 871 It is permitted to limit the size of either (or both) the register files
 872 down to the original size of the standard RV architecture.  However, below
 873 the mandatory limits set in the RV standard will result in non-compliance
 874 with the SV Specification.
 875
 876 ## RV32 / RV32F
 877
 878 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
 879 maximum limit for predication is also restricted to 32 bits.  Whilst not
 880 actually specifically an "option" it is worth noting.
 881
 882 ## RV32G
 883
 884 Normally in standard RV32 it does not make much sense to have
 885 RV32G, The critical instructions that are missing in standard RV32
 886 are those for moving data to and from the double-width floating-point
 887 registers into the integer ones, as well as the FCVT routines.
 888
 889 In an earlier draft of SV, it was possible to specify an elwidth
 890 of double the standard register size: this had to be dropped,
 891 and may be reintroduced in future revisions.
 892
 893 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
 894
 895 When floating-point is not implemented, the size of the User Register and
 896 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
 897 per table).
 898
 899 ## RV32E
 900
 901 In embedded scenarios the User Register and Predication CSRs may be
 902 dropped entirely, or optionally limited to 1 CSR, such that the combined
 903 number of entries from the M-Mode CSR Register table plus U-Mode
 904 CSR Register table is either 4 16-bit entries or (if the U-Mode is
 905 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
 906 the Predication CSR tables.
 907
 908 RV32E is the most likely candidate for simply detecting that registers
 909 are marked as "vectorised", and generating an appropriate exception
 910 for the VL loop to be implemented in software.
 911
 912 ## RV128
 913
 914 RV128 has not been especially considered, here, however it has some
 915 extremely large possibilities: double the element width implies
 916 256-bit operands, spanning 2 128-bit registers each, and predication
 917 of total length 128 bit given that XLEN is now 128.
 918
 919 # Example usage
 920
 921 TODO evaluate strncpy and strlen
 922 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
 923
 924 ## strncpy
 925
 926 RVV version: <a name="strncpy"></>
 927
 928     strncpy:
 929         mv a3, a0               # Copy dst
 930     loop:
 931         setvli x0, a2, vint8    # Vectors of bytes.
 932         vlbff.v v1, (a1)        # Get src bytes
 933         vseq.vi v0, v1, 0       # Flag zero bytes
 934         vmfirst a4, v0          # Zero found?
 935         vmsif.v v0, v0          # Set mask up to and including zero byte. Ppplio
 936         vsb.v v1, (a3), v0.t    # Write out bytes
 937         bgez a4, exit           # Done
 938         csrr t1, vl             # Get number of bytes fetched
 939         add a1, a1, t1          # Bump src pointer
 940         sub a2, a2, t1          # Decrement count.
 941         add a3, a3, t1          # Bump dst pointer
 942         bnez a2, loop           # Anymore?
 943
 944     exit:
 945         ret
 946
 947 SV version (WIP):
 948
 949     strncpy:
 950         mv a3, a0
 951         SETMVLI 8 # set max vector to 8
 952         RegCSR[a3] = 8bit, a3, scalar
 953         RegCSR[a1] = 8bit, a1, scalar
 954         RegCSR[t0] = 8bit, t0, vector
 955         PredTb[t0] = ffirst, x0, inv
 956     loop:
 957         SETVLI a2, t4 # t4 and VL now 1..8
 958         ldb t0, (a1) # t0 fail first mode
 959         bne t0, x0, allnonzero # still ff
 960         # VL points to last nonzero
 961         GETVL t4       # from bne tests
 962         addi t4, t4, 1 # include zero
 963         SETVL t4       # set exactly to t4
 964         stb t0, (a3)   # store incl zero
 965         ret            # end subroutine
 966     allnonzero:
 967         stb t0, (a3)    # VL legal range
 968         GETVL t4        # from bne tests
 969         add a1, a1, t4  # Bump src pointer
 970         sub a2, a2, t4  # Decrement count.
 971         add a3, a3, t4  # Bump dst pointer
 972         bnez a2, loop   # Anymore?
 973     exit:
 974         ret
 975
 976 Notes:
 977
 978 * Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
 979 * obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
 980 * with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
 981 * RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
 982 * RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
 983 * with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
 984 * setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
 985 * ldb and bne are both using t0, both in ffirst mode
 986 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
 987 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
 988 * however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
 989 * the branch only goes to allnonzero if all tests succeed
 990 * if it did not, we can safely increment VL by 1 (using a4) to include the zero.
 991 * SETVL sets *exactly* the requested amount into VL.
 992 * the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
 993 * this would cause the stb to copy up to the end of the legal memory
 994 * of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
 995
 996 ## strcpy
 997
 998 RVV version:
 999
1000         mv a3, a0             # Save start
1001     loop:
1002         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1003         vldbff.v v1, (a3)     # Get bytes
1004         csrr a1, vl           # Get bytes actually read e.g. if fault
1005         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1006         add a3, a3, a1        # Bump pointer
1007         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1008         bltz a2, loop         # Not found?
1009         add a0, a0, a1        # Sum start + bump
1010         add a3, a3, a2        # Add index of zero byte
1011         sub a0, a3, a0        # Subtract start address+bump
1012         ret