simple_v_extension/specification.mdwn

   1 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
   2
   3 Key insight: Simple-V is intended as an abstraction layer to provide
   4 a consistent "API" to parallelisation of existing *and future* operations.
   5 *Actual* internal hardware-level parallelism is *not* required, such
   6 that Simple-V may be viewed as providing a "compact" or "consolidated"
   7 means of issuing multiple near-identical arithmetic instructions to an
   8 instruction queue (FIFO), pending execution.
   9
  10 *Actual* parallelism, if added independently of Simple-V in the form
  11 of Out-of-order restructuring (including parallel ALU lanes) or VLIW
  12 implementations, or SIMD, or anything else, would then benefit from
  13 the uniformity of a consistent API.
  14
  15 **No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
  16
  17 Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
  18
  19 [[!toc ]]
  20
  21 # Introduction
  22
  23 # CSRs <a name="csrs"></a>
  24
  25 There are two CSR tables needed to create lookup tables which are used at
  26 the register decode phase.
  27
  28 ## MAXVECTORLENGTH
  29
  30 MAXVECTORLENGTH is the same concept as MVL in RVV.  However in Simple-V,
  31 given that its primary (base, unextended) purpose is for 3D, Video and
  32 other purposes (not requiring supercomputing capability), it makes sense
  33 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
  34 and so on).
  35
  36 The reason for setting this limit is so that predication registers, when
  37 marked as such, may fit into a single register as opposed to fanning out
  38 over several registers.  This keeps the implementation a little simpler.
  39 Note also (as also described in the VSETVL section) that the *minimum*
  40 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
  41 and 31 for RV32 or RV64).
  42
  43 Note that RVV on top of Simple-V may choose to over-ride this decision.
  44
  45 ## MAXVECTORLENGTH
  46
  47 MAXVECTORLENGTH is the same concept as MVL in RVV.  However in Simple-V,
  48 given that its primary (base, unextended) purpose is for 3D, Video and
  49 other purposes (not requiring supercomputing capability), it makes sense
  50 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
  51 and so on).
  52
  53 The reason for setting this limit is so that predication registers, when
  54 marked as such, may fit into a single register as opposed to fanning out
  55 over several registers.  This keeps the implementation a little simpler.
  56 Note also (as also described in the VSETVL section) that the *minimum*
  57 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
  58 and 31 for RV32 or RV64).
  59
  60 Note that RVV on top of Simple-V may choose to over-ride this decision.
  61
  62 ## Predication CSR <a name="predication_csr_table"></a>
  63
  64 The Predication CSR is a key-value store indicating whether, if a given
  65 destination register (integer or floating-point) is referred to in an
  66 instruction, it is to be predicated.  However it is important to note
  67 that the *actual* register is *different* from the one that ends up
  68 being used, due to the level of indirection through the lookup table.
  69
  70 * regidx is the actual register that in combination with the
  71   i/f flag, if that integer or floating-point register is referred to,
  72   results in the lookup table being referenced to find the predication
  73   mask to use on the operation in which that (regidx) register has
  74   been used
  75 * predidx (in combination with the bank bit in the future) is the
  76   *actual* register to be used for the predication mask.  Note:
  77   in effect predidx is actually a 6-bit register address, as the bank
  78   bit is the MSB (and is nominally set to zero for now).
  79 * inv indicates that the predication mask bits are to be inverted
  80   prior to use *without* actually modifying the contents of the
  81   register itself.
  82 * zeroing is either 1 or 0, and if set to 1, the operation must
  83   place zeros in any element position where the predication mask is
  84   set to zero.  If zeroing is set to 0, unpredicated elements *must*
  85   be left alone.  Some microarchitectures may choose to interpret
  86   this as skipping the operation entirely.  Others which wish to
  87   stick more closely to a SIMD architecture may choose instead to
  88   interpret unpredicated elements as an internal "copy element"
  89   operation (which would be necessary in SIMD microarchitectures
  90   that perform register-renaming)
  91
  92 | PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
  93 | ----- | -      | -      | -     | -   | ------- | ------- |
  94 | 0     | bank0  | zero0  | inv0  | i/f | regidx  | predkey |
  95 | 1     | bank1  | zero1  | inv1  | i/f | regidx  | predkey |
  96 | ..    | bank.. | zero.. | inv.. | i/f | regidx  | predkey |
  97 | 15    | bank15 | zero15 | inv15 | i/f | regidx  | predkey |
  98
  99 The Predication CSR Table is a key-value store, so implementation-wise
 100 it will be faster to turn the table around (maintain topologically
 101 equivalent state):
 102
 103     struct pred {
 104         bool zero;
 105         bool inv;
 106         bool bank;   // 0 for now, 1=rsvd
 107         bool enabled;
 108         int predidx; // redirection: actual int register to use
 109     }
 110
 111     struct pred fp_pred_reg[32];   // 64 in future (bank=1)
 112     struct pred int_pred_reg[32];  // 64 in future (bank=1)
 113
 114     for (i = 0; i < 16; i++)
 115       tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
 116       idx = CSRpred[i].regidx
 117       tb[idx].zero = CSRpred[i].zero
 118       tb[idx].inv  = CSRpred[i].inv
 119       tb[idx].bank = CSRpred[i].bank
 120       tb[idx].predidx  = CSRpred[i].predidx
 121       tb[idx].enabled  = true
 122
 123 So when an operation is to be predicated, it is the internal state that
 124 is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
 125 pseudo-code for operations is given, where p is the explicit (direct)
 126 reference to the predication register to be used:
 127
 128     for (int i=0; i<vl; ++i)
 129         if ([!]preg[p][i])
 130            (d ? vreg[rd][i] : sreg[rd]) =
 131             iop(s1 ? vreg[rs1][i] : sreg[rs1],
 132                 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
 133
 134 This instead becomes an *indirect* reference using the *internal* state
 135 table generated from the Predication CSR key-value store, which iwws used
 136 as follows.
 137
 138     if type(iop) == INT:
 139         preg = int_pred_reg[rd]
 140     else:
 141         preg = fp_pred_reg[rd]
 142
 143     for (int i=0; i<vl; ++i)
 144         predidx = preg[rd].predidx; // the indirection takes place HERE
 145         if (!preg[rd].enabled)
 146             predicate = ~0x0; // all parallel ops enabled
 147         else:
 148             predicate = intregfile[predidx]; // get actual reg contents HERE
 149             if (preg[rd].inv) // invert if requested
 150                 predicate = ~predicate;
 151         if (predicate && (1<<i))
 152            (d ? regfile[rd+i] : regfile[rd]) =
 153             iop(s1 ? regfile[rs1+i] : regfile[rs1],
 154                 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
 155         else if (preg[rd].zero)
 156             // TODO: place zero in dest reg
 157
 158 Note:
 159
 160 * d, s1 and s2 are booleans indicating whether destination,
 161   source1 and source2 are vector or scalar
 162 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
 163   above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
 164   register-level redirection (from the Register CSR table) if they are
 165   vectors.
 166
 167 If written as a function, obtaining the predication mask (but not whether
 168 zeroing takes place) may be done as follows:
 169
 170     def get_pred_val(bool is_fp_op, int reg):
 171        tb = int_pred if is_fp_op else fp_pred
 172        if (!tb[reg].enabled):
 173           return ~0x0              // all ops enabled
 174        predidx = tb[reg].predidx   // redirection occurs HERE
 175        predicate = intreg[predidx] // actual predicate HERE
 176        if (tb[reg].inv):
 177           predicate = ~predicate   // invert ALL bits
 178        return predicate
 179
 180 ## Register CSR key-value (CAM) table
 181
 182 The purpose of the Register CSR table is four-fold:
 183
 184 * To mark integer and floating-point registers as requiring "redirection"
 185   if it is ever used as a source or destination in any given operation.
 186   This involves a level of indirection through a 5-to-6-bit lookup table
 187   (where the 6th bit - bank - is always set to 0 for now).
 188 * To indicate whether, after redirection through the lookup table, the
 189   register is a vector (or remains a scalar).
 190 * To over-ride the implicit or explicit bitwidth that the operation would
 191   normally give the register.
 192 * To indicate if the register is to be interpreted as "packed" (SIMD)
 193   i.e. containing multiple contiguous elements of size equal to "bitwidth".
 194
 195 | RgCSR | 15     | 14     | 13       | (12..11) | 10  | (9..5)  | (4..0)  |
 196 | ----- | -      | -      | -        | -        | -   | ------- | ------- |
 197 | 0     | simd0  | bank0  | isvec0   | vew0     | i/f | regidx  | predidx |
 198 | 1     | simd1  | bank1  | isvec1   | vew1     | i/f | regidx  | predidx |
 199 | ..    | simd.. | bank.. | isvec..  | vew..    | i/f | regidx  | predidx |
 200 | 15    | simd15 | bank15 | isvec15  | vew15    | i/f | regidx  | predidx |
 201
 202 vew may be one of the following (giving a table "bytestable", used below):
 203
 204 | vew | bitwidth  |
 205 | --- | --------- |
 206 | 00  | default   |
 207 | 01  | default/2 |
 208 | 10  | 8         |
 209 | 11  | 16        |
 210
 211 Extending this table (with extra bits) is covered in the section
 212 "Implementing RVV on top of Simple-V".
 213
 214 As the above table is a CAM (key-value store) it may be appropriate
 215 to expand it as follows:
 216
 217     struct vectorised fp_vec[32], int_vec[32]; // 64 in future
 218
 219     for (i = 0; i < 16; i++) // 16 CSRs?
 220        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 221        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 222        tb[idx].elwidth  = CSRvec[i].elwidth
 223        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 224        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 225        tb[idx].packed   = CSRvec[i].packed  // SIMD or not
 226        tb[idx].bank     = CSRvec[i].bank    // 0 (1=rsvd)
 227
 228 TODO: move elsewhere
 229
 230     # TODO: use elsewhere (retire for now)
 231     vew = CSRbitwidth[rs1]
 232     if (vew == 0)
 233         bytesperreg = (XLEN/8) # or FLEN as appropriate
 234     elif (vew == 1)
 235         bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
 236     else:
 237         bytesperreg = bytestable[vew] # 8 or 16
 238     simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
 239     vlen = CSRvectorlen[rs1] * simdmult
 240     CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
 241
 242 The reason for multiplying the vector length by the number of SIMD elements
 243 (in each individual register) is so that each SIMD element may optionally be
 244 predicated.
 245
 246 An example of how to subdivide the register file when bitwidth != default
 247 is given in the section "Bitwidth Virtual Register Reordering".
 248
 249 # Instructions
 250
 251 Despite being a 98% complete and accurate topological remap of RVV
 252 concepts and functionality, no new instructions are needed.
 253 *All* RVV instructions can be re-mapped, however xBitManip
 254 becomes a critical dependency for efficient manipulation of predication
 255 masks (as a bit-field).  Despite the removal of all but VSETVL and VGETVL,
 256 *all instructions from RVV are topologically re-mapped and retain their
 257 complete functionality, intact*.
 258
 259 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
 260 equivalents, so are left out of Simple-V.  VSELECT could be included if
 261 there existed a MV.X instruction in RV (MV.X is a hypothetical
 262 non-immediate variant of MV that would allow another register to
 263 specify which register was to be copied).  Note that if any of these three
 264 instructions are added to any given RV extension, their functionality
 265 will be inherently parallelised.
 266
 267 ## Instruction Format
 268
 269 The instruction format for Simple-V does not actually have *any* explicit
 270 compare operations, *any* arithmetic, floating point or *any*
 271 memory instructions.  There are in fact **no operations added at all**.
 272 Instead it *overloads* pre-existing branch operations into predicated
 273 variants, and implicitly overloads arithmetic operations, MV,
 274 FCVT, and LOAD/STORE
 275 depending on CSR configurations for bitwidth and
 276 predication.  **Everything** becomes parallelised.  *This includes
 277 Compressed instructions* as well as any
 278 future instructions and Custom Extensions.
 279
 280 ## VSETVL
 281
 282 NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
 283 with the instruction format remaining the same.
 284
 285 VSETVL is slightly different from RVV in that the minimum vector length
 286 is required to be at least the number of registers in the register file,
 287 and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
 288 the entire bank of registers using a single instruction (see Appendix,
 289 "Context Switch Example").  The reason for limiting VSETVL to XLEN is
 290 down to the fact that predication bits fit into a single register of length
 291 XLEN bits.
 292
 293 The second change is that when VSETVL is requested to be stored
 294 into x0, it is *ignored* silently (VSETVL x0, x5, #4)
 295
 296 The third change is that there is an additional immediate added to VSETVL,
 297 to which VL is set after first going through MIN-filtering.
 298 So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
 299
 300     VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
 301
 302 where RegfileLen <= MAXVECTORDEPTH < XLEN
 303
 304 This has implication for the microarchitecture, as VL is required to be
 305 set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
 306 requested in the #immediate parameter.  RVV has the option to set VL
 307 to an arbitrary value that suits the conditions and the micro-architecture:
 308 SV does *not* permit that.
 309
 310 The reason is so that if SV is to be used for a context-switch or as a
 311 substitute for LOAD/STORE-Multiple, the operation can be done with only
 312 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
 313 single LD/ST operation).  If VL does *not* get set to the register file
 314 length when VSETVL is called, then a software-loop would be needed.
 315 To avoid this need, VL *must* be set to exactly what is requested
 316 (limits notwithstanding).
 317
 318 Therefore, in turn, unlike RVV, implementors *must* provide
 319 pseudo-parallelism (using sequential loops in hardware) if actual
 320 hardware-parallelism in the ALUs is not deployed.  A hybrid is also
 321 permitted (as used in Broadcom's VideoCore-IV) however this must be
 322 *entirely* transparent to the ISA.
 323
 324 ## Branch Instruction:
 325
 326 Branch operations use standard RV opcodes that are reinterpreted to
 327 be "predicate variants" in the instance where either of the two src
 328 registers are marked as vectors (isvector=1).  When this reinterpretation
 329 is enabled the "immediate" field of the branch operation is taken to be a
 330 predication target register, rs3.  The predicate target register rs3 is
 331 to be treated as a bitfield (up to a maximum of XLEN bits corresponding
 332 to a maximum of XLEN elements).
 333
 334 If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
 335 goes ahead as vector-scalar or scalar-vector.  Implementors should note that
 336 this could require considerable multi-porting of the register file in order
 337 to parallelise properly, so may have to involve the use of register cacheing
 338 and transparent copying (see Multiple-Banked Register File Architectures
 339 paper).
 340
 341 In instances where no vectorisation is detected on either src registers
 342 the operation is treated as an absolutely standard scalar branch operation.
 343
 344 This is the overloaded table for Integer-base Branch operations.  Opcode
 345 (bits 6..0) is set in all cases to 1100011.
 346
 347 [[!table  data="""
 348 31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
 349 imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
 350 7           | 5        | 5     | 3      | 4             | 1  | 7       |
 351 reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
 352 reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
 353 reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
 354 reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
 355 reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
 356 reserved    | src2     | src1  | 100    | predicate rs3     || BLT     |
 357 reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
 358 reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
 359 reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
 360 """]]
 361
 362 Note that just as with the standard (scalar, non-predicated) branch
 363 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 364 src1 and src2.
 365
 366 Below is the overloaded table for Floating-point Predication operations.
 367 Interestingly no change is needed to the instruction format because
 368 FP Compare already stores a 1 or a zero in its "rd" integer register
 369 target, i.e. it's not actually a Branch at all: it's a compare.
 370 The target needs to simply change to be a predication bitfield (done
 371 implicitly).
 372
 373 As with
 374 Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
 375 Likewise Single-precision, fmt bits 26..25) is still set to 00.
 376 Double-precision is still set to 01, whilst Quad-precision
 377 appears not to have a definition in V2.3-Draft (but should be unaffected).
 378
 379 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
 380 and whilst in ordinary branch code this is fine because the standard
 381 RVF compare can always be followed up with an integer BEQ or a BNE (or
 382 a compressed comparison to zero or non-zero), in predication terms that
 383 becomes more of an impact.  To deal with this, SV's predication has
 384 had "invert" added to it.
 385
 386 [[!table  data="""
 387 31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
 388 funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
 389 5       | 2        | 5        | 5     | 3      | 4        | 7       |
 390 10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
 391 10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | rsvd    |
 392 10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
 393 10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
 394 """]]
 395
 396 Note (**TBD**): floating-point exceptions will need to be extended
 397 to cater for multiple exceptions (and statuses of the same).  The
 398 usual approach is to have an array of status codes and bit-fields,
 399 and one exception, rather than throw separate exceptions for each
 400 Vector element.
 401
 402 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 403 for predicated compare operations of function "cmp":
 404
 405     for (int i=0; i<vl; ++i)
 406       if ([!]preg[p][i])
 407          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 408                            s2 ? vreg[rs2][i] : sreg[rs2]);
 409
 410 With associated predication, vector-length adjustments and so on,
 411 and temporarily ignoring bitwidth (which makes the comparisons more
 412 complex), this becomes:
 413
 414     if I/F == INT: # integer type cmp
 415         preg = int_pred_reg[rd]
 416         reg = int_regfile
 417     else:
 418         preg = fp_pred_reg[rd]
 419         reg = fp_regfile
 420
 421     s1 = reg_is_vectorised(src1);
 422     s2 = reg_is_vectorised(src2);
 423     if (!s2 && !s1) goto branch;
 424     for (int i = 0; i < VL; ++i)
 425       if (cmp(s1 ? reg[src1+i]:reg[src1],
 426               s2 ? reg[src2+i]:reg[src2])
 427              preg[rs3] |= 1<<i;  # bitfield not vector
 428
 429 Notes:
 430
 431 * Predicated SIMD comparisons would break src1 and src2 further down
 432   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 433   Reordering") setting Vector-Length times (number of SIMD elements) bits
 434   in Predicate Register rs3 as opposed to just Vector-Length bits.
 435 * Predicated Branches do not actually have an adjustment to the Program
 436   Counter, so all of bits 25 through 30 in every case are not needed.
 437 * There are plenty of reserved opcodes for which bits 25 through 30 could
 438   be put to good use if there is a suitable use-case.
 439   FLT and FLE may be inverted to FGT and FGE if needed by swapping
 440   src1 and src2 (likewise the integer counterparts).
 441
 442 ## Compressed Branch Instruction:
 443
 444 Compressed Branch instructions are likewise re-interpreted as predicated
 445 2-register operations, with the result going into rs3.  All the bits of
 446 the immediate are re-interpreted for different purposes, to extend the
 447 number of comparator operations to beyond the original specification,
 448 but also to cater for floating-point comparisons as well as integer ones.
 449
 450 [[!table  data="""
 451 15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
 452 funct3 | imm      | rs10 | imm   |      | op   |      |
 453 3      | 3        | 3    | 2     |  3   | 2    |      |
 454 C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
 455 110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
 456 111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
 457 110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
 458 111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
 459 """]]
 460
 461 Notes:
 462
 463 * Bits 5 13 14 and 15 make up the comparator type
 464 * Bit 6 indicates whether to use integer or floating-point comparisons
 465 * In both floating-point and integer cases there are four predication
 466   comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
 467   src1 and src2).
 468
 469 ## LOAD / STORE Instructions <a name="load_store"></a>
 470
 471 For full analysis of topological adaptation of RVV LOAD/STORE
 472 see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
 473 may be implicitly overloaded into the one base RV LOAD instruction,
 474 and likewise for STORE.
 475
 476 Revised LOAD:
 477
 478 [[!table  data="""
 479 31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
 480 imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
 481 1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
 482 ?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
 483 """]]
 484
 485 The exact same corresponding adaptation is also carried out on the single,
 486 double and quad precision floating-point LOAD-FP and STORE-FP operations,
 487 which fit the exact same instruction format.  Thus all three types
 488 (unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
 489 as well as FSW, FSD and FSQ.
 490
 491 Notes:
 492
 493 * LOAD remains functionally (topologically) identical to RVV LOAD
 494   (for both integer and floating-point variants).
 495 * Predication CSR-marking register is not explicitly shown in instruction, it's
 496   implicit based on the CSR predicate state for the rd (destination) register
 497 * rs2, the source, may *also be marked as a vector*, which implicitly
 498   is taken to indicate "Indexed Load" (LD.X)
 499 * Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
 500 * Bit 31 is reserved (ideas under consideration: auto-increment)
 501 * **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
 502 * **TODO**: clarify where width maps to elsize
 503
 504 Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
 505
 506     if (unit-strided) stride = elsize;
 507     else stride = areg[as2]; // constant-strided
 508
 509     preg = int_pred_reg[rd]
 510
 511     for (int i=0; i<vl; ++i)
 512       if ([!]preg[rd] & 1<<i)
 513         for (int j=0; j<seglen+1; j++)
 514         {
 515           if CSRvectorised[rs2])
 516              offs = vreg[rs2+i]
 517           else
 518              offs = i*(seglen+1)*stride;
 519           vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
 520         }
 521
 522 Taking CSR (SIMD) bitwidth into account involves using the vector
 523 length and register encoding according to the "Bitwidth Virtual Register
 524 Reordering" scheme shown in the Appendix (see function "regoffs").
 525
 526 A similar instruction exists for STORE, with identical topological
 527 translation of all features.  **TODO**
 528
 529 ## Compressed LOAD / STORE Instructions
 530
 531 Compressed LOAD and STORE are of the same format, where bits 2-4 are
 532 a src register instead of dest:
 533
 534 [[!table  data="""
 535 15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
 536 funct3 | imm         | rs10   | imm         | rd0  | op   |
 537 3      | 3           | 3      | 2           | 3    | 2    |
 538 C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
 539 """]]
 540
 541 Unfortunately it is not possible to fit the full functionality
 542 of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
 543 require another operand (rs2) in addition to the operand width
 544 (which is also missing), offset, base, and src/dest.
 545
 546 However a close approximation may be achieved by taking the top bit
 547 of the offset in each of the five types of LD (and ST), reducing the
 548 offset to 4 bits and utilising the 5th bit to indicate whether "stride"
 549 is to be enabled.  In this way it is at least possible to introduce
 550 that functionality.
 551
 552 (**TODO**: *assess whether the loss of one bit from offset is worth having
 553 "stride" capability.*)
 554
 555 We also assume (including for the "stride" variant) that the "width"
 556 parameter, which is missing, is derived and implicit, just as it is
 557 with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
 558 and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
 559 C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
 560
 561 Interestingly we note that the Vectorised Simple-V variant of
 562 LOAD/STORE (Compressed and otherwise), due to it effectively using the
 563 standard register file(s), is the direct functional equivalent of
 564 standard load-multiple and store-multiple instructions found in other
 565 processors.
 566
 567 In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
 568 page 76, "For virtual memory systems some data accesses could be resident
 569 in physical memory and some not".  The interesting question then arises:
 570 how does RVV deal with the exact same scenario?
 571 Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
 572 of detecting early page / segmentation faults and adjusting the TLB
 573 in advance, accordingly: other strategies are explored in the Appendix
 574 Section "Virtual Memory Page Faults".
 575
 576 ## Vectorised Copy/Move (and conversion) instructions
 577
 578 There is a series of 2-operand instructions involving copying (and
 579 alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ.  These operations all
 580 follow the same pattern, as it is *both* the source *and* destination
 581 predication masks that are taken into account.  This is different from
 582 the three-operand arithmetic instructions, where the predication mask
 583 is taken from the *destination* register, and applied uniformly to the
 584 elements of the source register(s), element-for-element.
 585
 586 ### C.MV Instruction <a name="c_mv"></a>
 587
 588 There is no MV instruction in RV however there is a C.MV instruction.
 589 It is used for copying integer-to-integer registers (vectorised FMV
 590 is used for copying floating-point).
 591
 592 If either the source or the destination register are marked as vectors
 593 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 594 move operation.  The actual instruction's format does not change:
 595
 596 [[!table  data="""
 597 15  12 | 11   7 | 6  2 | 1  0 |
 598 funct4 | rd     | rs   | op   |
 599 4      | 5      | 5    | 2    |
 600 C.MV   | dest   | src  | C0   |
 601 """]]
 602
 603 A simplified version of the pseudocode for this operation is as follows:
 604
 605     function op_mv(rd, rs) # MV not VMV!
 606       rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
 607       rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
 608       ps = get_pred_val(FALSE, rs); # predication on src
 609       pd = get_pred_val(FALSE, rd); # ... AND on dest
 610       for (int i = 0, int j = 0; i < VL && j < VL;):
 611         if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
 612         if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
 613         ireg[rd+j] <= ireg[rs+i];
 614         if (int_vec[rs].isvec) i++;
 615         if (int_vec[rd].isvec) j++;
 616
 617 Note that:
 618
 619 * elwidth (SIMD) is not covered above
 620 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 621   not covered
 622
 623 There are several different instructions from RVV that are covered by
 624 this one opcode:
 625
 626 [[!table  data="""
 627 src    | dest    | predication   | op             |
 628 scalar | vector  | none          | VSPLAT         |
 629 scalar | vector  | destination   | sparse VSPLAT  |
 630 scalar | vector  | 1-bit dest    | VINSERT        |
 631 vector | scalar  | 1-bit? src    | VEXTRACT       |
 632 vector | vector  | none          | VCOPY          |
 633 vector | vector  | src           | Vector Gather  |
 634 vector | vector  | dest          | Vector Scatter |
 635 vector | vector  | src & dest    | Gather/Scatter |
 636 vector | vector  | src == dest   | sparse VCOPY   |
 637 """]]
 638
 639 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 640 operations with inversion on the src and dest predication for one of the
 641 two C.MV operations.
 642
 643 Note that in the instance where the Compressed Extension is not implemented,
 644 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 645 Note that the behaviour is **different** from C.MV because with addi the
 646 predication mask to use is taken **only** from rd and is applied against
 647 all elements: rs[i] = rd[i].
 648
 649 ### FMV, FNEG and FABS Instructions
 650
 651 These are identical in form to C.MV, except covering floating-point
 652 register copying.  The same double-predication rules also apply.
 653 However when elwidth is not set to default the instruction is implicitly
 654 and automatic converted to a (vectorised) floating-point type conversion
 655 operation of the appropriate size covering the source and destination
 656 register bitwidths.
 657
 658 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 659
 660 ### FVCT Instructions
 661
 662 These are again identical in form to C.MV, except that they cover
 663 floating-point to integer and integer to floating-point.  When element
 664 width in each vector is set to default, the instructions behave exactly
 665 as they are defined for standard RV (scalar) operations, except vectorised
 666 in exactly the same fashion as outlined in C.MV.
 667
 668 However when the source or destination element width is not set to default,
 669 the opcode's explicit element widths are *over-ridden* to new definitions,
 670 and the opcode's element width is taken as indicative of the SIMD width
 671 (if applicable i.e. if packed SIMD is requested) instead.
 672
 673 For example FCVT.S.L would normally be used to convert a 64-bit
 674 integer in register rs1 to a 64-bit floating-point number in rd.
 675 If however the source rs1 is set to be a vector, where elwidth is set to
 676 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 677 rs1 are converted to a floating-point number to be stored in rd's
 678 first element and the higher 32-bits *also* converted to floating-point
 679 and stored in the second.  The 32 bit size comes from the fact that
 680 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 681 divide that by two it means that rs1 element width is to be taken as 32.
 682
 683 Similar rules apply to the destination register.
 684
 685 # Exceptions
 686
 687 > What does an ADD of two different-sized vectors do in simple-V?
 688
 689 * if the two source operands are not the same, throw an exception.
 690 * if the destination operand is also a vector, and the source is longer
 691   than the destination, throw an exception.
 692
 693 > And what about instructions like JALR?
 694 > What does jumping to a vector do?
 695
 696 * Throw an exception.  Whether that actually results in spawning threads
 697   as part of the trap-handling remains to be seen.
 698
 699 # Under consideration <a name="issues"></a>
 700
 701 ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
 702
 703 One of the goals of this parallelism proposal is to avoid instruction
 704 duplication.  However, with the base ISA having been designed explictly
 705 to *avoid* condition-codes entirely, shoe-horning predication into it
 706 bcomes quite challenging.
 707
 708 However what if all branch instructions, if referencing a vectorised
 709 register, were instead given *completely new analogous meanings* that
 710 resulted in a parallel bit-wise predication register being set?  This
 711 would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
 712 BLT and BGE.
 713
 714 We might imagine that FEQ, FLT and FLT would also need to be converted,
 715 however these are effectively *already* in the precise form needed and
 716 do not need to be converted *at all*!  The difference is that FEQ, FLT
 717 and FLE *specifically* write a 1 to an integer register if the condition
 718 holds, and 0 if not.  All that needs to be done here is to say, "if
 719 the integer register is tagged with a bit that says it is a predication
 720 register, the **bit** in the integer register is set based on the
 721 current vector index" instead.
 722
 723 There is, in the standard Conditional Branch instruction, more than
 724 adequate space to interpret it in a similar fashion:
 725
 726 [[!table  data="""
 727 31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
 728 imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
 729  1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
 730    offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
 731 """]]
 732
 733 This would become:
 734
 735 [[!table  data="""
 736 31      | 30 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
 737 imm[12] | imm[10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
 738 1       | 6        | 5        | 5     | 3      | 4             | 1  | 7       |
 739 reserved          || src2     | src1  | BEQ    | predicate rs3     || BRANCH  |
 740 """]]
 741
 742 Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
 743 with the interesting side-effect that there is space within what is presently
 744 the "immediate offset" field to reinterpret that to add in not only a bit
 745 field to distinguish between floating-point compare and integer compare,
 746 not only to add in a second source register, but also use some of the bits as
 747 a predication target as well.
 748
 749 [[!table  data="""
 750 15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
 751 funct3 | imm           | rs10  | imm               | op     |
 752 3      | 3             | 3     | 5                 | 2      |
 753 C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
 754 """]]
 755
 756 Now uses the CS format:
 757
 758 [[!table  data="""
 759 15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
 760 funct3 | imm      | rs10   | imm    |      | op     |
 761 3      | 3        | 3      | 2      | 3    | 2      |
 762 C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
 763 """]]
 764
 765 Bit 6 would be decoded as "operation refers to Integer or Float" including
 766 interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
 767 "C" Standard, version 2.0,
 768 whilst Bit 5 would allow the operation to be extended, in combination with
 769 funct3 = 110 or 111: a combination of four distinct (predicated) comparison
 770 operators.  In both floating-point and integer cases those could be
 771 EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
 772
 773