simple_v_extension/specification.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification
   2
   3 * Status: DRAFTv0.2
   4 * Last edited: 17 oct 2018
   5 * Ancillary resource: [[opcodes]]
   6
   7 With thanks to:
   8
   9 * Allen Baum
  10 * Jacob Bachmeyer
  11 * Guy Lemurieux
  12 * Jacob Lifshay
  13 * The RISC-V Founders, without whom this all would not be possible.
  14
  15 [[!toc ]]
  16
  17 # Summary and Background: Rationale
  18
  19 Simple-V is a uniform parallelism API for RISC-V hardware that has several
  20 unplanned side-effects including code-size reduction, expansion of
  21 HINT space and more.  The reason for
  22 creating it is to provide a manageable way to turn a pre-existing design
  23 into a parallel one, in a step-by-step incremental fashion, allowing
  24 the implementor to focus on adding hardware where it is needed and necessary.
  25
  26 Critically: **No new instructions are added**.  The parallelism (if any
  27 is implemented) is implicitly added by tagging *standard* scalar registers
  28 for redirection.  When such a tagged register is used in any instruction,
  29 it indicates that the PC shall **not** be incremented; instead a loop
  30 is activated where *multiple* instructions are issued to the pipeline
  31 (as determined by a length CSR), with contiguously incrementing register
  32 numbers starting from the tagged register.  When the last "element"
  33 has been reached, only then is the PC permitted to move on.  Thus
  34 Simple-V effectively sits (slots) *in between* the instruction decode phase
  35 and the ALU(s).
  36
  37 The barrier to entry with SV is therefore very low.  The minimum
  38 compliant implementation is software-emulation (traps), requiring
  39 only the CSRs and CSR tables, and that an exception be thrown if an
  40 instruction's registers are detected to have been tagged.  The looping
  41 that would otherwise be done in hardware is thus carried out in software,
  42 instead.  Whilst much slower, it is "compliant" with the SV specification,
  43 and may be suited for implementation in RV32E and also in situations
  44 where the implementor wishes to focus on certain aspects of SV, without
  45 unnecessary time and resources into the silicon, whilst also conforming
  46 strictly with the API.  A good area to punt to software would be the
  47 polymorphic element width capability for example.
  48
  49 Hardware Parallelism, if any, is therefore added at the implementor's
  50 discretion to turn what would otherwise be a sequential loop into a
  51 parallel one.
  52
  53 To emphasise that clearly: Simple-V (SV) is *not*:
  54
  55 * A SIMD system
  56 * A SIMT system
  57 * A Vectorisation Microarchitecture
  58 * A microarchitecture of any specific kind
  59 * A mandary parallel processor microarchitecture of any kind
  60 * A supercomputer extension
  61
  62 SV does **not** tell implementors how or even if they should implement
  63 parallelism: it is a hardware "API" (Application Programming Interface)
  64 that, if implemented, presents a uniform and consistent way to *express*
  65 parallelism, at the same time leaving the choice of if, how, how much,
  66 when and whether to parallelise operations **entirely to the implementor**.
  67
  68 # Basic Operation
  69
  70 The principle of SV is as follows:
  71
  72 * CSRs indicating which registers are "tagged" as parallel are set up
  73 * A "Vector Length" CSR is set, indicating the span of any future
  74   "parallel" operations.
  75 * A **scalar** operation, just after the decode phase and before the
  76   execution phase, checks the CSR register tables to see if any of
  77   its registers have been marked as "vectorised"
  78 * If so, a hardware "macro-unrolling loop" is activated, of length
  79   VL, that effectively issues **multiple** identical instructions (whether
  80   they be sequential or parallel is entirely up to the implementor),
  81   using contiguous sequentially-incrementing registers.
  82
  83 In this way an entire scalar algorithm may be vectorised with
  84 the minimum of modification to the hardware and to compiler toolchains.
  85 There are **no** new opcodes.
  86
  87 # CSRs <a name="csrs"></a>
  88
  89 For U-Mode there are two CSR key-value stores needed to create lookup
  90 tables which are used at the register decode phase.
  91
  92 * A register CSR key-value table (typically 8 32-bit CSRs of 2 16-bits each)
  93 * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each)
  94 * Small U-Mode and S-Mode register and predication CSR key-value tables
  95   (2 32-bit CSRs of 2x 16-bit entries each).
  96 * An optional "reshaping" CSR key-value table which remaps from a 1D
  97   linear shape to 2D or 3D, including full transposition.
  98
  99 There are also four additional CSRs for User-Mode:
 100
 101 * CFG subsets the CSR tables
 102 * MVL (the Maximum Vector Length)
 103 * VL (which has different characteristics from standard CSRs)
 104 * STATE (useful for saving and restoring during context switch,
 105   and for providing fast transitions)
 106
 107 There are also three additional CSRs for Supervisor-Mode:
 108
 109 * SMVL
 110 * SVL
 111 * SSTATE
 112
 113 And likewise for M-Mode:
 114
 115 * MMVL
 116 * MVL
 117 * MSTATE
 118
 119 Both Supervisor and M-Mode have their own (small) CSR register and
 120 predication tables of only 4 entries each.
 121
 122 ## CFG
 123
 124 This CSR may be used to switch between subsets of the CSR Register and
 125 Predication Tables: it is kept to 5 bits so that a single CSRRWI instruction
 126 can be used.  A setting of all ones is reserved to indicate that SimpleV
 127 is disabled.
 128
 129 | (4..3) | (2...0) |
 130 | ------ | ------- |
 131 | size   | bank    |
 132
 133 Bank is 3 bits in size, and indicates the starting index of the CSR
 134 Register and Predication Table entries that are "enabled".  Given that
 135 each CSR table row is 16 bits and contains 2 CAM entries each, there
 136 are only 8 CSRs to cover in each table, so 8 bits is sufficient.
 137
 138 Size is 2 bits.  With the exception of when bank == 7 and size == 3,
 139 the number of elements enabled is taken by right-shifting 2 by size:
 140
 141 | size   | elements |
 142 | ------ | -------- |
 143 | 0      | 2        |
 144 | 1      | 4        |
 145 | 2      | 8        |
 146 | 3      | 16       |
 147
 148 Given that there are 2 16-bit CAM entries per CSR table row, this
 149 may also be viewed as the number of CSR rows to enable, by raising size to
 150 the power of 2.
 151
 152 Examples:
 153
 154 * When bank = 0 and size = 3, SVREGCFG0 through to SVREGCFG7 are
 155   enabled, and SVPREDCFG0 through to SVPREGCFG7 are enabled.
 156 * When bank = 1 and size = 3, SVREGCFG1 through to SVREGCFG7 are
 157   enabled, and SVPREDCFG1 through to SVPREGCFG7 are enabled.
 158 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 159 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 160 * When bank = 7 and size = 1, SVREGCFG7 and SVPREDCFG7 are enabled.
 161 * When bank = 7 and size = 3, SimpleV is entirely disabled.
 162
 163 In this way it is possible to enable and disable SimpleV with a
 164 single instruction, and, furthermore, on context-switching the quantity
 165 of CSRs to be saved and restored is greatly reduced.
 166
 167 ## MAXVECTORLENGTH (MVL)
 168
 169 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
 170 is variable length and may be dynamically set.  MVL is
 171 however limited to the regfile bitwidth XLEN (1-32 for RV32,
 172 1-64 for RV64 and so on).
 173
 174 The reason for setting this limit is so that predication registers, when
 175 marked as such, may fit into a single register as opposed to fanning out
 176 over several registers.  This keeps the implementation a little simpler.
 177
 178 The other important factor to note is that the actual MVL is **offset
 179 by one**, so that it can fit into only 6 bits (for RV64) and still cover
 180 a range up to XLEN bits.  So, when setting the MVL CSR to 0, this actually
 181 means that MVL==1.  When setting the MVL CSR to 3, this actually means
 182 that MVL==4, and so on.  This is expressed more clearly in the "pseudocode"
 183 section, where there are subtle differences between CSRRW and CSRRWI.
 184
 185 ## Vector Length (VL)
 186
 187 VSETVL is slightly different from RVV.  Like RVV, VL is set to be within
 188 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
 189
 190     VL = rd = MIN(vlen, MVL)
 191
 192 where 1 <= MVL <= XLEN
 193
 194 However just like MVL it is important to note that the range for VL has
 195 subtle design implications, covered in the "CSR pseudocode" section
 196
 197 The fixed (specific) setting of VL allows vector LOAD/STORE to be used
 198 to switch the entire bank of registers using a single instruction (see
 199 Appendix, "Context Switch Example").  The reason for limiting VL to XLEN
 200 is down to the fact that predication bits fit into a single register of
 201 length XLEN bits.
 202
 203 The second change is that when VSETVL is requested to be stored
 204 into x0, it is *ignored* silently (VSETVL x0, x5)
 205
 206 The third and most important change is that, within the limits set by
 207 MVL, the value passed in **must** be set in VL (and in the
 208 destination register).
 209
 210 This has implication for the microarchitecture, as VL is required to be
 211 set (limits from MVL notwithstanding) to the actual value
 212 requested.  RVV has the option to set VL to an arbitrary value that suits
 213 the conditions and the micro-architecture: SV does *not* permit this.
 214
 215 The reason is so that if SV is to be used for a context-switch or as a
 216 substitute for LOAD/STORE-Multiple, the operation can be done with only
 217 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
 218 single LD/ST operation).  If VL does *not* get set to the register file
 219 length when VSETVL is called, then a software-loop would be needed.
 220 To avoid this need, VL *must* be set to exactly what is requested
 221 (limits notwithstanding).
 222
 223 Therefore, in turn, unlike RVV, implementors *must* provide
 224 pseudo-parallelism (using sequential loops in hardware) if actual
 225 hardware-parallelism in the ALUs is not deployed.  A hybrid is also
 226 permitted (as used in Broadcom's VideoCore-IV) however this must be
 227 *entirely* transparent to the ISA.
 228
 229 The fourth change is that VSETVL is implemented as a CSR, where the
 230 behaviour of CSRRW (and CSRRWI) must be changed to specifically store
 231 the *new* value in the destination register, **not** the old value.
 232 Where context-load/save is to be implemented in the usual fashion
 233 by using a single CSRRW instruction to obtain the old value, the
 234 *secondary* CSR must be used (SVSTATE).  This CSR behaves
 235 exactly as standard CSRs, and contains more than just VL.
 236
 237 One interesting side-effect of using CSRRWI to set VL is that this
 238 may be done with a single instruction, useful particularly for a
 239 context-load/save.  There are however limitations: CSRWWI's immediate
 240 is limited to 0-31.
 241
 242 ## STATE
 243
 244 This is a standard CSR that contains sufficient information for a
 245 full context save/restore.  It contains (and permits setting of)
 246 MVL, VL, CFG, the destination element offset of the current parallel
 247 instruction being executed, and, for twin-predication, the source
 248 element offset as well.  Interestingly it may hypothetically
 249 also be used to make the immediately-following instruction to skip a
 250 certain number of elements, however the recommended method to do
 251 this is predication.
 252
 253 Setting destoffs and srcoffs is realistically intended for saving state
 254 so that exceptions (page faults in particular) may be serviced and the
 255 hardware-loop that was being executed at the time of the trap, from
 256 user-mode (or Supervisor-mode), may be returned to and continued from
 257 where it left off.  The reason why this works is because setting
 258 User-Mode STATE will not change (not be used) in M-Mode or S-Mode
 259 (and is entirely why M-Mode and S-Mode have their own STATE CSRs).
 260
 261 The format of the STATE CSR is as follows:
 262
 263 | (28..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
 264 | -------- | -------- | -------- | -------- | ------- | ------- |
 265 | size     | bank     | destoffs | srcoffs  | vl      | maxvl   |
 266
 267 When setting this CSR, the following characteristics will be enforced:
 268
 269 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
 270 * **VL** will be truncated (after offset) to be within the range 1 to MAXVL
 271 * **srcoffs** will be truncated to be within the range 0 to VL-1
 272 * **destoffs** will be truncated to be within the range 0 to VL-1
 273
 274 ## MVL, VL and CSR Pseudocode
 275
 276 The pseudo-code for get and set of VL and MVL are as follows:
 277
 278     set_mvl_csr(value, rd):
 279         regs[rd] = MVL
 280         MVL = MIN(value, MVL)
 281
 282     get_mvl_csr(rd):
 283         regs[rd] = VL
 284
 285     set_vl_csr(value, rd):
 286         VL = MIN(value, MVL)
 287         regs[rd] = VL # yes returning the new value NOT the old CSR
 288
 289     get_vl_csr(rd):
 290         regs[rd] = VL
 291
 292 Note that where setting MVL behaves as a normal CSR, unlike standard CSR
 293 behaviour, setting VL will return the **new** value of VL **not** the old
 294 one.
 295
 296 For CSRRWI, the range of the immediate is restricted to 5 bits.  In order to
 297 maximise the effectiveness, an immediate of 0 is used to set VL=1,
 298 an immediate of 1 is used to set VL=2 and so on:
 299
 300     CSRRWI_Set_MVL(value):
 301         set_mvl_csr(value+1, x0)
 302
 303     CSRRWI_Set_VL(value):
 304         set_vl_csr(value+1, x0)
 305
 306 However for CSRRW the following pseudocide is used for MVL and VL,
 307 where setting the value to zero will cause an exception to be raised.
 308 The reason is that if VL or MVL are set to zero, the STATE CSR is
 309 not capable of returning that value.
 310
 311     CSRRW_Set_MVL(rs1, rd):
 312         value = regs[rs1]
 313         if value == 0:
 314             raise Exception
 315         set_mvl_csr(value, rd)
 316
 317     CSRRW_Set_VL(rs1, rd):
 318         value = regs[rs1]
 319         if value == 0:
 320             raise Exception
 321         set_vl_csr(value, rd)
 322
 323 In this way, when CSRRW is utilised with a loop variable, the value
 324 that goes into VL (and into the destination register) may be used
 325 in an instruction-minimal fashion:
 326
 327      CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
 328      CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
 329      CSRRWI MVL, 3          # sets MVL == **4** (not 3)
 330      j zerotest             # in case loop counter a0 already 0
 331     loop:
 332      CSRRW VL, t0, a0       # vl = t0 = min(mvl, a0)
 333      ld     a3, a1          # load 4 registers a3-6 from x
 334      slli   t1, t0, 3       # t1 = vl * 8 (in bytes)
 335      ld     a7, a2          # load 4 registers a7-10 from y
 336      add    a1, a1, t1      # increment pointer to x by vl*8
 337      fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
 338      sub    a0, a0, t0      # n -= vl (t0)
 339      st     a7, a2          # store 4 registers a7-10 to y
 340      add    a2, a2, t1      # increment pointer to y by vl*8
 341     zerotest:
 342      bnez   a0, loop        # repeat if n != 0
 343
 344 With the STATE CSR, just like with CSRRWI, in order to maximise the
 345 utilisation of the limited bitspace, "000000" in binary represents
 346 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
 347
 348     CSRRW_Set_SV_STATE(rs1, rd):
 349         value = regs[rs1]
 350         get_state_csr(rd)
 351         MVL = set_mvl_csr(value[11:6]+1)
 352         VL = set_vl_csr(value[5:0]+1)
 353         CFG = value[28:24]>>24
 354         destoffs = value[23:18]>>18
 355         srcoffs = value[23:18]>>12
 356
 357     get_state_csr(rd):
 358         regs[rd] = (MVL-1) | (VL-1)<<6 | (srcoffs)<<12 |
 359                    (destoffs)<<18 | (CFG)<<24
 360         return regs[rd]
 361
 362 In both cases, whilst CSR read of VL and MVL return the exact values
 363 of VL and MVL respectively, reading and writing the STATE CSR returns
 364 those values **minus one**.  This is absolutely critical to implement
 365 if the STATE CSR is to be used for fast context-switching.
 366
 367 ## Register CSR key-value (CAM) table
 368
 369 The purpose of the Register CSR table is four-fold:
 370
 371 * To mark integer and floating-point registers as requiring "redirection"
 372   if it is ever used as a source or destination in any given operation.
 373   This involves a level of indirection through a 5-to-7-bit lookup table,
 374   such that **unmodified** operands with 5 bit (3 for Compressed) may
 375   access up to **64** registers.
 376 * To indicate whether, after redirection through the lookup table, the
 377   register is a vector (or remains a scalar).
 378 * To over-ride the implicit or explicit bitwidth that the operation would
 379   normally give the register.
 380
 381 | RgCSR | | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
 382 | ----- | | -        | -        | -   | ------ | ------- |
 383 | 0     | | isvec0   | regidx0  | i/f | vew0   | regkey  |
 384 | 1     | | isvec1   | regidx1  | i/f | vew1   | regkey  |
 385 | ..    | | isvec..  | regidx.. | i/f | vew..  | regkey  |
 386 | 15    | | isvec15  | regidx15 | i/f | vew15  | regkey  |
 387
 388 i/f is set to "1" to indicate that the redirection/tag entry is to be applied
 389 to integer registers; 0 indicates that it is relevant to floating-point
 390 registers.  vew has the following meanings, indicating that the instruction's
 391 operand size is "over-ridden" in a polymorphic fashion:
 392
 393 | vew | bitwidth   |
 394 | --- | ---------- |
 395 | 00  | default    |
 396 | 01  | default/2  |
 397 | 10  | default\*2 |
 398 | 11  | 8          |
 399
 400 As the above table is a CAM (key-value store) it may be appropriate
 401 (faster, implementation-wise) to expand it as follows:
 402
 403     struct vectorised fp_vec[32], int_vec[32];
 404
 405     for (i = 0; i < 16; i++) // 16 CSRs?
 406        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 407        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 408        tb[idx].elwidth  = CSRvec[i].elwidth
 409        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 410        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 411        tb[idx].packed   = CSRvec[i].packed  // SIMD or not
 412
 413 The actual size of the CSR Register table depends on the platform
 414 and on whether other Extensions are present (RV64G, RV32E, etc.).
 415 For details see "Subsets" section.
 416
 417 16-bit CSR Register CAM entries are mapped directly into 32-bit
 418 on any RV32-based system, however RV64 (XLEN=64) and RV128 (XLEN=128)
 419 are slightly different: the 16-bit entries appear (and can be set)
 420 multiple times, in an overlapping fashion.  Here is the table for RV64:
 421
 422 | CSR#  | 63..48  | 47..32  | 31..16  | 15..0   |
 423 | 0x4c0 | RgCSR3  | RgCSR2  | RgCSR1  | RgCSR0  |
 424 | 0x4c1 | RgCSR5  | RgCSR4  | RgCSR3  | RgCSR2  |
 425 | 0x4c2 | ...     | ...     | ...     | ...     |
 426 | 0x4c1 | RgCSR15 | RgCSR14 | RgCSR13 | RgCSR12 |
 427 | 0x4c8 | n/a     | n/a     | RgCSR15 | RgCSR4  |
 428
 429 The rules for writing to these CSRs are that any entries above the ones
 430 being set will be automatically wiped (to zero), so to fill several entries
 431 they must be written in a sequentially increasing manner.  This functionality
 432 was in an early draft of RVV and it means that, firstly, compilers do not have
 433 to spend time zero-ing out CSRs unnecessarily, and secondly, that on
 434 context-switching (and function calls) the number of CSRs that may need
 435 saving is implicitly known.
 436
 437 The reason for the overlapping entries is that in the worst-case on an
 438 RV64 system, only 4 64-bit CSR reads/writes are required for a full
 439 context-switch (and an RV128 system, only 2 128-bit CSR reads/writes).
 440
 441 --
 442
 443 TODO: move elsewhere
 444
 445     # TODO: use elsewhere (retire for now)
 446     vew = CSRbitwidth[rs1]
 447     if (vew == 0)
 448         bytesperreg = (XLEN/8) # or FLEN as appropriate
 449     elif (vew == 1)
 450         bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
 451     else:
 452         bytesperreg = bytestable[vew] # 8 or 16
 453     simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
 454     vlen = CSRvectorlen[rs1] * simdmult
 455     CSRvlength = MIN(MIN(vlen, MAXVECTORLENGTH), rs2)
 456
 457 The reason for multiplying the vector length by the number of SIMD elements
 458 (in each individual register) is so that each SIMD element may optionally be
 459 predicated.
 460
 461 An example of how to subdivide the register file when bitwidth != default
 462 is given in the section "Bitwidth Virtual Register Reordering".
 463
 464 ## Predication CSR <a name="predication_csr_table"></a>
 465
 466 TODO: update CSR tables, now 7-bit for regidx
 467
 468 The Predication CSR is a key-value store indicating whether, if a given
 469 destination register (integer or floating-point) is referred to in an
 470 instruction, it is to be predicated.  Tt is particularly important to note
 471 that the *actual* register used can be *different* from the one that is
 472 in the instruction, due to the redirection through the lookup table.
 473
 474 * regidx is the actual register that in combination with the
 475   i/f flag, if that integer or floating-point register is referred to,
 476   results in the lookup table being referenced to find the predication
 477   mask to use on the operation in which that (regidx) register has
 478   been used
 479 * predidx (in combination with the bank bit in the future) is the
 480   *actual* register to be used for the predication mask.  Note:
 481   in effect predidx is actually a 6-bit register address, as the bank
 482   bit is the MSB (and is nominally set to zero for now).
 483 * inv indicates that the predication mask bits are to be inverted
 484   prior to use *without* actually modifying the contents of the
 485   register itself.
 486 * zeroing is either 1 or 0, and if set to 1, the operation must
 487   place zeros in any element position where the predication mask is
 488   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 489   be left alone.  Some microarchitectures may choose to interpret
 490   this as skipping the operation entirely.  Others which wish to
 491   stick more closely to a SIMD architecture may choose instead to
 492   interpret unpredicated elements as an internal "copy element"
 493   operation (which would be necessary in SIMD microarchitectures
 494   that perform register-renaming)
 495 * "packed" indicates if the register is to be interpreted as SIMD
 496   i.e. containing multiple contiguous elements of size equal to "bitwidth".
 497   (Note: in earlier drafts this was in the Register CSR table.
 498   However after extending to 7 bits there was not enough space.
 499   To use "unpredicated" packed SIMD, set the predicate to x0 and
 500   set "invert".  This has the effect of setting a predicate of all 1s)
 501
 502 | PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
 503 | ----- | -      | -      | -     | -   | ------- | ------- |
 504 | 0     | bank0  | zero0  | inv0  | i/f | regidx  | predkey |
 505 | 1     | bank1  | zero1  | inv1  | i/f | regidx  | predkey |
 506 | ..    | bank.. | zero.. | inv.. | i/f | regidx  | predkey |
 507 | 15    | bank15 | zero15 | inv15 | i/f | regidx  | predkey |
 508
 509 The Predication CSR Table is a key-value store, so implementation-wise
 510 it will be faster to turn the table around (maintain topologically
 511 equivalent state):
 512
 513     struct pred {
 514         bool zero;
 515         bool inv;
 516         bool enabled;
 517         int predidx; // redirection: actual int register to use
 518     }
 519
 520     struct pred fp_pred_reg[32];   // 64 in future (bank=1)
 521     struct pred int_pred_reg[32];  // 64 in future (bank=1)
 522
 523     for (i = 0; i < 16; i++)
 524       tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
 525       idx = CSRpred[i].regidx
 526       tb[idx].zero = CSRpred[i].zero
 527       tb[idx].inv  = CSRpred[i].inv
 528       tb[idx].predidx  = CSRpred[i].predidx
 529       tb[idx].enabled  = true
 530
 531 So when an operation is to be predicated, it is the internal state that
 532 is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
 533 pseudo-code for operations is given, where p is the explicit (direct)
 534 reference to the predication register to be used:
 535
 536     for (int i=0; i<vl; ++i)
 537         if ([!]preg[p][i])
 538            (d ? vreg[rd][i] : sreg[rd]) =
 539             iop(s1 ? vreg[rs1][i] : sreg[rs1],
 540                 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
 541
 542 This instead becomes an *indirect* reference using the *internal* state
 543 table generated from the Predication CSR key-value store, which iwws used
 544 as follows.
 545
 546     if type(iop) == INT:
 547         preg = int_pred_reg[rd]
 548     else:
 549         preg = fp_pred_reg[rd]
 550
 551     for (int i=0; i<vl; ++i)
 552         predicate, zeroing = get_pred_val(type(iop) == INT, rd):
 553         if (predicate && (1<<i))
 554            (d ? regfile[rd+i] : regfile[rd]) =
 555             iop(s1 ? regfile[rs1+i] : regfile[rs1],
 556                 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
 557         else if (zeroing)
 558            (d ? regfile[rd+i] : regfile[rd]) = 0
 559
 560 Note:
 561
 562 * d, s1 and s2 are booleans indicating whether destination,
 563   source1 and source2 are vector or scalar
 564 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
 565   above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
 566   register-level redirection (from the Register CSR table) if they are
 567   vectors.
 568
 569 If written as a function, obtaining the predication mask (and whether
 570 zeroing takes place) may be done as follows:
 571
 572     def get_pred_val(bool is_fp_op, int reg):
 573        tb = int_reg if is_fp_op else fp_reg
 574        if (!tb[reg].enabled):
 575           return ~0x0, False       // all enabled; no zeroing
 576        tb = int_pred if is_fp_op else fp_pred
 577        if (!tb[reg].enabled):
 578           return ~0x0, False       // all enabled; no zeroing
 579        predidx = tb[reg].predidx   // redirection occurs HERE
 580        predicate = intreg[predidx] // actual predicate HERE
 581        if (tb[reg].inv):
 582           predicate = ~predicate   // invert ALL bits
 583        return predicate, tb[reg].zero
 584
 585 Note here, critically, that **only** if the register is marked
 586 in its CSR **register** table entry as being "active" does the testing
 587 proceed further to check if the CSR **predicate** table entry is
 588 also active.
 589
 590 Note also that this is in direct contrast to branch operations
 591 for the storage of comparisions: in these specific circumstances
 592 the requirement for there to be an active CSR *register* entry
 593 is removed.
 594
 595 ## REMAP CSR
 596
 597 (Note: both the REMAP and SHAPE sections are best read after the
 598  rest of the document has been read)
 599
 600 There is one 32-bit CSR which may be used to indicate which registers,
 601 if used in any operation, must be "reshaped" (re-mapped) from a linear
 602 form to a 2D or 3D transposed form.  The 32-bit REMAP CSR may reshape
 603 up to 3 registers:
 604
 605 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 606 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
 607 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
 608
 609 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
 610 *real* register (see regidx, the value) and consequently is 7-bits wide.
 611 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
 612 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
 613
 614 ## SHAPE 1D/2D/3D vector-matrix remapping CSRs
 615
 616 (Note: both the REMAP and SHAPE sections are best read after the
 617  rest of the document has been read)
 618
 619 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
 620 which have the same format.  When each SHAPE CSR is set entirely to zeros,
 621 remapping is disabled: the register's elements are a linear (1D) vector.
 622
 623 | 26..24  | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 624 | ------- | -- | ------- | -- | ------- | -- | ------- |
 625 | permute | 0  | zdimsz  | 0  | ydimsz  | 0  | xdimsz  |
 626
 627 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 628 that the array dimensionality for that dimension is 1.  A value of xdimsz=2
 629 would indicate that in the first dimension there are 3 elements in the
 630 array.  The format of the array is therefore as follows:
 631
 632     array[xdim+1][ydim+1][zdim+1]
 633
 634 However whilst illustrative of the dimensionality, that does not take the
 635 "permute" setting into account.  "permute" may be any one of six values
 636 (0-5, with values of 6 and 7 being reserved, and not legal).  The table
 637 below shows how the permutation dimensionality order works:
 638
 639 | permute | order | array format             |
 640 | ------- | ----- | ------------------------ |
 641 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 642 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 643 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 644 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 645 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 646 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 647
 648 In other words, the "permute" option changes the order in which
 649 nested for-loops over the array would be done.  The algorithm below
 650 shows this more clearly, and may be executed as a python program:
 651
 652     # mapidx = REMAP.shape2
 653     xdim = 3 # SHAPE[mapidx].xdim_sz+1
 654     ydim = 4 # SHAPE[mapidx].ydim_sz+1
 655     zdim = 5 # SHAPE[mapidx].zdim_sz+1
 656
 657     lims = [xdim, ydim, zdim]
 658     idxs = [0,0,0] # starting indices
 659     order = [1,0,2] # experiment with different permutations, here
 660
 661     for idx in range(xdim * ydim * zdim):
 662         new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
 663         print new_idx,
 664         for i in range(3):
 665             idxs[order[i]] = idxs[order[i]] + 1
 666             if (idxs[order[i]] != lims[order[i]]):
 667                 break
 668             print
 669             idxs[order[i]] = 0
 670
 671 Here, it is assumed that this algorithm be run within all pseudo-code
 672 throughout this document where a (parallelism) for-loop would normally
 673 run from 0 to VL-1 to refer to contiguous register
 674 elements; instead, where REMAP indicates to do so, the element index
 675 is run through the above algorithm to work out the **actual** element
 676 index, instead.  Given that there are three possible SHAPE entries, up to
 677 three separate registers in any given operation may be simultaneously
 678 remapped:
 679
 680     function op_add(rd, rs1, rs2) # add not VADD!
 681       ...
 682       ...
 683       for (i = 0; i < VL; i++)
 684         if (predval & 1<<i) # predication uses intregs
 685            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 686                                  ireg[rs2+remap(irs2)];
 687            if (!int_vec[rd ].isvector) break;
 688         if (int_vec[rd ].isvector)  { id += 1; }
 689         if (int_vec[rs1].isvector)  { irs1 += 1; }
 690         if (int_vec[rs2].isvector)  { irs2 += 1; }
 691
 692 By changing remappings, 2D matrices may be transposed "in-place" for one
 693 operation, followed by setting a different permutation order without
 694 having to move the values in the registers to or from memory.  Also,
 695 the reason for having REMAP separate from the three SHAPE CSRs is so
 696 that in a chain of matrix multiplications and additions, for example,
 697 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 698 changed to target different registers.
 699
 700 Note that:
 701
 702 * If permute option 000 is utilised, the actual order of the
 703   reindexing does not change!
 704 * If two or more dimensions are set to zero, the actual order does not change!
 705 * The above algorithm is pseudo-code **only**.  Actual implementations
 706   will need to take into account the fact that the element for-looping
 707   must be **re-entrant**, due to the possibility of exceptions occurring.
 708   See MSTATE CSR, which records the current element index.
 709 * Twin-predicated operations require **two** separate and distinct
 710   element offsets.  The above pseudo-code algorithm will be applied
 711   separately and independently to each, should each of the two
 712   operands be remapped.  *This even includes C.LDSP* and other operations
 713   in that category, where in that case it will be the **offset** that is
 714   remapped (see Compressed Stack LOAD/STORE section).
 715 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 716   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 717   entries to be regularly presented to operands **more than once**, thus
 718   allowing the same underlying registers to act as an accumulator of
 719   multiple vector or matrix operations, for example.
 720
 721 Clearly here some considerable care needs to be taken as the remapping
 722 could hypothetically create arithmetic operations that target the
 723 exact same underlying registers, resulting in data corruption due to
 724 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 725 register-renaming will have an easier time dealing with this than
 726 DSP-style SIMD micro-architectures.
 727
 728 # Instruction Execution Order
 729
 730 Simple-V behaves as if it is a hardware-level "macro expansion system",
 731 substituting and expanding a single instruction into multiple sequential
 732 instructions with contiguous and sequentially-incrementing registers.
 733 As such, it does **not** modify - or specify - the behaviour and semantics of
 734 the execution order: that may be deduced from the **existing** RV
 735 specification in each and every case.
 736
 737 So for example if a particular micro-architecture permits out-of-order
 738 execution, and it is augmented with Simple-V, then wherever instructions
 739 may be out-of-order then so may the "post-expansion" SV ones.
 740
 741 If on the other hand there are memory guarantees which specifically
 742 prevent and prohibit certain instructions from being re-ordered
 743 (such as the Atomicity Axiom, or FENCE constraints), then clearly
 744 those constraints **MUST** also be obeyed "post-expansion".
 745
 746 It should be absolutely clear that SV is **not** about providing new
 747 functionality or changing the existing behaviour of a micro-architetural
 748 design, or about changing the RISC-V Specification.
 749 It is **purely** about compacting what would otherwise be contiguous
 750 instructions that use sequentially-increasing register numbers down
 751 to the **one** instruction.
 752
 753 # Instructions
 754
 755 Despite being a 98% complete and accurate topological remap of RVV
 756 concepts and functionality, no new instructions are needed.
 757 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
 758 becomes a critical dependency for efficient manipulation of predication
 759 masks (as a bit-field).  Despite the removal of all operations,
 760 with the exception of CLIP and VSELECT.X
 761 *all instructions from RVV Base are topologically re-mapped and retain their
 762 complete functionality, intact*.  Note that if RV64G ever had
 763 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
 764 be obtained in SV.
 765
 766 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
 767 equivalents, so are left out of Simple-V.  VSELECT could be included if
 768 there existed a MV.X instruction in RV (MV.X is a hypothetical
 769 non-immediate variant of MV that would allow another register to
 770 specify which register was to be copied).  Note that if any of these three
 771 instructions are added to any given RV extension, their functionality
 772 will be inherently parallelised.
 773
 774 With some exceptions, where it does not make sense or is simply too
 775 challenging, all RV-Base instructions are parallelised:
 776
 777 * CSR instructions, whilst a case could be made for fast-polling of
 778   a CSR into multiple registers, would require guarantees of strict
 779   sequential ordering that SV does not provide.  Therefore, CSRs are
 780   not really suitable and are left out.
 781 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 782   left as scalar.
 783 * LR/SC could hypothetically be parallelised however their purpose is
 784   single (complex) atomic memory operations where the LR must be followed
 785   up by a matching SC.  A sequence of parallel LR instructions followed
 786   by a sequence of parallel SC instructions therefore is guaranteed to
 787   not be useful. Not least: the guarantees of LR/SC
 788   would be impossible to provide if emulated in a trap.
 789 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 790   paralleliseable anyway.
 791
 792 All other operations using registers are automatically parallelised.
 793 This includes AMOMAX, AMOSWAP and so on, where particular care and
 794 attention must be paid.
 795
 796 Example pseudo-code for an integer ADD operation (including scalar operations).
 797 Floating-point uses fp csrs.
 798
 799     function op_add(rd, rs1, rs2) # add not VADD!
 800       int i, id=0, irs1=0, irs2=0;
 801       predval = get_pred_val(FALSE, rd);
 802       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 803       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 804       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 805       for (i = 0; i < VL; i++)
 806         if (predval & 1<<i) # predication uses intregs
 807            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 808            if (!int_vec[rd ].isvector) break;
 809         if (int_vec[rd ].isvector)  { id += 1; }
 810         if (int_vec[rs1].isvector)  { irs1 += 1; }
 811         if (int_vec[rs2].isvector)  { irs2 += 1; }
 812
 813 ## Instruction Format
 814
 815 There are **no operations added to SV, at all**.
 816 Instead SV  *overloads* pre-existing branch operations into predicated
 817 variants, and implicitly overloads arithmetic operations, MV,
 818 FCVT, and LOAD/STORE
 819 depending on CSR configurations for bitwidth and
 820 predication.  **Everything** becomes parallelised.  *This includes
 821 Compressed instructions* as well as any
 822 future instructions and Custom Extensions.
 823
 824 ## Branch Instructions
 825
 826 ### Standard Branch <a name="standard_branch"></a>
 827
 828 Branch operations use standard RV opcodes that are reinterpreted to
 829 be "predicate variants" in the instance where either of the two src
 830 registers are marked as vectors (active=1, vector=1).
 831
 832 Note that he predication register to use (if one is enabled) is taken from
 833 the *first* src register.  The target (destination) predication register
 834 to use (if one is enabled) is taken from the *second* src register.
 835
 836 If either of src1 or src2 are scalars (whether by there being no
 837 CSR register entry or whether by the CSR entry specifically marking
 838 the register as "scalar") the comparison goes ahead as vector-scalar
 839 or scalar-vector.
 840
 841 In instances where no vectorisation is detected on either src registers
 842 the operation is treated as an absolutely standard scalar branch operation.
 843 Where vectorisation is present on either or both src registers, the
 844 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 845 those tests that are predicated out).
 846
 847 Note that just as with the standard (scalar, non-predicated) branch
 848 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 849 src1 and src2.
 850
 851 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 852 for predicated compare operations of function "cmp":
 853
 854     for (int i=0; i<vl; ++i)
 855       if ([!]preg[p][i])
 856          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 857                            s2 ? vreg[rs2][i] : sreg[rs2]);
 858
 859 With associated predication, vector-length adjustments and so on,
 860 and temporarily ignoring bitwidth (which makes the comparisons more
 861 complex), this becomes:
 862
 863     s1 = reg_is_vectorised(src1);
 864     s2 = reg_is_vectorised(src2);
 865
 866     if not s1 && not s2
 867         if cmp(rs1, rs2) # scalar compare
 868             goto branch
 869         return
 870
 871     preg = int_pred_reg[rd]
 872     reg = int_regfile
 873
 874     ps = get_pred_val(I/F==INT, rs1);
 875     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 876
 877     if not exists(rd)
 878         temporary_result = 0
 879     else
 880         preg[rd] = 0; # initialise to zero
 881
 882     for (int i = 0; i < VL; ++i)
 883       if (ps & (1<<i)) && (cmp(s1 ? reg[src1+i]:reg[src1],
 884                                s2 ? reg[src2+i]:reg[src2])
 885           if not exists(rd)
 886               temporary_result |= 1<<i;
 887           else
 888               preg[rd] |= 1<<i;  # bitfield not vector
 889
 890      if not exists(rd)
 891         if temporary_result == ps
 892             goto branch
 893      else
 894         if preg[rd] == ps
 895             goto branch
 896
 897 Notes:
 898
 899 * zeroing has been temporarily left out of the above pseudo-code,
 900   for clarity
 901 * Predicated SIMD comparisons would break src1 and src2 further down
 902   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 903   Reordering") setting Vector-Length times (number of SIMD elements) bits
 904   in Predicate Register rd, as opposed to just Vector-Length bits.
 905
 906 TODO: predication now taken from src2.  also branch goes ahead
 907 if all compares are successful.
 908
 909 Note also that where normally, predication requires that there must
 910 also be a CSR register entry for the register being used in order
 911 for the **predication** CSR register entry to also be active,
 912 for branches this is **not** the case.  src2 does **not** have
 913 to have its CSR register entry marked as active in order for
 914 predication on src2 to be active.
 915
 916 ### Floating-point Comparisons
 917
 918 There does not exist floating-point branch operations, only compare.
 919 Interestingly no change is needed to the instruction format because
 920 FP Compare already stores a 1 or a zero in its "rd" integer register
 921 target, i.e. it's not actually a Branch at all: it's a compare.
 922 Thus, no change is made to the floating-point comparison, so
 923
 924 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
 925 and whilst in ordinary branch code this is fine because the standard
 926 RVF compare can always be followed up with an integer BEQ or a BNE (or
 927 a compressed comparison to zero or non-zero), in predication terms that
 928 becomes more of an impact.  To deal with this, SV's predication has
 929 had "invert" added to it.
 930
 931 ### Compressed Branch Instruction
 932
 933 Compressed Branch instructions are, just like standard Branch instructions,
 934 reinterpreted to be vectorised and predicated based on the source register
 935 (rs1s) CSR entries.  As however there is only the one source register,
 936 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 937 to store the results of the comparisions is taken from CSR predication
 938 table entries for **x0**.
 939
 940 The specific required use of x0 is, with a little thought, quite obvious,
 941 but is counterintuitive.  Clearly it is **not** recommended to redirect
 942 x0 with a CSR register entry, however as a means to opaquely obtain
 943 a predication target it is the only sensible option that does not involve
 944 additional special CSRs (or, worse, additional special opcodes).
 945
 946 Note also that, just as with standard branches, the 2nd source
 947 (in this case x0 rather than src2) does **not** have to have its CSR
 948 register table marked as "active" in order for predication to work.
 949
 950 ## Vectorised Dual-operand instructions
 951
 952 There is a series of 2-operand instructions involving copying (and
 953 sometimes alteration):
 954
 955 * C.MV
 956 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 957 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 958 * LOAD(-FP) and STORE(-FP)
 959
 960 All of these operations follow the same two-operand pattern, so it is
 961 *both* the source *and* destination predication masks that are taken into
 962 account.  This is different from
 963 the three-operand arithmetic instructions, where the predication mask
 964 is taken from the *destination* register, and applied uniformly to the
 965 elements of the source register(s), element-for-element.
 966
 967 The pseudo-code pattern for twin-predicated operations is as
 968 follows:
 969
 970     function op(rd, rs):
 971       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 972       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 973       ps = get_pred_val(FALSE, rs); # predication on src
 974       pd = get_pred_val(FALSE, rd); # ... AND on dest
 975       for (int i = 0, int j = 0; i < VL && j < VL;):
 976         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 977         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 978         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 979         if (int_csr[rs].isvec) i++;
 980         if (int_csr[rd].isvec) j++; else break
 981
 982 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 983 and vector-vector, and predicated variants of all of those.
 984 Zeroing is not presently included (TODO).  As such, when compared
 985 to RVV, the twin-predicated variants of C.MV and FMV cover
 986 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 987 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 988
 989 Note that:
 990
 991 * elwidth (SIMD) is not covered in the pseudo-code above
 992 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 993   not covered
 994 * zero predication is also not shown (TODO).
 995
 996 ### C.MV Instruction <a name="c_mv"></a>
 997
 998 There is no MV instruction in RV however there is a C.MV instruction.
 999 It is used for copying integer-to-integer registers (vectorised FMV
1000 is used for copying floating-point).
1001
1002 If either the source or the destination register are marked as vectors
1003 C.MV is reinterpreted to be a vectorised (multi-register) predicated
1004 move operation.  The actual instruction's format does not change:
1005
1006 [[!table  data="""
1007 15  12 | 11   7 | 6  2 | 1  0 |
1008 funct4 | rd     | rs   | op   |
1009 4      | 5      | 5    | 2    |
1010 C.MV   | dest   | src  | C0   |
1011 """]]
1012
1013 A simplified version of the pseudocode for this operation is as follows:
1014
1015     function op_mv(rd, rs) # MV not VMV!
1016       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1017       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1018       ps = get_pred_val(FALSE, rs); # predication on src
1019       pd = get_pred_val(FALSE, rd); # ... AND on dest
1020       for (int i = 0, int j = 0; i < VL && j < VL;):
1021         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1022         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1023         ireg[rd+j] <= ireg[rs+i];
1024         if (int_csr[rs].isvec) i++;
1025         if (int_csr[rd].isvec) j++; else break
1026
1027 There are several different instructions from RVV that are covered by
1028 this one opcode:
1029
1030 [[!table  data="""
1031 src    | dest    | predication   | op             |
1032 scalar | vector  | none          | VSPLAT         |
1033 scalar | vector  | destination   | sparse VSPLAT  |
1034 scalar | vector  | 1-bit dest    | VINSERT        |
1035 vector | scalar  | 1-bit? src    | VEXTRACT       |
1036 vector | vector  | none          | VCOPY          |
1037 vector | vector  | src           | Vector Gather  |
1038 vector | vector  | dest          | Vector Scatter |
1039 vector | vector  | src & dest    | Gather/Scatter |
1040 vector | vector  | src == dest   | sparse VCOPY   |
1041 """]]
1042
1043 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
1044 operations with inversion on the src and dest predication for one of the
1045 two C.MV operations.
1046
1047 Note that in the instance where the Compressed Extension is not implemented,
1048 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
1049 Note that the behaviour is **different** from C.MV because with addi the
1050 predication mask to use is taken **only** from rd and is applied against
1051 all elements: rs[i] = rd[i].
1052
1053 ### FMV, FNEG and FABS Instructions
1054
1055 These are identical in form to C.MV, except covering floating-point
1056 register copying.  The same double-predication rules also apply.
1057 However when elwidth is not set to default the instruction is implicitly
1058 and automatic converted to a (vectorised) floating-point type conversion
1059 operation of the appropriate size covering the source and destination
1060 register bitwidths.
1061
1062 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
1063
1064 ### FVCT Instructions
1065
1066 These are again identical in form to C.MV, except that they cover
1067 floating-point to integer and integer to floating-point.  When element
1068 width in each vector is set to default, the instructions behave exactly
1069 as they are defined for standard RV (scalar) operations, except vectorised
1070 in exactly the same fashion as outlined in C.MV.
1071
1072 However when the source or destination element width is not set to default,
1073 the opcode's explicit element widths are *over-ridden* to new definitions,
1074 and the opcode's element width is taken as indicative of the SIMD width
1075 (if applicable i.e. if packed SIMD is requested) instead.
1076
1077 For example FCVT.S.L would normally be used to convert a 64-bit
1078 integer in register rs1 to a 64-bit floating-point number in rd.
1079 If however the source rs1 is set to be a vector, where elwidth is set to
1080 default/2 and "packed SIMD" is enabled, then the first 32 bits of
1081 rs1 are converted to a floating-point number to be stored in rd's
1082 first element and the higher 32-bits *also* converted to floating-point
1083 and stored in the second.  The 32 bit size comes from the fact that
1084 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
1085 divide that by two it means that rs1 element width is to be taken as 32.
1086
1087 Similar rules apply to the destination register.
1088
1089 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
1090
1091 An earlier draft of SV modified the behaviour of LOAD/STORE.  This
1092 actually undermined the fundamental principle of SV, namely that there
1093 be no modifications to the scalar behaviour (except where absolutely
1094 necessary), in order to simplify an implementor's task if considering
1095 converting a pre-existing scalar design to support parallelism.
1096
1097 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
1098 do not change in SV, however just as with C.MV it is important to note
1099 that dual-predication is possible.  Using the template outlined in
1100 the section "Vectorised dual-op instructions", the pseudo-code covering
1101 scalar-scalar, scalar-vector, vector-scalar and vector-vector applies,
1102 where SCALAR\_OPERATION is as follows, exactly as for a standard
1103 scalar RV LOAD operation:
1104
1105         srcbase = ireg[rs+i];
1106         return mem[srcbase + imm];
1107
1108 Whilst LOAD and STORE remain as-is when compared to their scalar
1109 counterparts, the incrementing on the source register (for LOAD)
1110 means that pointers-to-structures can be easily implemented, and
1111 if contiguous offsets are required, those pointers (the contents
1112 of the contiguous source registers) may simply be set up to point
1113 to contiguous locations.
1114
1115 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
1116
1117 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
1118 where it is implicit in C.LWSP/FLWSP that x2 is the source register.
1119 It is therefore possible to use predicated C.LWSP to efficiently
1120 pop registers off the stack (by predicating x2 as the source), cherry-picking
1121 which registers to store to (by predicating the destination).  Likewise
1122 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
1123
1124 However, to do so, the behaviour of C.LWSP/C.SWSP needs to be slightly
1125 different: where x2 is marked as vectorised, instead of incrementing
1126 the register on each loop (x2, x3, x4...), instead it is the *immediate*
1127 that must be incremented.  Pseudo-code follows:
1128
1129     function lwsp(rd, rs):
1130       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1131       rs = x2 # effectively no redirection on x2.
1132       ps = get_pred_val(FALSE, rs); # predication on src
1133       pd = get_pred_val(FALSE, rd); # ... AND on dest
1134       for (int i = 0, int j = 0; i < VL && j < VL;):
1135         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1136         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1137         reg[rd+j] = mem[x2 + ((offset+i) * 4)]
1138         if (int_csr[rs].isvec) i++;
1139         if (int_csr[rd].isvec) j++; else break;
1140
1141 For C.LDSP, the offset (and loop) multiplier would be 8, and for
1142 C.LQSP it would be 16.  Effectively this makes C.LWSP etc. a Vector
1143 "Unit Stride" Load instruction.
1144
1145 **Note**: it is still possible to redirect x2 to an alternative target
1146 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
1147 general-purpose Vector "Unit Stride" LOAD/STORE operations.
1148
1149 ## Compressed LOAD / STORE Instructions
1150
1151 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
1152 where the same rules apply and the same pseudo-code apply as for
1153 non-compressed LOAD/STORE.  This is **different** from Compressed Stack
1154 LOAD/STORE (C.LWSP / C.SWSP), which have been augmented to become
1155 Vector "Unit Stride" capable.
1156
1157 Just as with uncompressed LOAD/STORE C.LD / C.ST increment the *register*
1158 during the hardware loop, **not** the offset.
1159
1160 # Element bitwidth polymorphism <a name="elwidth"></a>
1161
1162 Element bitwidth is best covered as its own special section, as it
1163 is quite involved and applies uniformly across-the-board.  SV restricts
1164 bitwidth polymorphism to default, default/2, default\*2 and 8-bit
1165 (whilst this seems limiting, the justification is covered in a later
1166 sub-section).
1167
1168 The effect of setting an element bitwidth is to re-cast each entry
1169 in the register table, and for all memory operations involving
1170 load/stores of certain specific sizes, to a completely different width.
1171 Thus In c-style terms, on an RV64 architecture, effectively each register
1172 now looks like this:
1173
1174     typedef union {
1175         uint8_t  b[8];
1176         uint16_t s[4];
1177         uint32_t i[2];
1178         uint64_t l[1];
1179     } reg_t;
1180
1181     // integer table: assume maximum SV 7-bit regfile size
1182     reg_t int_regfile[128];
1183
1184 where the CSR Register table entry (not the instruction alone) determines
1185 which of those union entries is to be used on each operation, and the
1186 VL element offset in the hardware-loop specifies the index into each array.
1187
1188 However a naive interpretation of the data structure above masks the
1189 fact that setting VL greater than 8, for example, when the bitwidth is 8,
1190 accessing one specific register "spills over" to the following parts of
1191 the register file in a sequential fashion.  So a much more accurate way
1192 to reflect this would be:
1193
1194     typedef union {
1195         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
1196         uint8_t  b[0]; // array of type uint8_t
1197         uint16_t s[0];
1198         uint32_t i[0];
1199         uint64_t l[0];
1200         uint128_t d[0];
1201     } reg_t;
1202
1203     reg_t int_regfile[128];
1204
1205 where when accessing any individual regfile[n].b entry it is permitted
1206 (in c) to arbitrarily over-run the *declared* length of the array (zero),
1207 and thus "overspill" to consecutive register file entries in a fashion
1208 that is completely transparent to a greatly-simplified software / pseudo-code
1209 representation.
1210 It is however critical to note that it is clearly the responsibility of
1211 the implementor to ensure that, towards the end of the register file,
1212 an exception is thrown if attempts to access beyond the "real" register
1213 bytes is ever attempted.
1214
1215 Now we may modify pseudo-code an operation where all element bitwidths have
1216 been set to the same size, where this pseudo-code is otherwise identical
1217 to its "non" polymorphic versions (above):
1218
1219     function op_add(rd, rs1, rs2) # add not VADD!
1220       ...
1221       ...
1222       for (i = 0; i < VL; i++)
1223            ...
1224            ...
1225            // TODO, calculate if over-run occurs, for each elwidth
1226            if (elwidth == 8) {
1227                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
1228                                         int_regfile[rs2].i[irs2];
1229             } else if elwidth == 16 {
1230                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
1231                                         int_regfile[rs2].s[irs2];
1232             } else if elwidth == 32 {
1233                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
1234                                         int_regfile[rs2].i[irs2];
1235             } else { // elwidth == 64
1236                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
1237                                         int_regfile[rs2].l[irs2];
1238             }
1239            ...
1240            ...
1241
1242 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
1243 following sequentially on respectively from the same) are "type-cast"
1244 to 8-bit; for 16-bit entries likewise and so on.
1245
1246 However that only covers the case where the element widths are the same.
1247 Where the element widths are different, the following algorithm applies:
1248
1249 * Analyse the bitwidth of all source operands and work out the
1250   maximum.  Record this as "maxsrcbitwidth"
1251 * If any given source operand requires sign-extension or zero-extension
1252   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
1253   sign-extension / zero-extension or whatever is specified in the standard
1254   RV specification, **change** that to sign-extending from the respective
1255   individual source operand's bitwidth from the CSR table out to
1256   "maxsrcbitwidth" (previously calculated), instead.
1257 * Following separate and distinct (optional) sign/zero-extension of all
1258   source operands as specifically required for that operation, carry out the
1259   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
1260   this may be a "null" (copy) operation, and that with FCVT, the changes
1261   to the source and destination bitwidths may also turn FVCT effectively
1262   into a copy).
1263 * If the destination operand requires sign-extension or zero-extension,
1264   instead of a mandatory fixed size (typically 32-bit for arithmetic,
1265   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
1266   etc.), overload the RV specification with the bitwidth from the
1267   destination register's elwidth entry.
1268 * Finally, store the (optionally) sign/zero-extended value into its
1269   destination: memory for sb/sw etc., or an offset section of the register
1270   file for an arithmetic operation.
1271
1272 In this way, polymorphic bitwidths are achieved without requiring a
1273 massive 64-way permutation of calculations **per opcode**, for example
1274 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
1275 rd bitwidths).  The pseudo-code is therefore as follows:
1276
1277     typedef union {
1278         uint8_t  b;
1279         uint16_t s;
1280         uint32_t i;
1281         uint64_t l;
1282     } el_reg_t;
1283
1284     bw(elwidth):
1285         if elwidth == 0:
1286             return xlen
1287         if elwidth == 1:
1288             return xlen / 2
1289         if elwidth == 2:
1290             return xlen * 2
1291         // elwidth == 3:
1292         return 8
1293
1294     get_max_elwidth(rs1, rs2):
1295         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
1296                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
1297
1298     get_polymorphed_reg(reg, bitwidth, offset):
1299         el_reg_t res;
1300         res.l = 0; // TODO: going to need sign-extending / zero-extending
1301         if bitwidth == 8:
1302             reg.b = int_regfile[reg].b[offset]
1303         elif bitwidth == 16:
1304             reg.s = int_regfile[reg].s[offset]
1305         elif bitwidth == 32:
1306             reg.i = int_regfile[reg].i[offset]
1307         elif bitwidth == 64:
1308             reg.l = int_regfile[reg].l[offset]
1309         return res
1310
1311     set_polymorphed_reg(reg, bitwidth, offset, val):
1312         if (!int_csr[reg].isvec):
1313             # sign/zero-extend depending on opcode requirements, from
1314             # the reg's bitwidth out to the full bitwidth of the regfile
1315             val = sign_or_zero_extend(val, bitwidth, xlen)
1316             int_regfile[reg].l[0] = val
1317         elif bitwidth == 8:
1318             int_regfile[reg].b[offset] = val
1319         elif bitwidth == 16:
1320             int_regfile[reg].s[offset] = val
1321         elif bitwidth == 32:
1322             int_regfile[reg].i[offset] = val
1323         elif bitwidth == 64:
1324             int_regfile[reg].l[offset] = val
1325
1326       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
1327       destwid = int_csr[rs1].elwidth         # destination element width
1328       for (i = 0; i < VL; i++)
1329         if (predval & 1<<i) # predication uses intregs
1330            // TODO, calculate if over-run occurs, for each elwidth
1331            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
1332            // TODO, sign/zero-extend src1 and src2 as operation requires
1333            if (op_requires_sign_extend_src1)
1334               src1 = sign_extend(src1, maxsrcwid)
1335            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
1336            result = src1 + src2 # actual add here
1337            // TODO, sign/zero-extend result, as operation requires
1338            if (op_requires_sign_extend_dest)
1339               result = sign_extend(result, maxsrcwid)
1340            set_polymorphed_reg(rd, destwid, ird, result)
1341         if (int_vec[rd ].isvector)  { id += 1; } else break
1342         if (int_vec[rs1].isvector)  { irs1 += 1; }
1343         if (int_vec[rs2].isvector)  { irs2 += 1; }
1344
1345 Whilst specific sign-extension and zero-extension pseudocode call
1346 details are left out, due to each operation being different, the above
1347 should be clear that;
1348
1349 * the source operands are extended out to the maximum bitwidth of all
1350   source operands
1351 * the operation takes place at that maximum source bitwidth (the
1352   destination bitwidth is not involved at this point, at all)
1353 * the result is extended (or potentially even, truncated) before being
1354   stored in the destination.  i.e. truncation (if required) to the
1355   destination width occurs **after** the operation **not** before.
1356 * when the destination is not marked as "vectorised", the **full**
1357   (standard, scalar) register file entry is taken up, i.e. the
1358   element is either sign-extended or zero-extended to cover the
1359   full register bitwidth (XLEN) if it is not already XLEN bits long.
1360
1361 Implementors are entirely free to optimise the above, particularly
1362 if it is specifically known that any given operation will complete
1363 accurately in less bits, as long as the results produced are
1364 directly equivalent and equal, for all inputs and all outputs,
1365 to those produced by the above algorithm.
1366
1367 ## Polymorphic floating-point operation exceptions and error-handling
1368
1369 For floating-point operations, conversion takes place without
1370 raising any kind of exception.  Exactly as specified in the standard
1371 RV specification, NAN (or appropriate) is stored if the result
1372 is beyond the range of the destination, and, again, exactly as
1373 with the standard RV specification just as with scalar
1374 operations, the floating-point flag is raised (FCSR).  And, again, just as
1375 with scalar operations, it is software's responsibility to check this flag.
1376 Given that the FCSR flags are "accrued", the fact that multiple element
1377 operations could have occurred is not a problem.
1378
1379 Note that it is perfectly legitimate for floating-point bitwidths of
1380 only 8 to be specified.  However whilst it is possible to apply IEEE 754
1381 principles, no actual standard yet exists.  Implementors wishing to
1382 provide hardware-level 8-bit support rather than throw a trap to emulate
1383 in software should contact the author of this specification before
1384 proceeding.
1385
1386 ## Polymorphic shift operators
1387
1388 A special note is needed for changing the element width of left and right
1389 shift operators, particularly right-shift.  Even for standard RV base,
1390 in order for correct results to be returned, the second operand RS2 must
1391 be truncated to be within the range of RS1's bitwidth.  spike's implementation
1392 of sll for example is as follows:
1393
1394     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
1395
1396 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
1397 range 0..31 so that RS1 will only be left-shifted by the amount that
1398 is possible to fit into a 32-bit register.  Whilst this appears not
1399 to matter for hardware, it matters greatly in software implementations,
1400 and it also matters where an RV64 system is set to "RV32" mode, such
1401 that the underlying registers RS1 and RS2 comprise 64 hardware bits
1402 each.
1403
1404 For SV, where each operand's element bitwidth may be over-ridden, the
1405 rule about determining the operation's bitwidth *still applies*, being
1406 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
1407 **also applies to the truncation of RS2**.  In other words, *after*
1408 determining the maximum bitwidth, RS2's range must **also be truncated**
1409 to ensure a correct answer.  Example:
1410
1411 * RS1 is over-ridden to a 16-bit width
1412 * RS2 is over-ridden to an 8-bit width
1413 * RD is over-ridden to a 64-bit width
1414 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
1415 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
1416
1417 Pseudocode for this example would therefore be:
1418
1419     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
1420
1421 This example illustrates that considerable care therefore needs to be
1422 taken to ensure that left and right shift operations are implemented
1423 correctly.
1424
1425 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
1426
1427 Polymorphic element widths in vectorised form means that the data
1428 being loaded (or stored) across multiple registers needs to be treated
1429 (reinterpreted) as a contiguous stream of elwidth-wide items, where
1430 the source register's element width is **independent** from the destination's.
1431
1432 This makes for a slightly more complex algorithm when using indirection
1433 on the "addressed" register (source for LOAD and destination for STORE),
1434 particularly given that the LOAD/STORE instruction provides important
1435 information about the width of the data to be reinterpreted.
1436
1437 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
1438 was as follows, and i is the loop from 0 to VL-1:
1439
1440     srcbase = ireg[rs+i];
1441     return mem[srcbase + imm]; // returns XLEN bits
1442
1443 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
1444 chunks are taken from the source memory location addressed by the current
1445 indexed source address register, and only when a full 32-bits-worth
1446 are taken will the index be moved on to the next contiguous source
1447 address register:
1448
1449     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
1450     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
1451     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1452     offs = i % elsperblock;             // modulo
1453     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
1454
1455 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1456 and 128 for LQ.
1457
1458 The principle is basically exactly the same as if the srcbase were pointing
1459 at the memory of the *register* file: memory is re-interpreted as containing
1460 groups of elwidth-wide discrete elements.
1461
1462 When storing the result from a load, it's important to respect the fact
1463 that the destination register has its *own separate element width*.  Thus,
1464 when each element is loaded (at the source element width), any sign-extension
1465 or zero-extension (or truncation) needs to be done to the *destination*
1466 bitwidth.  Also, the storing has the exact same analogous algorithm as
1467 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1468 (completely unchanged) used above.
1469
1470 One issue remains: when the source element width is **greater** than
1471 the width of the operation, it is obvious that a single LB for example
1472 cannot possibly obtain 16-bit-wide data.  This condition may be detected
1473 where, when using integer divide, elsperblock (the width of the LOAD
1474 divided by the bitwidth of the element) is zero.
1475
1476 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1477
1478     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1479
1480 The elements, if the element bitwidth is larger than the LD operation's
1481 size, will then be sign/zero-extended to the full LD operation size, as
1482 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1483 being passed on to the second phase.
1484
1485 As LOAD/STORE may be twin-predicated, it is important to note that
1486 the rules on twin predication still apply, except where in previous
1487 pseudo-code (elwidth=default for both source and target) it was
1488 the *registers* that the predication was applied to, it is now the
1489 **elements** that the predication are applied to.
1490
1491 Thus the full pseudocode for all LD operations may be written out
1492 as follows:
1493
1494     function LBU(rd, rs):
1495         load_elwidthed(rd, rs, 8, true)
1496     function LB(rd, rs):
1497         load_elwidthed(rd, rs, 8, false)
1498     function LH(rd, rs):
1499         load_elwidthed(rd, rs, 16, false)
1500     ...
1501     ...
1502     function LQ(rd, rs):
1503         load_elwidthed(rd, rs, 128, false)
1504
1505     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1506     function load_memory(rs, imm, i, opwidth):
1507         elwidth = int_csr[rs].elwidth
1508         bitwidth = bw(elwidth);
1509         elsperblock = min(1, opwidth / bitwidth)
1510         srcbase = ireg[rs+i/(elsperblock)];
1511         offs = i % elsperblock;
1512         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1513
1514     function load_elwidthed(rd, rs, opwidth, unsigned):
1515       destwid = int_csr[rd].elwidth # destination element width
1516       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1517       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1518       ps = get_pred_val(FALSE, rs); # predication on src
1519       pd = get_pred_val(FALSE, rd); # ... AND on dest
1520       for (int i = 0, int j = 0; i < VL && j < VL;):
1521         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1522         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1523         val = load_memory(rs, imm, i, opwidth)
1524         if unsigned:
1525             val = zero_extend(val, min(opwidth, bitwidth))
1526         else:
1527             val = sign_extend(val, min(opwidth, bitwidth))
1528         set_polymorphed_reg(rd, bitwidth, j, val)
1529         if (int_csr[rs].isvec) i++;
1530         if (int_csr[rd].isvec) j++; else break;
1531
1532 Note:
1533
1534 * when comparing against for example the twin-predicated c.mv
1535   pseudo-code, the pattern of independent incrementing of rd and rs
1536   is preserved unchanged.
1537 * just as with the c.mv pseudocode, zeroing is not included and must be
1538   taken into account (TODO).
1539 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1540   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1541   VSCATTER characteristics.
1542 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1543   a destination that is not vectorised (marked as scalar) will
1544   result in the element being fully sign-extended or zero-extended
1545   out to the full register file bitwidth (XLEN).  When the source
1546   is also marked as scalar, this is how the compatibility with
1547   standard RV LOAD/STORE is preserved by this algorithm.
1548
1549 ## Why SV bitwidth specification is restricted to 4 entries
1550
1551 The four entries for SV element bitwidths only allows three over-rides:
1552
1553 * default bitwidth for a given operation *divided* by two
1554 * default bitwidth for a given operation *multiplied* by two
1555 * 8-bit
1556
1557 At first glance this seems completely inadequate: for example, RV64
1558 cannot possibly operate on 16-bit operations, because 64 divided by
1559 2 is 32.  However, the reader may have forgotten that it is possible,
1560 at run-time, to switch a 64-bit application into 32-bit mode, by
1561 setting UXL.  Once switched, opcodes that formerly had 64-bit
1562 meanings now have 32-bit meanings, and in this way, "default/2"
1563 now reaches **16-bit** where previously it meant "32-bit".
1564
1565 There is however an absolutely crucial aspect oF SV here that explicitly
1566 needs spelling out, and it's whether the "vectorised" bit is set in
1567 the Register's CSR entry.
1568
1569 If "vectorised" is clear (not set), this indicates that the operation
1570 is "scalar".  Under these circumstances, when set on a destination (RD),
1571 then sign-extension and zero-extension, whilst changed to match the
1572 override bitwidth (if set), will erase the **full** register entry
1573 (64-bit if RV64).
1574
1575 When vectorised is *set*, this indicates that the operation now treats
1576 **elements** as if they were independent registers, so regardless of
1577 the length, any parts of a given actual register that are not involved
1578 in the operation are **NOT** modified, but are **PRESERVED**.
1579
1580 SIMD micro-architectures may implement this by using predication on
1581 any elements in a given actual register that are beyond the end of
1582 multi-element operation.
1583
1584 Example:
1585
1586 * rs1, rs2 and rd are all set to 8-bit
1587 * VL is set to 3
1588 * RV64 architecture is set (UXL=64)
1589 * add operation is carried out
1590 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1591   concatenated with similar add operations on bits 15..8 and 7..0
1592 * bits 24 through 63 **remain as they originally were**.
1593
1594 Example SIMD micro-architectural implementation:
1595
1596 * SIMD architecture works out the nearest round number of elements
1597   that would fit into a full RV64 register (in this case: 8)
1598 * SIMD architecture creates a hidden predicate, binary 0b00000111
1599   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1600 * SIMD architecture goes ahead with the add operation as if it
1601   was a full 8-wide batch of 8 adds
1602 * SIMD architecture passes top 5 elements through the adders
1603   (which are "disabled" due to zero-bit predication)
1604 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1605   and stores them in rd.
1606
1607 This requires a read on rd, however this is required anyway in order
1608 to support non-zeroing mode.
1609
1610 ## Specific instruction walk-throughs
1611
1612 This section covers walk-throughs of the above-outlined procedure
1613 for converting standard RISC-V scalar arithmetic operations to
1614 polymorphic widths, to ensure that it is correct.
1615
1616 ### add
1617
1618 Standard Scalar RV32/RV64 (xlen):
1619
1620 * RS1 @ xlen bits
1621 * RS2 @ xlen bits
1622 * add @ xlen bits
1623 * RD @ xlen bits
1624
1625 Polymorphic variant:
1626
1627 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1628 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1629 * add @ max(rs1, rs2) bits
1630 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1631
1632 Note here that polymorphic add zero-extends its source operands,
1633 where addw sign-extends.
1634
1635 ### addw
1636
1637 The RV Specification specifically states that "W" variants of arithmetic
1638 operations always produce 32-bit signed values.  In a polymorphic
1639 environment it is reasonable to assume that the signed aspect is
1640 preserved, where it is the length of the operands and the result
1641 that may be changed.
1642
1643 Standard Scalar RV64 (xlen):
1644
1645 * RS1 @ xlen bits
1646 * RS2 @ xlen bits
1647 * add @ xlen bits
1648 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1649
1650 Polymorphic variant:
1651
1652 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1653 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1654 * add @ max(rs1, rs2) bits
1655 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1656
1657 Note here that polymorphic addw sign-extends its source operands,
1658 where add zero-extends.
1659
1660 This requires a little more in-depth analysis.  Where the bitwidth of
1661 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1662 only where the bitwidth of either rs1 or rs2 are different, will the
1663 lesser-width operand be sign-extended.
1664
1665 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1666 where for add they are both zero-extended.  This holds true for all arithmetic
1667 operations ending with "W".
1668
1669 ### addiw
1670
1671 Standard Scalar RV64I:
1672
1673 * RS1 @ xlen bits, truncated to 32-bit
1674 * immed @ 12 bits, sign-extended to 32-bit
1675 * add @ 32 bits
1676 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1677
1678 Polymorphic variant:
1679
1680 * RS1 @ rs1 bits
1681 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1682 * add @ max(rs1, 12) bits
1683 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1684
1685
1686 # Exceptions
1687
1688 TODO: expand.  Exceptions may occur at any time, in any given underlying
1689 scalar operation.  This implies that context-switching (traps) may
1690 occur, and operation must be returned to where it left off.  That in
1691 turn implies that the full state - including the current parallel
1692 element being processed - has to be saved and restored.  This is
1693 what the **STATE** CSR is for.
1694
1695 The implications are that all underlying individual scalar operations
1696 "issued" by the parallelisation have to appear to be executed sequentially.
1697 The further implications are that if two or more individual element
1698 operations are underway, and one with an earlier index causes an exception,
1699 it may be necessary for the microarchitecture to **discard** or terminate
1700 operations with higher indices.
1701
1702 This being somewhat dissatisfactory, an "opaque predication" variant
1703 of the STATE CSR is being considered.
1704
1705 # Hints
1706
1707 A "HINT" is an operation that has no effect on architectural state,
1708 where its use may, by agreed convention, give advance notification
1709 to the microarchitecture: branch prediction notification would be
1710 a good example.  Usually HINTs are where rd=x0.
1711
1712 With Simple-V being capable of issuing *parallel* instructions where
1713 rd=x0, the space for possible HINTs is expanded considerably.  VL
1714 could be used to indicate different hints.  In addition, if predication
1715 is set, the predication register itself could hypothetically be passed
1716 in as a *parameter* to the HINT operation.
1717
1718 No specific hints are yet defined in Simple-V
1719
1720 # Subsets of RV functionality
1721
1722 This section describes the differences when SV is implemented on top of
1723 different subsets of RV.
1724
1725 ## Common options
1726
1727 It is permitted to limit the size of either (or both) the register files
1728 down to the original size of the standard RV architecture.  However, below
1729 the mandatory limits set in the RV standard will result in non-compliance
1730 with the SV Specification.
1731
1732 ## RV32 / RV32F
1733
1734 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1735 maximum limit for predication is also restricted to 32 bits.  Whilst not
1736 actually specifically an "option" it is worth noting.
1737
1738 ## RV32G
1739
1740 Normally in standard RV32 it does not make much sense to have
1741 RV32G, however it is automatically implied to exist in RV32+SV due to
1742 the option for the element width to be doubled.  This may be sufficient
1743 for implementors, such that actually needing RV32G itself (which makes
1744 no sense given that the RV32 integer register file is 32-bit) may be
1745 redundant.
1746
1747 It is a strange combination that may make sense on closer inspection,
1748 particularly given that under the standard RV32 system many of the opcodes
1749 to convert and sign-extend 64-bit integers to 64-bit floating-point will
1750 be missing, as they are assumed to only be present in an RV64 context.
1751
1752 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1753
1754 When floating-point is not implemented, the size of the User Register and
1755 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1756 per table).
1757
1758 ## RV32E
1759
1760 In embedded scenarios the User Register and Predication CSRs may be
1761 dropped entirely, or optionally limited to 1 CSR, such that the combined
1762 number of entries from the M-Mode CSR Register table plus U-Mode
1763 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1764 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1765 the Predication CSR tables.
1766
1767 RV32E is the most likely candidate for simply detecting that registers
1768 are marked as "vectorised", and generating an appropriate exception
1769 for the VL loop to be implemented in software.
1770
1771 ## RV128
1772
1773 RV128 has not been especially considered, here, however it has some
1774 extremely large possibilities: double the element width implies
1775 256-bit operands, spanning 2 128-bit registers each, and predication
1776 of total length 128 bit given that XLEN is now 128.
1777
1778 # Under consideration <a name="issues"></a>
1779
1780 for element-grouping, if there is unused space within a register
1781 (3 16-bit elements in a 64-bit register for example), recommend:
1782
1783 * For the unused elements in an integer register, the used element
1784   closest to the MSB is sign-extended on write and the unused elements
1785   are ignored on read.
1786 * The unused elements in a floating-point register are treated as-if
1787   they are set to all ones on write and are ignored on read, matching the
1788   existing standard for storing smaller FP values in larger registers.
1789
1790 ---
1791
1792 info register,
1793
1794 > One solution is to just not support LR/SC wider than a fixed
1795 > implementation-dependent size, which must be at least
1796 >1 XLEN word, which can be read from a read-only CSR
1797 > that can also be used for info like the kind and width of
1798 > hw parallelism supported (128-bit SIMD, minimal virtual
1799 > parallelism, etc.) and other things (like maybe the number
1800 > of registers supported).
1801
1802 > That CSR would have to have a flag to make a read trap so
1803 > a hypervisor can simulate different values.
1804
1805 ----
1806
1807 > And what about instructions like JALR?
1808
1809 answer: they're not vectorised, so not a problem
1810
1811 ----
1812
1813 * if opcode is in the RV32 group, rd, rs1 and rs2 bitwidth are
1814   XLEN if elwidth==default
1815 * if opcode is in the RV32I group, rd, rs1 and rs2 bitwidth are
1816   *32* if elwidth == default
1817
1818 ---
1819
1820 TODO: update elwidth to be default / 8 / 16 / 32
1821
1822 ---
1823
1824 TODO: document different lengths for INT / FP regfiles, and provide
1825 as part of info register. 00=32, 01=64, 10=128, 11=reserved.
1826
1827