simple_v_extension/specification.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification
   2
   3 * Status: DRAFTv0.1
   4 * Last edited: 30 sep 2018
   5
   6 With thanks to:
   7
   8 * Allen Baum
   9 * Jacob Bachmeyer
  10 * Guy Lemurieux
  11 * Jacob Lifshay
  12 * The RISC-V Founders, without whom this all would not be possible.
  13
  14 [[!toc ]]
  15
  16 # Summary and Background: Rationale
  17
  18 Simple-V is a uniform parallelism API for RISC-V hardware that has several
  19 unplanned side-effects including code-size reduction, expansion of
  20 HINT space and more.  The reason for
  21 creating it is to provide a manageable way to turn a pre-existing design
  22 into a parallel one, in a step-by-step incremental fashion, allowing
  23 the implementor to focus on adding hardware where it is needed and necessary.
  24
  25 Critically: **No new instructions are added**.  The parallelism (if any
  26 is implemented) is implicitly added by tagging *standard* scalar registers
  27 for redirection.  When such a tagged register is used in any instruction,
  28 it indicates that the PC shall **not** be incremented; instead a loop
  29 is activated where *multiple* instructions are issued to the pipeline
  30 (as determined by a length CSR), with contiguously incrementing register
  31 numbers starting from the tagged register.  When the last "element"
  32 has been reached, only then is the PC permitted to move on.  Thus
  33 Simple-V effectively sits (slots) *in between* the instruction decode phase
  34 and the ALU(s).
  35
  36 The barrier to entry with SV is therefore very low.  The minimum
  37 compliantt implementation is software-emulation (traps), requiring
  38 only the CSRs and CSR tables, and that an exception be thrown if an
  39 instruction's registers are detected to have been tagged.  The looping
  40 that would otherwise be done in hardware is thus carried out in software,
  41 instead.  Whilst much slower, it is "compliant" with the SV specification,
  42 and may be suited for implementation in RV32E and also in situations
  43 where the implementor wishes to focus on certain aspects of SV, without
  44 unnecessary time and resources into the silicon, whilst also conforming
  45 strictly with the API.  A good area to punt to software would be the
  46 polymorphic element width capability for example.
  47
  48 Hardware Parallelism, if any, is therefore added at the implementor's
  49 discretion to turn what would otherwise be a sequential loop into a
  50 parallel one.
  51
  52 To emphasise that clearly: Simple-V (SV) is *not*:
  53
  54 * A SIMD system
  55 * A SIMT system
  56 * A Vectorisation Microarchitecture
  57 * A microarchitecture of any specific kind
  58 * A mandary parallel processor microarchitecture of any kind
  59 * A supercomputer extension
  60
  61 SV does **not** tell implementors how or even if they should implement
  62 parallelism: it is a hardware "API" (Application Programming Interface)
  63 that, if implemented, presents a uniform and consistent way to *express*
  64 parallelism, at the same time leaving the choice of if, how, how much,
  65 when and whether to parallelise operations **entirely to the implementor**.
  66
  67 # CSRs <a name="csrs"></a>
  68
  69 For U-Mode there are two CSR key-value stores needed to create lookup
  70 tables which are used at the register decode phase.
  71
  72 * A register CSR key-value table (typically 8 32-bit CSRs of 2 16-bits each)
  73 * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each)
  74 * Small U-Mode and S-Mode register and predication CSR key-value tables
  75   (2 32-bit CSRs of 2x 16-bit entries each).
  76 * An optional "reshaping" CSR key-value table which remaps from a 1D
  77   linear shape to 2D or 3D, including full transposition.
  78
  79 There are also four additional CSRs for User-Mode:
  80
  81 * CFG subsets the CSR tables
  82 * MVL (the Maximum Vector Length)
  83 * VL (which has different characteristics from standard CSRs)
  84 * STATE (useful for saving and restoring during context switch,
  85   and for providing fast transitions)
  86
  87 There are also three additional CSRs for Supervisor-Mode:
  88
  89 * SMVL
  90 * SVL
  91 * SSTATE
  92
  93 And likewise for M-Mode:
  94
  95 * MMVL
  96 * MVL
  97 * MSTATE
  98
  99 Both Supervisor and M-Mode have their own (small) CSR register and
 100 predication tables of only 4 entries each.
 101
 102 ## CFG
 103
 104 This CSR may be used to switch between subsets of the CSR Register and
 105 Predication Tables: it is kept to 5 bits so that a single CSRRWI instruction
 106 can be used.  A setting of all ones is reserved to indicate that SimpleV
 107 is disabled.
 108
 109 | (4..3) | (2...0) |
 110 | ------ | ------- |
 111 | size   | bank    |
 112
 113 Bank is 3 bits in size, and indicates the starting index of the CSR
 114 Register and Predication Table entries that are "enabled".  Given that
 115 each CSR table row is 16 bits and contains 2 CAM entries each, there
 116 are only 8 CSRs to cover in each table, so 8 bits is sufficient.
 117
 118 Size is 2 bits.  With the exception of when bank == 7 and size == 3,
 119 the number of elements enabled is taken by right-shifting 2 by size:
 120
 121 | size   | elements |
 122 | ------ | -------- |
 123 | 0      | 2        |
 124 | 1      | 4        |
 125 | 2      | 8        |
 126 | 3      | 16       |
 127
 128 Given that there are 2 16-bit CAM entries per CSR table row, this
 129 may also be viewed as the number of CSR rows to enable, by raising size to
 130 the power of 2.
 131
 132 Examples:
 133
 134 * When bank = 0 and size = 3, SVREGCFG0 through to SVREGCFG7 are
 135   enabled, and SVPREDCFG0 through to SVPREGCFG7 are enabled.
 136 * When bank = 1 and size = 3, SVREGCFG1 through to SVREGCFG7 are
 137   enabled, and SVPREDCFG1 through to SVPREGCFG7 are enabled.
 138 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 139 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 140 * When bank = 7 and size = 1, SVREGCFG7 and SVPREDCFG7 are enabled.
 141 * When bank = 7 and size = 3, SimpleV is entirely disabled.
 142
 143 In this way it is possible to enable and disable SimpleV with a
 144 single instruction, and, furthermore, on context-switching the quantity
 145 of CSRs to be saved and restored is greatly reduced.
 146
 147 ## MAXVECTORLENGTH (MVL)
 148
 149 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
 150 is variable length and may be dynamically set.  MVL is
 151 however limited to the regfile bitwidth XLEN (1-32 for RV32,
 152 1-64 for RV64 and so on).
 153
 154 The reason for setting this limit is so that predication registers, when
 155 marked as such, may fit into a single register as opposed to fanning out
 156 over several registers.  This keeps the implementation a little simpler.
 157
 158 The other important factor to note is that the actual MVL is **offset
 159 by one**, so that it can fit into only 6 bits (for RV64) and still cover
 160 a range up to XLEN bits.  So, when setting the MVL CSR to 0, this actually
 161 means that MVL==1.  When setting the MVL CSR to 3, this actually means
 162 that MVL==4, and so on.  This is expressed more clearly in the "pseudocode"
 163 section, where there are subtle differences between CSRRW and CSRRWI.
 164
 165 ## Vector Length (VL)
 166
 167 VSETVL is slightly different from RVV.  Like RVV, VL is set to be within
 168 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
 169
 170     VL = rd = MIN(vlen, MVL)
 171
 172 where 1 <= MVL <= XLEN
 173
 174 However just like MVL it is important to note that the range for VL has
 175 subtle design implications, covered in the "CSR pseudocode" section
 176
 177 The fixed (specific) setting of VL allows vector LOAD/STORE to be used
 178 to switch the entire bank of registers using a single instruction (see
 179 Appendix, "Context Switch Example").  The reason for limiting VL to XLEN
 180 is down to the fact that predication bits fit into a single register of
 181 length XLEN bits.
 182
 183 The second change is that when VSETVL is requested to be stored
 184 into x0, it is *ignored* silently (VSETVL x0, x5)
 185
 186 The third and most important change is that, within the limits set by
 187 MVL, the value passed in **must** be set in VL (and in the
 188 destination register).
 189
 190 This has implication for the microarchitecture, as VL is required to be
 191 set (limits from MVL notwithstanding) to the actual value
 192 requested.  RVV has the option to set VL to an arbitrary value that suits
 193 the conditions and the micro-architecture: SV does *not* permit this.
 194
 195 The reason is so that if SV is to be used for a context-switch or as a
 196 substitute for LOAD/STORE-Multiple, the operation can be done with only
 197 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
 198 single LD/ST operation).  If VL does *not* get set to the register file
 199 length when VSETVL is called, then a software-loop would be needed.
 200 To avoid this need, VL *must* be set to exactly what is requested
 201 (limits notwithstanding).
 202
 203 Therefore, in turn, unlike RVV, implementors *must* provide
 204 pseudo-parallelism (using sequential loops in hardware) if actual
 205 hardware-parallelism in the ALUs is not deployed.  A hybrid is also
 206 permitted (as used in Broadcom's VideoCore-IV) however this must be
 207 *entirely* transparent to the ISA.
 208
 209 The fourth change is that VSETVL is implemented as a CSR, where the
 210 behaviour of CSRRW (and CSRRWI) must be changed to specifically store
 211 the *new* value in the destination register, **not** the old value.
 212 Where context-load/save is to be implemented in the usual fashion
 213 by using a single CSRRW instruction to obtain the old value, the
 214 *secondary* CSR must be used (SVSTATE).  This CSR behaves
 215 exactly as standard CSRs, and contains more than just VL.
 216
 217 One interesting side-effect of using CSRRWI to set VL is that this
 218 may be done with a single instruction, useful particularly for a
 219 context-load/save.  There are however limitations: CSRWWI's immediate
 220 is limited to 0-31.
 221
 222 ## STATE
 223
 224 This is a standard CSR that contains sufficient information for a
 225 full context save/restore.  It contains (and permits setting of)
 226 MVL, VL, CFG, the destination element offset of the current parallel
 227 instruction being executed, and, for twin-predication, the source
 228 element offset as well.  Interestingly it may hypothetically
 229 also be used to make the immediately-following instruction to skip a
 230 certain number of elements, however the recommended method to do
 231 this is predication.
 232
 233 Setting destoffs and srcoffs is realistically intended for saving state
 234 so that exceptions (page faults in particular) may be serviced and the
 235 hardware-loop that was being executed at the time of the trap, from
 236 user-mode (or Supervisor-mode), may be returned to and continued from
 237 where it left off.  The reason why this works is because setting
 238 User-Mode STATE will not change (not be used) in M-Mode or S-Mode
 239 (and is entirely why M-Mode and S-Mode have their own STATE CSRs).
 240
 241 The format of the STATE CSR is as follows:
 242
 243 | (28..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
 244 | -------- | -------- | -------- | -------- | ------- | ------- |
 245 | size     | bank     | destoffs | srcoffs  | vl      | maxvl   |
 246
 247 When setting this CSR, the following characteristics will be enforced:
 248
 249 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
 250 * **VL** will be truncated (after offset) to be within the range 1 to MAXVL
 251 * **srcoffs** will be truncated to be within the range 0 to VL-1
 252 * **destoffs** will be truncated to be within the range 0 to VL-1
 253
 254 ## MVL, VL and CSR Pseudocode
 255
 256 The pseudo-code for get and set of VL and MVL are as follows:
 257
 258     set_mvl_csr(value, rd):
 259         regs[rd] = MVL
 260         MVL = MIN(value, MVL)
 261
 262     get_mvl_csr(rd):
 263         regs[rd] = VL
 264
 265     set_vl_csr(value, rd):
 266         VL = MIN(value, MVL)
 267         regs[rd] = VL # yes returning the new value NOT the old CSR
 268
 269     get_vl_csr(rd):
 270         regs[rd] = VL
 271
 272 Note that where setting MVL behaves as a normal CSR, unlike standard CSR
 273 behaviour, setting VL will return the **new** value of VL **not** the old
 274 one.
 275
 276 For CSRRWI, the range of the immediate is restricted to 5 bits.  In order to
 277 maximise the effectiveness, an immediate of 0 is used to set VL=1,
 278 an immediate of 1 is used to set VL=2 and so on:
 279
 280     CSRRWI_Set_MVL(value):
 281         set_mvl_csr(value+1, x0)
 282
 283     CSRRWI_Set_VL(value):
 284         set_vl_csr(value+1, x0)
 285
 286 However for CSRRW the following pseudocide is used for MVL and VL,
 287 where setting the value to zero will cause an exception to be raised.
 288 The reason is that if VL or MVL are set to zero, the STATE CSR is
 289 not capable of returning that value.
 290
 291     CSRRW_Set_MVL(rs1, rd):
 292         value = regs[rs1]
 293         if value == 0:
 294             raise Exception
 295         set_mvl_csr(value, rd)
 296
 297     CSRRW_Set_VL(rs1, rd):
 298         value = regs[rs1]
 299         if value == 0:
 300             raise Exception
 301         set_vl_csr(value, rd)
 302
 303 In this way, when CSRRW is utilised with a loop variable, the value
 304 that goes into VL (and into the destination register) may be used
 305 in an instruction-minimal fashion:
 306
 307      CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
 308      CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
 309      CSRRWI MVL, 3          # sets MVL == **4** (not 3)
 310      j zerotest             # in case loop counter a0 already 0
 311     loop:
 312      CSRRW VL, t0, a0       # vl = t0 = min(mvl, a0)
 313      ld     a3, a1          # load 4 registers a3-6 from x
 314      slli   t1, t0, 3       # t1 = vl * 8 (in bytes)
 315      ld     a7, a2          # load 4 registers a7-10 from y
 316      add    a1, a1, t1      # increment pointer to x by vl*8
 317      fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
 318      sub    a0, a0, t0      # n -= vl (t0)
 319      st     a7, a2          # store 4 registers a7-10 to y
 320      add    a2, a2, t1      # increment pointer to y by vl*8
 321     zerotest:
 322      bnez   a0, loop        # repeat if n != 0
 323
 324 With the STATE CSR, just like with CSRRWI, in order to maximise the
 325 utilisation of the limited bitspace, "000000" in binary represents
 326 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
 327
 328     CSRRW_Set_SV_STATE(rs1, rd):
 329         value = regs[rs1]
 330         get_state_csr(rd)
 331         MVL = set_mvl_csr(value[11:6]+1)
 332         VL = set_vl_csr(value[5:0]+1)
 333         CFG = value[28:24]>>24
 334         destoffs = value[23:18]>>18
 335         srcoffs = value[23:18]>>12
 336
 337     get_state_csr(rd):
 338         regs[rd] = (MVL-1) | (VL-1)<<6 | (srcoffs)<<12 |
 339                    (destoffs)<<18 | (CFG)<<24
 340         return regs[rd]
 341
 342 In both cases, whilst CSR read of VL and MVL return the exact values
 343 of VL and MVL respectively, reading and writing the STATE CSR returns
 344 those values **minus one**.  This is absolutely critical to implement
 345 if the STATE CSR is to be used for fast context-switching.
 346
 347 ## Register CSR key-value (CAM) table
 348
 349 The purpose of the Register CSR table is four-fold:
 350
 351 * To mark integer and floating-point registers as requiring "redirection"
 352   if it is ever used as a source or destination in any given operation.
 353   This involves a level of indirection through a 5-to-7-bit lookup table,
 354   such that **unmodified** operands with 5 bit (3 for Compressed) may
 355   access up to **64** registers.
 356 * To indicate whether, after redirection through the lookup table, the
 357   register is a vector (or remains a scalar).
 358 * To over-ride the implicit or explicit bitwidth that the operation would
 359   normally give the register.
 360
 361 | RgCSR | | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
 362 | ----- | | -        | -        | -   | ------ | ------- |
 363 | 0     | | isvec0   | regidx0  | i/f | vew0   | regkey  |
 364 | 1     | | isvec1   | regidx1  | i/f | vew1   | regkey  |
 365 | ..    | | isvec..  | regidx.. | i/f | vew..  | regkey  |
 366 | 15    | | isvec15  | regidx15 | i/f | vew15  | regkey  |
 367
 368 i/f is set to "1" to indicate that the redirection/tag entry is to be applied
 369 to integer registers; 0 indicates that it is relevant to floating-point
 370 registers.  vew has the following meanings, indicating that the instruction's
 371 operand size is "over-ridden" in a polymorphic fashion:
 372
 373 | vew | bitwidth   |
 374 | --- | ---------- |
 375 | 00  | default    |
 376 | 01  | default/2  |
 377 | 10  | default\*2 |
 378 | 11  | 8          |
 379
 380 As the above table is a CAM (key-value store) it may be appropriate
 381 (faster, implementation-wise) to expand it as follows:
 382
 383     struct vectorised fp_vec[32], int_vec[32];
 384
 385     for (i = 0; i < 16; i++) // 16 CSRs?
 386        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 387        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 388        tb[idx].elwidth  = CSRvec[i].elwidth
 389        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 390        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 391        tb[idx].packed   = CSRvec[i].packed  // SIMD or not
 392
 393 The actual size of the CSR Register table depends on the platform
 394 and on whether other Extensions are present (RV64G, RV32E, etc.).
 395 For details see "Subsets" section.
 396
 397 16-bit CSR Register CAM entries are mapped directly into 32-bit
 398 on any RV32-based system, however RV64 (XLEN=64) and RV128 (XLEN=128)
 399 are slightly different: the 16-bit entries appear (and can be set)
 400 multiple times, in an overlapping fashion.  Here is the table for RV64:
 401
 402 | CSR#  | 63..48  | 47..32  | 31..16  | 15..0   |
 403 | 0x4c0 | RgCSR3  | RgCSR2  | RgCSR1  | RgCSR0  |
 404 | 0x4c1 | RgCSR5  | RgCSR4  | RgCSR3  | RgCSR2  |
 405 | 0x4c2 | ...     | ...     | ...     | ...     |
 406 | 0x4c1 | RgCSR15 | RgCSR14 | RgCSR13 | RgCSR12 |
 407 | 0x4c8 | n/a     | n/a     | RgCSR15 | RgCSR4  |
 408
 409 The rules for writing to these CSRs are that any entries above the ones
 410 being set will be automatically wiped (to zero), so to fill several entries
 411 they must be written in a sequentially increasing manner.  This functionality
 412 was in an early draft of RVV and it means that, firstly, compilers do not have
 413 to spend time zero-ing out CSRs unnecessarily, and secondly, that on
 414 context-switching (and function calls) the number of CSRs that may need
 415 saving is implicitly known.
 416
 417 The reason for the overlapping entries is that in the worst-case on an
 418 RV64 system, only 4 64-bit CSR reads/writes are required for a full
 419 context-switch (and an RV128 system, only 2 128-bit CSR reads/writes).
 420
 421 --
 422
 423 TODO: move elsewhere
 424
 425     # TODO: use elsewhere (retire for now)
 426     vew = CSRbitwidth[rs1]
 427     if (vew == 0)
 428         bytesperreg = (XLEN/8) # or FLEN as appropriate
 429     elif (vew == 1)
 430         bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
 431     else:
 432         bytesperreg = bytestable[vew] # 8 or 16
 433     simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
 434     vlen = CSRvectorlen[rs1] * simdmult
 435     CSRvlength = MIN(MIN(vlen, MAXVECTORLENGTH), rs2)
 436
 437 The reason for multiplying the vector length by the number of SIMD elements
 438 (in each individual register) is so that each SIMD element may optionally be
 439 predicated.
 440
 441 An example of how to subdivide the register file when bitwidth != default
 442 is given in the section "Bitwidth Virtual Register Reordering".
 443
 444 ## Predication CSR <a name="predication_csr_table"></a>
 445
 446 TODO: update CSR tables, now 7-bit for regidx
 447
 448 The Predication CSR is a key-value store indicating whether, if a given
 449 destination register (integer or floating-point) is referred to in an
 450 instruction, it is to be predicated.  Tt is particularly important to note
 451 that the *actual* register used can be *different* from the one that is
 452 in the instruction, due to the redirection through the lookup table.
 453
 454 * regidx is the actual register that in combination with the
 455   i/f flag, if that integer or floating-point register is referred to,
 456   results in the lookup table being referenced to find the predication
 457   mask to use on the operation in which that (regidx) register has
 458   been used
 459 * predidx (in combination with the bank bit in the future) is the
 460   *actual* register to be used for the predication mask.  Note:
 461   in effect predidx is actually a 6-bit register address, as the bank
 462   bit is the MSB (and is nominally set to zero for now).
 463 * inv indicates that the predication mask bits are to be inverted
 464   prior to use *without* actually modifying the contents of the
 465   register itself.
 466 * zeroing is either 1 or 0, and if set to 1, the operation must
 467   place zeros in any element position where the predication mask is
 468   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 469   be left alone.  Some microarchitectures may choose to interpret
 470   this as skipping the operation entirely.  Others which wish to
 471   stick more closely to a SIMD architecture may choose instead to
 472   interpret unpredicated elements as an internal "copy element"
 473   operation (which would be necessary in SIMD microarchitectures
 474   that perform register-renaming)
 475 * "packed" indicates if the register is to be interpreted as SIMD
 476   i.e. containing multiple contiguous elements of size equal to "bitwidth".
 477   (Note: in earlier drafts this was in the Register CSR table.
 478   However after extending to 7 bits there was not enough space.
 479   To use "unpredicated" packed SIMD, set the predicate to x0 and
 480   set "invert".  This has the effect of setting a predicate of all 1s)
 481
 482 | PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
 483 | ----- | -      | -      | -     | -   | ------- | ------- |
 484 | 0     | bank0  | zero0  | inv0  | i/f | regidx  | predkey |
 485 | 1     | bank1  | zero1  | inv1  | i/f | regidx  | predkey |
 486 | ..    | bank.. | zero.. | inv.. | i/f | regidx  | predkey |
 487 | 15    | bank15 | zero15 | inv15 | i/f | regidx  | predkey |
 488
 489 The Predication CSR Table is a key-value store, so implementation-wise
 490 it will be faster to turn the table around (maintain topologically
 491 equivalent state):
 492
 493     struct pred {
 494         bool zero;
 495         bool inv;
 496         bool enabled;
 497         int predidx; // redirection: actual int register to use
 498     }
 499
 500     struct pred fp_pred_reg[32];   // 64 in future (bank=1)
 501     struct pred int_pred_reg[32];  // 64 in future (bank=1)
 502
 503     for (i = 0; i < 16; i++)
 504       tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
 505       idx = CSRpred[i].regidx
 506       tb[idx].zero = CSRpred[i].zero
 507       tb[idx].inv  = CSRpred[i].inv
 508       tb[idx].predidx  = CSRpred[i].predidx
 509       tb[idx].enabled  = true
 510
 511 So when an operation is to be predicated, it is the internal state that
 512 is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
 513 pseudo-code for operations is given, where p is the explicit (direct)
 514 reference to the predication register to be used:
 515
 516     for (int i=0; i<vl; ++i)
 517         if ([!]preg[p][i])
 518            (d ? vreg[rd][i] : sreg[rd]) =
 519             iop(s1 ? vreg[rs1][i] : sreg[rs1],
 520                 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
 521
 522 This instead becomes an *indirect* reference using the *internal* state
 523 table generated from the Predication CSR key-value store, which iwws used
 524 as follows.
 525
 526     if type(iop) == INT:
 527         preg = int_pred_reg[rd]
 528     else:
 529         preg = fp_pred_reg[rd]
 530
 531     for (int i=0; i<vl; ++i)
 532         predicate, zeroing = get_pred_val(type(iop) == INT, rd):
 533         if (predicate && (1<<i))
 534            (d ? regfile[rd+i] : regfile[rd]) =
 535             iop(s1 ? regfile[rs1+i] : regfile[rs1],
 536                 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
 537         else if (zeroing)
 538            (d ? regfile[rd+i] : regfile[rd]) = 0
 539
 540 Note:
 541
 542 * d, s1 and s2 are booleans indicating whether destination,
 543   source1 and source2 are vector or scalar
 544 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
 545   above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
 546   register-level redirection (from the Register CSR table) if they are
 547   vectors.
 548
 549 If written as a function, obtaining the predication mask (and whether
 550 zeroing takes place) may be done as follows:
 551
 552     def get_pred_val(bool is_fp_op, int reg):
 553        tb = int_reg if is_fp_op else fp_reg
 554        if (!tb[reg].enabled):
 555           return ~0x0, False       // all enabled; no zeroing
 556        tb = int_pred if is_fp_op else fp_pred
 557        if (!tb[reg].enabled):
 558           return ~0x0, False       // all enabled; no zeroing
 559        predidx = tb[reg].predidx   // redirection occurs HERE
 560        predicate = intreg[predidx] // actual predicate HERE
 561        if (tb[reg].inv):
 562           predicate = ~predicate   // invert ALL bits
 563        return predicate, tb[reg].zero
 564
 565 Note here, critically, that **only** if the register is marked
 566 in its CSR **register** table entry as being "active" does the testing
 567 proceed further to check if the CSR **predicate** table entry is
 568 also active.
 569
 570 Note also that this is in direct contrast to branch operations
 571 for the storage of comparisions: in these specific circumstances
 572 the requirement for there to be an active CSR *register* entry
 573 is removed.
 574
 575 ## REMAP CSR
 576
 577 (Note: both the REMAP and SHAPE sections are best read after the
 578  rest of the document has been read)
 579
 580 There is one 32-bit CSR which may be used to indicate which registers,
 581 if used in any operation, must be "reshaped" (re-mapped) from a linear
 582 form to a 2D or 3D transposed form.  The 32-bit REMAP CSR may reshape
 583 up to 3 registers:
 584
 585 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 586 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
 587 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
 588
 589 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
 590 *real* register (see regidx, the value) and consequently is 7-bits wide.
 591 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
 592 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
 593
 594 ## SHAPE 1D/2D/3D vector-matrix remapping CSRs
 595
 596 (Note: both the REMAP and SHAPE sections are best read after the
 597  rest of the document has been read)
 598
 599 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
 600 which have the same format.  When each SHAPE CSR is set entirely to zeros,
 601 remapping is disabled: the register's elements are a linear (1D) vector.
 602
 603 | 26..24  | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 604 | ------- | -- | ------- | -- | ------- | -- | ------- |
 605 | permute | 0  | zdimsz  | 0  | ydimsz  | 0  | xdimsz  |
 606
 607 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 608 that the array dimensionality for that dimension is 1.  A value of xdimsz=2
 609 would indicate that in the first dimension there are 3 elements in the
 610 array.  The format of the array is therefore as follows:
 611
 612     array[xdim+1][ydim+1][zdim+1]
 613
 614 However whilst illustrative of the dimensionality, that does not take the
 615 "permute" setting into account.  "permute" may be any one of six values
 616 (0-5, with values of 6 and 7 being reserved, and not legal).  The table
 617 below shows how the permutation dimensionality order works:
 618
 619 | permute | order | array format             |
 620 | ------- | ----- | ------------------------ |
 621 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 622 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 623 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 624 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 625 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 626 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 627
 628 In other words, the "permute" option changes the order in which
 629 nested for-loops over the array would be done.  The algorithm below
 630 shows this more clearly, and may be executed as a python program:
 631
 632     # mapidx = REMAP.shape2
 633     xdim = 3 # SHAPE[mapidx].xdim_sz+1
 634     ydim = 4 # SHAPE[mapidx].ydim_sz+1
 635     zdim = 5 # SHAPE[mapidx].zdim_sz+1
 636
 637     lims = [xdim, ydim, zdim]
 638     idxs = [0,0,0] # starting indices
 639     order = [1,0,2] # experiment with different permutations, here
 640
 641     for idx in range(xdim * ydim * zdim):
 642         new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
 643         print new_idx,
 644         for i in range(3):
 645             idxs[order[i]] = idxs[order[i]] + 1
 646             if (idxs[order[i]] != lims[order[i]]):
 647                 break
 648             print
 649             idxs[order[i]] = 0
 650
 651 Here, it is assumed that this algorithm be run within all pseudo-code
 652 throughout this document where a (parallelism) for-loop would normally
 653 run from 0 to VL-1 to refer to contiguous register
 654 elements; instead, where REMAP indicates to do so, the element index
 655 is run through the above algorithm to work out the **actual** element
 656 index, instead.  Given that there are three possible SHAPE entries, up to
 657 three separate registers in any given operation may be simultaneously
 658 remapped:
 659
 660     function op_add(rd, rs1, rs2) # add not VADD!
 661       ...
 662       ...
 663       for (i = 0; i < VL; i++)
 664         if (predval & 1<<i) # predication uses intregs
 665            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 666                                  ireg[rs2+remap(irs2)];
 667         if (int_vec[rd ].isvector)  { id += 1; }
 668         if (int_vec[rs1].isvector)  { irs1 += 1; }
 669         if (int_vec[rs2].isvector)  { irs2 += 1; }
 670
 671 By changing remappings, 2D matrices may be transposed "in-place" for one
 672 operation, followed by setting a different permutation order without
 673 having to move the values in the registers to or from memory.  Also,
 674 the reason for having REMAP separate from the three SHAPE CSRs is so
 675 that in a chain of matrix multiplications and additions, for example,
 676 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 677 changed to target different registers.
 678
 679 Note that:
 680
 681 * If permute option 000 is utilised, the actual order of the
 682   reindexing does not change!
 683 * If two or more dimensions are set to zero, the actual order does not change!
 684 * The above algorithm is pseudo-code **only**.  Actual implementations
 685   will need to take into account the fact that the element for-looping
 686   must be **re-entrant**, due to the possibility of exceptions occurring.
 687   See MSTATE CSR, which records the current element index.
 688 * Twin-predicated operations require **two** separate and distinct
 689   element offsets.  The above pseudo-code algorithm will be applied
 690   separately and independently to each, should each of the two
 691   operands be remapped.  *This even includes C.LDSP* and other operations
 692   in that category, where in that case it will be the **offset** that is
 693   remapped (see Compressed Stack LOAD/STORE section).
 694 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 695   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 696   entries to be regularly presented to operands **more than once**, thus
 697   allowing the same underlying registers to act as an accumulator of
 698   multiple vector or matrix operations, for example.
 699
 700 Clearly here some considerable care needs to be taken as the remapping
 701 could hypothetically create arithmetic operations that target the
 702 exact same underlying registers, resulting in data corruption due to
 703 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 704 register-renaming will have an easier time dealing with this than
 705 DSP-style SIMD micro-architectures.
 706
 707 # Instruction Execution Order
 708
 709 Simple-V behaves as if it is a hardware-level "macro expansion system",
 710 substituting and expanding a single instruction into multiple sequential
 711 instructions with contiguous and sequentially-incrementing registers.
 712 As such, it does **not** modify - or specify - the behaviour and semantics of
 713 the execution order: that may be deduced from the **existing** RV
 714 specification in each and every case.
 715
 716 So for example if a particular micro-architecture permits out-of-order
 717 execution, and it is augmented with Simple-V, then wherever instructions
 718 may be out-of-order then so may the "post-expansion" SV ones.
 719
 720 If on the other hand there are memory guarantees which specifically
 721 prevent and prohibit certain instructions from being re-ordered
 722 (such as the Atomicity Axiom, or FENCE constraints), then clearly
 723 those constraints **MUST** also be obeyed "post-expansion".
 724
 725 It should be absolutely clear that SV is **not** about providing new
 726 functionality or changing the existing behaviour of a micro-architetural
 727 design, or about changing the RISC-V Specification.
 728 It is **purely** about compacting what would otherwise be contiguous
 729 instructions that use sequentially-increasing register numbers down
 730 to the **one** instruction.
 731
 732 # Instructions
 733
 734 Despite being a 98% complete and accurate topological remap of RVV
 735 concepts and functionality, no new instructions are needed.
 736 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
 737 becomes a critical dependency for efficient manipulation of predication
 738 masks (as a bit-field).  Despite the removal of all operations,
 739 with the exception of CLIP and VSELECT.X
 740 *all instructions from RVV Base are topologically re-mapped and retain their
 741 complete functionality, intact*.  Note that if RV64G ever had
 742 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
 743 be obtained in SV.
 744
 745 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
 746 equivalents, so are left out of Simple-V.  VSELECT could be included if
 747 there existed a MV.X instruction in RV (MV.X is a hypothetical
 748 non-immediate variant of MV that would allow another register to
 749 specify which register was to be copied).  Note that if any of these three
 750 instructions are added to any given RV extension, their functionality
 751 will be inherently parallelised.
 752
 753 With some exceptions, where it does not make sense or is simply too
 754 challenging, all RV-Base instructions are parallelised:
 755
 756 * CSR instructions, whilst a case could be made for fast-polling of
 757   a CSR into multiple registers, would require guarantees of strict
 758   sequential ordering that SV does not provide.  Therefore, CSRs are
 759   not really suitable and are left out.
 760 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 761   left as scalar.
 762 * LR/SC could hypothetically be parallelised however their purpose is
 763   single (complex) atomic memory operations where the LR must be followed
 764   up by a matching SC.  A sequence of parallel LR instructions followed
 765   by a sequence of parallel SC instructions therefore is guaranteed to
 766   not be useful. Not least: the guarantees of LR/SC
 767   would be impossible to provide if emulated in a trap.
 768 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 769   paralleliseable anyway.
 770
 771 All other operations using registers are automatically parallelised.
 772 This includes AMOMAX, AMOSWAP and so on, where particular care and
 773 attention must be paid.
 774
 775 Example pseudo-code for an integer ADD operation (including scalar operations).
 776 Floating-point uses fp csrs.
 777
 778     function op_add(rd, rs1, rs2) # add not VADD!
 779       int i, id=0, irs1=0, irs2=0;
 780       predval = get_pred_val(FALSE, rd);
 781       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 782       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 783       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 784       for (i = 0; i < VL; i++)
 785         if (predval & 1<<i) # predication uses intregs
 786            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 787         if (int_vec[rd ].isvector)  { id += 1; }
 788         if (int_vec[rs1].isvector)  { irs1 += 1; }
 789         if (int_vec[rs2].isvector)  { irs2 += 1; }
 790
 791 ## Instruction Format
 792
 793 There are **no operations added to SV, at all**.
 794 Instead SV  *overloads* pre-existing branch operations into predicated
 795 variants, and implicitly overloads arithmetic operations, MV,
 796 FCVT, and LOAD/STORE
 797 depending on CSR configurations for bitwidth and
 798 predication.  **Everything** becomes parallelised.  *This includes
 799 Compressed instructions* as well as any
 800 future instructions and Custom Extensions.
 801
 802 ## Branch Instructions
 803
 804 ### Standard Branch <a name="standard_branch"></a>
 805
 806 Branch operations use standard RV opcodes that are reinterpreted to
 807 be "predicate variants" in the instance where either of the two src
 808 registers are marked as vectors (active=1, vector=1).
 809
 810 Note that he predication register to use (if one is enabled) is taken from
 811 the *first* src register.  The target (destination) predication register
 812 to use (if one is enabled) is taken from the *second* src register.
 813
 814 If either of src1 or src2 are scalars (whether by there being no
 815 CSR register entry or whether by the CSR entry specifically marking
 816 the register as "scalar") the comparison goes ahead as vector-scalar
 817 or scalar-vector.
 818
 819 In instances where no vectorisation is detected on either src registers
 820 the operation is treated as an absolutely standard scalar branch operation.
 821 Where vectorisation is present on either or both src registers, the
 822 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 823 those tests that are predicated out).
 824
 825 Note that just as with the standard (scalar, non-predicated) branch
 826 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 827 src1 and src2.
 828
 829 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 830 for predicated compare operations of function "cmp":
 831
 832     for (int i=0; i<vl; ++i)
 833       if ([!]preg[p][i])
 834          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 835                            s2 ? vreg[rs2][i] : sreg[rs2]);
 836
 837 With associated predication, vector-length adjustments and so on,
 838 and temporarily ignoring bitwidth (which makes the comparisons more
 839 complex), this becomes:
 840
 841     s1 = reg_is_vectorised(src1);
 842     s2 = reg_is_vectorised(src2);
 843
 844     if not s1 && not s2
 845         if cmp(rs1, rs2) # scalar compare
 846             goto branch
 847         return
 848
 849     preg = int_pred_reg[rd]
 850     reg = int_regfile
 851
 852     ps = get_pred_val(I/F==INT, rs1);
 853     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 854
 855     if not exists(rd)
 856         temporary_result = 0
 857     else
 858         preg[rd] = 0; # initialise to zero
 859
 860     for (int i = 0; i < VL; ++i)
 861       if (ps & (1<<i)) && (cmp(s1 ? reg[src1+i]:reg[src1],
 862                                s2 ? reg[src2+i]:reg[src2])
 863           if not exists(rd)
 864               temporary_result |= 1<<i;
 865           else
 866               preg[rd] |= 1<<i;  # bitfield not vector
 867
 868      if not exists(rd)
 869         if temporary_result == ps
 870             goto branch
 871      else
 872         if preg[rd] == ps
 873             goto branch
 874
 875 Notes:
 876
 877 * zeroing has been temporarily left out of the above pseudo-code,
 878   for clarity
 879 * Predicated SIMD comparisons would break src1 and src2 further down
 880   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 881   Reordering") setting Vector-Length times (number of SIMD elements) bits
 882   in Predicate Register rd, as opposed to just Vector-Length bits.
 883
 884 TODO: predication now taken from src2.  also branch goes ahead
 885 if all compares are successful.
 886
 887 Note also that where normally, predication requires that there must
 888 also be a CSR register entry for the register being used in order
 889 for the **predication** CSR register entry to also be active,
 890 for branches this is **not** the case.  src2 does **not** have
 891 to have its CSR register entry marked as active in order for
 892 predication on src2 to be active.
 893
 894 ### Floating-point Comparisons
 895
 896 There does not exist floating-point branch operations, only compare.
 897 Interestingly no change is needed to the instruction format because
 898 FP Compare already stores a 1 or a zero in its "rd" integer register
 899 target, i.e. it's not actually a Branch at all: it's a compare.
 900 Thus, no change is made to the floating-point comparison, so
 901
 902 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
 903 and whilst in ordinary branch code this is fine because the standard
 904 RVF compare can always be followed up with an integer BEQ or a BNE (or
 905 a compressed comparison to zero or non-zero), in predication terms that
 906 becomes more of an impact.  To deal with this, SV's predication has
 907 had "invert" added to it.
 908
 909 ### Compressed Branch Instruction
 910
 911 Compressed Branch instructions are, just like standard Branch instructions,
 912 reinterpreted to be vectorised and predicated based on the source register
 913 (rs1s) CSR entries.  As however there is only the one source register,
 914 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 915 to store the results of the comparisions is taken from CSR predication
 916 table entries for **x0**.
 917
 918 The specific required use of x0 is, with a little thought, quite obvious,
 919 but is counterintuitive.  Clearly it is **not** recommended to redirect
 920 x0 with a CSR register entry, however as a means to opaquely obtain
 921 a predication target it is the only sensible option that does not involve
 922 additional special CSRs (or, worse, additional special opcodes).
 923
 924 Note also that, just as with standard branches, the 2nd source
 925 (in this case x0 rather than src2) does **not** have to have its CSR
 926 register table marked as "active" in order for predication to work.
 927
 928 ## Vectorised Dual-operand instructions
 929
 930 There is a series of 2-operand instructions involving copying (and
 931 sometimes alteration):
 932
 933 * C.MV
 934 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 935 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 936 * LOAD(-FP) and STORE(-FP)
 937
 938 All of these operations follow the same two-operand pattern, so it is
 939 *both* the source *and* destination predication masks that are taken into
 940 account.  This is different from
 941 the three-operand arithmetic instructions, where the predication mask
 942 is taken from the *destination* register, and applied uniformly to the
 943 elements of the source register(s), element-for-element.
 944
 945 The pseudo-code pattern for twin-predicated operations is as
 946 follows:
 947
 948     function op(rd, rs):
 949       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 950       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 951       ps = get_pred_val(FALSE, rs); # predication on src
 952       pd = get_pred_val(FALSE, rd); # ... AND on dest
 953       for (int i = 0, int j = 0; i < VL && j < VL;):
 954         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 955         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 956         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 957         if (int_csr[rs].isvec) i++;
 958         if (int_csr[rd].isvec) j++;
 959
 960 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 961 and vector-vector, and predicated variants of all of those.
 962 Zeroing is not presently included (TODO).  As such, when compared
 963 to RVV, the twin-predicated variants of C.MV and FMV cover
 964 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 965 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 966
 967 Note that:
 968
 969 * elwidth (SIMD) is not covered in the pseudo-code above
 970 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 971   not covered
 972 * zero predication is also not shown (TODO).
 973
 974 ### C.MV Instruction <a name="c_mv"></a>
 975
 976 There is no MV instruction in RV however there is a C.MV instruction.
 977 It is used for copying integer-to-integer registers (vectorised FMV
 978 is used for copying floating-point).
 979
 980 If either the source or the destination register are marked as vectors
 981 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 982 move operation.  The actual instruction's format does not change:
 983
 984 [[!table  data="""
 985 15  12 | 11   7 | 6  2 | 1  0 |
 986 funct4 | rd     | rs   | op   |
 987 4      | 5      | 5    | 2    |
 988 C.MV   | dest   | src  | C0   |
 989 """]]
 990
 991 A simplified version of the pseudocode for this operation is as follows:
 992
 993     function op_mv(rd, rs) # MV not VMV!
 994       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 995       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 996       ps = get_pred_val(FALSE, rs); # predication on src
 997       pd = get_pred_val(FALSE, rd); # ... AND on dest
 998       for (int i = 0, int j = 0; i < VL && j < VL;):
 999         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1000         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1001         ireg[rd+j] <= ireg[rs+i];
1002         if (int_csr[rs].isvec) i++;
1003         if (int_csr[rd].isvec) j++;
1004
1005 There are several different instructions from RVV that are covered by
1006 this one opcode:
1007
1008 [[!table  data="""
1009 src    | dest    | predication   | op             |
1010 scalar | vector  | none          | VSPLAT         |
1011 scalar | vector  | destination   | sparse VSPLAT  |
1012 scalar | vector  | 1-bit dest    | VINSERT        |
1013 vector | scalar  | 1-bit? src    | VEXTRACT       |
1014 vector | vector  | none          | VCOPY          |
1015 vector | vector  | src           | Vector Gather  |
1016 vector | vector  | dest          | Vector Scatter |
1017 vector | vector  | src & dest    | Gather/Scatter |
1018 vector | vector  | src == dest   | sparse VCOPY   |
1019 """]]
1020
1021 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
1022 operations with inversion on the src and dest predication for one of the
1023 two C.MV operations.
1024
1025 Note that in the instance where the Compressed Extension is not implemented,
1026 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
1027 Note that the behaviour is **different** from C.MV because with addi the
1028 predication mask to use is taken **only** from rd and is applied against
1029 all elements: rs[i] = rd[i].
1030
1031 ### FMV, FNEG and FABS Instructions
1032
1033 These are identical in form to C.MV, except covering floating-point
1034 register copying.  The same double-predication rules also apply.
1035 However when elwidth is not set to default the instruction is implicitly
1036 and automatic converted to a (vectorised) floating-point type conversion
1037 operation of the appropriate size covering the source and destination
1038 register bitwidths.
1039
1040 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
1041
1042 ### FVCT Instructions
1043
1044 These are again identical in form to C.MV, except that they cover
1045 floating-point to integer and integer to floating-point.  When element
1046 width in each vector is set to default, the instructions behave exactly
1047 as they are defined for standard RV (scalar) operations, except vectorised
1048 in exactly the same fashion as outlined in C.MV.
1049
1050 However when the source or destination element width is not set to default,
1051 the opcode's explicit element widths are *over-ridden* to new definitions,
1052 and the opcode's element width is taken as indicative of the SIMD width
1053 (if applicable i.e. if packed SIMD is requested) instead.
1054
1055 For example FCVT.S.L would normally be used to convert a 64-bit
1056 integer in register rs1 to a 64-bit floating-point number in rd.
1057 If however the source rs1 is set to be a vector, where elwidth is set to
1058 default/2 and "packed SIMD" is enabled, then the first 32 bits of
1059 rs1 are converted to a floating-point number to be stored in rd's
1060 first element and the higher 32-bits *also* converted to floating-point
1061 and stored in the second.  The 32 bit size comes from the fact that
1062 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
1063 divide that by two it means that rs1 element width is to be taken as 32.
1064
1065 Similar rules apply to the destination register.
1066
1067 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
1068
1069 An earlier draft of SV modified the behaviour of LOAD/STORE.  This
1070 actually undermined the fundamental principle of SV, namely that there
1071 be no modifications to the scalar behaviour (except where absolutely
1072 necessary), in order to simplify an implementor's task if considering
1073 converting a pre-existing scalar design to support parallelism.
1074
1075 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
1076 do not change in SV, however just as with C.MV it is important to note
1077 that dual-predication is possible.  Using the template outlined in
1078 the section "Vectorised dual-op instructions", the pseudo-code covering
1079 scalar-scalar, scalar-vector, vector-scalar and vector-vector applies,
1080 where SCALAR\_OPERATION is as follows, exactly as for a standard
1081 scalar RV LOAD operation:
1082
1083         srcbase = ireg[rs+i];
1084         return mem[srcbase + imm];
1085
1086 Whilst LOAD and STORE remain as-is when compared to their scalar
1087 counterparts, the incrementing on the source register (for LOAD)
1088 means that pointers-to-structures can be easily implemented, and
1089 if contiguous offsets are required, those pointers (the contents
1090 of the contiguous source registers) may simply be set up to point
1091 to contiguous locations.
1092
1093 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
1094
1095 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
1096 where it is implicit in C.LWSP/FLWSP that x2 is the source register.
1097 It is therefore possible to use predicated C.LWSP to efficiently
1098 pop registers off the stack (by predicating x2 as the source), cherry-picking
1099 which registers to store to (by predicating the destination).  Likewise
1100 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
1101
1102 However, to do so, the behaviour of C.LWSP/C.SWSP needs to be slightly
1103 different: where x2 is marked as vectorised, instead of incrementing
1104 the register on each loop (x2, x3, x4...), instead it is the *immediate*
1105 that must be incremented.  Pseudo-code follows:
1106
1107     function lwsp(rd, rs):
1108       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1109       rs = x2 # effectively no redirection on x2.
1110       ps = get_pred_val(FALSE, rs); # predication on src
1111       pd = get_pred_val(FALSE, rd); # ... AND on dest
1112       for (int i = 0, int j = 0; i < VL && j < VL;):
1113         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1114         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1115         reg[rd+j] = mem[x2 + ((offset+i) * 4)]
1116         if (int_csr[rs].isvec) i++;
1117         if (int_csr[rd].isvec) j++;
1118
1119 For C.LDSP, the offset (and loop) multiplier would be 8, and for
1120 C.LQSP it would be 16.  Effectively this makes C.LWSP etc. a Vector
1121 "Unit Stride" Load instruction.
1122
1123 **Note**: it is still possible to redirect x2 to an alternative target
1124 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
1125 general-purpose Vector "Unit Stride" LOAD/STORE operations.
1126
1127 ## Compressed LOAD / STORE Instructions
1128
1129 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
1130 where the same rules apply and the same pseudo-code apply as for
1131 non-compressed LOAD/STORE.  This is **different** from Compressed Stack
1132 LOAD/STORE (C.LWSP / C.SWSP), which have been augmented to become
1133 Vector "Unit Stride" capable.
1134
1135 Just as with uncompressed LOAD/STORE C.LD / C.ST increment the *register*
1136 during the hardware loop, **not** the offset.
1137
1138 # Element bitwidth polymorphism
1139
1140 Element bitwidth is best covered as its own special section, as it
1141 is quite involved and applies uniformly across-the-board.
1142
1143 # Exceptions
1144
1145 TODO: expand.  Exceptions may occur at any time, in any given underlying
1146 scalar operation.  This implies that context-switching (traps) may
1147 occur, and operation must be returned to where it left off.  That in
1148 turn implies that the full state - including the current parallel
1149 element being processed - has to be saved and restored.  This is
1150 what the **STATE** CSR is for.
1151
1152 The implications are that all underlying individual scalar operations
1153 "issued" by the parallelisation have to appear to be executed sequentially.
1154 The further implications are that if two or more individual element
1155 operations are underway, and one with an earlier index causes an exception,
1156 it may be necessary for the microarchitecture to **discard** or terminate
1157 operations with higher indices.
1158
1159 This being somewhat dissatisfactory, an "opaque predication" variant
1160 of the STATE CSR is being considered.
1161
1162 > And what about instructions like JALR?
1163
1164 answer: they're not vectorised, so not a problem
1165
1166 # Hints
1167
1168 A "HINT" is an operation that has no effect on architectural state,
1169 where its use may, by agreed convention, give advance notification
1170 to the microarchitecture: branch prediction notification would be
1171 a good example.  Usually HINTs are where rd=x0.
1172
1173 With Simple-V being capable of issuing *parallel* instructions where
1174 rd=x0, the space for possible HINTs is expanded considerably.  VL
1175 could be used to indicate different hints.  In addition, if predication
1176 is set, the predication register itself could hypothetically be passed
1177 in as a *parameter* to the HINT operation.
1178
1179 No specific hints are yet defined in Simple-V
1180
1181 # Subsets of RV functionality
1182
1183 This section describes the differences when SV is implemented on top of
1184 different subsets of RV.
1185
1186 ## Common options
1187
1188 It is permitted to limit the size of either (or both) the register files
1189 down to the original size of the standard RV architecture.  However, below
1190 the mandatory limits set in the RV standard will result in non-compliance
1191 with the SV Specification.
1192
1193 ## RV32 / RV32F
1194
1195 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1196 maximum limit for predication is also restricted to 32 bits.  Whilst not
1197 actually specifically an "option" it is worth noting.
1198
1199 ## RV32G
1200
1201 Normally in standard RV32 it does not make much sense to have
1202 RV32G, however it is automatically implied to exist in RV32+SV due to
1203 the option for the element width to be doubled.  This may be sufficient
1204 for implementors, such that actually needing RV32G itself (which makes
1205 no sense given that the RV32 integer register file is 32-bit) may be
1206 redundant.
1207
1208 It is a strange combination that may make sense on closer inspection,
1209 particularly given that under the standard RV32 system many of the opcodes
1210 to convert and sign-extend 64-bit integers to 64-bit floating-point will
1211 be missing, as they are assumed to only be present in an RV64 context.
1212
1213 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1214
1215 When floating-point is not implemented, the size of the User Register and
1216 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1217 per table).
1218
1219 ## RV32E
1220
1221 In embedded scenarios the User Register and Predication CSRs may be
1222 dropped entirely, or optionally limited to 1 CSR, such that the combined
1223 number of entries from the M-Mode CSR Register table plus U-Mode
1224 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1225 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1226 the Predication CSR tables.
1227
1228 RV32E is the most likely candidate for simply detecting that registers
1229 are marked as "vectorised", and generating an appropriate exception
1230 for the VL loop to be implemented in software.
1231
1232 ## RV128
1233
1234 RV128 has not been especially considered, here, however it has some
1235 extremely large possibilities: double the element width implies
1236 256-bit operands, spanning 2 128-bit registers each, and predication
1237 of total length 128 bit given that XLEN is now 128.
1238
1239 # Under consideration <a name="issues"></a>
1240
1241 for element-grouping, if there is unused space within a register
1242 (3 16-bit elements in a 64-bit register for example), recommend:
1243
1244 * For the unused elements in an integer register, the used element
1245   closest to the MSB is sign-extended on write and the unused elements
1246   are ignored on read.
1247 * The unused elements in a floating-point register are treated as-if
1248   they are set to all ones on write and are ignored on read, matching the
1249   existing standard for storing smaller FP values in larger registers.
1250