simple_v_extension/specification.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification
   2
   3 * Status: DRAFTv0.1
   4 * Last edited: 30 sep 2018
   5
   6 With thanks to:
   7
   8 * Allen Baum
   9 * Jacob Bachmeyer
  10 * Guy Lemurieux
  11 * Jacob Lifshay
  12 * The RISC-V Founders, without whom this all would not be possible.
  13
  14 [[!toc ]]
  15
  16 # Summary and Background: Rationale
  17
  18 Simple-V is a uniform parallelism API for RISC-V hardware that has several
  19 unplanned side-effects including code-size reduction, expansion of
  20 HINT space and more.  The reason for
  21 creating it is to provide a manageable way to turn a pre-existing design
  22 into a parallel one, in a step-by-step incremental fashion, allowing
  23 the implementor to focus on adding hardware where it is needed and necessary.
  24
  25 Critically: **No new instructions are added**.  The parallelism (if any
  26 is implemented) is implicitly added by tagging *standard* scalar registers
  27 for redirection.  When such a tagged register is used in any instruction,
  28 it indicates that the PC shall **not** be incremented; instead a loop
  29 is activated where *multiple* instructions are issued to the pipeline
  30 (as determined by a length CSR), with contiguously incrementing register
  31 numbers starting from the tagged register.  When the last "element"
  32 has been reached, only then is the PC permitted to move onn.  Thus
  33 Simple-V effectively sits (slots) *in between* the instruction decode phase
  34 and the ALU(s).
  35
  36 The barrier to entry with SV is therefore very low.  The minimum is
  37 software-emulation (traps), requiring only the CSRs and CSR tables, and that
  38 an exception be thrown if an instruction is detected to have been
  39 parallelised.  The looping that would otherwise be done in hardware is
  40 thus carried out in software, instead.  Whilst much slower, it is "compliant"
  41 with the SV specification, and may be suited for implementation in RV32E
  42 and also in situations where the implementor wishes to focus on certain
  43 aspects of SV, whilst also conforming strictly with the API.
  44
  45 Hardware Parallelism, if any, is therefore added at the implementor's
  46 discretion to turn what would otherwise be a sequential loop into a
  47 parallel one.
  48
  49 To emphasise that clearly: Simple-V (SV) is *not*:
  50
  51 * A SIMD system
  52 * A SIMT system
  53 * A Vectorisation Microarchitecture
  54 * A microarchitecture of any specific kind
  55 * A mandary parallel processor microarchitecture of any kind
  56 * A supercomputer extension
  57
  58 SV does **not** tell implementors how or even if they should implement
  59 parallelism: it is a hardware "API" (Application Programming Interface)
  60 that, if implemented, presents a uniform and consistent way to *express*
  61 parallelism, at the same time leaving the choice of if, how, how much and
  62 when to parallelise operations **entirely to the implementor**.
  63
  64 # CSRs <a name="csrs"></a>
  65
  66 For U-Mode there are two CSR key-value stores needed to create lookup
  67 tables which are used at the register decode phase.
  68
  69 * A register CSR key-value table (8 32-bit CSRs of 2 16-bits each)
  70 * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each)
  71 * A "reshaping"
  72
  73 There are also four additional CSRs for User-Mode:
  74
  75 * CFG subsets the CSR tables
  76 * MVL (the Maximum Vector Length)
  77 * VL (which has different characteristics from standard CSRs)
  78 * STATE (useful for saving and restoring during context switch,
  79   and for providing fast transitions)
  80
  81 There are also three additional CSRs for Supervisor-Mode:
  82
  83 * SMVL
  84 * SVL
  85 * SSTATE
  86
  87 And likewise for M-Mode:
  88
  89 * MMVL
  90 * MVL
  91 * MSTATE
  92
  93 Both Supervisor and M-Mode have their own (small) CSR register and
  94 predication tables of only 4 entries each.
  95
  96 ## CFG
  97
  98 This CSR may be used to switch between subsets of the CSR Register and
  99 Predication Tables: it is kept to 5 bits so that a single CSRRWI instruction
 100 can be used.  A setting of all ones is reserved to indicate that SimpleV
 101 is disabled.
 102
 103 | (4..3) | (2...0) |
 104 | ------ | ------- |
 105 | size   | bank    |
 106
 107 Bank is 3 bits in size, and indicates the starting index of the CSR
 108 entries that are "enabled".  Given that each CSR table row is 16 bits
 109 and contains 2 CAM entries each, there are only 8 CSRs to cover in
 110 each table, so 8 bits is sufficient.
 111
 112 Size is 2 bits.  With the exception of when bank == 7 and size == 3,
 113 the number of elements enabled is taken by right-shifting 2 by size:
 114
 115 | size   | elements |
 116 | ------ | -------- |
 117 | 0      | 2        |
 118 | 1      | 4        |
 119 | 2      | 8        |
 120 | 3      | 16       |
 121
 122 Given that there are 2 16-bit CAM entries per CSR table row, this
 123 may also be viewed as the number of CSR rows to enable, by raising size to
 124 the power of 2.
 125
 126 Examples:
 127
 128 * When bank = 0 and size = 3, SVREGCFG0 through to SVREGCFG7 are
 129   enabled, and SVPREDCFG0 through to SVPREGCFG7 are enabled.
 130 * When bank = 1 and size = 3, SVREGCFG1 through to SVREGCFG7 are
 131   enabled, and SVPREDCFG1 through to SVPREGCFG7 are enabled.
 132 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 133 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
 134 * When bank = 7 and size = 1, SVREGCFG7 and SVPREDCFG7 are enabled.
 135 * When bank = 7 and size = 3, SimpleV is entirely disabled.
 136
 137 In this way it is possible to enable and disable SimpleV with a
 138 single instruction, and, furthermore, on context-switching the quantity
 139 of CSRs to be saved and restored is greatly reduced.
 140
 141 ## MAXVECTORLENGTH
 142
 143 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
 144 is variable length and may be dynamically set.  MAXVECTORLENGTH is
 145 however limited to the regfile bitwidth minus one (31 for RV32, 63 for RV64
 146 and so on).
 147
 148 The reason for setting this limit is so that predication registers, when
 149 marked as such, may fit into a single register as opposed to fanning out
 150 over several registers.  This keeps the implementation a little simpler.
 151
 152 ## VSETVL (VL and CSRs)
 153
 154 VSETVL is slightly different from RVV.  Like RVV, VL is set to be limited
 155 to the MAXVECTORLENGTH, which in turn is limited to XLEN.
 156
 157     VL = rd = MIN(vlen, MAXVECTORLENGTH)
 158
 159 where MAXVECTORLENGTH <= XLEN
 160
 161 This allows vector LOAD/STORE to be used to switch
 162 the entire bank of registers using a single instruction (see Appendix,
 163 "Context Switch Example").  The reason for limiting VSETVL to XLEN is
 164 down to the fact that predication bits fit into a single register of length
 165 XLEN bits.
 166
 167 The second change is that when VSETVL is requested to be stored
 168 into x0, it is *ignored* silently (VSETVL x0, x5)
 169
 170 The third and most important change is that, within the limits set by
 171 MAXVECTORLENGTH, the value passed in **must** be set in VL (and in the
 172 destination register).
 173
 174 This has implication for the microarchitecture, as VL is required to be
 175 set (limits from MAXVECTORLENGTH notwithstanding) to the actual value
 176 requested.  RVV has the option to set VL to an arbitrary value that suits
 177 the conditions and the micro-architecture: SV does *not* permit this.
 178
 179 The reason is so that if SV is to be used for a context-switch or as a
 180 substitute for LOAD/STORE-Multiple, the operation can be done with only
 181 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
 182 single LD/ST operation).  If VL does *not* get set to the register file
 183 length when VSETVL is called, then a software-loop would be needed.
 184 To avoid this need, VL *must* be set to exactly what is requested
 185 (limits notwithstanding).
 186
 187 Therefore, in turn, unlike RVV, implementors *must* provide
 188 pseudo-parallelism (using sequential loops in hardware) if actual
 189 hardware-parallelism in the ALUs is not deployed.  A hybrid is also
 190 permitted (as used in Broadcom's VideoCore-IV) however this must be
 191 *entirely* transparent to the ISA.
 192
 193 The fourth change is that VSETVL is implemented as a CSR, where the
 194 behaviour of CSRRW (and CSRRWI) must be changed to specifically store
 195 the *new* value in the destination register, **not** the old value.
 196 Where context-load/save is to be implemented in the usual fashion
 197 by using a single CSRRW instruction to obtain the old value, the
 198 *secondary* CSR must be used (SVSTATE).  This CSR behaves
 199 exactly as standard CSRs, and contains more than just VL.
 200
 201 One interesting side-effect of using CSRRWI to set VL is that this
 202 may be done with a single instruction, useful particularly for a
 203 context-load/save.  There are however limitations: CSRWWI's immediate
 204 is limited to 0-31.
 205
 206 ## STATE
 207
 208 This is a standard CSR that contains sufficient information for a
 209 full context save/restore.  It contains (and permits setting of)
 210 MAXVL, VL, the destination element offset of the current parallel
 211 instruction being executed, and, for twin-predication, the source
 212 element offset as well.  Interestingly it may hypothetically
 213 also be used to get the immediately-following instruction to skip a
 214 certain number of elements, however the recommended method to do
 215 this is predication.
 216
 217 The format of the SVSTATE CSR is as follows:
 218
 219 | (23..18) | (17..12) | (11..6) | (5...0) |
 220 | -------- | -------- | ------- | ------- |
 221 | destoffs | srcoffs  | vl      | maxvl   |
 222
 223 When setting this CSR, the following characteristics will be enforced:
 224
 225 * **MAXVL** will be truncated to be within the range 0 to XLEN-1
 226 * **VL** will be truncated to be within the range 0 to MAXVL
 227 * **srcoffs** will be truncated to be within the range 0 to VL
 228 * **destoffs** will be truncated to be within the range 0 to VL
 229
 230 ## Register CSR key-value (CAM) table
 231
 232 TODO: update CSR tables, now 7-bit for regidx
 233
 234 The purpose of the Register CSR table is four-fold:
 235
 236 * To mark integer and floating-point registers as requiring "redirection"
 237   if it is ever used as a source or destination in any given operation.
 238   This involves a level of indirection through a 5-to-6-bit lookup table,
 239   such that **unmodified** operands with 5 bit (3 for Compressed) may
 240   access up to **64** registers.
 241 * To indicate whether, after redirection through the lookup table, the
 242   register is a vector (or remains a scalar).
 243 * To over-ride the implicit or explicit bitwidth that the operation would
 244   normally give the register.
 245 * To indicate if the register is to be interpreted as "packed" (SIMD)
 246   i.e. containing multiple contiguous elements of size equal to "bitwidth".
 247
 248 | RgCSR | 15     | 14     | 13       | (12..11) | 10  | (9..5)  | (4..0)  |
 249 | ----- | -      | -      | -        | -        | -   | ------- | ------- |
 250 | 0     | simd0  | bank0  | isvec0   | vew0     | i/f | regidx  | predidx |
 251 | 1     | simd1  | bank1  | isvec1   | vew1     | i/f | regidx  | predidx |
 252 | ..    | simd.. | bank.. | isvec..  | vew..    | i/f | regidx  | predidx |
 253 | 15    | simd15 | bank15 | isvec15  | vew15    | i/f | regidx  | predidx |
 254
 255 vew may be one of the following (giving a table "bytestable", used below):
 256
 257 | vew | bitwidth   |
 258 | --- | ---------- |
 259 | 00  | default    |
 260 | 01  | default/2  |
 261 | 10  | default\*2 |
 262 | 11  | 8          |
 263
 264 As the above table is a CAM (key-value store) it may be appropriate
 265 to expand it as follows:
 266
 267     struct vectorised fp_vec[32], int_vec[32]; // 64 in future
 268
 269     for (i = 0; i < 16; i++) // 16 CSRs?
 270        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 271        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 272        tb[idx].elwidth  = CSRvec[i].elwidth
 273        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 274        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 275        tb[idx].packed   = CSRvec[i].packed  // SIMD or not
 276
 277 TODO: move elsewhere
 278
 279     # TODO: use elsewhere (retire for now)
 280     vew = CSRbitwidth[rs1]
 281     if (vew == 0)
 282         bytesperreg = (XLEN/8) # or FLEN as appropriate
 283     elif (vew == 1)
 284         bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
 285     else:
 286         bytesperreg = bytestable[vew] # 8 or 16
 287     simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
 288     vlen = CSRvectorlen[rs1] * simdmult
 289     CSRvlength = MIN(MIN(vlen, MAXVECTORLENGTH), rs2)
 290
 291 The reason for multiplying the vector length by the number of SIMD elements
 292 (in each individual register) is so that each SIMD element may optionally be
 293 predicated.
 294
 295 An example of how to subdivide the register file when bitwidth != default
 296 is given in the section "Bitwidth Virtual Register Reordering".
 297
 298 ## Predication CSR <a name="predication_csr_table"></a>
 299
 300 TODO: update CSR tables, now 7-bit for regidx
 301
 302 The Predication CSR is a key-value store indicating whether, if a given
 303 destination register (integer or floating-point) is referred to in an
 304 instruction, it is to be predicated.  Tt is particularly important to note
 305 that the *actual* register used can be *different* from the one that is
 306 in the instruction, due to the redirection through the lookup table.
 307
 308 * regidx is the actual register that in combination with the
 309   i/f flag, if that integer or floating-point register is referred to,
 310   results in the lookup table being referenced to find the predication
 311   mask to use on the operation in which that (regidx) register has
 312   been used
 313 * predidx (in combination with the bank bit in the future) is the
 314   *actual* register to be used for the predication mask.  Note:
 315   in effect predidx is actually a 6-bit register address, as the bank
 316   bit is the MSB (and is nominally set to zero for now).
 317 * inv indicates that the predication mask bits are to be inverted
 318   prior to use *without* actually modifying the contents of the
 319   register itself.
 320 * zeroing is either 1 or 0, and if set to 1, the operation must
 321   place zeros in any element position where the predication mask is
 322   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 323   be left alone.  Some microarchitectures may choose to interpret
 324   this as skipping the operation entirely.  Others which wish to
 325   stick more closely to a SIMD architecture may choose instead to
 326   interpret unpredicated elements as an internal "copy element"
 327   operation (which would be necessary in SIMD microarchitectures
 328   that perform register-renaming)
 329
 330 | PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
 331 | ----- | -      | -      | -     | -   | ------- | ------- |
 332 | 0     | bank0  | zero0  | inv0  | i/f | regidx  | predkey |
 333 | 1     | bank1  | zero1  | inv1  | i/f | regidx  | predkey |
 334 | ..    | bank.. | zero.. | inv.. | i/f | regidx  | predkey |
 335 | 15    | bank15 | zero15 | inv15 | i/f | regidx  | predkey |
 336
 337 The Predication CSR Table is a key-value store, so implementation-wise
 338 it will be faster to turn the table around (maintain topologically
 339 equivalent state):
 340
 341     struct pred {
 342         bool zero;
 343         bool inv;
 344         bool enabled;
 345         int predidx; // redirection: actual int register to use
 346     }
 347
 348     struct pred fp_pred_reg[32];   // 64 in future (bank=1)
 349     struct pred int_pred_reg[32];  // 64 in future (bank=1)
 350
 351     for (i = 0; i < 16; i++)
 352       tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
 353       idx = CSRpred[i].regidx
 354       tb[idx].zero = CSRpred[i].zero
 355       tb[idx].inv  = CSRpred[i].inv
 356       tb[idx].predidx  = CSRpred[i].predidx
 357       tb[idx].enabled  = true
 358
 359 So when an operation is to be predicated, it is the internal state that
 360 is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
 361 pseudo-code for operations is given, where p is the explicit (direct)
 362 reference to the predication register to be used:
 363
 364     for (int i=0; i<vl; ++i)
 365         if ([!]preg[p][i])
 366            (d ? vreg[rd][i] : sreg[rd]) =
 367             iop(s1 ? vreg[rs1][i] : sreg[rs1],
 368                 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
 369
 370 This instead becomes an *indirect* reference using the *internal* state
 371 table generated from the Predication CSR key-value store, which iwws used
 372 as follows.
 373
 374     if type(iop) == INT:
 375         preg = int_pred_reg[rd]
 376     else:
 377         preg = fp_pred_reg[rd]
 378
 379     for (int i=0; i<vl; ++i)
 380         predicate, zeroing = get_pred_val(type(iop) == INT, rd):
 381         if (predicate && (1<<i))
 382            (d ? regfile[rd+i] : regfile[rd]) =
 383             iop(s1 ? regfile[rs1+i] : regfile[rs1],
 384                 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
 385         else if (zeroing)
 386            (d ? regfile[rd+i] : regfile[rd]) = 0
 387
 388 Note:
 389
 390 * d, s1 and s2 are booleans indicating whether destination,
 391   source1 and source2 are vector or scalar
 392 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
 393   above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
 394   register-level redirection (from the Register CSR table) if they are
 395   vectors.
 396
 397 If written as a function, obtaining the predication mask (and whether
 398 zeroing takes place) may be done as follows:
 399
 400     def get_pred_val(bool is_fp_op, int reg):
 401        tb = int_reg if is_fp_op else fp_reg
 402        if (!tb[reg].enabled):
 403           return ~0x0, False       // all enabled; no zeroing
 404        tb = int_pred if is_fp_op else fp_pred
 405        if (!tb[reg].enabled):
 406           return ~0x0, False       // all enabled; no zeroing
 407        predidx = tb[reg].predidx   // redirection occurs HERE
 408        predicate = intreg[predidx] // actual predicate HERE
 409        if (tb[reg].inv):
 410           predicate = ~predicate   // invert ALL bits
 411        return predicate, tb[reg].zero
 412
 413 Note here, critically, that **only** if the register is marked
 414 in its CSR **register** table entry as being "active" does the testing
 415 proceed further to check if the CSR **predicate** table entry is
 416 also active.
 417
 418 Note also that this is in direct contrast to branch operations
 419 for the storage of comparisions: in these specific circumstances
 420 the requirement for there to be an active CSR *register* entry
 421 is removed.
 422
 423 ## REMAP CSR
 424
 425 (Note: both the REMAP and SHAPE sections are best read after the
 426  rest of the document has been read)
 427
 428 There is one 32-bit CSR which may be used to indicate which registers,
 429 if used in any operation, must be "reshaped" (re-mapped) from a linear
 430 form to a 2D or 3D transposed form.  The 32-bit REMAP CSR may reshape
 431 up to 3 registers:
 432
 433 | 29..28 | 27..26 | 25..24 | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 434 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
 435 | shape2 | shape1 | shape0 | 0  | regidx2 | 0  | regidx1 | 0  | regidx0 |
 436
 437 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
 438 *real* register (see regidx, the value) and consequently is 7-bits wide.
 439 shape0-2 refers to one of three SHAPE CSRs.  A value of 0x3 is reserved.
 440 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
 441
 442 ## SHAPE 1D/2D/3D vector-matrix remapping CSRs
 443
 444 (Note: both the REMAP and SHAPE sections are best read after the
 445  rest of the document has been read)
 446
 447 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
 448 which have the same format.  When each SHAPE CSR is set entirely to zeros,
 449 remapping is disabled: the register's elements are a linear (1D) vector.
 450
 451 | 26..24  | 23 | 22..16  | 15 | 14..8   | 7  | 6..0    |
 452 | ------- | -- | ------- | -- | ------- | -- | ------- |
 453 | permute | 0  | zdimsz  | 0  | ydimsz  | 0  | xdimsz  |
 454
 455 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 456 that the array dimensionality for that dimension is 1.  A value of xdimsz=2
 457 would indicate that in the first dimension there are 3 elements in the
 458 array.  The format of the array is therefore as follows:
 459
 460     array[xdim+1][ydim+1][zdim+1]
 461
 462 However whilst illustrative of the dimensionality, that does not take the
 463 "permute" setting into account.  "permute" may be any one of six values
 464 (0-5, with values of 6 and 7 being reserved, and not legal).  The table
 465 below shows how the permutation dimensionality order works:
 466
 467 | permute | order | array format             |
 468 | ------- | ----- | ------------------------ |
 469 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 470 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 471 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 472 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 473 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 474 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 475
 476 In other words, the "permute" option changes the order in which
 477 nested for-loops over the array would be done.  The algorithm below
 478 shows this more clearly, and may be executed as a python program:
 479
 480     # mapidx = REMAP.shape2
 481     xdim = 3 # SHAPE[mapidx].xdim_sz+1
 482     ydim = 4 # SHAPE[mapidx].ydim_sz+1
 483     zdim = 5 # SHAPE[mapidx].zdim_sz+1
 484
 485     lims = [xdim, ydim, zdim]
 486     idxs = [0,0,0] # starting indices
 487     order = [1,0,2] # experiment with different permutations, here
 488
 489     for idx in range(xdim * ydim * zdim):
 490         new_idx = idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
 491         print new_idx,
 492         for i in range(3):
 493             idxs[order[i]] = idxs[order[i]] + 1
 494             if (idxs[order[i]] != lims[order[i]]):
 495                 break
 496             print
 497             idxs[order[i]] = 0
 498
 499 Here, it is assumed that this algorithm be run within all pseudo-code
 500 throughout this document where a (parallelism) for-loop would normally
 501 run from 0 to VL-1 to refer to contiguous register
 502 elements; instead, where REMAP indicates to do so, the element index
 503 is run through the above algorithm to work out the **actual** element
 504 index, instead.  Given that there are three possible SHAPE entries, up to
 505 three separate registers in any given operation may be simultaneously
 506 remapped:
 507
 508     function op_add(rd, rs1, rs2) # add not VADD!
 509       ...
 510       ...
 511       for (i = 0; i < VL; i++)
 512         if (predval & 1<<i) # predication uses intregs
 513            ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
 514                                  ireg[rs2+remap(irs2)];
 515         if (int_vec[rd ].isvector)  { id += 1; }
 516         if (int_vec[rs1].isvector)  { irs1 += 1; }
 517         if (int_vec[rs2].isvector)  { irs2 += 1; }
 518
 519 By changing remappings, 2D matrices may be transposed "in-place" for one
 520 operation, followed by setting a different permutation order without
 521 having to move the values in the registers to or from memory.  Also,
 522 the reason for having REMAP separate from the three SHAPE CSRs is so
 523 that in a chain of matrix multiplications and additions, for example,
 524 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
 525 changed to target different registers.
 526
 527 Note that:
 528
 529 * If permute option 000 is utilised, the actual order of the
 530   reindexing does not change!
 531 * If two or more dimensions are set to zero, the actual order does not change!
 532 * The above algorithm is pseudo-code **only**.  Actual implementations
 533   will need to take into account the fact that the element for-looping
 534   must be **re-entrant**, due to the possibility of exceptions occurring.
 535   See MSTATE CSR, which records the current element index.
 536 * Twin-predicated operations require **two** separate and distinct
 537   element offsets.  The above pseudo-code algorithm will be applied
 538   separately and independently to each, should each of the two
 539   operands be remapped.  *This even includes C.LDSP* and other operations
 540   in that category, where in that case it will be the **offset** that is
 541   remapped (see Compressed Stack LOAD/STORE section).
 542 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
 543   less than MVL is **perfectly legal**, albeit very obscure.  It permits
 544   entries to be regularly presented to operands **more than once**, thus
 545   allowing the same underlying registers to act as an accumulator of
 546   multiple vector or matrix operations, for example.
 547
 548 Clearly here some considerable care needs to be taken as the remapping
 549 could hypothetically create arithmetic operations that target the
 550 exact same underlying registers, resulting in data corruption due to
 551 pipeline overlaps.  Out-of-order / Superscalar micro-architectures with
 552 register-renaming will have an easier time dealing with this than
 553 DSP-style SIMD micro-architectures.
 554
 555 # Instruction Execution Order
 556
 557 Simple-V behaves as if it is a hardware-level "macro expansion system",
 558 substituting and expanding a single instruction into multiple sequential
 559 instructions with contiguous and sequentially-incrementing registers.
 560 As such, it does **not** modify - or specify - the behaviour and semantics of
 561 the execution order: that may be deduced from the **existing** RV
 562 specification in each and every case.
 563
 564 So for example if a particular micro-architecture permits out-of-order
 565 execution, and it is augmented with Simple-V, then wherever instructions
 566 may be out-of-order then so may the "post-expansion" SV ones.
 567
 568 If on the other hand there are memory guarantees which specifically
 569 prevent and prohibit certain instructions from being re-ordered
 570 (such as the Atomicity Axiom, or FENCE constraints), then clearly
 571 those constraints **MUST** also be obeyed "post-expansion".
 572
 573 It should be absolutely clear that SV is **not** about providing new
 574 functionality or changing the existing behaviour of a micro-architetural
 575 design, or about changing the RISC-V Specification.
 576 It is **purely** about compacting what would otherwise be contiguous
 577 instructions that use sequentially-increasing register numbers down
 578 to the **one** instruction.
 579
 580 # Instructions
 581
 582 Despite being a 98% complete and accurate topological remap of RVV
 583 concepts and functionality, no new instructions are needed.
 584 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
 585 becomes a critical dependency for efficient manipulation of predication
 586 masks (as a bit-field).  Despite the removal of all operations,
 587 with the exception of CLIP and VSELECT.X
 588 *all instructions from RVV Base are topologically re-mapped and retain their
 589 complete functionality, intact*.  Note that if RV64G ever had
 590 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
 591 be obtained in SV.
 592
 593 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
 594 equivalents, so are left out of Simple-V.  VSELECT could be included if
 595 there existed a MV.X instruction in RV (MV.X is a hypothetical
 596 non-immediate variant of MV that would allow another register to
 597 specify which register was to be copied).  Note that if any of these three
 598 instructions are added to any given RV extension, their functionality
 599 will be inherently parallelised.
 600
 601 With some exceptions, where it does not make sense or is simply too
 602 challenging, all RV-Base instructions are parallelised:
 603
 604 * CSR instructions, whilst a case could be made for fast-polling of
 605   a CSR into multiple registers, would require guarantees of strict
 606   sequential ordering that SV does not provide.  Therefore, CSRs are
 607   not really suitable and are left out.
 608 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 609   left as scalar.
 610 * LR/SC could hypothetically be parallelised however their purpose is
 611   single (complex) atomic memory operations where the LR must be followed
 612   up by a matching SC.  A sequence of parallel LR instructions followed
 613   by a sequence of parallel SC instructions therefore is guaranteed to
 614   not be useful. Not least: the guarantees of LR/SC
 615   would be impossible to provide if emulated in a trap.
 616 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 617   paralleliseable anyway.
 618
 619 All other operations using registers are automatically parallelised.
 620 This includes AMOMAX, AMOSWAP and so on, where particular care and
 621 attention must be paid.
 622
 623 Example pseudo-code for an integer ADD operation (including scalar operations).
 624 Floating-point uses fp csrs.
 625
 626     function op_add(rd, rs1, rs2) # add not VADD!
 627       int i, id=0, irs1=0, irs2=0;
 628       predval = get_pred_val(FALSE, rd);
 629       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 630       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 631       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 632       for (i = 0; i < VL; i++)
 633         if (predval & 1<<i) # predication uses intregs
 634            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 635         if (int_vec[rd ].isvector)  { id += 1; }
 636         if (int_vec[rs1].isvector)  { irs1 += 1; }
 637         if (int_vec[rs2].isvector)  { irs2 += 1; }
 638
 639 ## Instruction Format
 640
 641 There are **no operations added to SV, at all**.
 642 Instead SV  *overloads* pre-existing branch operations into predicated
 643 variants, and implicitly overloads arithmetic operations, MV,
 644 FCVT, and LOAD/STORE
 645 depending on CSR configurations for bitwidth and
 646 predication.  **Everything** becomes parallelised.  *This includes
 647 Compressed instructions* as well as any
 648 future instructions and Custom Extensions.
 649
 650 ## Branch Instructions
 651
 652 ### Standard Branch <a name="standard_branch"></a>
 653
 654 Branch operations use standard RV opcodes that are reinterpreted to
 655 be "predicate variants" in the instance where either of the two src
 656 registers are marked as vectors (active=1, vector=1).
 657
 658 Note that he predication register to use (if one is enabled) is taken from
 659 the *first* src register.  The target (destination) predication register
 660 to use (if one is enabled) is taken from the *second* src register.
 661
 662 If either of src1 or src2 are scalars (whether by there being no
 663 CSR register entry or whether by the CSR entry specifically marking
 664 the register as "scalar") the comparison goes ahead as vector-scalar
 665 or scalar-vector.
 666
 667 In instances where no vectorisation is detected on either src registers
 668 the operation is treated as an absolutely standard scalar branch operation.
 669 Where vectorisation is present on either or both src registers, the
 670 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 671 those tests that are predicated out).
 672
 673 Note that just as with the standard (scalar, non-predicated) branch
 674 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 675 src1 and src2.
 676
 677 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 678 for predicated compare operations of function "cmp":
 679
 680     for (int i=0; i<vl; ++i)
 681       if ([!]preg[p][i])
 682          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 683                            s2 ? vreg[rs2][i] : sreg[rs2]);
 684
 685 With associated predication, vector-length adjustments and so on,
 686 and temporarily ignoring bitwidth (which makes the comparisons more
 687 complex), this becomes:
 688
 689     s1 = reg_is_vectorised(src1);
 690     s2 = reg_is_vectorised(src2);
 691
 692     if not s1 && not s2
 693         if cmp(rs1, rs2) # scalar compare
 694             goto branch
 695         return
 696
 697     preg = int_pred_reg[rd]
 698     reg = int_regfile
 699
 700     ps = get_pred_val(I/F==INT, rs1);
 701     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 702
 703     if not exists(rd)
 704         temporary_result = 0
 705     else
 706         preg[rd] = 0; # initialise to zero
 707
 708     for (int i = 0; i < VL; ++i)
 709       if (ps & (1<<i)) && (cmp(s1 ? reg[src1+i]:reg[src1],
 710                                s2 ? reg[src2+i]:reg[src2])
 711           if not exists(rd)
 712               temporary_result |= 1<<i;
 713           else
 714               preg[rd] |= 1<<i;  # bitfield not vector
 715
 716      if not exists(rd)
 717         if temporary_result == ps
 718             goto branch
 719      else
 720         if preg[rd] == ps
 721             goto branch
 722
 723 Notes:
 724
 725 * zeroing has been temporarily left out of the above pseudo-code,
 726   for clarity
 727 * Predicated SIMD comparisons would break src1 and src2 further down
 728   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 729   Reordering") setting Vector-Length times (number of SIMD elements) bits
 730   in Predicate Register rd, as opposed to just Vector-Length bits.
 731
 732 TODO: predication now taken from src2.  also branch goes ahead
 733 if all compares are successful.
 734
 735 Note also that where normally, predication requires that there must
 736 also be a CSR register entry for the register being used in order
 737 for the **predication** CSR register entry to also be active,
 738 for branches this is **not** the case.  src2 does **not** have
 739 to have its CSR register entry marked as active in order for
 740 predication on src2 to be active.
 741
 742 ### Floating-point Comparisons
 743
 744 There does not exist floating-point branch operations, only compare.
 745 Interestingly no change is needed to the instruction format because
 746 FP Compare already stores a 1 or a zero in its "rd" integer register
 747 target, i.e. it's not actually a Branch at all: it's a compare.
 748 Thus, no change is made to the floating-point comparison, so
 749
 750 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
 751 and whilst in ordinary branch code this is fine because the standard
 752 RVF compare can always be followed up with an integer BEQ or a BNE (or
 753 a compressed comparison to zero or non-zero), in predication terms that
 754 becomes more of an impact.  To deal with this, SV's predication has
 755 had "invert" added to it.
 756
 757 ### Compressed Branch Instruction
 758
 759 Compressed Branch instructions are, just like standard Branch instructions,
 760 reinterpreted to be vectorised and predicated based on the source register
 761 (rs1s) CSR entries.  As however there is only the one source register,
 762 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 763 to store the results of the comparisions is taken from CSR predication
 764 table entries for **x0**.
 765
 766 The specific required use of x0 is, with a little thought, quite obvious,
 767 but is counterintuitive.  Clearly it is **not** recommended to redirect
 768 x0 with a CSR register entry, however as a means to opaquely obtain
 769 a predication target it is the only sensible option that does not involve
 770 additional special CSRs (or, worse, additional special opcodes).
 771
 772 Note also that, just as with standard branches, the 2nd source
 773 (in this case x0 rather than src2) does **not** have to have its CSR
 774 register table marked as "active" in order for predication to work.
 775
 776 ## Vectorised Dual-operand instructions
 777
 778 There is a series of 2-operand instructions involving copying (and
 779 sometimes alteration):
 780
 781 * C.MV
 782 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 783 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 784 * LOAD(-FP) and STORE(-FP)
 785
 786 All of these operations follow the same two-operand pattern, so it is
 787 *both* the source *and* destination predication masks that are taken into
 788 account.  This is different from
 789 the three-operand arithmetic instructions, where the predication mask
 790 is taken from the *destination* register, and applied uniformly to the
 791 elements of the source register(s), element-for-element.
 792
 793 The pseudo-code pattern for twin-predicated operations is as
 794 follows:
 795
 796     function op(rd, rs):
 797       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 798       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 799       ps = get_pred_val(FALSE, rs); # predication on src
 800       pd = get_pred_val(FALSE, rd); # ... AND on dest
 801       for (int i = 0, int j = 0; i < VL && j < VL;):
 802         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 803         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 804         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 805         if (int_csr[rs].isvec) i++;
 806         if (int_csr[rd].isvec) j++;
 807
 808 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 809 and vector-vector, and predicated variants of all of those.
 810 Zeroing is not presently included (TODO).  As such, when compared
 811 to RVV, the twin-predicated variants of C.MV and FMV cover
 812 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 813 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 814
 815 Note that:
 816
 817 * elwidth (SIMD) is not covered in the pseudo-code above
 818 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 819   not covered
 820 * zero predication is also not shown (TODO).
 821
 822 ### C.MV Instruction <a name="c_mv"></a>
 823
 824 There is no MV instruction in RV however there is a C.MV instruction.
 825 It is used for copying integer-to-integer registers (vectorised FMV
 826 is used for copying floating-point).
 827
 828 If either the source or the destination register are marked as vectors
 829 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 830 move operation.  The actual instruction's format does not change:
 831
 832 [[!table  data="""
 833 15  12 | 11   7 | 6  2 | 1  0 |
 834 funct4 | rd     | rs   | op   |
 835 4      | 5      | 5    | 2    |
 836 C.MV   | dest   | src  | C0   |
 837 """]]
 838
 839 A simplified version of the pseudocode for this operation is as follows:
 840
 841     function op_mv(rd, rs) # MV not VMV!
 842       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 843       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 844       ps = get_pred_val(FALSE, rs); # predication on src
 845       pd = get_pred_val(FALSE, rd); # ... AND on dest
 846       for (int i = 0, int j = 0; i < VL && j < VL;):
 847         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 848         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 849         ireg[rd+j] <= ireg[rs+i];
 850         if (int_csr[rs].isvec) i++;
 851         if (int_csr[rd].isvec) j++;
 852
 853 There are several different instructions from RVV that are covered by
 854 this one opcode:
 855
 856 [[!table  data="""
 857 src    | dest    | predication   | op             |
 858 scalar | vector  | none          | VSPLAT         |
 859 scalar | vector  | destination   | sparse VSPLAT  |
 860 scalar | vector  | 1-bit dest    | VINSERT        |
 861 vector | scalar  | 1-bit? src    | VEXTRACT       |
 862 vector | vector  | none          | VCOPY          |
 863 vector | vector  | src           | Vector Gather  |
 864 vector | vector  | dest          | Vector Scatter |
 865 vector | vector  | src & dest    | Gather/Scatter |
 866 vector | vector  | src == dest   | sparse VCOPY   |
 867 """]]
 868
 869 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 870 operations with inversion on the src and dest predication for one of the
 871 two C.MV operations.
 872
 873 Note that in the instance where the Compressed Extension is not implemented,
 874 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 875 Note that the behaviour is **different** from C.MV because with addi the
 876 predication mask to use is taken **only** from rd and is applied against
 877 all elements: rs[i] = rd[i].
 878
 879 ### FMV, FNEG and FABS Instructions
 880
 881 These are identical in form to C.MV, except covering floating-point
 882 register copying.  The same double-predication rules also apply.
 883 However when elwidth is not set to default the instruction is implicitly
 884 and automatic converted to a (vectorised) floating-point type conversion
 885 operation of the appropriate size covering the source and destination
 886 register bitwidths.
 887
 888 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 889
 890 ### FVCT Instructions
 891
 892 These are again identical in form to C.MV, except that they cover
 893 floating-point to integer and integer to floating-point.  When element
 894 width in each vector is set to default, the instructions behave exactly
 895 as they are defined for standard RV (scalar) operations, except vectorised
 896 in exactly the same fashion as outlined in C.MV.
 897
 898 However when the source or destination element width is not set to default,
 899 the opcode's explicit element widths are *over-ridden* to new definitions,
 900 and the opcode's element width is taken as indicative of the SIMD width
 901 (if applicable i.e. if packed SIMD is requested) instead.
 902
 903 For example FCVT.S.L would normally be used to convert a 64-bit
 904 integer in register rs1 to a 64-bit floating-point number in rd.
 905 If however the source rs1 is set to be a vector, where elwidth is set to
 906 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 907 rs1 are converted to a floating-point number to be stored in rd's
 908 first element and the higher 32-bits *also* converted to floating-point
 909 and stored in the second.  The 32 bit size comes from the fact that
 910 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 911 divide that by two it means that rs1 element width is to be taken as 32.
 912
 913 Similar rules apply to the destination register.
 914
 915 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 916
 917 An earlier draft of SV modified the behaviour of LOAD/STORE.  This
 918 actually undermined the fundamental principle of SV, namely that there
 919 be no modifications to the scalar behaviour (except where absolutely
 920 necessary), in order to simplify an implementor's task if considering
 921 converting a pre-existing scalar design to support parallelism.
 922
 923 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 924 do not change in SV, however just as with C.MV it is important to note
 925 that dual-predication is possible.  Using the template outlined in
 926 the section "Vectorised dual-op instructions", the pseudo-code covering
 927 scalar-scalar, scalar-vector, vector-scalar and vector-vector applies,
 928 where SCALAR\_OPERATION is as follows, exactly as for a standard
 929 scalar RV LOAD operation:
 930
 931         srcbase = ireg[rs+i];
 932         return mem[srcbase + imm];
 933
 934 Whilst LOAD and STORE remain as-is when compared to their scalar
 935 counterparts, the incrementing on the source register (for LOAD)
 936 means that pointers-to-structures can be easily implemented, and
 937 if contiguous offsets are required, those pointers (the contents
 938 of the contiguous source registers) may simply be set up to point
 939 to contiguous locations.
 940
 941 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 942
 943 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 944 where it is implicit in C.LWSP/FLWSP that x2 is the source register.
 945 It is therefore possible to use predicated C.LWSP to efficiently
 946 pop registers off the stack (by predicating x2 as the source), cherry-picking
 947 which registers to store to (by predicating the destination).  Likewise
 948 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 949
 950 However, to do so, the behaviour of C.LWSP/C.SWSP needs to be slightly
 951 different: where x2 is marked as vectorised, instead of incrementing
 952 the register on each loop (x2, x3, x4...), instead it is the *immediate*
 953 that must be incremented.  Pseudo-code follows:
 954
 955     function lwsp(rd, rs):
 956       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 957       rs = x2 # effectively no redirection on x2.
 958       ps = get_pred_val(FALSE, rs); # predication on src
 959       pd = get_pred_val(FALSE, rd); # ... AND on dest
 960       for (int i = 0, int j = 0; i < VL && j < VL;):
 961         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 962         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 963         reg[rd+j] = mem[x2 + ((offset+i) * 4)]
 964         if (int_csr[rs].isvec) i++;
 965         if (int_csr[rd].isvec) j++;
 966
 967 For C.LDSP, the offset (and loop) multiplier would be 8, and for
 968 C.LQSP it would be 16.  Effectively this makes C.LWSP etc. a Vector
 969 "Unit Stride" Load instruction.
 970
 971 **Note**: it is still possible to redirect x2 to an alternative target
 972 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 973 general-purpose Vector "Unit Stride" LOAD/STORE operations.
 974
 975 ## Compressed LOAD / STORE Instructions
 976
 977 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 978 where the same rules apply and the same pseudo-code apply as for
 979 non-compressed LOAD/STORE.  This is **different** from Compressed Stack
 980 LOAD/STORE (C.LWSP / C.SWSP), which have been augmented to become
 981 Vector "Unit Stride" capable.
 982
 983 Just as with uncompressed LOAD/STORE C.LD / C.ST increment the *register*
 984 during the hardware loop, **not** the offset.
 985
 986 # Exceptions
 987
 988 TODO: expand.  Exceptions may occur at any time, in any given underlying
 989 scalar operation.  This implies that context-switching (traps) may
 990 occur, and operation must be returned to where it left off.  That in
 991 turn implies that the full state - including the current parallel
 992 element being processed - has to be saved and restored.  This is
 993 what the **STATE** CSR is for.
 994
 995 The implications are that all underlying individual scalar operations
 996 "issued" by the parallelisation have to appear to be executed sequentially.
 997 The further implications are that if two or more individual element
 998 operations are underway, and one with an earlier index causes an exception,
 999 it may be necessary for the microarchitecture to **discard** or terminate
1000 operations with higher indices.
1001
1002 This being somewhat dissatisfactory, an "opaque predication" variant
1003 of the STATE CSR is being considered.
1004
1005 > And what about instructions like JALR?
1006
1007 answer: they're not vectorised, so not a problem
1008
1009 # Hints
1010
1011 A "HINT" is an operation that has no effect on architectural state,
1012 where its use may, by agreed convention, give advance notification
1013 to the microarchitecture: branch prediction notification would be
1014 a good example.  Usually HINTs are where rd=x0.
1015
1016 With Simple-V being capable of issuing *parallel* instructions where
1017 rd=x0, the space for possible HINTs is expanded considerably.  VL
1018 could be used to indicate different hints.  In addition, if predication
1019 is set, the predication register itself could hypothetically be passed
1020 in as a *parameter* to the HINT operation.
1021
1022 No specific hints are yet defined in Simple-V
1023
1024 # Subsets of RV functionality
1025
1026 This section describes the differences when SV is implemented on top of
1027 different subsets of RV.
1028
1029 ## Common options
1030
1031 It is permitted to limit the size of either (or both) the register files
1032 down to the original size of the standard RV architecture.  However, below
1033 the mandatory limits set in the RV standard will result in non-compliance
1034 with the SV Specification.
1035
1036 ## RV32 / RV32F
1037
1038 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1039 maximum limit for predication is also restricted to 32 bits.  Whilst not
1040 actually specifically an "option" it is worth noting.
1041
1042 ## RV32G
1043
1044 Normally in standard RV32 it does not make much sense to have
1045 RV32G, however it is automatically implied to exist in RV32+SV due to
1046 the option for the element width to be doubled.  This may be sufficient
1047 for implementors, such that actually needing RV32G itself (which makes
1048 no sense given that the RV32 integer register file is 32-bit) may be
1049 redundant.
1050
1051 It is a strange combination that may make sense on closer inspection,
1052 particularly given that under the standard RV32 system many of the opcodes
1053 to convert and sign-extend 64-bit integers to 64-bit floating-point will
1054 be missing, as they are assumed to only be present in an RV64 context.
1055
1056 ## RV32
1057
1058 When floating-point is not implemented, the size of the User Register and
1059 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1060 per table).
1061
1062 ## RV32E
1063
1064 In embedded scenarios the User Register and Predication CSRs may be
1065 dropped entirely, or optionally limited to 1 CSR, such that the combined
1066 number of entries from the M-Mode CSR Register table plus U-Mode
1067 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1068 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1069 the Predication CSR tables.
1070
1071 ## RV128
1072
1073 RV128 has not been especially considered, here, however it has some
1074 extremely large possibilities: double the element width implies
1075 256-bit operands, spanning 2 128-bit registers each, and predication
1076 of total length 128 bit given that XLEN is now 128.
1077
1078 # Under consideration <a name="issues"></a>
1079
1080 for element-grouping, if there is unused space within a register
1081 (3 16-bit elements in a 64-bit register for example), recommend:
1082
1083 * For the unused elements in an integer register, the used element
1084   closest to the MSB is sign-extended on write and the unused elements
1085   are ignored on read.
1086 * The unused elements in a floating-point register are treated as-if
1087   they are set to all ones on write and are ignored on read, matching the
1088   existing standard for storing smaller FP values in larger registers.
1089