clarify REMAP section
[libreriscv.git] / simple_v_extension / specification.mdwn
1 # Simple-V (Parallelism Extension Proposal) Specification
2
3 * Status: DRAFTv0.2
4 * Last edited: 17 oct 2018
5 * Ancillary resource: [[opcodes]]
6
7 With thanks to:
8
9 * Allen Baum
10 * Jacob Bachmeyer
11 * Guy Lemurieux
12 * Jacob Lifshay
13 * The RISC-V Founders, without whom this all would not be possible.
14
15 [[!toc ]]
16
17 # Summary and Background: Rationale
18
19 Simple-V is a uniform parallelism API for RISC-V hardware that has several
20 unplanned side-effects including code-size reduction, expansion of
21 HINT space and more. The reason for
22 creating it is to provide a manageable way to turn a pre-existing design
23 into a parallel one, in a step-by-step incremental fashion, allowing
24 the implementor to focus on adding hardware where it is needed and necessary.
25 The primary target is for mobile-class 3D GPUs and VPUs, with secondary
26 goals being to reduce executable size and reduce context-switch latency.
27
28 Critically: **No new instructions are added**. The parallelism (if any
29 is implemented) is implicitly added by tagging *standard* scalar registers
30 for redirection. When such a tagged register is used in any instruction,
31 it indicates that the PC shall **not** be incremented; instead a loop
32 is activated where *multiple* instructions are issued to the pipeline
33 (as determined by a length CSR), with contiguously incrementing register
34 numbers starting from the tagged register. When the last "element"
35 has been reached, only then is the PC permitted to move on. Thus
36 Simple-V effectively sits (slots) *in between* the instruction decode phase
37 and the ALU(s).
38
39 The barrier to entry with SV is therefore very low. The minimum
40 compliant implementation is software-emulation (traps), requiring
41 only the CSRs and CSR tables, and that an exception be thrown if an
42 instruction's registers are detected to have been tagged. The looping
43 that would otherwise be done in hardware is thus carried out in software,
44 instead. Whilst much slower, it is "compliant" with the SV specification,
45 and may be suited for implementation in RV32E and also in situations
46 where the implementor wishes to focus on certain aspects of SV, without
47 unnecessary time and resources into the silicon, whilst also conforming
48 strictly with the API. A good area to punt to software would be the
49 polymorphic element width capability for example.
50
51 Hardware Parallelism, if any, is therefore added at the implementor's
52 discretion to turn what would otherwise be a sequential loop into a
53 parallel one.
54
55 To emphasise that clearly: Simple-V (SV) is *not*:
56
57 * A SIMD system
58 * A SIMT system
59 * A Vectorisation Microarchitecture
60 * A microarchitecture of any specific kind
61 * A mandary parallel processor microarchitecture of any kind
62 * A supercomputer extension
63
64 SV does **not** tell implementors how or even if they should implement
65 parallelism: it is a hardware "API" (Application Programming Interface)
66 that, if implemented, presents a uniform and consistent way to *express*
67 parallelism, at the same time leaving the choice of if, how, how much,
68 when and whether to parallelise operations **entirely to the implementor**.
69
70 # Basic Operation
71
72 The principle of SV is as follows:
73
74 * CSRs indicating which registers are "tagged" as "vectorised"
75 (potentially parallel, depending on the microarchitecture)
76 must be set up
77 * A "Vector Length" CSR is set, indicating the span of any future
78 "parallel" operations.
79 * A **scalar** operation, just after the decode phase and before the
80 execution phase, checks the CSR register tables to see if any of
81 its registers have been marked as "vectorised"
82 * If so, a hardware "macro-unrolling loop" is activated, of length
83 VL, that effectively issues **multiple** identical instructions
84 using contiguous sequentially-incrementing registers.
85 **Whether they be executed sequentially or in parallel or a
86 mixture of both is entirely up to the implementor**.
87
88 In this way an entire scalar algorithm may be vectorised with
89 the minimum of modification to the hardware and to compiler toolchains.
90 There are **no** new opcodes.
91
92 # CSRs <a name="csrs"></a>
93
94 For U-Mode there are two CSR key-value stores needed to create lookup
95 tables which are used at the register decode phase.
96
97 * A register CSR key-value table (typically 8 32-bit CSRs of 2 16-bits each)
98 * A predication CSR key-value table (again, 8 32-bit CSRs of 2 16-bits each)
99 * Small U-Mode and S-Mode register and predication CSR key-value tables
100 (2 32-bit CSRs of 2x 16-bit entries each).
101 * An optional "reshaping" CSR key-value table which remaps from a 1D
102 linear shape to 2D or 3D, including full transposition.
103
104 There are also four additional CSRs for User-Mode:
105
106 * CFG subsets the CSR tables
107 * MVL (the Maximum Vector Length)
108 * VL (which has different characteristics from standard CSRs)
109 * STATE (useful for saving and restoring during context switch,
110 and for providing fast transitions)
111
112 There are also three additional CSRs for Supervisor-Mode:
113
114 * SMVL
115 * SVL
116 * SSTATE
117
118 And likewise for M-Mode:
119
120 * MMVL
121 * MVL
122 * MSTATE
123
124 Both Supervisor and M-Mode have their own (small) CSR register and
125 predication tables of only 4 entries each.
126
127 ## CFG
128
129 This CSR may be used to switch between subsets of the CSR Register and
130 Predication Tables: it is kept to 5 bits so that a single CSRRWI instruction
131 can be used. A setting of all ones is reserved to indicate that SimpleV
132 is disabled.
133
134 | (4..3) | (2...0) |
135 | ------ | ------- |
136 | size | bank |
137
138 Bank is 3 bits in size, and indicates the starting index of the CSR
139 Register and Predication Table entries that are "enabled". Given that
140 each CSR table row is 16 bits and contains 2 CAM entries each, there
141 are only 8 CSRs to cover in each table, so 8 bits is sufficient.
142
143 Size is 2 bits. With the exception of when bank == 7 and size == 3,
144 the number of elements enabled is taken by right-shifting 2 by size:
145
146 | size | elements |
147 | ------ | -------- |
148 | 0 | 2 |
149 | 1 | 4 |
150 | 2 | 8 |
151 | 3 | 16 |
152
153 Given that there are 2 16-bit CAM entries per CSR table row, this
154 may also be viewed as the number of CSR rows to enable, by raising size to
155 the power of 2.
156
157 Examples:
158
159 * When bank = 0 and size = 3, SVREGCFG0 through to SVREGCFG7 are
160 enabled, and SVPREDCFG0 through to SVPREGCFG7 are enabled.
161 * When bank = 1 and size = 3, SVREGCFG1 through to SVREGCFG7 are
162 enabled, and SVPREDCFG1 through to SVPREGCFG7 are enabled.
163 * When bank = 3 and size = 0, SVREGCFG3 and SVPREDCFG3 are enabled.
164 * When bank = 3 and size = 1, SVREGCFG3-4 and SVPREDCFG3-4 are enabled.
165 * When bank = 7 and size = 1, SVREGCFG7 and SVPREDCFG7 are enabled
166 (because there are only 8 32-bit CSRs there does not exist a
167 SVREGCFG8 or SVPREDCFG8 to enable).
168 * When bank = 7 and size = 3, SimpleV is entirely disabled.
169
170 In this way it is possible to enable and disable SimpleV with a
171 single instruction, and, furthermore, on context-switching the quantity
172 of CSRs to be saved and restored is greatly reduced.
173
174 ## MAXVECTORLENGTH (MVL)
175
176 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
177 is variable length and may be dynamically set. MVL is
178 however limited to the regfile bitwidth XLEN (1-32 for RV32,
179 1-64 for RV64 and so on).
180
181 The reason for setting this limit is so that predication registers, when
182 marked as such, may fit into a single register as opposed to fanning out
183 over several registers. This keeps the implementation a little simpler.
184
185 The other important factor to note is that the actual MVL is **offset
186 by one**, so that it can fit into only 6 bits (for RV64) and still cover
187 a range up to XLEN bits. So, when setting the MVL CSR to 0, this actually
188 means that MVL==1. When setting the MVL CSR to 3, this actually means
189 that MVL==4, and so on. This is expressed more clearly in the "pseudocode"
190 section, where there are subtle differences between CSRRW and CSRRWI.
191
192 ## Vector Length (VL)
193
194 VSETVL is slightly different from RVV. Like RVV, VL is set to be within
195 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
196
197 VL = rd = MIN(vlen, MVL)
198
199 where 1 <= MVL <= XLEN
200
201 However just like MVL it is important to note that the range for VL has
202 subtle design implications, covered in the "CSR pseudocode" section
203
204 The fixed (specific) setting of VL allows vector LOAD/STORE to be used
205 to switch the entire bank of registers using a single instruction (see
206 Appendix, "Context Switch Example"). The reason for limiting VL to XLEN
207 is down to the fact that predication bits fit into a single register of
208 length XLEN bits.
209
210 The second change is that when VSETVL is requested to be stored
211 into x0, it is *ignored* silently (VSETVL x0, x5)
212
213 The third and most important change is that, within the limits set by
214 MVL, the value passed in **must** be set in VL (and in the
215 destination register).
216
217 This has implication for the microarchitecture, as VL is required to be
218 set (limits from MVL notwithstanding) to the actual value
219 requested. RVV has the option to set VL to an arbitrary value that suits
220 the conditions and the micro-architecture: SV does *not* permit this.
221
222 The reason is so that if SV is to be used for a context-switch or as a
223 substitute for LOAD/STORE-Multiple, the operation can be done with only
224 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
225 single LD/ST operation). If VL does *not* get set to the register file
226 length when VSETVL is called, then a software-loop would be needed.
227 To avoid this need, VL *must* be set to exactly what is requested
228 (limits notwithstanding).
229
230 Therefore, in turn, unlike RVV, implementors *must* provide
231 pseudo-parallelism (using sequential loops in hardware) if actual
232 hardware-parallelism in the ALUs is not deployed. A hybrid is also
233 permitted (as used in Broadcom's VideoCore-IV) however this must be
234 *entirely* transparent to the ISA.
235
236 The fourth change is that VSETVL is implemented as a CSR, where the
237 behaviour of CSRRW (and CSRRWI) must be changed to specifically store
238 the *new* value in the destination register, **not** the old value.
239 Where context-load/save is to be implemented in the usual fashion
240 by using a single CSRRW instruction to obtain the old value, the
241 *secondary* CSR must be used (SVSTATE). This CSR behaves
242 exactly as standard CSRs, and contains more than just VL.
243
244 One interesting side-effect of using CSRRWI to set VL is that this
245 may be done with a single instruction, useful particularly for a
246 context-load/save. There are however limitations: CSRWWI's immediate
247 is limited to 0-31.
248
249 ## STATE
250
251 This is a standard CSR that contains sufficient information for a
252 full context save/restore. It contains (and permits setting of)
253 MVL, VL, CFG, the destination element offset of the current parallel
254 instruction being executed, and, for twin-predication, the source
255 element offset as well. Interestingly it may hypothetically
256 also be used to make the immediately-following instruction to skip a
257 certain number of elements, however the recommended method to do
258 this is predication or using the offset mode of the REMAP CSRs.
259
260 Setting destoffs and srcoffs is realistically intended for saving state
261 so that exceptions (page faults in particular) may be serviced and the
262 hardware-loop that was being executed at the time of the trap, from
263 user-mode (or Supervisor-mode), may be returned to and continued from
264 where it left off. The reason why this works is because setting
265 User-Mode STATE will not change (not be used) in M-Mode or S-Mode
266 (and is entirely why M-Mode and S-Mode have their own STATE CSRs).
267
268 The format of the STATE CSR is as follows:
269
270 | (28..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
271 | -------- | -------- | -------- | -------- | ------- | ------- |
272 | size | bank | destoffs | srcoffs | vl | maxvl |
273
274 When setting this CSR, the following characteristics will be enforced:
275
276 * **MAXVL** will be truncated (after offset) to be within the range 1 to XLEN
277 * **VL** will be truncated (after offset) to be within the range 1 to MAXVL
278 * **srcoffs** will be truncated to be within the range 0 to VL-1
279 * **destoffs** will be truncated to be within the range 0 to VL-1
280
281 ## MVL, VL and CSR Pseudocode
282
283 The pseudo-code for get and set of VL and MVL are as follows:
284
285 set_mvl_csr(value, rd):
286 regs[rd] = MVL
287 MVL = MIN(value, MVL)
288
289 get_mvl_csr(rd):
290 regs[rd] = VL
291
292 set_vl_csr(value, rd):
293 VL = MIN(value, MVL)
294 regs[rd] = VL # yes returning the new value NOT the old CSR
295
296 get_vl_csr(rd):
297 regs[rd] = VL
298
299 Note that where setting MVL behaves as a normal CSR, unlike standard CSR
300 behaviour, setting VL will return the **new** value of VL **not** the old
301 one.
302
303 For CSRRWI, the range of the immediate is restricted to 5 bits. In order to
304 maximise the effectiveness, an immediate of 0 is used to set VL=1,
305 an immediate of 1 is used to set VL=2 and so on:
306
307 CSRRWI_Set_MVL(value):
308 set_mvl_csr(value+1, x0)
309
310 CSRRWI_Set_VL(value):
311 set_vl_csr(value+1, x0)
312
313 However for CSRRW the following pseudocide is used for MVL and VL,
314 where setting the value to zero will cause an exception to be raised.
315 The reason is that if VL or MVL are set to zero, the STATE CSR is
316 not capable of returning that value.
317
318 CSRRW_Set_MVL(rs1, rd):
319 value = regs[rs1]
320 if value == 0:
321 raise Exception
322 set_mvl_csr(value, rd)
323
324 CSRRW_Set_VL(rs1, rd):
325 value = regs[rs1]
326 if value == 0:
327 raise Exception
328 set_vl_csr(value, rd)
329
330 In this way, when CSRRW is utilised with a loop variable, the value
331 that goes into VL (and into the destination register) may be used
332 in an instruction-minimal fashion:
333
334 CSRvect1 = {type: F, key: a3, val: a3, elwidth: dflt}
335 CSRvect2 = {type: F, key: a7, val: a7, elwidth: dflt}
336 CSRRWI MVL, 3 # sets MVL == **4** (not 3)
337 j zerotest # in case loop counter a0 already 0
338 loop:
339 CSRRW VL, t0, a0 # vl = t0 = min(mvl, a0)
340 ld a3, a1 # load 4 registers a3-6 from x
341 slli t1, t0, 3 # t1 = vl * 8 (in bytes)
342 ld a7, a2 # load 4 registers a7-10 from y
343 add a1, a1, t1 # increment pointer to x by vl*8
344 fmadd a7, a3, fa0, a7 # v1 += v0 * fa0 (y = a * x + y)
345 sub a0, a0, t0 # n -= vl (t0)
346 st a7, a2 # store 4 registers a7-10 to y
347 add a2, a2, t1 # increment pointer to y by vl*8
348 zerotest:
349 bnez a0, loop # repeat if n != 0
350
351 With the STATE CSR, just like with CSRRWI, in order to maximise the
352 utilisation of the limited bitspace, "000000" in binary represents
353 VL==1, "00001" represents VL==2 and so on (likewise for MVL):
354
355 CSRRW_Set_SV_STATE(rs1, rd):
356 value = regs[rs1]
357 get_state_csr(rd)
358 MVL = set_mvl_csr(value[11:6]+1)
359 VL = set_vl_csr(value[5:0]+1)
360 CFG = value[28:24]>>24
361 destoffs = value[23:18]>>18
362 srcoffs = value[23:18]>>12
363
364 get_state_csr(rd):
365 regs[rd] = (MVL-1) | (VL-1)<<6 | (srcoffs)<<12 |
366 (destoffs)<<18 | (CFG)<<24
367 return regs[rd]
368
369 In both cases, whilst CSR read of VL and MVL return the exact values
370 of VL and MVL respectively, reading and writing the STATE CSR returns
371 those values **minus one**. This is absolutely critical to implement
372 if the STATE CSR is to be used for fast context-switching.
373
374 ## Register CSR key-value (CAM) table
375
376 The purpose of the Register CSR table is four-fold:
377
378 * To mark integer and floating-point registers as requiring "redirection"
379 if it is ever used as a source or destination in any given operation.
380 This involves a level of indirection through a 5-to-7-bit lookup table,
381 such that **unmodified** operands with 5 bit (3 for Compressed) may
382 access up to **64** registers.
383 * To indicate whether, after redirection through the lookup table, the
384 register is a vector (or remains a scalar).
385 * To over-ride the implicit or explicit bitwidth that the operation would
386 normally give the register.
387
388 | RgCSR | | 15 | (14..8) | 7 | (6..5) | (4..0) |
389 | ----- | | - | - | - | ------ | ------- |
390 | 0 | | isvec0 | regidx0 | i/f | vew0 | regkey |
391 | 1 | | isvec1 | regidx1 | i/f | vew1 | regkey |
392 | .. | | isvec.. | regidx.. | i/f | vew.. | regkey |
393 | 15 | | isvec15 | regidx15 | i/f | vew15 | regkey |
394
395 i/f is set to "1" to indicate that the redirection/tag entry is to be applied
396 to integer registers; 0 indicates that it is relevant to floating-point
397 registers. vew has the following meanings, indicating that the instruction's
398 operand size is "over-ridden" in a polymorphic fashion:
399
400 | vew | bitwidth |
401 | --- | ---------- |
402 | 00 | default |
403 | 01 | default/2 |
404 | 10 | default\*2 |
405 | 11 | 8 |
406
407 As the above table is a CAM (key-value store) it may be appropriate
408 (faster, implementation-wise) to expand it as follows:
409
410 struct vectorised fp_vec[32], int_vec[32];
411
412 for (i = 0; i < 16; i++) // 16 CSRs?
413 tb = int_vec if CSRvec[i].type == 0 else fp_vec
414 idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
415 tb[idx].elwidth = CSRvec[i].elwidth
416 tb[idx].regidx = CSRvec[i].regidx // indirection
417 tb[idx].isvector = CSRvec[i].isvector // 0=scalar
418 tb[idx].packed = CSRvec[i].packed // SIMD or not
419
420 The actual size of the CSR Register table depends on the platform
421 and on whether other Extensions are present (RV64G, RV32E, etc.).
422 For details see "Subsets" section.
423
424 16-bit CSR Register CAM entries are mapped directly into 32-bit
425 on any RV32-based system, however RV64 (XLEN=64) and RV128 (XLEN=128)
426 are slightly different: the 16-bit entries appear (and can be set)
427 multiple times, in an overlapping fashion. Here is the table for RV64:
428
429 | CSR# | 63..48 | 47..32 | 31..16 | 15..0 |
430 | 0x4c0 | RgCSR3 | RgCSR2 | RgCSR1 | RgCSR0 |
431 | 0x4c1 | RgCSR5 | RgCSR4 | RgCSR3 | RgCSR2 |
432 | 0x4c2 | ... | ... | ... | ... |
433 | 0x4c1 | RgCSR15 | RgCSR14 | RgCSR13 | RgCSR12 |
434 | 0x4c8 | n/a | n/a | RgCSR15 | RgCSR4 |
435
436 The rules for writing to these CSRs are that any entries above the ones
437 being set will be automatically wiped (to zero), so to fill several entries
438 they must be written in a sequentially increasing manner. This functionality
439 was in an early draft of RVV and it means that, firstly, compilers do not have
440 to spend time zero-ing out CSRs unnecessarily, and secondly, that on
441 context-switching (and function calls) the number of CSRs that may need
442 saving is implicitly known.
443
444 The reason for the overlapping entries is that in the worst-case on an
445 RV64 system, only 4 64-bit CSR reads/writes are required for a full
446 context-switch (and an RV128 system, only 2 128-bit CSR reads/writes).
447
448 --
449
450 TODO: move elsewhere
451
452 # TODO: use elsewhere (retire for now)
453 vew = CSRbitwidth[rs1]
454 if (vew == 0)
455 bytesperreg = (XLEN/8) # or FLEN as appropriate
456 elif (vew == 1)
457 bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
458 else:
459 bytesperreg = bytestable[vew] # 8 or 16
460 simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
461 vlen = CSRvectorlen[rs1] * simdmult
462 CSRvlength = MIN(MIN(vlen, MAXVECTORLENGTH), rs2)
463
464 The reason for multiplying the vector length by the number of SIMD elements
465 (in each individual register) is so that each SIMD element may optionally be
466 predicated.
467
468 An example of how to subdivide the register file when bitwidth != default
469 is given in the section "Bitwidth Virtual Register Reordering".
470
471 ## Predication CSR <a name="predication_csr_table"></a>
472
473 TODO: update CSR tables, now 7-bit for regidx
474
475 The Predication CSR is a key-value store indicating whether, if a given
476 destination register (integer or floating-point) is referred to in an
477 instruction, it is to be predicated. Tt is particularly important to note
478 that the *actual* register used can be *different* from the one that is
479 in the instruction, due to the redirection through the lookup table.
480
481 * regidx is the actual register that in combination with the
482 i/f flag, if that integer or floating-point register is referred to,
483 results in the lookup table being referenced to find the predication
484 mask to use on the operation in which that (regidx) register has
485 been used
486 * predidx (in combination with the bank bit in the future) is the
487 *actual* register to be used for the predication mask. Note:
488 in effect predidx is actually a 6-bit register address, as the bank
489 bit is the MSB (and is nominally set to zero for now).
490 * inv indicates that the predication mask bits are to be inverted
491 prior to use *without* actually modifying the contents of the
492 register itself.
493 * zeroing is either 1 or 0, and if set to 1, the operation must
494 place zeros in any element position where the predication mask is
495 set to zero. If zeroing is set to 0, unpredicated elements *must*
496 be left alone. Some microarchitectures may choose to interpret
497 this as skipping the operation entirely. Others which wish to
498 stick more closely to a SIMD architecture may choose instead to
499 interpret unpredicated elements as an internal "copy element"
500 operation (which would be necessary in SIMD microarchitectures
501 that perform register-renaming)
502 * "packed" indicates if the register is to be interpreted as SIMD
503 i.e. containing multiple contiguous elements of size equal to "bitwidth".
504 (Note: in earlier drafts this was in the Register CSR table.
505 However after extending to 7 bits there was not enough space.
506 To use "unpredicated" packed SIMD, set the predicate to x0 and
507 set "invert". This has the effect of setting a predicate of all 1s)
508
509 | PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
510 | ----- | - | - | - | - | ------- | ------- |
511 | 0 | bank0 | zero0 | inv0 | i/f | regidx | predkey |
512 | 1 | bank1 | zero1 | inv1 | i/f | regidx | predkey |
513 | .. | bank.. | zero.. | inv.. | i/f | regidx | predkey |
514 | 15 | bank15 | zero15 | inv15 | i/f | regidx | predkey |
515
516 The Predication CSR Table is a key-value store, so implementation-wise
517 it will be faster to turn the table around (maintain topologically
518 equivalent state):
519
520 struct pred {
521 bool zero;
522 bool inv;
523 bool enabled;
524 int predidx; // redirection: actual int register to use
525 }
526
527 struct pred fp_pred_reg[32]; // 64 in future (bank=1)
528 struct pred int_pred_reg[32]; // 64 in future (bank=1)
529
530 for (i = 0; i < 16; i++)
531 tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
532 idx = CSRpred[i].regidx
533 tb[idx].zero = CSRpred[i].zero
534 tb[idx].inv = CSRpred[i].inv
535 tb[idx].predidx = CSRpred[i].predidx
536 tb[idx].enabled = true
537
538 So when an operation is to be predicated, it is the internal state that
539 is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
540 pseudo-code for operations is given, where p is the explicit (direct)
541 reference to the predication register to be used:
542
543 for (int i=0; i<vl; ++i)
544 if ([!]preg[p][i])
545 (d ? vreg[rd][i] : sreg[rd]) =
546 iop(s1 ? vreg[rs1][i] : sreg[rs1],
547 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
548
549 This instead becomes an *indirect* reference using the *internal* state
550 table generated from the Predication CSR key-value store, which iwws used
551 as follows.
552
553 if type(iop) == INT:
554 preg = int_pred_reg[rd]
555 else:
556 preg = fp_pred_reg[rd]
557
558 for (int i=0; i<vl; ++i)
559 predicate, zeroing = get_pred_val(type(iop) == INT, rd):
560 if (predicate && (1<<i))
561 (d ? regfile[rd+i] : regfile[rd]) =
562 iop(s1 ? regfile[rs1+i] : regfile[rs1],
563 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
564 else if (zeroing)
565 (d ? regfile[rd+i] : regfile[rd]) = 0
566
567 Note:
568
569 * d, s1 and s2 are booleans indicating whether destination,
570 source1 and source2 are vector or scalar
571 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
572 above, for clarity. rd, rs1 and rs2 all also must ALSO go through
573 register-level redirection (from the Register CSR table) if they are
574 vectors.
575
576 If written as a function, obtaining the predication mask (and whether
577 zeroing takes place) may be done as follows:
578
579 def get_pred_val(bool is_fp_op, int reg):
580 tb = int_reg if is_fp_op else fp_reg
581 if (!tb[reg].enabled):
582 return ~0x0, False // all enabled; no zeroing
583 tb = int_pred if is_fp_op else fp_pred
584 if (!tb[reg].enabled):
585 return ~0x0, False // all enabled; no zeroing
586 predidx = tb[reg].predidx // redirection occurs HERE
587 predicate = intreg[predidx] // actual predicate HERE
588 if (tb[reg].inv):
589 predicate = ~predicate // invert ALL bits
590 return predicate, tb[reg].zero
591
592 Note here, critically, that **only** if the register is marked
593 in its CSR **register** table entry as being "active" does the testing
594 proceed further to check if the CSR **predicate** table entry is
595 also active.
596
597 Note also that this is in direct contrast to branch operations
598 for the storage of comparisions: in these specific circumstances
599 the requirement for there to be an active CSR *register* entry
600 is removed.
601
602 ## REMAP CSR
603
604 (Note: both the REMAP and SHAPE sections are best read after the
605 rest of the document has been read)
606
607 There is one 32-bit CSR which may be used to indicate which registers,
608 if used in any operation, must be "reshaped" (re-mapped) from a linear
609 form to a 2D or 3D transposed form, or "offset" to permit arbitrary
610 access to elements within a register.
611
612 The 32-bit REMAP CSR may reshape up to 3 registers:
613
614 | 29..28 | 27..26 | 25..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
615 | ------ | ------ | ------ | -- | ------- | -- | ------- | -- | ------- |
616 | shape2 | shape1 | shape0 | 0 | regidx2 | 0 | regidx1 | 0 | regidx0 |
617
618 regidx0-2 refer not to the Register CSR CAM entry but to the underlying
619 *real* register (see regidx, the value) and consequently is 7-bits wide.
620 shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved.
621 Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.
622
623 ## SHAPE 1D/2D/3D vector-matrix remapping CSRs
624
625 (Note: both the REMAP and SHAPE sections are best read after the
626 rest of the document has been read)
627
628 There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each,
629 which have the same format. When each SHAPE CSR is set entirely to zeros,
630 remapping is disabled: the register's elements are a linear (1D) vector.
631
632 | 26..24 | 23 | 22..16 | 15 | 14..8 | 7 | 6..0 |
633 | ------- | -- | ------- | -- | ------- | -- | ------- |
634 | permute | offs[2] | zdimsz | offs[1] | ydimsz | offs[0] | xdimsz |
635
636 offs is a 3-bit field, spread out across bits 7, 15 and 23, which
637 is added to the element index during the loop calculation.
638
639 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
640 that the array dimensionality for that dimension is 1. A value of xdimsz=2
641 would indicate that in the first dimension there are 3 elements in the
642 array. The format of the array is therefore as follows:
643
644 array[xdim+1][ydim+1][zdim+1]
645
646 However whilst illustrative of the dimensionality, that does not take the
647 "permute" setting into account. "permute" may be any one of six values
648 (0-5, with values of 6 and 7 being reserved, and not legal). The table
649 below shows how the permutation dimensionality order works:
650
651 | permute | order | array format |
652 | ------- | ----- | ------------------------ |
653 | 000 | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
654 | 001 | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
655 | 010 | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
656 | 011 | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
657 | 100 | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
658 | 101 | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
659
660 In other words, the "permute" option changes the order in which
661 nested for-loops over the array would be done. The algorithm below
662 shows this more clearly, and may be executed as a python program:
663
664 # mapidx = REMAP.shape2
665 xdim = 3 # SHAPE[mapidx].xdim_sz+1
666 ydim = 4 # SHAPE[mapidx].ydim_sz+1
667 zdim = 5 # SHAPE[mapidx].zdim_sz+1
668
669 lims = [xdim, ydim, zdim]
670 idxs = [0,0,0] # starting indices
671 order = [1,0,2] # experiment with different permutations, here
672 offs = 0 # experiment with different offsets, here
673
674 for idx in range(xdim * ydim * zdim):
675 new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
676 print new_idx,
677 for i in range(3):
678 idxs[order[i]] = idxs[order[i]] + 1
679 if (idxs[order[i]] != lims[order[i]]):
680 break
681 print
682 idxs[order[i]] = 0
683
684 Here, it is assumed that this algorithm be run within all pseudo-code
685 throughout this document where a (parallelism) for-loop would normally
686 run from 0 to VL-1 to refer to contiguous register
687 elements; instead, where REMAP indicates to do so, the element index
688 is run through the above algorithm to work out the **actual** element
689 index, instead. Given that there are three possible SHAPE entries, up to
690 three separate registers in any given operation may be simultaneously
691 remapped:
692
693 function op_add(rd, rs1, rs2) # add not VADD!
694 ...
695 ...
696  for (i = 0; i < VL; i++)
697 if (predval & 1<<i) # predication uses intregs
698    ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
699 ireg[rs2+remap(irs2)];
700 if (!int_vec[rd ].isvector) break;
701 if (int_vec[rd ].isvector)  { id += 1; }
702 if (int_vec[rs1].isvector)  { irs1 += 1; }
703 if (int_vec[rs2].isvector)  { irs2 += 1; }
704
705 By changing remappings, 2D matrices may be transposed "in-place" for one
706 operation, followed by setting a different permutation order without
707 having to move the values in the registers to or from memory. Also,
708 the reason for having REMAP separate from the three SHAPE CSRs is so
709 that in a chain of matrix multiplications and additions, for example,
710 the SHAPE CSRs need only be set up once; only the REMAP CSR need be
711 changed to target different registers.
712
713 Note that:
714
715 * Over-running the register file clearly has to be detected and
716 an illegal instruction exception thrown
717 * When non-default elwidths are set, the exact same algorithm still
718 applies (i.e. it offsets elements *within* registers rather than
719 entire registers).
720 * If permute option 000 is utilised, the actual order of the
721 reindexing does not change!
722 * If two or more dimensions are set to zero, the actual order does not change!
723 * The above algorithm is pseudo-code **only**. Actual implementations
724 will need to take into account the fact that the element for-looping
725 must be **re-entrant**, due to the possibility of exceptions occurring.
726 See MSTATE CSR, which records the current element index.
727 * Twin-predicated operations require **two** separate and distinct
728 element offsets. The above pseudo-code algorithm will be applied
729 separately and independently to each, should each of the two
730 operands be remapped. *This even includes C.LDSP* and other operations
731 in that category, where in that case it will be the **offset** that is
732 remapped (see Compressed Stack LOAD/STORE section).
733 * Offset is especially useful, on its own, for accessing elements
734 within the middle of a register. Without offsets, it is necessary
735 to either use a predicated MV, skipping the first elements, or
736 performing a LOAD/STORE cycle to memory.
737 With offsets, the data does not have to be moved.
738 * Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to
739 less than MVL is **perfectly legal**, albeit very obscure. It permits
740 entries to be regularly presented to operands **more than once**, thus
741 allowing the same underlying registers to act as an accumulator of
742 multiple vector or matrix operations, for example.
743
744 Clearly here some considerable care needs to be taken as the remapping
745 could hypothetically create arithmetic operations that target the
746 exact same underlying registers, resulting in data corruption due to
747 pipeline overlaps. Out-of-order / Superscalar micro-architectures with
748 register-renaming will have an easier time dealing with this than
749 DSP-style SIMD micro-architectures.
750
751 # Instruction Execution Order
752
753 Simple-V behaves as if it is a hardware-level "macro expansion system",
754 substituting and expanding a single instruction into multiple sequential
755 instructions with contiguous and sequentially-incrementing registers.
756 As such, it does **not** modify - or specify - the behaviour and semantics of
757 the execution order: that may be deduced from the **existing** RV
758 specification in each and every case.
759
760 So for example if a particular micro-architecture permits out-of-order
761 execution, and it is augmented with Simple-V, then wherever instructions
762 may be out-of-order then so may the "post-expansion" SV ones.
763
764 If on the other hand there are memory guarantees which specifically
765 prevent and prohibit certain instructions from being re-ordered
766 (such as the Atomicity Axiom, or FENCE constraints), then clearly
767 those constraints **MUST** also be obeyed "post-expansion".
768
769 It should be absolutely clear that SV is **not** about providing new
770 functionality or changing the existing behaviour of a micro-architetural
771 design, or about changing the RISC-V Specification.
772 It is **purely** about compacting what would otherwise be contiguous
773 instructions that use sequentially-increasing register numbers down
774 to the **one** instruction.
775
776 # Instructions
777
778 Despite being a 98% complete and accurate topological remap of RVV
779 concepts and functionality, no new instructions are needed.
780 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
781 becomes a critical dependency for efficient manipulation of predication
782 masks (as a bit-field). Despite the removal of all operations,
783 with the exception of CLIP and VSELECT.X
784 *all instructions from RVV Base are topologically re-mapped and retain their
785 complete functionality, intact*. Note that if RV64G ever had
786 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
787 be obtained in SV.
788
789 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
790 equivalents, so are left out of Simple-V. VSELECT could be included if
791 there existed a MV.X instruction in RV (MV.X is a hypothetical
792 non-immediate variant of MV that would allow another register to
793 specify which register was to be copied). Note that if any of these three
794 instructions are added to any given RV extension, their functionality
795 will be inherently parallelised.
796
797 With some exceptions, where it does not make sense or is simply too
798 challenging, all RV-Base instructions are parallelised:
799
800 * CSR instructions, whilst a case could be made for fast-polling of
801 a CSR into multiple registers, would require guarantees of strict
802 sequential ordering that SV does not provide. Therefore, CSRs are
803 not really suitable and are left out.
804 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
805 left as scalar.
806 * LR/SC could hypothetically be parallelised however their purpose is
807 single (complex) atomic memory operations where the LR must be followed
808 up by a matching SC. A sequence of parallel LR instructions followed
809 by a sequence of parallel SC instructions therefore is guaranteed to
810 not be useful. Not least: the guarantees of LR/SC
811 would be impossible to provide if emulated in a trap.
812 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
813 paralleliseable anyway.
814
815 All other operations using registers are automatically parallelised.
816 This includes AMOMAX, AMOSWAP and so on, where particular care and
817 attention must be paid.
818
819 Example pseudo-code for an integer ADD operation (including scalar operations).
820 Floating-point uses fp csrs.
821
822 function op_add(rd, rs1, rs2) # add not VADD!
823  int i, id=0, irs1=0, irs2=0;
824  predval = get_pred_val(FALSE, rd);
825  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
826  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
827  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
828  for (i = 0; i < VL; i++)
829 if (predval & 1<<i) # predication uses intregs
830    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
831 if (!int_vec[rd ].isvector) break;
832 if (int_vec[rd ].isvector)  { id += 1; }
833 if (int_vec[rs1].isvector)  { irs1 += 1; }
834 if (int_vec[rs2].isvector)  { irs2 += 1; }
835
836 Note that for simplicity there is quite a lot missing from the above
837 pseudo-code: element widths, zeroing on predication, dimensional
838 reshaping and offsets and so on. However it demonstrates the basic
839 principle. Augmentations that produce the full pseudo-code are covered in
840 other sections.
841
842 ## Instruction Format
843
844 It is critical to appreciate that there are
845 **no operations added to SV, at all**.
846
847 Instead, by using CSRs to tag registers as an indication of "changed behaviour",
848 SV *overloads* pre-existing branch operations into predicated
849 variants, and implicitly overloads arithmetic operations, MV,
850 FCVT, and LOAD/STORE depending on CSR configurations for bitwidth
851 and predication. **Everything** becomes parallelised. *This includes
852 Compressed instructions* as well as any future instructions and Custom
853 Extensions.
854
855 Note: CSR tags to change behaviour of instructions is nothing new, including
856 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
857 FRM changes the behaviour of the floating-point unit, to alter the rounding
858 mode. Other architectures change the LOAD/STORE byte-order from big-endian
859 to little-endian on a per-instruction basis. SV is just a little more...
860 comprehensive in its effect on instructions.
861
862 ## Branch Instructions
863
864 ### Standard Branch <a name="standard_branch"></a>
865
866 Branch operations use standard RV opcodes that are reinterpreted to
867 be "predicate variants" in the instance where either of the two src
868 registers are marked as vectors (active=1, vector=1).
869
870 Note that he predication register to use (if one is enabled) is taken from
871 the *first* src register. The target (destination) predication register
872 to use (if one is enabled) is taken from the *second* src register.
873
874 If either of src1 or src2 are scalars (whether by there being no
875 CSR register entry or whether by the CSR entry specifically marking
876 the register as "scalar") the comparison goes ahead as vector-scalar
877 or scalar-vector.
878
879 In instances where no vectorisation is detected on either src registers
880 the operation is treated as an absolutely standard scalar branch operation.
881 Where vectorisation is present on either or both src registers, the
882 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
883 those tests that are predicated out).
884
885 Note that just as with the standard (scalar, non-predicated) branch
886 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
887 src1 and src2.
888
889 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
890 for predicated compare operations of function "cmp":
891
892 for (int i=0; i<vl; ++i)
893 if ([!]preg[p][i])
894 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
895 s2 ? vreg[rs2][i] : sreg[rs2]);
896
897 With associated predication, vector-length adjustments and so on,
898 and temporarily ignoring bitwidth (which makes the comparisons more
899 complex), this becomes:
900
901 s1 = reg_is_vectorised(src1);
902 s2 = reg_is_vectorised(src2);
903
904 if not s1 && not s2
905 if cmp(rs1, rs2) # scalar compare
906 goto branch
907 return
908
909 preg = int_pred_reg[rd]
910 reg = int_regfile
911
912 ps = get_pred_val(I/F==INT, rs1);
913 rd = get_pred_val(I/F==INT, rs2); # this may not exist
914
915 if not exists(rd)
916 temporary_result = 0
917 else
918 preg[rd] = 0; # initialise to zero
919
920 for (int i = 0; i < VL; ++i)
921 if (ps & (1<<i)) && (cmp(s1 ? reg[src1+i]:reg[src1],
922 s2 ? reg[src2+i]:reg[src2])
923 if not exists(rd)
924 temporary_result |= 1<<i;
925 else
926 preg[rd] |= 1<<i; # bitfield not vector
927
928 if not exists(rd)
929 if temporary_result == ps
930 goto branch
931 else
932 if preg[rd] == ps
933 goto branch
934
935 Notes:
936
937 * zeroing has been temporarily left out of the above pseudo-code,
938 for clarity
939 * Predicated SIMD comparisons would break src1 and src2 further down
940 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
941 Reordering") setting Vector-Length times (number of SIMD elements) bits
942 in Predicate Register rd, as opposed to just Vector-Length bits.
943
944 TODO: predication now taken from src2. also branch goes ahead
945 if all compares are successful.
946
947 Note also that where normally, predication requires that there must
948 also be a CSR register entry for the register being used in order
949 for the **predication** CSR register entry to also be active,
950 for branches this is **not** the case. src2 does **not** have
951 to have its CSR register entry marked as active in order for
952 predication on src2 to be active.
953
954 ### Floating-point Comparisons
955
956 There does not exist floating-point branch operations, only compare.
957 Interestingly no change is needed to the instruction format because
958 FP Compare already stores a 1 or a zero in its "rd" integer register
959 target, i.e. it's not actually a Branch at all: it's a compare.
960 Thus, no change is made to the floating-point comparison, so
961
962 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
963 and whilst in ordinary branch code this is fine because the standard
964 RVF compare can always be followed up with an integer BEQ or a BNE (or
965 a compressed comparison to zero or non-zero), in predication terms that
966 becomes more of an impact. To deal with this, SV's predication has
967 had "invert" added to it.
968
969 ### Compressed Branch Instruction
970
971 Compressed Branch instructions are, just like standard Branch instructions,
972 reinterpreted to be vectorised and predicated based on the source register
973 (rs1s) CSR entries. As however there is only the one source register,
974 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
975 to store the results of the comparisions is taken from CSR predication
976 table entries for **x0**.
977
978 The specific required use of x0 is, with a little thought, quite obvious,
979 but is counterintuitive. Clearly it is **not** recommended to redirect
980 x0 with a CSR register entry, however as a means to opaquely obtain
981 a predication target it is the only sensible option that does not involve
982 additional special CSRs (or, worse, additional special opcodes).
983
984 Note also that, just as with standard branches, the 2nd source
985 (in this case x0 rather than src2) does **not** have to have its CSR
986 register table marked as "active" in order for predication to work.
987
988 ## Vectorised Dual-operand instructions
989
990 There is a series of 2-operand instructions involving copying (and
991 sometimes alteration):
992
993 * C.MV
994 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
995 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
996 * LOAD(-FP) and STORE(-FP)
997
998 All of these operations follow the same two-operand pattern, so it is
999 *both* the source *and* destination predication masks that are taken into
1000 account. This is different from
1001 the three-operand arithmetic instructions, where the predication mask
1002 is taken from the *destination* register, and applied uniformly to the
1003 elements of the source register(s), element-for-element.
1004
1005 The pseudo-code pattern for twin-predicated operations is as
1006 follows:
1007
1008 function op(rd, rs):
1009  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1010  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1011  ps = get_pred_val(FALSE, rs); # predication on src
1012  pd = get_pred_val(FALSE, rd); # ... AND on dest
1013  for (int i = 0, int j = 0; i < VL && j < VL;):
1014 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1015 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1016 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
1017 if (int_csr[rs].isvec) i++;
1018 if (int_csr[rd].isvec) j++; else break
1019
1020 This pattern covers scalar-scalar, scalar-vector, vector-scalar
1021 and vector-vector, and predicated variants of all of those.
1022 Zeroing is not presently included (TODO). As such, when compared
1023 to RVV, the twin-predicated variants of C.MV and FMV cover
1024 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
1025 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
1026
1027 Note that:
1028
1029 * elwidth (SIMD) is not covered in the pseudo-code above
1030 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
1031 not covered
1032 * zero predication is also not shown (TODO).
1033
1034 ### C.MV Instruction <a name="c_mv"></a>
1035
1036 There is no MV instruction in RV however there is a C.MV instruction.
1037 It is used for copying integer-to-integer registers (vectorised FMV
1038 is used for copying floating-point).
1039
1040 If either the source or the destination register are marked as vectors
1041 C.MV is reinterpreted to be a vectorised (multi-register) predicated
1042 move operation. The actual instruction's format does not change:
1043
1044 [[!table data="""
1045 15 12 | 11 7 | 6 2 | 1 0 |
1046 funct4 | rd | rs | op |
1047 4 | 5 | 5 | 2 |
1048 C.MV | dest | src | C0 |
1049 """]]
1050
1051 A simplified version of the pseudocode for this operation is as follows:
1052
1053 function op_mv(rd, rs) # MV not VMV!
1054  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1055  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1056  ps = get_pred_val(FALSE, rs); # predication on src
1057  pd = get_pred_val(FALSE, rd); # ... AND on dest
1058  for (int i = 0, int j = 0; i < VL && j < VL;):
1059 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1060 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1061 ireg[rd+j] <= ireg[rs+i];
1062 if (int_csr[rs].isvec) i++;
1063 if (int_csr[rd].isvec) j++; else break
1064
1065 There are several different instructions from RVV that are covered by
1066 this one opcode:
1067
1068 [[!table data="""
1069 src | dest | predication | op |
1070 scalar | vector | none | VSPLAT |
1071 scalar | vector | destination | sparse VSPLAT |
1072 scalar | vector | 1-bit dest | VINSERT |
1073 vector | scalar | 1-bit? src | VEXTRACT |
1074 vector | vector | none | VCOPY |
1075 vector | vector | src | Vector Gather |
1076 vector | vector | dest | Vector Scatter |
1077 vector | vector | src & dest | Gather/Scatter |
1078 vector | vector | src == dest | sparse VCOPY |
1079 """]]
1080
1081 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
1082 operations with inversion on the src and dest predication for one of the
1083 two C.MV operations.
1084
1085 Note that in the instance where the Compressed Extension is not implemented,
1086 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
1087 Note that the behaviour is **different** from C.MV because with addi the
1088 predication mask to use is taken **only** from rd and is applied against
1089 all elements: rs[i] = rd[i].
1090
1091 ### FMV, FNEG and FABS Instructions
1092
1093 These are identical in form to C.MV, except covering floating-point
1094 register copying. The same double-predication rules also apply.
1095 However when elwidth is not set to default the instruction is implicitly
1096 and automatic converted to a (vectorised) floating-point type conversion
1097 operation of the appropriate size covering the source and destination
1098 register bitwidths.
1099
1100 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
1101
1102 ### FVCT Instructions
1103
1104 These are again identical in form to C.MV, except that they cover
1105 floating-point to integer and integer to floating-point. When element
1106 width in each vector is set to default, the instructions behave exactly
1107 as they are defined for standard RV (scalar) operations, except vectorised
1108 in exactly the same fashion as outlined in C.MV.
1109
1110 However when the source or destination element width is not set to default,
1111 the opcode's explicit element widths are *over-ridden* to new definitions,
1112 and the opcode's element width is taken as indicative of the SIMD width
1113 (if applicable i.e. if packed SIMD is requested) instead.
1114
1115 For example FCVT.S.L would normally be used to convert a 64-bit
1116 integer in register rs1 to a 64-bit floating-point number in rd.
1117 If however the source rs1 is set to be a vector, where elwidth is set to
1118 default/2 and "packed SIMD" is enabled, then the first 32 bits of
1119 rs1 are converted to a floating-point number to be stored in rd's
1120 first element and the higher 32-bits *also* converted to floating-point
1121 and stored in the second. The 32 bit size comes from the fact that
1122 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
1123 divide that by two it means that rs1 element width is to be taken as 32.
1124
1125 Similar rules apply to the destination register.
1126
1127 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
1128
1129 An earlier draft of SV modified the behaviour of LOAD/STORE. This
1130 actually undermined the fundamental principle of SV, namely that there
1131 be no modifications to the scalar behaviour (except where absolutely
1132 necessary), in order to simplify an implementor's task if considering
1133 converting a pre-existing scalar design to support parallelism.
1134
1135 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
1136 do not change in SV, however just as with C.MV it is important to note
1137 that dual-predication is possible. Using the template outlined in
1138 the section "Vectorised dual-op instructions", the pseudo-code covering
1139 scalar-scalar, scalar-vector, vector-scalar and vector-vector applies,
1140 where SCALAR\_OPERATION is as follows, exactly as for a standard
1141 scalar RV LOAD operation:
1142
1143 srcbase = ireg[rs+i];
1144 return mem[srcbase + imm];
1145
1146 Whilst LOAD and STORE remain as-is when compared to their scalar
1147 counterparts, the incrementing on the source register (for LOAD)
1148 means that pointers-to-structures can be easily implemented, and
1149 if contiguous offsets are required, those pointers (the contents
1150 of the contiguous source registers) may simply be set up to point
1151 to contiguous locations.
1152
1153 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
1154
1155 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
1156 where it is implicit in C.LWSP/FLWSP that x2 is the source register.
1157 It is therefore possible to use predicated C.LWSP to efficiently
1158 pop registers off the stack (by predicating x2 as the source), cherry-picking
1159 which registers to store to (by predicating the destination). Likewise
1160 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
1161
1162 However, to do so, the behaviour of C.LWSP/C.SWSP needs to be slightly
1163 different: where x2 is marked as vectorised, instead of incrementing
1164 the register on each loop (x2, x3, x4...), instead it is the *immediate*
1165 that must be incremented. Pseudo-code follows:
1166
1167 function lwsp(rd, rs):
1168  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1169  rs = x2 # effectively no redirection on x2.
1170  ps = get_pred_val(FALSE, rs); # predication on src
1171  pd = get_pred_val(FALSE, rd); # ... AND on dest
1172  for (int i = 0, int j = 0; i < VL && j < VL;):
1173 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1174 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1175 reg[rd+j] = mem[x2 + ((offset+i) * 4)]
1176 if (int_csr[rs].isvec) i++;
1177 if (int_csr[rd].isvec) j++; else break;
1178
1179 For C.LDSP, the offset (and loop) multiplier would be 8, and for
1180 C.LQSP it would be 16. Effectively this makes C.LWSP etc. a Vector
1181 "Unit Stride" Load instruction.
1182
1183 **Note**: it is still possible to redirect x2 to an alternative target
1184 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
1185 general-purpose Vector "Unit Stride" LOAD/STORE operations.
1186
1187 ## Compressed LOAD / STORE Instructions
1188
1189 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
1190 where the same rules apply and the same pseudo-code apply as for
1191 non-compressed LOAD/STORE. This is **different** from Compressed Stack
1192 LOAD/STORE (C.LWSP / C.SWSP), which have been augmented to become
1193 Vector "Unit Stride" capable.
1194
1195 Just as with uncompressed LOAD/STORE C.LD / C.ST increment the *register*
1196 during the hardware loop, **not** the offset.
1197
1198 # Element bitwidth polymorphism <a name="elwidth"></a>
1199
1200 Element bitwidth is best covered as its own special section, as it
1201 is quite involved and applies uniformly across-the-board. SV restricts
1202 bitwidth polymorphism to default, default/2, default\*2 and 8-bit
1203 (whilst this seems limiting, the justification is covered in a later
1204 sub-section).
1205
1206 The effect of setting an element bitwidth is to re-cast each entry
1207 in the register table, and for all memory operations involving
1208 load/stores of certain specific sizes, to a completely different width.
1209 Thus In c-style terms, on an RV64 architecture, effectively each register
1210 now looks like this:
1211
1212 typedef union {
1213 uint8_t b[8];
1214 uint16_t s[4];
1215 uint32_t i[2];
1216 uint64_t l[1];
1217 } reg_t;
1218
1219 // integer table: assume maximum SV 7-bit regfile size
1220 reg_t int_regfile[128];
1221
1222 where the CSR Register table entry (not the instruction alone) determines
1223 which of those union entries is to be used on each operation, and the
1224 VL element offset in the hardware-loop specifies the index into each array.
1225
1226 However a naive interpretation of the data structure above masks the
1227 fact that setting VL greater than 8, for example, when the bitwidth is 8,
1228 accessing one specific register "spills over" to the following parts of
1229 the register file in a sequential fashion. So a much more accurate way
1230 to reflect this would be:
1231
1232 typedef union {
1233 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
1234 uint8_t b[0]; // array of type uint8_t
1235 uint16_t s[0];
1236 uint32_t i[0];
1237 uint64_t l[0];
1238 uint128_t d[0];
1239 } reg_t;
1240
1241 reg_t int_regfile[128];
1242
1243 where when accessing any individual regfile[n].b entry it is permitted
1244 (in c) to arbitrarily over-run the *declared* length of the array (zero),
1245 and thus "overspill" to consecutive register file entries in a fashion
1246 that is completely transparent to a greatly-simplified software / pseudo-code
1247 representation.
1248 It is however critical to note that it is clearly the responsibility of
1249 the implementor to ensure that, towards the end of the register file,
1250 an exception is thrown if attempts to access beyond the "real" register
1251 bytes is ever attempted.
1252
1253 Now we may modify pseudo-code an operation where all element bitwidths have
1254 been set to the same size, where this pseudo-code is otherwise identical
1255 to its "non" polymorphic versions (above):
1256
1257 function op_add(rd, rs1, rs2) # add not VADD!
1258 ...
1259 ...
1260  for (i = 0; i < VL; i++)
1261 ...
1262 ...
1263 // TODO, calculate if over-run occurs, for each elwidth
1264 if (elwidth == 8) {
1265    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
1266     int_regfile[rs2].i[irs2];
1267 } else if elwidth == 16 {
1268    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
1269     int_regfile[rs2].s[irs2];
1270 } else if elwidth == 32 {
1271    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
1272     int_regfile[rs2].i[irs2];
1273 } else { // elwidth == 64
1274    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
1275     int_regfile[rs2].l[irs2];
1276 }
1277 ...
1278 ...
1279
1280 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
1281 following sequentially on respectively from the same) are "type-cast"
1282 to 8-bit; for 16-bit entries likewise and so on.
1283
1284 However that only covers the case where the element widths are the same.
1285 Where the element widths are different, the following algorithm applies:
1286
1287 * Analyse the bitwidth of all source operands and work out the
1288 maximum. Record this as "maxsrcbitwidth"
1289 * If any given source operand requires sign-extension or zero-extension
1290 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
1291 sign-extension / zero-extension or whatever is specified in the standard
1292 RV specification, **change** that to sign-extending from the respective
1293 individual source operand's bitwidth from the CSR table out to
1294 "maxsrcbitwidth" (previously calculated), instead.
1295 * Following separate and distinct (optional) sign/zero-extension of all
1296 source operands as specifically required for that operation, carry out the
1297 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
1298 this may be a "null" (copy) operation, and that with FCVT, the changes
1299 to the source and destination bitwidths may also turn FVCT effectively
1300 into a copy).
1301 * If the destination operand requires sign-extension or zero-extension,
1302 instead of a mandatory fixed size (typically 32-bit for arithmetic,
1303 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
1304 etc.), overload the RV specification with the bitwidth from the
1305 destination register's elwidth entry.
1306 * Finally, store the (optionally) sign/zero-extended value into its
1307 destination: memory for sb/sw etc., or an offset section of the register
1308 file for an arithmetic operation.
1309
1310 In this way, polymorphic bitwidths are achieved without requiring a
1311 massive 64-way permutation of calculations **per opcode**, for example
1312 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
1313 rd bitwidths). The pseudo-code is therefore as follows:
1314
1315 typedef union {
1316 uint8_t b;
1317 uint16_t s;
1318 uint32_t i;
1319 uint64_t l;
1320 } el_reg_t;
1321
1322 bw(elwidth):
1323 if elwidth == 0:
1324 return xlen
1325 if elwidth == 1:
1326 return xlen / 2
1327 if elwidth == 2:
1328 return xlen * 2
1329 // elwidth == 3:
1330 return 8
1331
1332 get_max_elwidth(rs1, rs2):
1333 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
1334 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
1335
1336 get_polymorphed_reg(reg, bitwidth, offset):
1337 el_reg_t res;
1338 res.l = 0; // TODO: going to need sign-extending / zero-extending
1339 if bitwidth == 8:
1340 reg.b = int_regfile[reg].b[offset]
1341 elif bitwidth == 16:
1342 reg.s = int_regfile[reg].s[offset]
1343 elif bitwidth == 32:
1344 reg.i = int_regfile[reg].i[offset]
1345 elif bitwidth == 64:
1346 reg.l = int_regfile[reg].l[offset]
1347 return res
1348
1349 set_polymorphed_reg(reg, bitwidth, offset, val):
1350 if (!int_csr[reg].isvec):
1351 # sign/zero-extend depending on opcode requirements, from
1352 # the reg's bitwidth out to the full bitwidth of the regfile
1353 val = sign_or_zero_extend(val, bitwidth, xlen)
1354 int_regfile[reg].l[0] = val
1355 elif bitwidth == 8:
1356 int_regfile[reg].b[offset] = val
1357 elif bitwidth == 16:
1358 int_regfile[reg].s[offset] = val
1359 elif bitwidth == 32:
1360 int_regfile[reg].i[offset] = val
1361 elif bitwidth == 64:
1362 int_regfile[reg].l[offset] = val
1363
1364 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
1365 destwid = int_csr[rs1].elwidth # destination element width
1366  for (i = 0; i < VL; i++)
1367 if (predval & 1<<i) # predication uses intregs
1368 // TODO, calculate if over-run occurs, for each elwidth
1369 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
1370 // TODO, sign/zero-extend src1 and src2 as operation requires
1371 if (op_requires_sign_extend_src1)
1372 src1 = sign_extend(src1, maxsrcwid)
1373 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
1374 result = src1 + src2 # actual add here
1375 // TODO, sign/zero-extend result, as operation requires
1376 if (op_requires_sign_extend_dest)
1377 result = sign_extend(result, maxsrcwid)
1378 set_polymorphed_reg(rd, destwid, ird, result)
1379 if (!int_vec[rd].isvector) break
1380 if (int_vec[rd ].isvector)  { id += 1; }
1381 if (int_vec[rs1].isvector)  { irs1 += 1; }
1382 if (int_vec[rs2].isvector)  { irs2 += 1; }
1383
1384 Whilst specific sign-extension and zero-extension pseudocode call
1385 details are left out, due to each operation being different, the above
1386 should be clear that;
1387
1388 * the source operands are extended out to the maximum bitwidth of all
1389 source operands
1390 * the operation takes place at that maximum source bitwidth (the
1391 destination bitwidth is not involved at this point, at all)
1392 * the result is extended (or potentially even, truncated) before being
1393 stored in the destination. i.e. truncation (if required) to the
1394 destination width occurs **after** the operation **not** before.
1395 * when the destination is not marked as "vectorised", the **full**
1396 (standard, scalar) register file entry is taken up, i.e. the
1397 element is either sign-extended or zero-extended to cover the
1398 full register bitwidth (XLEN) if it is not already XLEN bits long.
1399
1400 Implementors are entirely free to optimise the above, particularly
1401 if it is specifically known that any given operation will complete
1402 accurately in less bits, as long as the results produced are
1403 directly equivalent and equal, for all inputs and all outputs,
1404 to those produced by the above algorithm.
1405
1406 ## Polymorphic floating-point operation exceptions and error-handling
1407
1408 For floating-point operations, conversion takes place without
1409 raising any kind of exception. Exactly as specified in the standard
1410 RV specification, NAN (or appropriate) is stored if the result
1411 is beyond the range of the destination, and, again, exactly as
1412 with the standard RV specification just as with scalar
1413 operations, the floating-point flag is raised (FCSR). And, again, just as
1414 with scalar operations, it is software's responsibility to check this flag.
1415 Given that the FCSR flags are "accrued", the fact that multiple element
1416 operations could have occurred is not a problem.
1417
1418 Note that it is perfectly legitimate for floating-point bitwidths of
1419 only 8 to be specified. However whilst it is possible to apply IEEE 754
1420 principles, no actual standard yet exists. Implementors wishing to
1421 provide hardware-level 8-bit support rather than throw a trap to emulate
1422 in software should contact the author of this specification before
1423 proceeding.
1424
1425 ## Polymorphic shift operators
1426
1427 A special note is needed for changing the element width of left and right
1428 shift operators, particularly right-shift. Even for standard RV base,
1429 in order for correct results to be returned, the second operand RS2 must
1430 be truncated to be within the range of RS1's bitwidth. spike's implementation
1431 of sll for example is as follows:
1432
1433 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
1434
1435 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
1436 range 0..31 so that RS1 will only be left-shifted by the amount that
1437 is possible to fit into a 32-bit register. Whilst this appears not
1438 to matter for hardware, it matters greatly in software implementations,
1439 and it also matters where an RV64 system is set to "RV32" mode, such
1440 that the underlying registers RS1 and RS2 comprise 64 hardware bits
1441 each.
1442
1443 For SV, where each operand's element bitwidth may be over-ridden, the
1444 rule about determining the operation's bitwidth *still applies*, being
1445 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
1446 **also applies to the truncation of RS2**. In other words, *after*
1447 determining the maximum bitwidth, RS2's range must **also be truncated**
1448 to ensure a correct answer. Example:
1449
1450 * RS1 is over-ridden to a 16-bit width
1451 * RS2 is over-ridden to an 8-bit width
1452 * RD is over-ridden to a 64-bit width
1453 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
1454 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
1455
1456 Pseudocode for this example would therefore be:
1457
1458 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
1459
1460 This example illustrates that considerable care therefore needs to be
1461 taken to ensure that left and right shift operations are implemented
1462 correctly.
1463
1464 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
1465
1466 Polymorphic element widths in vectorised form means that the data
1467 being loaded (or stored) across multiple registers needs to be treated
1468 (reinterpreted) as a contiguous stream of elwidth-wide items, where
1469 the source register's element width is **independent** from the destination's.
1470
1471 This makes for a slightly more complex algorithm when using indirection
1472 on the "addressed" register (source for LOAD and destination for STORE),
1473 particularly given that the LOAD/STORE instruction provides important
1474 information about the width of the data to be reinterpreted.
1475
1476 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
1477 was as follows, and i is the loop from 0 to VL-1:
1478
1479 srcbase = ireg[rs+i];
1480 return mem[srcbase + imm]; // returns XLEN bits
1481
1482 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
1483 chunks are taken from the source memory location addressed by the current
1484 indexed source address register, and only when a full 32-bits-worth
1485 are taken will the index be moved on to the next contiguous source
1486 address register:
1487
1488 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
1489 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
1490 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
1491 offs = i % elsperblock; // modulo
1492 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
1493
1494 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
1495 and 128 for LQ.
1496
1497 The principle is basically exactly the same as if the srcbase were pointing
1498 at the memory of the *register* file: memory is re-interpreted as containing
1499 groups of elwidth-wide discrete elements.
1500
1501 When storing the result from a load, it's important to respect the fact
1502 that the destination register has its *own separate element width*. Thus,
1503 when each element is loaded (at the source element width), any sign-extension
1504 or zero-extension (or truncation) needs to be done to the *destination*
1505 bitwidth. Also, the storing has the exact same analogous algorithm as
1506 above, where in fact it is just the set\_polymorphed\_reg pseudocode
1507 (completely unchanged) used above.
1508
1509 One issue remains: when the source element width is **greater** than
1510 the width of the operation, it is obvious that a single LB for example
1511 cannot possibly obtain 16-bit-wide data. This condition may be detected
1512 where, when using integer divide, elsperblock (the width of the LOAD
1513 divided by the bitwidth of the element) is zero.
1514
1515 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
1516
1517 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
1518
1519 The elements, if the element bitwidth is larger than the LD operation's
1520 size, will then be sign/zero-extended to the full LD operation size, as
1521 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1522 being passed on to the second phase.
1523
1524 As LOAD/STORE may be twin-predicated, it is important to note that
1525 the rules on twin predication still apply, except where in previous
1526 pseudo-code (elwidth=default for both source and target) it was
1527 the *registers* that the predication was applied to, it is now the
1528 **elements** that the predication are applied to.
1529
1530 Thus the full pseudocode for all LD operations may be written out
1531 as follows:
1532
1533 function LBU(rd, rs):
1534 load_elwidthed(rd, rs, 8, true)
1535 function LB(rd, rs):
1536 load_elwidthed(rd, rs, 8, false)
1537 function LH(rd, rs):
1538 load_elwidthed(rd, rs, 16, false)
1539 ...
1540 ...
1541 function LQ(rd, rs):
1542 load_elwidthed(rd, rs, 128, false)
1543
1544 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1545 function load_memory(rs, imm, i, opwidth):
1546 elwidth = int_csr[rs].elwidth
1547 bitwidth = bw(elwidth);
1548 elsperblock = min(1, opwidth / bitwidth)
1549 srcbase = ireg[rs+i/(elsperblock)];
1550 offs = i % elsperblock;
1551 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1552
1553 function load_elwidthed(rd, rs, opwidth, unsigned):
1554 destwid = int_csr[rd].elwidth # destination element width
1555  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1556  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1557  ps = get_pred_val(FALSE, rs); # predication on src
1558  pd = get_pred_val(FALSE, rd); # ... AND on dest
1559  for (int i = 0, int j = 0; i < VL && j < VL;):
1560 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1561 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1562 val = load_memory(rs, imm, i, opwidth)
1563 if unsigned:
1564 val = zero_extend(val, min(opwidth, bitwidth))
1565 else:
1566 val = sign_extend(val, min(opwidth, bitwidth))
1567 set_polymorphed_reg(rd, bitwidth, j, val)
1568 if (int_csr[rs].isvec) i++;
1569 if (int_csr[rd].isvec) j++; else break;
1570
1571 Note:
1572
1573 * when comparing against for example the twin-predicated c.mv
1574 pseudo-code, the pattern of independent incrementing of rd and rs
1575 is preserved unchanged.
1576 * just as with the c.mv pseudocode, zeroing is not included and must be
1577 taken into account (TODO).
1578 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1579 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1580 VSCATTER characteristics.
1581 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1582 a destination that is not vectorised (marked as scalar) will
1583 result in the element being fully sign-extended or zero-extended
1584 out to the full register file bitwidth (XLEN). When the source
1585 is also marked as scalar, this is how the compatibility with
1586 standard RV LOAD/STORE is preserved by this algorithm.
1587
1588 ### Example Tables showing LOAD elements
1589
1590 This section contains examples of vectorised LOAD operations, showing
1591 how the two stage process works (three if zero/sign-extension is included).
1592
1593
1594 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1595
1596 This is:
1597
1598 * a 64-bit load, with an offset of zero
1599 * with a source-address elwidth of 16-bit
1600 * into a destination-register with an elwidth of 32-bit
1601 * where VL=7
1602 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1603 * RV64, where XLEN=64 is assumed.
1604
1605 First, the memory table, which, due to the
1606 element width being 16 and the operation being LD (64), the 64-bits
1607 loaded from memory are subdivided into groups of **four** elements.
1608 And, with VL being 7 (deliberately to illustrate that this is reasonable
1609 and possible), the first four are sourced from the offset addresses pointed
1610 to by x5, and the next three from the ofset addresses pointed to by
1611 the next contiguous register, x6:
1612
1613 [[!table data="""
1614 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1615 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1616 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1617 """]]
1618
1619 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1620 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1621
1622 [[!table data="""
1623 byte 3 | byte 2 | byte 1 | byte 0 |
1624 0x0 | 0x0 | elem0 ||
1625 0x0 | 0x0 | elem1 ||
1626 0x0 | 0x0 | elem2 ||
1627 0x0 | 0x0 | elem3 ||
1628 0x0 | 0x0 | elem4 ||
1629 0x0 | 0x0 | elem5 ||
1630 0x0 | 0x0 | elem6 ||
1631 0x0 | 0x0 | elem7 ||
1632 """]]
1633
1634 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1635 byte-addressable "memory". That "memory" happens to cover registers
1636 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1637
1638 [[!table data="""
1639 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1640 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1641 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1642 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1643 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1644 """]]
1645
1646 Thus we have data that is loaded from the **addresses** pointed to by
1647 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1648 x8 through to half of x11.
1649 The end result is that elements 0 and 1 end up in x8, with element 8 being
1650 shifted up 32 bits, and so on, until finally element 6 is in the
1651 LSBs of x11.
1652
1653 Note that whilst the memory addressing table is shown left-to-right byte order,
1654 the registers are shown in right-to-left (MSB) order. This does **not**
1655 imply that bit or byte-reversal is carried out: it's just easier to visualise
1656 memory as being contiguous bytes, and emphasises that registers are not
1657 really actually "memory" as such.
1658
1659 ## Why SV bitwidth specification is restricted to 4 entries
1660
1661 The four entries for SV element bitwidths only allows three over-rides:
1662
1663 * default bitwidth for a given operation *divided* by two
1664 * default bitwidth for a given operation *multiplied* by two
1665 * 8-bit
1666
1667 At first glance this seems completely inadequate: for example, RV64
1668 cannot possibly operate on 16-bit operations, because 64 divided by
1669 2 is 32. However, the reader may have forgotten that it is possible,
1670 at run-time, to switch a 64-bit application into 32-bit mode, by
1671 setting UXL. Once switched, opcodes that formerly had 64-bit
1672 meanings now have 32-bit meanings, and in this way, "default/2"
1673 now reaches **16-bit** where previously it meant "32-bit".
1674
1675 There is however an absolutely crucial aspect oF SV here that explicitly
1676 needs spelling out, and it's whether the "vectorised" bit is set in
1677 the Register's CSR entry.
1678
1679 If "vectorised" is clear (not set), this indicates that the operation
1680 is "scalar". Under these circumstances, when set on a destination (RD),
1681 then sign-extension and zero-extension, whilst changed to match the
1682 override bitwidth (if set), will erase the **full** register entry
1683 (64-bit if RV64).
1684
1685 When vectorised is *set*, this indicates that the operation now treats
1686 **elements** as if they were independent registers, so regardless of
1687 the length, any parts of a given actual register that are not involved
1688 in the operation are **NOT** modified, but are **PRESERVED**.
1689
1690 SIMD micro-architectures may implement this by using predication on
1691 any elements in a given actual register that are beyond the end of
1692 multi-element operation.
1693
1694 Example:
1695
1696 * rs1, rs2 and rd are all set to 8-bit
1697 * VL is set to 3
1698 * RV64 architecture is set (UXL=64)
1699 * add operation is carried out
1700 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1701 concatenated with similar add operations on bits 15..8 and 7..0
1702 * bits 24 through 63 **remain as they originally were**.
1703
1704 Example SIMD micro-architectural implementation:
1705
1706 * SIMD architecture works out the nearest round number of elements
1707 that would fit into a full RV64 register (in this case: 8)
1708 * SIMD architecture creates a hidden predicate, binary 0b00000111
1709 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1710 * SIMD architecture goes ahead with the add operation as if it
1711 was a full 8-wide batch of 8 adds
1712 * SIMD architecture passes top 5 elements through the adders
1713 (which are "disabled" due to zero-bit predication)
1714 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1715 and stores them in rd.
1716
1717 This requires a read on rd, however this is required anyway in order
1718 to support non-zeroing mode.
1719
1720 ## Polymorphic floating-point
1721
1722 Standard scalar RV integer operations base the register width on XLEN,
1723 which may be changed (UXL in USTATUS, and the corresponding MXL and
1724 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1725 arithmetic operations are therefore restricted to an active XLEN bits,
1726 with sign or zero extension to pad out the upper bits when XLEN has
1727 been dynamically set to less than the actual register size.
1728
1729 For scalar floating-point, the active (used / changed) bits are
1730 specified exclusively by the operation: ADD.S specifies an active
1731 32-bits, with the upper bits of the source registers needing to
1732 be all 1s ("NaN-boxed"), and the destination upper bits being
1733 *set* to all 1s (including on LOAD/STOREs).
1734
1735 Where elwidth is set to default (on any source or the destination)
1736 it is obvious that this NaN-boxing behaviour can and should be
1737 preserved. When elwidth is non-default things are less obvious,
1738 so need to be thought through. Here is a normal (scalar) sequence,
1739 assuming an RV64 which supports Quad (128-bit) FLEN:
1740
1741 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1742 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1743 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1744 top 64 MSBs ignored.
1745
1746 Therefore it makes sense to mirror this behaviour when, for example,
1747 elwidth is set to 32. Assume elwidth set to 32 on all source and
1748 destination registers:
1749
1750 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1751 floating-point numbers.
1752 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1753 in bits 0-31 and the second in bits 32-63.
1754 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1755
1756 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1757 of the registers either during the FLD **or** the ADD.D. The reason
1758 is that, effectively, the top 64 MSBs actually represent a completely
1759 independent 64-bit register, so overwriting it is not only gratuitous
1760 but may actually be harmful for a future extension to SV which may
1761 have a way to directly access those top 64 bits.
1762
1763 The decision is therefore **not** to touch the upper parts of floating-point
1764 registers whereever elwidth is set to non-default values, including
1765 when "isvec" is false in a given register's CSR entry. Only when the
1766 elwidth is set to default **and** isvec is false will the standard
1767 RV behaviour be followed, namely that the upper bits be modified.
1768
1769 Ultimately if elwidth is default and isvec false on *all* source
1770 and destination registers, a SimpleV instruction defaults completely
1771 to standard RV scalar behaviour (this holds true for **all** operations,
1772 right across the board).
1773
1774 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1775 non-default values are effectively all the same: they all still perform
1776 multiple ADD operations, just at different widths. A future extension
1777 to SimpleV may actually allow ADD.S to access the upper bits of the
1778 register, effectively breaking down a 128-bit register into a bank
1779 of 4 independently-accesible 32-bit registers.
1780
1781 In the meantime, although when e.g. setting VL to 8 it would technically
1782 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1783 using ADD.Q may be an easy way to signal to the microarchitecture that
1784 it is to receive a higher VL value. On a superscalar OoO architecture
1785 there may be absolutely no difference, however on simpler SIMD-style
1786 microarchitectures they may not necessarily have the infrastructure in
1787 place to know the difference, such that when VL=8 and an ADD.D instruction
1788 is issued, it completes in 2 cycles (or more) rather than one, where
1789 if an ADD.Q had been issued instead on such simpler microarchitectures
1790 it would complete in one.
1791
1792 ## Specific instruction walk-throughs
1793
1794 This section covers walk-throughs of the above-outlined procedure
1795 for converting standard RISC-V scalar arithmetic operations to
1796 polymorphic widths, to ensure that it is correct.
1797
1798 ### add
1799
1800 Standard Scalar RV32/RV64 (xlen):
1801
1802 * RS1 @ xlen bits
1803 * RS2 @ xlen bits
1804 * add @ xlen bits
1805 * RD @ xlen bits
1806
1807 Polymorphic variant:
1808
1809 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1810 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1811 * add @ max(rs1, rs2) bits
1812 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1813
1814 Note here that polymorphic add zero-extends its source operands,
1815 where addw sign-extends.
1816
1817 ### addw
1818
1819 The RV Specification specifically states that "W" variants of arithmetic
1820 operations always produce 32-bit signed values. In a polymorphic
1821 environment it is reasonable to assume that the signed aspect is
1822 preserved, where it is the length of the operands and the result
1823 that may be changed.
1824
1825 Standard Scalar RV64 (xlen):
1826
1827 * RS1 @ xlen bits
1828 * RS2 @ xlen bits
1829 * add @ xlen bits
1830 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1831
1832 Polymorphic variant:
1833
1834 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1835 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1836 * add @ max(rs1, rs2) bits
1837 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1838
1839 Note here that polymorphic addw sign-extends its source operands,
1840 where add zero-extends.
1841
1842 This requires a little more in-depth analysis. Where the bitwidth of
1843 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1844 only where the bitwidth of either rs1 or rs2 are different, will the
1845 lesser-width operand be sign-extended.
1846
1847 Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
1848 where for add they are both zero-extended. This holds true for all arithmetic
1849 operations ending with "W".
1850
1851 ### addiw
1852
1853 Standard Scalar RV64I:
1854
1855 * RS1 @ xlen bits, truncated to 32-bit
1856 * immed @ 12 bits, sign-extended to 32-bit
1857 * add @ 32 bits
1858 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1859
1860 Polymorphic variant:
1861
1862 * RS1 @ rs1 bits
1863 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1864 * add @ max(rs1, 12) bits
1865 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1866
1867
1868 # Exceptions
1869
1870 TODO: expand. Exceptions may occur at any time, in any given underlying
1871 scalar operation. This implies that context-switching (traps) may
1872 occur, and operation must be returned to where it left off. That in
1873 turn implies that the full state - including the current parallel
1874 element being processed - has to be saved and restored. This is
1875 what the **STATE** CSR is for.
1876
1877 The implications are that all underlying individual scalar operations
1878 "issued" by the parallelisation have to appear to be executed sequentially.
1879 The further implications are that if two or more individual element
1880 operations are underway, and one with an earlier index causes an exception,
1881 it may be necessary for the microarchitecture to **discard** or terminate
1882 operations with higher indices.
1883
1884 This being somewhat dissatisfactory, an "opaque predication" variant
1885 of the STATE CSR is being considered.
1886
1887 # Hints
1888
1889 A "HINT" is an operation that has no effect on architectural state,
1890 where its use may, by agreed convention, give advance notification
1891 to the microarchitecture: branch prediction notification would be
1892 a good example. Usually HINTs are where rd=x0.
1893
1894 With Simple-V being capable of issuing *parallel* instructions where
1895 rd=x0, the space for possible HINTs is expanded considerably. VL
1896 could be used to indicate different hints. In addition, if predication
1897 is set, the predication register itself could hypothetically be passed
1898 in as a *parameter* to the HINT operation.
1899
1900 No specific hints are yet defined in Simple-V
1901
1902 # Subsets of RV functionality
1903
1904 This section describes the differences when SV is implemented on top of
1905 different subsets of RV.
1906
1907 ## Common options
1908
1909 It is permitted to limit the size of either (or both) the register files
1910 down to the original size of the standard RV architecture. However, below
1911 the mandatory limits set in the RV standard will result in non-compliance
1912 with the SV Specification.
1913
1914 ## RV32 / RV32F
1915
1916 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1917 maximum limit for predication is also restricted to 32 bits. Whilst not
1918 actually specifically an "option" it is worth noting.
1919
1920 ## RV32G
1921
1922 Normally in standard RV32 it does not make much sense to have
1923 RV32G, however it is automatically implied to exist in RV32+SV due to
1924 the option for the element width to be doubled. This may be sufficient
1925 for implementors, such that actually needing RV32G itself (which makes
1926 no sense given that the RV32 integer register file is 32-bit) may be
1927 redundant.
1928
1929 It is a strange combination that may make sense on closer inspection,
1930 particularly given that under the standard RV32 system many of the opcodes
1931 to convert and sign-extend 64-bit integers to 64-bit floating-point will
1932 be missing, as they are assumed to only be present in an RV64 context.
1933
1934 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1935
1936 When floating-point is not implemented, the size of the User Register and
1937 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1938 per table).
1939
1940 ## RV32E
1941
1942 In embedded scenarios the User Register and Predication CSRs may be
1943 dropped entirely, or optionally limited to 1 CSR, such that the combined
1944 number of entries from the M-Mode CSR Register table plus U-Mode
1945 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1946 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1947 the Predication CSR tables.
1948
1949 RV32E is the most likely candidate for simply detecting that registers
1950 are marked as "vectorised", and generating an appropriate exception
1951 for the VL loop to be implemented in software.
1952
1953 ## RV128
1954
1955 RV128 has not been especially considered, here, however it has some
1956 extremely large possibilities: double the element width implies
1957 256-bit operands, spanning 2 128-bit registers each, and predication
1958 of total length 128 bit given that XLEN is now 128.
1959
1960 # Under consideration <a name="issues"></a>
1961
1962 for element-grouping, if there is unused space within a register
1963 (3 16-bit elements in a 64-bit register for example), recommend:
1964
1965 * For the unused elements in an integer register, the used element
1966 closest to the MSB is sign-extended on write and the unused elements
1967 are ignored on read.
1968 * The unused elements in a floating-point register are treated as-if
1969 they are set to all ones on write and are ignored on read, matching the
1970 existing standard for storing smaller FP values in larger registers.
1971
1972 ---
1973
1974 info register,
1975
1976 > One solution is to just not support LR/SC wider than a fixed
1977 > implementation-dependent size, which must be at least 
1978 >1 XLEN word, which can be read from a read-only CSR
1979 > that can also be used for info like the kind and width of 
1980 > hw parallelism supported (128-bit SIMD, minimal virtual 
1981 > parallelism, etc.) and other things (like maybe the number 
1982 > of registers supported). 
1983
1984 > That CSR would have to have a flag to make a read trap so
1985 > a hypervisor can simulate different values.
1986
1987 ----
1988
1989 > And what about instructions like JALR? 
1990
1991 answer: they're not vectorised, so not a problem
1992
1993 ----
1994
1995 * if opcode is in the RV32 group, rd, rs1 and rs2 bitwidth are
1996 XLEN if elwidth==default
1997 * if opcode is in the RV32I group, rd, rs1 and rs2 bitwidth are
1998 *32* if elwidth == default
1999
2000 ---
2001
2002 TODO: update elwidth to be default / 8 / 16 / 32
2003
2004 ---
2005
2006 TODO: document different lengths for INT / FP regfiles, and provide
2007 as part of info register. 00=32, 01=64, 10=128, 11=reserved.
2008
2009