[[!tag standards]] # Appendix * Saturation * Parallel Prefix * Reduce Modes * parallel prefix simulator * OV sv.addex discussion This is the appendix to [[sv/svp64]], providing explanations of modes etc. leaving the main svp64 page's primary purpose as outlining the instruction format. Table of contents: [[!toc]] # Partial Implementations It is perfectly legal to implement subsets of SVP64 as long as illegal instruction traps are always raised on unimplemented features, so that soft-emulation is possible, even for future revisions of SVP64. With SVP64 being partly controlled through contextual SPRs, a little care has to be taken. **All** SPRs not implemented including reserved ones for future use must raise an illegal instruction trap if read or written. This allows software the opportunity to emulate the context created by the given SPR. See [[sv/compliancy_levels]] for full details. # XER, SO and other global flags Vector systems are expected to be high performance. This is achieved through parallelism, which requires that elements in the vector be independent. XER SO/OV and other global "accumulation" flags (CR.SO) cause Read-Write Hazards on single-bit global resources, having a significant detrimental effect. Consequently in SV, XER.SO behaviour is disregarded (including in `cmp` instructions). XER.SO is not read, but XER.OV may be written, breaking the Read-Modify-Write Hazard Chain that complicates microarchitectural implementations. This includes when `scalar identity behaviour` occurs. If precise OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1 instructions should be used without an SV Prefix. TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf Of note here is that XER.SO and OV may already be disregarded in the Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset. SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets, but only for SVP64 Prefixed Operations. XER.CA/CA32 on the other hand is expected and required to be implemented according to standard Power ISA Scalar behaviour. Interestingly, due to SVP64 being in effect a hardware for-loop around Scalar instructions executing in precise Program Order, a little thought shows that a Vectorised Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In and producing, at the end, a single bit Carry out. High performance implementations may exploit this observation to deploy efficient Parallel Carry Lookahead. # assume VL=4, this results in 4 sequential ops (below) sv.adde r0.v, r4.v, r8.v # instructions that get executed in backend hardware: adde r0, r4, r8 # takes carry-in, produces carry-out adde r1, r5, r9 # takes carry from previous ... adde r3, r7, r11 # likewise It can clearly be seen that the carry chains from one 64 bit add to the next, the end result being that a 256-bit "Big Integer Add with Carry" has been performed, and that CA contains the 257th bit. A one-instruction 512-bit Add-with-Carry may be performed by setting VL=8, and a one-instruction 1024-bit Add-with-Carry by setting VL=16, and so on. More on this in [[openpower/sv/biginteger]] # v3.0B/v3.1 relevant instructions SV is primarily designed for use as an efficient hybrid 3D GPU / VPU / CPU ISA. Vectorisation of the VSX Packed SIMD system makes no sense whatsoever, the sole exceptions potentially being any operations with 128-bit operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar Quad-precision Add). SV effectively *replaces* the majority of VSX, requiring far less instructions, and provides, at the very minimum, predication (which VSX was designed without). Likewise, Load/Store Multiple make no sense to have because they are not only provided by SV, the SV alternatives may be predicated as well, making them far better suited to use in function calls and context-switching. Additionally, some v3.0/1 instructions simply make no sense at all in a Vector context: `rfid` falls into this category, as well as `sc` and `scv`. Here there is simply no point trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions should be called instead. Fortuitously this leaves several Major Opcodes free for use by SV to fit alternative future instructions. In a 3D context this means Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST operations, and others critical to an efficient, effective 3D GPU and VPU ISA. With such instructions being included as standard in other commercially-successful GPU ISAs it is likewise critical that a 3D GPU/VPU based on svp64 also have such instructions. Note however that svp64 is stand-alone and is in no way critically dependent on the existence or provision of 3D GPU or VPU instructions. These should be considered extensions, and their discussion and specification is out of scope for this document. Note, again: this is *only* under svp64 prefixing. Standard v3.0B / v3.1B is *not* altered by svp64 in any way. ## Major opcode map (v3.0B) This table is taken from v3.0B. Table 9: Primary Opcode Map (opcode bits 0:5) ``` | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 000 | | | tdi | twi | EXT04 | | | mulli | 000 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 ``` ## Suitable for svp64-only This is the same table containing v3.0B Primary Opcodes except those that make no sense in a Vectorisation Context have been removed. These removed POs can, *in the SV Vector Context only*, be assigned to alternative (Vectorised-only) instructions, including future extensions. EXT04 retains the scalar `madd*` operations but would have all PackedSIMD (aka VSX) operations removed. Note, again, to emphasise: outside of svp64 these opcodes **do not** change. When not prefixed with svp64 these opcodes **specifically** retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning. ``` | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 000 | | | | | EXT04 | | | mulli | 000 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 010 | bc/l/a | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 ``` It is important to note that having a different v3.0B Scalar opcode that is different from an SVP64 one is highly undesirable: the complexity in the decoder is greatly increased. # EXTRA Field Mapping The purpose of the 9-bit EXTRA field mapping is to mark individual registers (RT, RA, BFA) as either scalar or vector, and to extend their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64. Three of the 9 bits may also be used up for a 2nd Predicate (Twin Predication) leaving a mere 6 bits for qualifying registers. As can be seen there is significant pressure on these (and in fact all) SVP64 bits. In Power ISA v3.1 prefixing there are bits which describe and classify the prefix in a fashion that is independent of the suffix. MLSS for example. For SVP64 there is insufficient space to make the SVP64 Prefix "self-describing", and consequently every single Scalar instruction had to be individually analysed, by rote, to craft an EXTRA Field Mapping. This process was semi-automated and is described in this section. The final results, which are part of the SVP64 Specification, are here: * [[openpower/opcode_regs_deduped]] Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed from reading the markdown formatted version of the Scalar pseudocode which is machine-readable and found in [[openpower/isatables]]. The analysis gives, by instruction, a "Register Profile". `add RT, RA, RB` for example is given a designation `RM-2R-1W` because it requires two GPR reads and one GPR write. Secondly, the total number of registers was added up (2R-1W is 3 registers) and if less than or equal to three then that instruction could be given an EXTRA3 designation. Four or more is given an EXTRA2 designation because there are only 9 bits available. Thirdly, the instruction was analysed to see if Twin or Single Predication was suitable. As a general rule this was if there was only a single operand and a single result (`extw` and LD/ST) however it was found that some 2 or 3 operand instructions also qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use in Twin Predication, some compromises were made, here. LDST is Twin but also has 3 operands in some operations, so only EXTRA2 can be used. Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing could have been decided that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5) and RT indexed 2 (EXTRA bits 6-8). In some cases (LD/ST with update) RA-as-a-source is given a **different** EXTRA index from RA-as-a-result (because it is possible to do, and perceived to be useful). Rc=1 co-results (CR0, CR1) are always given the same EXTRA index as their main result (RT, FRT). Fifthly, in an automated process the results of the analysis were outputted in CSV Format for use in machine-readable form by sv_analysis.py This process was laborious but logical, and, crucially, once a decision is made (and ratified) cannot be reversed. Qualifying future Power ISA Scalar instructions for SVP64 is **strongly** advised to utilise this same process and the same sv_analysis.py program as a canonical method of maintaining the relationships. Alterations to that same program which change the Designation is **prohibited** once finalised (ratified through the Power ISA WG Process). It would be similar to deciding that `add` should be changed from X-Form to D-Form. # Single Predication This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask. In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep, but depending on whether sz and/or dz are set, srcstep and dststep can still potentially become different indices. Only when sz=dz is srcstep guaranteed to equal dststep at all times. Note that in some Mode Formats there is only one flag (zz). This indicates that *both* sz *and* dz are set to the same. Example 1: * VL=4 * mask=0b1101 * sz=0, dz=1 The following schedule for srcstep and dststep will occur: | srcstep | dststep | comment | | ---- | ----- | -------- | | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 | | 1 | 2 | sz=1 but dz=0: dst skips mask[1], src soes not | | 2 | 3 | mask[src=2] and mask[dst=3] are 1 | | end | end | loop has ended because dst reached VL-1 | Example 2: * VL=4 * mask=0b1101 * sz=1, dz=0 The following schedule for srcstep and dststep will occur: | srcstep | dststep | comment | | ---- | ----- | -------- | | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 | | 2 | 1 | sz=0 but dz=1: src skips mask[1], dst does not | | 3 | 2 | mask[src=3] and mask[dst=2] are 1 | | end | end | loop has ended because src reached VL-1 | In both these examples it is crucial to note that despite there being a single predicate mask, with sz and dz being different, srcstep and dststep are being requested to react differently. Example 3: * VL=4 * mask=0b1101 * sz=0, dz=0 The following schedule for srcstep and dststep will occur: | srcstep | dststep | comment | | ---- | ----- | -------- | | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 | | 2 | 2 | sz=0 and dz=0: both src and dst skip mask[1] | | 3 | 3 | mask[src=3] and mask[dst=3] are 1 | | end | end | loop has ended because src and dst reached VL-1 | Here, both srcstep and dststep remain in lockstep because sz=dz=1 # EXTRA Pack/Unpack Modes The pack/unpack concept of VSX `vpack` is abstracted out as a Sub-Vector reordering Schedule, named `RM-2P-1S1D-PU`. The usual RM-2P-1S1D is reduced from EXTRA3 to EXTRA2, making room for 2 extra bits that enable either "packing" or "unpacking" on the subvectors vec2/3/4. Illustrating a "normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides): def index(): for i in range(VL): for j in range(SUBVL): yield i*SUBVL+j for idx in index(): operation_on(RA+idx) For pack/unpack (again, no elwidth overrides): # yield an outer-SUBVL or inner VL loop with SUBVL def index_p(outer): if outer: for j in range(SUBVL): for i in range(VL): yield i+VL*j else: for i in range(VL): for j in range(SUBVL): yield i*SUBVL+j # walk through both source and dest indices simultaneously for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)): move_operation(RT+dst_idx, RA+src_idx) "yield" from python is used here for simplicity and clarity. The two Finite State Machines for the generation of the source and destination element offsets progress incrementally in lock-step. Setting of both `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED` because the reordering is fully deterministic, and additional REMAP reordering may be applied. For Matrix this would give potentially up to 4 Dimensions of reordering. Pack/Unpack applies to mv operations and some other single-source single-destination operations such as Indexed LD/ST and extsw. [[sv/mv.swizzle] has a slightly different pseudocode algorithm for Vertical-First Mode. # Twin Predication This is a novel concept that allows predication to be applied to a single source and a single dest register. The following types of traditional Vector operations may be encoded with it, *without requiring explicit opcodes to do so* * VSPLAT (a single scalar distributed across a vector) * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction)) * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction)) * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics)) * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics)) Those patterns (and more) may be applied to: * mv (the usual way that V\* ISA operations are created) * exts\* sign-extension * rwlinm and other RS-RA shift operations (**note**: excluding those that take RA as both a src and dest. These are not 1-src 1-dest, they are 2-src, 1-dest) * LD and ST (treating AGEN as one source) * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc. * Condition Register ops mfcr, mtcr and other similar This is a huge list that creates extremely powerful combinations, particularly given that one of the predicate options is `(1< Numbering relationships for CR fields are already complex due to being in BE format (*the relationship is not clearly explained in the v3.0B or v3.1 specification*). However with some care and consideration the exact same mapping used for INT and FP regfiles may be applied, just to the upper bits, as explained below. The notation `CR{field number}` is used to indicate access to a particular Condition Register Field (as opposed to the notation `CR[bit]` which accesses one bit of the 32 bit Power ISA v3.0B Condition Register) `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as: CR{7-n} = CR[32+n*4:35+n*4] For SVP64 the relationship for the sequential numbering of elements is to the CR **fields** within the CR Register, not to individual bits within the CR register. In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2) select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits *in* that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of analysis and research) to be as follows: CR_index = 7-(BA>>2) # top 3 bits but BE bit_index = 3-(BA & 0b11) # low 2 bits but BE CR_reg = CR{CR_index} # get the CR # finally get the bit from the CR. CR_bit = (CR_reg & (1<> 2)<<6) | # hi 3 bits shifted up (spec[1:2]<<4) | # to make room for these (BA & 0b11) # CR_bit on the end else: # scalar constructs "00 spec[1:2] BA[0:4]" return (spec[1:2] << 5) | BA Thus, for example, to access a given bit for a CR in SV mode, the v3.0B algorithm to determine CR\_reg is modified to as follows: CR_index = 7-(BA>>2) # top 3 bits but BE if spec[0]: # vector mode, 0-124 increments of 4 CR_index = (CR_index<<4) | (spec[1:2] << 2) else: # scalar mode, 0-32 increments of 1 CR_index = (spec[1:2]<<3) | CR_index # same as for v3.0/v3.1 from this point onwards bit_index = 3-(BA & 0b11) # low 2 bits but BE CR_reg = CR{CR_index} # get the CR # finally get the bit from the CR. CR_bit = (CR_reg & (1< 0 ... etc If a "cumulated" CR based analysis of results is desired (a la VSX CR6) then a followup instruction must be performed, setting "reduce" mode on the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far more flexibility in analysing vectors than standard Vector ISAs. Normal Vector ISAs are typically restricted to "were all results nonzero" and "were some results nonzero". The application of mapreduce to Vectorised cr operations allows far more sophisticated analysis, particularly in conjunction with the new crweird operations see [[sv/cr_int_predication]]. Note in particular that the use of a separate instruction in this way ensures that high performance multi-issue OoO inplementations do not have the computation of the cumulative analysis CR as a bottleneck and hindrance, regardless of the length of VL. Additionally, SVP64 [[sv/branches]] may be used, even when the branch itself is to the following instruction. The combined side-effects of CTR reduction and VL truncation provide several benefits. (see [[discussion]]. some alternative schemes are described there) ## Rc=1 when SUBVL!=1 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of predicate is allocated per subvector; likewise only one CR is allocated per subvector. This leaves a conundrum as to how to apply CR computation per subvector, when normally Rc=1 is exclusively applied to scalar elements. A solution is to perform a bitwise OR or AND of the subvector tests. Given that OE is ignored in SVP64, this field may (when available) be used to select OR or AND behavior. ### Table of CR fields CRn is the notation used by the OpenPower spec to refer to CR field #i, so FP instructions with Rc=1 write to CR1 (n=1). CRs are not stored in SPRs: they are registers in their own right. Therefore context-switching the full set of CRs involves a Vectorised mfcr or mtcr, using VL=8 to do so. This is exactly as how scalar OpenPOWER context-switches CRs: it is just that there are now more of them. The 64 SV CRs are arranged similarly to the way the 128 integer registers are arranged. TODO a python program that auto-generates a CSV file which can be included in a table, which is in a new page (so as not to overwhelm this one). [[svp64/cr_names]] # Register Profiles **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see for details. Instructions are broken down by Register Profiles as listed in the following auto-generated page: [[opcode_regs_deduped]]. "Non-SV" indicates that the operations with this Register Profile cannot be Vectorised (mtspr, bc, dcbz, twi) TODO generate table which will be here [[svp64/reg_profiles]] # SV pseudocode illilustration ## Single-predicated Instruction illustration of normal mode add operation: zeroing not included, elwidth overrides not included. if there is no predicate, it is set to all 1s function op_add(rd, rs1, rs2) # add not VADD! int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd); for (i = 0; i < VL; i++) STATE.srcoffs = i # save context if (predval & 1< # Assembly Annotation Assembly code annotation is required for SV to be able to successfully mark instructions as "prefixed". A reasonable (prototype) starting point: svp64 [field=value]* Fields: * ew=8/16/32 - element width * sew=8/16/32 - source element width * vec=2/3/4 - SUBVL * mode=mr/satu/sats/crpred * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne similar to x86 "rex" prefix. For actual assembler: sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s Qualifiers: * m={pred}: predicate mask mode * sm={pred}: source-predicate mask mode (only allowed in Twin-predication) * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4 * ew={N}: ew=8/16/32 - sets elwidth override * sw={N}: sw=8/16/32 - sets source elwidth override * ff={xx}: see fail-first mode * pr={xx}: see predicate-result mode * sat{x}: satu / sats - see saturation mode * mr: see map-reduce mode * mr.svm see map-reduce with sub-vector mode * crm: see map-reduce CR mode * crm.svm see map-reduce CR with sub-vector mode * sz: predication with source-zeroing * dz: predication with dest-zeroing For modes: * pred-result: - pm=lt/gt/le/ge/eq/ne/so/ns - RC1 mode * fail-first - ff=lt/gt/le/ge/eq/ne/so/ns - RC1 mode * saturation: - sats - satu * map-reduce: - mr OR crm: "normal" map-reduce mode or CR-mode. - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled # Parallel-reduction algorithm The principle of SVP64 is that SVP64 is a fully-independent Abstraction of hardware-looping in between issue and execute phases that has no relation to the operation it issues. Additional state cannot be saved on context-switching beyond that of SVSTATE, making things slightly tricky. Executable demo pseudocode, full version [here](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/test_preduce.py;hb=HEAD) ``` [[!inline raw="yes" pages="openpower/sv/preduce.py" ]] ``` This algorithm works by noting when data remains in-place rather than being reduced, and referring to that alternative position on subsequent layers of reduction. It is re-entrant. If however interrupted and restored, some implementations may take longer to re-establish the context. Its application by default is that: * RA, FRA or BFA is the first register as the first operand (ci index offset in the above pseudocode) * RB, FRB or BFB is the second (co index offset) * RT (result) also uses ci **if RA==RT** For more complex applications a REMAP Schedule must be used *Programmers's note: if passed a predicate mask with only one bit set, this algorithm takes no action, similar to when a predicate mask is all zero.* *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are implemented in hardware with MVs that ensure lane-crossing is minimised. The mistake which would be catastrophic to SVP64 to make is to then limit the Reduction Sequence for all implementors based solely and exclusively on what one specific internal microarchitecture does. In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient, compact and efficient encodings of abstract concepts.* **It is the Implementor's responsibility to produce a design that complies with the above algorithm, utilising internal Micro-coding and other techniques to transparently insert micro-architectural lane-crossing Move operations if necessary or desired, to give the level of efficiency or performance required.** # Element-width overrides Element-width overrides are best illustrated with a packed structure union in the c programming language. The following should be taken literally, and assume always a little-endian layout: typedef union { uint8_t b[]; uint16_t s[]; uint32_t i[]; uint64_t l[]; uint8_t actual_bytes[8]; } el_reg_t; elreg_t int_regfile[128]; get_polymorphed_reg(reg, bitwidth, offset): el_reg_t res; res.l = 0; // TODO: going to need sign-extending / zero-extending if bitwidth == 8: reg.b = int_regfile[reg].b[offset] elif bitwidth == 16: reg.s = int_regfile[reg].s[offset] elif bitwidth == 32: reg.i = int_regfile[reg].i[offset] elif bitwidth == 64: reg.l = int_regfile[reg].l[offset] return res set_polymorphed_reg(reg, bitwidth, offset, val): if (!reg.isvec): # not a vector: first element only, overwrites high bits int_regfile[reg].l[0] = val elif bitwidth == 8: int_regfile[reg].b[offset] = val elif bitwidth == 16: int_regfile[reg].s[offset] = val elif bitwidth == 32: int_regfile[reg].i[offset] = val elif bitwidth == 64: int_regfile[reg].l[offset] = val In effect the GPR registers r0 to r127 (and corresponding FPRs fp0 to fp127) are reinterpreted to be "starting points" in a byte-addressable memory. Vectors - which become just a virtual naming construct - effectively overlap. It is extremely important for implementors to note that the only circumstance where upper portions of an underlying 64-bit register are zero'd out is when the destination is a scalar. The ideal register file has byte-level write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE. An example ADD operation with predication and element width overrides:  for (i = 0; i < VL; i++) if (predval & 1<>destwid) if (!RT.isvec) break if (RT.isvec)  { id += 1; } if (RA.isvec)  { irs1 += 1; } if (RB.isvec)  { irs2 += 1; } if (RC.isvec)  { irs3 += 1; } The significant part here is that the second half is stored starting not from RT+MAXVL at all: it is the *element* index that is offset by MAXVL, both halves actually starting from RT. If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements RT0 to RT2 are stored: 0..31 32..63 r0 unchanged unchanged r1 RT0.lo RT1.lo r2 RT2.lo unchanged r3 unchanged RT0.hi r4 RT1.hi RT2.hi r5 unchanged unchanged Note that all of the LO halves start from r1, but that the HI halves start from half-way into r3. The reason is that with MAXVL bring 5 and elwidth being 32, this is the 5th element offset (in 32 bit quantities) counting from r1. *Programmer's note: accessing registers that have been placed starting on a non-contiguous boundary (half-way along a scalar register) can be inconvenient: REMAP can provide an offset but it requires extra instructions to set up. A simple solution is to ensure that MAXVL is rounded up such that the Vector ends cleanly on a contiguous register boundary. MAXVL=6 in the above example would achieve that* Additional DRAFT Scalar instructions in 3-in 2-out form with an implicit 2nd destination: * [[isa/svfixedarith]] * [[isa/svfparith]]