[[!tag standards]] # SV Load and Store Links: * * * * * * [[simple_v_extension/specification/ld.x]] Vectorisation of Load and Store requires creation, from scalar operations, a number of different modes: * fixed stride (contiguous sequence with no gaps) aka "unit" stride * element strided (sequential but regularly offset, with gaps) * vector indexed (vector of base addresses and vector of offsets) * fail-first on the same (where it makes sense to do so) * Structure Packing (covered in SV by [[sv/remap]]). OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and [[isa/fixedstore]] pseudocode to be of the form: lbux RT, RA, RB EA <- (RA) + (RB) RT <- MEM(EA) and for immediate variants: lb RT,D(RA) EA <- RA + EXTS(D) RT <- MEM(EA) Thus in the first example, the source registers may each be independently marked as scalar or vector, and likewise the destination; in the second example only the one source and one dest may be marked as scalar or vector. Thus we can see that Vector Indexed may be covered, and, as demonstrated with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context. # LD not VLD! format - ldop RT, immed(RA) # op_width: lb=1, lh=2, lw=4, ld=8 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):  ps = get_pred_val(FALSE, RA); # predication on src  pd = get_pred_val(FALSE, RT); # ... AND on dest  for (i=0, j=0, u=0; i < VL && j < VL;): # skip nonpredicates elements if (RA.isvec) while (!(ps & 1<>= 1 return result Indexed LD is: # format: ldop RT, RA, RB function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!  ps = get_pred_val(FALSE, RA); # predication on src  pd = get_pred_val(FALSE, RT); # ... AND on dest  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL): # skip nonpredicated RA, RB and RT if (RA.isvec) while (!(ps & 1< for(i = 0; i < VL; i++) reg[rt + i] = mem[reg[ra] + i * reg[rb]]; High security implementations where any kind of speculative probing of memory pages is considered a risk should take advantage of the fact that implementations may truncate VL at any point, without requiring software to be rewritten and made non-portable. Such implementations may choose to *always* set VL=1 which will have the effect of terminating any speculative probing (and also adversely affect performance), but will at least not require applications to be rewritten. # LOAD/STORE Elwidths Loads and Stores are almost unique in that the OpenPOWER Scalar ISA provides a width for the operation (lb, lh, lw, ld). Only `extsb` and others like it provide an explicit operation width. There are therefore *three* widths involved: * operation width (lb=8, lh=16, lw=32, ld=64) * src elelent width override * destination element width override Some care is therefore needed to express and make clear the transformations, which are expressly in this order: * Load at the operation width (lb/lh/lw/ld) as usual * byte-reversal as usual * Non-saturated mode: - zero-extension or truncation from operation width to source elwidth - zero/truncation to dest elwidth * Saturated mode: - Sign-extension or truncation from operation width to source width - signed/unsigned saturation down to dest elwidth In order to respect OpenPOWER v3.0B Scalar behaviour the memory side is treated effectively as completely separate and distinct from SV augmentation. This is primarily down to quirks surrounding LE/BE and byte-reversal in OpenPOWER. It is unfortunately possible to request an elwidth override on the memory side which does not mesh with the operation width: these result in `UNDEFINED` behaviour. The reason is that the effect of attempting a 64-bit `sv.ld` operation with a source elwidth override of 8/16/32 would result in overlapping memory requests, particularly on unit and element strided operations. Thus it is `UNDEFINED` when the elwidth is smaller than the memory operation width. Examples include `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset from each other at 2-byte intervals. Store likewise is also `UNDEFINED` where the dest elwidth override is less than the operation width. Note the following regarding the pseudocode to follow: * `scalar identity behaviour` SV Context parameter conditions turn this into a straight absolute fully-compliant Scalar v3.0B LD operation * `brev` selects whether the operation is the byte-reversed variant (`ldbrx` rather than `ld`) * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as a "normal" part of Scalar v3.0B LD * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again as a "normal" part of Scalar v3.0B LD * `svctx` specifies the SV Context and includes VL as well as source and destination elwidth overrides. Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Note that twin predication, predication-zeroing, saturation and other modes have all been removed, for clarity and simplicity: # LD not VLD! (ldbrx if brev=True) # this covers unit stride mode and a type of vector offset function op_ld(RT, RA, brev, op_width, imm_offs, svctx) for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;): if not svctx.unit/el-strided: # strange vector mode, compute 64 bit address which is # not polymorphic! elwidth hardcoded to 64 here srcbase = get_polymorphed_reg(RA, 64, i) else: # unit / element stride mode, compute 64 bit address srcbase = get_polymorphed_reg(RA, 64, 0) # adjust for unit/el-stride srcbase += .... # takes care of (merges) processor LE/BE and ld/ldbrx bytereverse = brev XNOR MSR.LE # read the underlying memory memread <= mem[srcbase + imm_offs]; # optionally performs byteswap at op width if (bytereverse): memread = byteswap(memread, op_width) # check saturation. if svpctx.saturation_mode: ... saturation adjustment... else: # truncate/extend to over-ridden source width. memread = adjust_wid(memread, op_width, svctx.src_elwidth) # takes care of inserting memory-read (now correctly byteswapped) # into regfile underlying LE-defined order, into the right place # within the NEON-like register, respecting destination element # bitwidth, and the element index (j) set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread) # increments both src and dest element indices (no predication here) i++; j++; # Remapped LD/ST In the [[sv/propagation]] page the concept of "Remapping" is described. Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth of LDs or STs. The usual interest in such re-mapping is for example in separating out 24-bit RGB channel data into separate contiguous registers. NEON covers this as shown in the diagram below: Remap easily covers this capability, and with dest elwidth overrides and saturation may do so with built-in conversion that would normally require additional width-extension, sign-extension and min/max Vectorised instructions as post-processing stages. Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes because the generic abstracted concept of "Remapping", when applied to LD/ST, will give that same capability, with far more flexibility. # notes from lxo this section covers assembly notation for the immediate and indexed LD/ST. the summary is that in immediate mode for LD it is not clear that if the destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar the memory being read is *still a vector load*, known as "unit or element strides". This anomaly is made clear with the following notation: sv.ld RT.v, imm(RA).v The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error: sv.ld RT.v, imm(RA) Notes taken from IRC conversation sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r# sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r# whereas sv.ldx r#.v, r#.v, r# -> vector of addresses point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem)); (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers permutations of vector selection, to identify above asm-syntax: imm(RA) RT.v RA.v nonstrided sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2) destreg r# r#+1 r#+2 imm(RA) RT.s RA.v nonstrided sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses (dest r# is scalar) -> VSELECT mode imm(RA) RT.v RA.s fixed stride: unit or element sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2 mem@r#2 +0 +1 +2 destreg r# r#+1 r#+2 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2 mem@r#2 +0 ... +offs ... +offs*2 destreg r# r#+1 r#+2 imm(RA) RT.s RA.s not vectorised sv.ld r#, ofst(r#2) indexed mode: RA,RB RT.v RA.v RB.v sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3 RA,RB RT.v RA.s RB.v sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3 RA,RB RT.v RA.v RB.s sv.ldx r#.v, r#2.v, r#3 -> vector of addresses RA,RB RT.v RA.s RB.s sv.ldx r#.v, r#2, r#3 -> VSPLAT mode RA,RB RT.s RA.v RB.v RA,RB RT.s RA.s RB.v RA,RB RT.s RA.v RB.s RA,RB RT.s RA.s RB.s not vectorised