From: Luke Kenneth Casson Leighton Date: Sun, 2 Apr 2023 17:06:03 +0000 (+0100) Subject: remove LD/ST from ls010 because it is added via pandoc explicitly X-Git-Tag: opf_rfc_ls012_v1~180 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=3c7b3a084258697aba8c11b4c375adf4e8fb8401;p=libreriscv.git remove LD/ST from ls010 because it is added via pandoc explicitly --- diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn index ff3b8a558..eabb471b1 100644 --- a/openpower/sv/ldst.mdwn +++ b/openpower/sv/ldst.mdwn @@ -1,5 +1,3 @@ -[[!tag standards]] - # SV Load and Store Links: @@ -660,4 +658,6 @@ REMAP will need to be used. -------- +[[!tag standards]] + \newpage{} diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index b2731eb3e..acdba62ab 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -1315,652 +1315,6 @@ different: elements that fail the CR test *or* are masked out are zero'd. \newpage{} -# SV Load and Store - -**Rationale** - -All Vector ISAs dating back fifty years have extensive and comprehensive -Load and Store operations that go far beyond the capabilities of Scalar -RISC and most CISC processors, yet at their heart on an individual element -basis may be found to be no different from RISC Scalar equivalents. - -The resource savings from Vector LD/ST are significant and stem -from the fact that one single instruction can trigger a dozen (or in -some microarchitectures such as Cray or NEC SX Aurora) hundreds of -element-level Memory accesses. - -Additionally, and simply: if the Arithmetic side of an ISA supports -Vector Operations, then in order to keep the ALUs 100% occupied the -Memory infrastructure (and the ISA itself) correspondingly needs Vector -Memory Operations as well. - -Vectorised Load and Store also presents an extra dimension (literally) -which creates scenarios unique to Vector applications, that a Scalar (and -even a SIMD) ISA simply never encounters. SVP64 endeavours to add the -modes typically found in *all* Scalable Vector ISAs, without changing the -behaviour of the underlying Base (Scalar) v3.0B operations in any way. -(The sole apparent exception is Post-Increment Mode on LD/ST-update -instructions) - -## Modes overview - -Vectorisation of Load and Store requires creation, from scalar operations, -a number of different modes: - -* **fixed aka "unit" stride** - contiguous sequence with no gaps -* **element strided** - sequential but regularly offset, with gaps -* **vector indexed** - vector of base addresses and vector of offsets -* **Speculative fail-first** - where it makes sense to do so -* **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode. - -*Despite being constructed from Scalar LD/ST none of these Modes exist -or make sense in any Scalar ISA. They **only** exist in Vector ISAs* - -Also included in SVP64 LD/ST is both signed and unsigned Saturation, -as well as Element-width overrides and Twin-Predication. - -Note also that Indexed [[sv/remap]] mode may be applied to both v3.0 -LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions. -LD/ST-Indexed should not be conflated with Indexed REMAP mode: -clarification is provided below. - -**Determining the LD/ST Modes** - -A minor complication (caused by the retro-fitting of modern Vector -features to a Scalar ISA) is that certain features do not exactly make -sense or are considered a security risk. Fail-first on Vector Indexed -would allow attackers to probe large numbers of pages from userspace, -where strided fail-first (by creating contiguous sequential LDs) does not. - -In addition, reduce mode makes no sense. Realistically we need an -alternative table definition for [[sv/svp64]] `RM.MODE`. The following -modes make sense: - -* saturation -* predicate-result (mostly for cache-inhibited LD/ST) -* simple (no augmentation) -* fail-first (where Vector Indexed is banned) -* Signed Effective Address computation (Vector Indexed only) - -More than that however it is necessary to fit the usual Vector ISA -capabilities onto both Power ISA LD/ST with immediate and to LD/ST -Indexed. They present subtly different Mode tables, which, due to lack -of space, have the following quirks: - -* LD/ST Immediate has no individual control over src/dest zeroing, - whereas LD/ST Indexed does. -* LD/ST Indexed has limited zeroing on pred-result, LD/ST Immediate has - *no* option to select zeroing on pred-result. - -## Format and fields - -Fields used in tables below: - -* **sz / dz** if predication is enabled will put zeros into the dest - (or as src in the case of twin pred) when the predicate bit is zero. - otherwise the element is ignored or skipped, depending on context. -* **zz**: both sz and dz are set equal to this flag. -* **inv CR bit** just as in branches (BO) these bits allow testing of - a CR bit and whether it is set (inv=0) or unset (inv=1) -* **N** sets signed/unsigned saturation. -* **RC1** as if Rc=1, stores CRs *but not the result* -* **SEA** - Signed Effective Address, if enabled performs sign-extension on - registers that have been reduced due to elwidth overrides -* **PI** - post-increment mode (applies to LD/ST with update only). - the Effective Address utilised is always just RA, i.e. the computation of - EA is stored in RA **after** it is actually used. -* **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors - may be truncated to (at least) one element, and VL altered to indicate such. - -**LD/ST immediate** - -The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE` -(bits 19:23 of `RM`) is: - -| 0-1 | 2 | 3 4 | description | -| --- | --- |---------|--------------------------- | -| 00 | 0 | zz els | simple mode | -| 00 | 1 | PI LF | post-increment and Fault-First | -| 01 | inv | CR-bit | Rc=1: ffirst CR sel | -| 01 | inv | els RC1 | Rc=0: ffirst z/nonz | -| 10 | N | zz els | sat mode: N=0/1 u/s | -| 11 | inv | CR-bit | Rc=1: pred-result CR sel | -| 11 | inv | els RC1 | Rc=0: pred-result z/nonz | - -The `els` bit is only relevant when `RA.isvec` is clear: this indicates -whether stride is unit or element: - -``` - if RA.isvec: - svctx.ldstmode = indexed - elif els == 0: - svctx.ldstmode = unitstride - elif immediate != 0: - svctx.ldstmode = elementstride -``` - -An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect -the multiplication of the immediate-offset by zero results in reading from -the exact same memory location, *even with a Vector register*. (Normally -this type of behaviour is reserved for the mapreduce modes) - -For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just -the once and be copied, rather than hitting the Data Cache multiple -times with the same memory read at the same location. The benefit of -Cache-inhibited LD-splats is that it allows for memory-mapped peripherals -to have multiple data values read in quick succession and stored in -sequentially numbered registers (but, see Note below). - -For non-cache-inhibited ST from a vector source onto a scalar destination: -with the Vector loop effectively creating multiple memory writes to -the same location, we can deduce that the last of these will be the -"successful" one. Thus, implementations are free and clear to optimise -out the overwriting STs, leaving just the last one as the "winner". -Bear in mind that predicate masks will skip some elements (in source -non-zeroing mode). Cache-inhibited ST operations on the other hand -**MUST** write out a Vector source multiple successive times to the exact -same Scalar destination. Just like Cache-inhibited LDs, multiple values -may be written out in quick succession to a memory-mapped peripheral -from sequentially-numbered registers. - -Note that any memory location may be Cache-inhibited -(Power ISA v3.1, Book III, 1.6.1, p1033) - -*Programmer's Note: an immediate also with a Scalar source as a "VSPLAT" -mode is simply not possible: there are not enough Mode bits. One single -Scalar Load operation may be used instead, followed by any arithmetic -operation (including a simple mv) in "Splat" mode.* - -**LD/ST Indexed** - -The modes for `RA+RB` indexed version are slightly different -but are the same `RM.MODE` bits (19:23 of `RM`): - -| 0-1 | 2 | 3 4 | description | -| --- | --- |---------|-------------------------- | -| 00 | SEA | dz sz | simple mode | -| 01 | SEA | dz sz | Strided (scalar only source) | -| 10 | N | dz sz | sat mode: N=0/1 u/s | -| 11 | inv | CR-bit | Rc=1: pred-result CR sel | -| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz | - -Vector Indexed Strided Mode is qualified as follows: - - if mode = 0b01 and !RA.isvec and !RB.isvec: - svctx.ldstmode = elementstride - -A summary of the effect of Vectorisation of src or dest: - -``` - imm(RA) RT.v RA.v no stride allowed - imm(RA) RT.s RA.v no stride allowed - imm(RA) RT.v RA.s stride-select allowed - imm(RA) RT.s RA.s not vectorised - RA,RB RT.v {RA|RB}.v Standard Indexed - RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT) - RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable - RA,RB RT.s {RA&RB}.s not vectorised (scalar identity) -``` - -Signed Effective Address computation is only relevant for Vector Indexed -Mode, when elwidth overrides are applied. The source override applies to -RB, and before adding to RA in order to calculate the Effective Address, -if SEA is set RB is sign-extended from elwidth bits to the full 64 bits. -For other Modes (ffirst, saturate), all EA computation with elwidth -overrides is unsigned. - -Note that cache-inhibited LD/ST when VSPLAT is activated will perform -**multiple** LD/ST operations, sequentially. Even with scalar src -a Cache-inhibited LD will read the same memory location *multiple -times*, storing the result in successive Vector destination registers. -This because the cache-inhibit instructions are typically used to read -and write memory-mapped peripherals. If a genuine cache-inhibited -LD-VSPLAT is required then a single *scalar* cache-inhibited LD should -be performed, followed by a VSPLAT-augmented mv, copying the one *scalar* -value into multiple register destinations. - -Note also that cache-inhibited VSPLAT with Predicate-result is possible. -This allows for example to issue a massive batch of memory-mapped -peripheral reads, stopping at the first NULL-terminated character and -truncating VL to that point. No branch is needed to issue that large -burst of LDs, which may be valuable in Embedded scenarios. - -## Vectorisation of Scalar Power ISA v3.0B - -Scalar Power ISA Load/Store operations may be seen from their -pseudocode to be of the form: - -``` - lbux RT, RA, RB - EA <- (RA) + (RB) - RT <- MEM(EA) -``` - -and for immediate variants: - -``` - lb RT,D(RA) - EA <- RA + EXTS(D) - RT <- MEM(EA) -``` - -Thus in the first example, the source registers may each be independently -marked as scalar or vector, and likewise the destination; in the second -example only the one source and one dest may be marked as scalar or -vector. - -Thus we can see that Vector Indexed may be covered, and, as demonstrated -with the pseudocode below, the immediate can be used to give unit -stride or element stride. With there being no way to tell which from -the Power v3.0B Scalar opcode alone, the choice is provided instead by -the SV Context. - -``` - # LD not VLD! format - ldop RT, immed(RA) - # op_width: lb=1, lh=2, lw=4, ld=8 - op_load(RT, RA, op_width, immed, svctx, RAupdate): -  ps = get_pred_val(FALSE, RA); # predication on src -  pd = get_pred_val(FALSE, RT); # ... AND on dest -  for (i=0, j=0, u=0; i < VL && j < VL;): - # skip nonpredicates elements - if (RA.isvec) while (!(ps & 1<next) - for i in range(VL): - EA = GPR(RA+i) + imm # ptr + offset(next) - data = MEM(EA, 8) # 64-bit address of ptr->next - GPR(RT+i) = data # happens to be read on next loop! - # was a normal ld up to this point. now the Data-Fail-First - CR.field(i) = conditions(data) - if CR.field(i).EQ == testbit: # check if zero - if VLI then VL = i+1 # update VL, inclusive - else VL = i # update VL - break # stop looping -``` - -## LOAD/STORE Elwidths - -Loads and Stores are almost unique in that the Power Scalar ISA -provides a width for the operation (lb, lh, lw, ld). Only `extsb` and -others like it provide an explicit operation width. There are therefore -*three* widths involved: - -* operation width (lb=8, lh=16, lw=32, ld=64) -* src element width override (8/16/32/default) -* destination element width override (8/16/32/default) - -Some care is therefore needed to express and make clear the transformations, -which are expressly in this order: - -* Calculate the Effective Address from RA at full width - but (on Indexed Load) allow srcwidth overrides on RB -* Load at the operation width (lb/lh/lw/ld) as usual -* byte-reversal as usual -* Non-saturated mode: - - zero-extension or truncation from operation width to dest elwidth - - place result in destination at dest elwidth -* Saturated mode: - - Sign-extension or truncation from operation width to dest width - - signed/unsigned saturation down to dest elwidth - -In order to respect Power v3.0B Scalar behaviour the memory side -is treated effectively as completely separate and distinct from SV -augmentation. This is primarily down to quirks surrounding LE/BE and -byte-reversal. - -It is rather unfortunately possible to request an elwidth override on -the memory side which does not mesh with the overridden operation width: -these result in `UNDEFINED` behaviour. The reason is that the effect -of attempting a 64-bit `sv.ld` operation with a source elwidth override -of 8/16/32 would result in overlapping memory requests, particularly -on unit and element strided operations. Thus it is `UNDEFINED` when -the elwidth is smaller than the memory operation width. Examples include -`sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset -from each other at 2-byte intervals. Store likewise is also `UNDEFINED` -where the dest elwidth override is less than the operation width. - -Note the following regarding the pseudocode to follow: - -* `scalar identity behaviour` SV Context parameter conditions turn this - into a straight absolute fully-compliant Scalar v3.0B LD operation -* `brev` selects whether the operation is the byte-reversed variant (`ldbrx` - rather than `ld`) -* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as - a "normal" part of Scalar v3.0B LD -* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again - as a "normal" part of Scalar v3.0B LD -* `svctx` specifies the SV Context and includes VL as well as - source and destination elwidth overrides. - -Below is the pseudocode for Unit-Strided LD (which includes Vector -capability). Observe in particular that RA, as the base address in both -Immediate and Indexed LD/ST, does not have element-width overriding -applied to it. - -Note that predication, predication-zeroing, and other modes except -saturation have all been removed, for clarity and simplicity: - -``` - # LD not VLD! - # this covers unit stride mode and a type of vector offset - function op_ld(RT, RA, op_width, imm_offs, svctx) - for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL): - if not svctx.unit/el-strided: - # strange vector mode, compute 64 bit address which is - # not polymorphic! elwidth hardcoded to 64 here - srcbase = get_polymorphed_reg(RA, 64, i) - else: - # unit / element stride mode, compute 64 bit address - srcbase = get_polymorphed_reg(RA, 64, 0) - # adjust for unit/el-stride - srcbase += .... - - # read the underlying memory - memread <= MEM(srcbase + imm_offs, op_width) - - # check saturation. - if svpctx.saturation_mode: - # ... saturation adjustment... - memread = clamp(memread, op_width, svctx.dest_elwidth) - else: - # truncate/extend to over-ridden dest width. - memread = adjust_wid(memread, op_width, svctx.dest_elwidth) - - # takes care of inserting memory-read (now correctly byteswapped) - # into regfile underlying LE-defined order, into the right place - # within the NEON-like register, respecting destination element - # bitwidth, and the element index (j) - set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread) - - # increments both src and dest element indices (no predication here) - i++; - j++; -``` - -Note above that the source elwidth is *not used at all* in LD-immediate. - -For LD/Indexed, the key is that in the calculation of the Effective Address, -RA has no elwidth override but RB does. Pseudocode below is simplified -for clarity: predication and all modes except saturation are removed: - -``` - # LD not VLD! ld*rx if brev else ld* - function op_ld(RT, RA, RB, op_width, svctx, brev) - for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL): - if not svctx.el-strided: - # RA not polymorphic! elwidth hardcoded to 64 here - srcbase = get_polymorphed_reg(RA, 64, i) - else: - # element stride mode, again RA not polymorphic - srcbase = get_polymorphed_reg(RA, 64, 0) - # RB *is* polymorphic - offs = get_polymorphed_reg(RB, svctx.src_elwidth, i) - # sign-extend - if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64) - - # takes care of (merges) processor LE/BE and ld/ldbrx - bytereverse = brev XNOR MSR.LE - - # read the underlying memory - memread <= MEM(srcbase + offs, op_width) - - # optionally performs byteswap at op width - if (bytereverse): - memread = byteswap(memread, op_width) - - if svpctx.saturation_mode: - # ... saturation adjustment... - memread = clamp(memread, op_width, svctx.dest_elwidth) - else: - # truncate/extend to over-ridden dest width. - memread = adjust_wid(memread, op_width, svctx.dest_elwidth) - - # takes care of inserting memory-read (now correctly byteswapped) - # into regfile underlying LE-defined order, into the right place - # within the NEON-like register, respecting destination element - # bitwidth, and the element index (j) - set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread) - - # increments both src and dest element indices (no predication here) - i++; - j++; -``` - -## Remapped LD/ST - -In the [[sv/remap]] page the concept of "Remapping" is described. Whilst -it is expensive to set up (2 64-bit opcodes minimum) it provides a way to -arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth -of LDs or STs. The usual interest in such re-mapping is for example in -separating out 24-bit RGB channel data into separate contiguous registers. - -REMAP easily covers this capability, and with dest elwidth overrides -and saturation may do so with built-in conversion that would normally -require additional width-extension, sign-extension and min/max Vectorised -instructions as post-processing stages. - -Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes -because the generic abstracted concept of "Remapping", when applied to -LD/ST, will give that same capability, with far more flexibility. - -It is worth noting that Pack/Unpack Modes of SVSTATE, which may be -established through `svstep`, are also an easy way to perform regular -Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that, -REMAP will need to be used. - --------- - -\newpage{} - # Condition Register SVP64 Operations Condition Register Fields are only 4 bits wide: this presents some @@ -2971,3 +2325,8 @@ more an actual Vector ISA Branch and as such is not at all appropriate: ``` [[!tag opf_rfc]] + +-------- + +\newpage{} +