X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fldst.mdwn;h=ca67a4f004ba1461da6fd59f5613fa0a9168961e;hb=0a873ae35b9633a01c0090a30d919cc8c979cdfc;hp=4b106928e296a20e93a5ffbe8b02ef7d320345e0;hpb=5be5938b00a48f4e49c23fe94d63de9861283ae1;p=libreriscv.git diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn index 4b106928e..ca67a4f00 100644 --- a/openpower/sv/ldst.mdwn +++ b/openpower/sv/ldst.mdwn @@ -9,6 +9,31 @@ Links: * * * +* [[simple_v_extension/specification/ld.x]] + +# Rationale + +All Vector ISAs dating back fifty years have extensive and comprehensive +Load and Store operations that go far beyond the capabilities of Scalar +RISC or CISC processors, yet at their heart on an individual element +basis may be found to be no different from RISC Scalar equivalents. + +The resource savings from Vector LD/ST are significant and stem from +the fact that one single instruction can trigger a dozen (or in some +microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses. + +Additionally, and simply: if the Arithmetic side of an ISA supports +Vector Operations, then in order to keep the ALUs 100% occupied the +Memory infrastructure (and the ISA itself) correspondingly needs Vector +Memory Operations as well. + +Vectorised Load and Store also presents an extra dimension (literally) +which creates scenarios unique to Vector applications, that a Scalar +(and even a SIMD) ISA simply never encounters. SVP64 endeavours to +add such modes without changing the behaviour of the underlying Base +(Scalar) v3.0B operations. + +# Modes overview Vectorisation of Load and Store requires creation, from scalar operations, a number of different modes: @@ -16,9 +41,14 @@ a number of different modes: * fixed stride (contiguous sequence with no gaps) aka "unit" stride * element strided (sequential but regularly offset, with gaps) * vector indexed (vector of base addresses and vector of offsets) -* fail-first on the same (where it makes sense to do so) +* Speculative fail-first (where it makes sense to do so) * Structure Packing (covered in SV by [[sv/remap]]). +Also included in SVP64 LD/ST is both signed and unsigned Saturation, +as well as Element-width overrides and Twin-Predication. + +# Vectorisation of Scalar Power ISA v3.0B + OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and [[isa/fixedstore]] pseudocode to be of the form: @@ -50,13 +80,13 @@ with the pseudocode below, the immediate can be used to give unit stride or elem if (RA.isvec) while (!(ps & 1< +## LD/ST ffirst + +LD/ST ffirst treats the first LD/ST in a vector (element 0) as an +ordinary one. Exceptions occur "as normal". However for elements 1 +and above, if an exception would occur, then VL is **truncated** to the +previous element: the exception is **not** then raised because the +LD/ST was effectively speculative. + +ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See + + for(i = 0; i < VL; i++) + reg[rt + i] = mem[reg[ra] + i * reg[rb]]; + +High security implementations where any kind of speculative probing +of memory pages is considered a risk should take advantage of the fact that +implementations may truncate VL at any point, without requiring software +to be rewritten and made non-portable. Such implementations may choose +to *always* set VL=1 which will have the effect of terminating any +speculative probing (and also adversely affect performance), but will +at least not require applications to be rewritten. + +Low-performance simpler hardware implementations may +choose (always) to also set VL=1 as the bare minimum compliant implementation of +LD/ST Fail-First. It is however critically important to remember that +the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. +**MUST** raise exceptions exactly like an ordinary LD/ST. + +For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary +such as the beginning of a cache line, or beginning of a Virtual Memory +page. Likewise, to reduce workloads or balance resources. + +Vertical-First Mode is slightly strange in that only one element +at a time is ever executed anyway. Given that programmers may +legitimately choose to alter srcstep and dststep in non-sequential +order as part of explicit loops, it is neither possible nor +safe to make speculative assumptions about future LD/STs. +Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`. +This is very different from Arithmetic (Data-dependent) FFirst +where Vertical-First Mode is deterministic, not speculative. + +# LOAD/STORE Elwidths Loads and Stores are almost unique in that the OpenPOWER Scalar ISA provides a width for the operation (lb, lh, lw, ld). Only `extsb` and @@ -235,6 +334,17 @@ is treated effectively as completely separate and distinct from SV augmentation. This is primarily down to quirks surrounding LE/BE and byte-reversal in OpenPOWER. +It is unfortunately possible to request an elwidth override on the memory side which +does not mesh with the operation width: these result in `UNDEFINED` +behaviour. The reason is that the effect of attempting a 64-bit `sv.ld` +operation with a source elwidth override of 8/16/32 would result in +overlapping memory requests, particularly on unit and element strided +operations. Thus it is `UNDEFINED` when the elwidth is smaller than +the memory operation width. Examples include `sv.lw/sw=16/els` which +requests (overlapping) 4-byte memory reads offset from +each other at 2-byte intervals. Store likewise is also `UNDEFINED` +where the dest elwidth override is less than the operation width. + Note the following regarding the pseudocode to follow: * `scalar identity behaviour` SV Context parameter conditions turn this @@ -278,7 +388,6 @@ and other modes have all been removed, for clarity and simplicity: if (bytereverse): memread = byteswap(memread, op_width) - # check saturation. if svpctx.saturation_mode: ... saturation adjustment...