From: Luke Kenneth Casson Leighton Date: Mon, 3 Apr 2023 10:38:35 +0000 (+0100) Subject: remove duplicate parts of ls010.mdwn that were moved back to normal.mdwn X-Git-Tag: opf_rfc_ls012_v1~163 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=c86ba5897557426ae02d6219f2e074ef10663ec5;p=libreriscv.git remove duplicate parts of ls010.mdwn that were moved back to normal.mdwn --- diff --git a/openpower/sv/rfc/Makefile b/openpower/sv/rfc/Makefile index 55c08588b..62c80262f 100644 --- a/openpower/sv/rfc/Makefile +++ b/openpower/sv/rfc/Makefile @@ -1,7 +1,7 @@ all: ls001.pdf ls002.pdf ls003.pdf ls004.pdf ls005.pdf ls006.pdf ls007.pdf -LSO10_FILES = ../svp64.mdwn ls010.mdwn -LSO10_FILES += ../ldst.mdwn ../branches.mdwn ../cr_ops.mdwn +LSO10_FILES = ls010.mdwn ../svp64.mdwn +LSO10_FILES += ../normal.mdwn ../ldst.mdwn ../branches.mdwn ../cr_ops.mdwn ls010.pdf: $(LS010_FILES) cd ../.. && pandoc \ @@ -13,8 +13,10 @@ ls010.pdf: $(LS010_FILES) -V fontsize=9pt \ -V papersize=legal \ -V linkcolor=blue \ - -f markdown sv/svp64.mdwn \ + -f markdown \ sv/rfc/ls010.mdwn \ + sv/svp64.mdwn \ + sv/normal.mdwn \ sv/ldst.mdwn \ sv/branches.mdwn \ sv/cr_ops.mdwn \ diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index 8084ba104..ec9c8ba20 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -1,299 +1,10 @@ -# Normal SVP64 Modes, for Arithmetic and Logical Operations +# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem -Normal SVP64 Mode covers Arithmetic and Logical operations -to provide suitable additional behaviour. The Mode -field is bits 19-23 of the [[svp64]] RM Field. +TODO template from other RFCs -## Mode - -Mode is an augmentation of SV behaviour, providing additional -functionality. Some of these alterations are element-based (saturation), -others involve post-analysis (predicate result) and others are -Vector-based (mapreduce, fail-on-first). - -[[sv/ldst]], [[sv/cr_ops]] and [[sv/branches]] are covered separately: -the following Modes apply to Arithmetic and Logical SVP64 operations: - -* **simple** mode is straight vectorisation. no augmentations: the - vector comprises an array of independently created results. -* **ffirst** or data-dependent fail-on-first: see separate section. - the vector may be truncated depending on certain criteria. - *VL is altered as a result*. -* **sat mode** or saturation: clamps each element result to a min/max - rather than overflows / wraps. allows signed and unsigned clamping - for both INT and FP. -* **reduce mode**. if used correctly, a mapreduce (or a prefix sum) - is performed. see [[svp64/appendix]]. - note that there are comprehensive caveats when using this mode. -* **pred-result** will test the result (CR testing selects a bit of CR - and inverts it, just like branch conditional testing) and if the - test fails it is as if the *destination* predicate bit was zero even - before starting the operation. When Rc=1 the CR element however is - still stored in the CR regfile, even if the test failed. See appendix - for details. - -Note that ffirst and reduce modes are not anticipated to be -high-performance in some implementations. ffirst due to interactions -with VL, and reduce due to it requiring additional operations to produce -a result. simple, saturate and pred-result are however inter-element -independent and may easily be parallelised to give high performance, -regardless of the value of VL. - -The Mode table for Arithmetic and Logical operations is laid out as -follows: - -| 0-1 | 2 | 3 4 | description | -| --- | --- |---------|-------------------------- | -| 00 | 0 | dz sz | simple mode | -| 00 | 1 | 0 RG | scalar reduce mode (mapreduce) | -| 00 | 1 | 1 / | reserved | -| 01 | inv | CR-bit | Rc=1: ffirst CR sel | -| 01 | inv | VLi RC1 | Rc=0: ffirst z/nonz | -| 10 | N | dz sz | sat mode: N=0/1 u/s | -| 11 | inv | CR-bit | Rc=1: pred-result CR sel | -| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz | - -Fields: - -* **sz / dz** if predication is enabled will put zeros into the dest - (or as src in the case of twin pred) when the predicate bit is zero. - otherwise the element is ignored or skipped, depending on context. -* **zz**: both sz and dz are set equal to this flag -* **inv CR bit** just as in branches (BO) these bits allow testing of - a CR bit and whether it is set (inv=0) or unset (inv=1) -* **RG** inverts the Vector Loop order (VL-1 downto 0) rather - than the normal 0..VL-1 -* **N** sets signed/unsigned saturation. -* **RC1** as if Rc=1, enables access to `VLi`. -* **VLi** VL inclusive: in fail-first mode, the truncation of - VL *includes* the current element at the failure point rather - than excludes it from the count. - -For LD/ST Modes, see [[sv/ldst]]. For Condition Registers see -[[sv/cr_ops]]. For Branch modes, see [[sv/branches]]. - -## Rounding, clamp and saturate - -To help ensure for example that audio quality is not compromised by -overflow, "saturation" is provided, as well as a way to detect when -saturation occurred if desired (Rc=1). When Rc=1 there will be a *vector* -of CRs, one CR per element in the result (Note: this is different from -VSX which has a single CR per block). - -When N=0 the result is saturated to within the maximum range of an -unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar -logic applies to FP operations, with the result being saturated to -maximum rather than returning INF, and the minimum to +0.0 - -When N=1 the same occurs except that the result is saturated to the min -or max of a signed result, and for FP to the min and max value rather -than returning +/- INF. - -When Rc=1, the CR "overflow" bit is set on the CR associated with -the element, to indicate whether saturation occurred. Note that -due to the hugely detrimental effect it has on parallel processing, -XER.SO is **ignored** completely and is **not** brought into play here. -The CR overflow bit is therefore simply set to zero if saturation did -not occur, and to one if it did. This behaviour (ignoring XER.SO) is -actually optional in the SFFS Compliancy Subset: for SVP64 it is made -mandatory *but only on Vectorised instructions*. - -Note also that saturate on operations that set OE=1 must raise an Illegal -Instruction due to the conflicting use of the CR.so bit for storing -if saturation occurred. Vectorised Integer Operations that produce a -Carry-Out (CA, CA32): these two bits will be `UNDEFINED` if saturation -is also requested. - -Note that the operation takes place at the maximum bitwidth (max of -src and dest elwidth) and that truncation occurs to the range of the -dest elwidth. - -*Programmer's Note: Post-analysis of the Vector of CRs to find out if any -given element hit saturation may be done using a mapreduced CR op (cror), -or by using the new crrweird instruction with Rc=1, which will transfer -the required CR bits to a scalar integer and update CR0, which will allow -testing the scalar integer for nonzero. see [[sv/cr_int_predication]]. -Alternatively, a Data-Dependent Fail-First may be used to truncate the -Vector Length to non-saturated elements, greatly increasing the productivity -of parallelised inner hot-loops.* - -## Reduce mode - -Reduction in SVP64 is similar in essence to other Vector Processing ISAs, -but leverages the underlying scalar Base v3.0B operations. Thus it is -more a convention that the programmer may utilise to give the appearance -and effect of a Horizontal Vector Reduction. Due to the unusual decoupling -it is also possible to perform prefix-sum (Fibonacci Series) in certain -circumstances. Details are in the SVP64 appendix - -Reduce Mode should not be confused with Parallel Reduction [[sv/remap]]. -As explained in the [[sv/appendix]] Reduce Mode switches off the check -which would normally stop looping if the result register is scalar. -Thus, the result scalar register, if also used as a source scalar, -may be used to perform sequential accumulation. This *deliberately* -sets up a chain of Register Hazard Dependencies, whereas Parallel Reduce -[[sv/remap]] deliberately issues a Tree-Schedule of operations that may -be parallelised. - -## Data-dependent Fail-on-first - -Data-dependent fail-on-first is very different from LD/ST Fail-First -(also known as Fault-First) and is actually CR-field-driven. -Vector elements are required to appear -to be executed in sequential Program Order. When REMAP is not active, -element 0 would be the first. - -Data-driven (CR-driven) fail-on-first activates when Rc=1 or other -CR-creating operation produces a result (including cmp). Similar to -branch, an analysis of the CR is performed and if the test fails, the -vector operation terminates and discards all element operations **at and -above the current one**, and VL is truncated to either the *previous* -element or the current one, depending on whether VLi (VL "inclusive") -is clear or set, respectively. - -Thus the new VL comprises a contiguous vector of results, all of which -pass the testing criteria (equal to zero, less than zero etc as defined -by the CR-bit test). - -*Note: when VLi is clear, the behaviour at first seems counter-intuitive. -A result is calculated but if the test fails it is prohibited from being -actually written. This becomes intuitive again when it is remembered -that the length that VL is set to is the number of *written* elements, and -only when VLI is set will the current element be included in that count.* - -The CR-based data-driven fail-on-first is "new" and not found in ARM SVE -or RVV. At the same time it is "old" because it is almost identical to -a generalised form of Z80's `CPIR` instruction. It is extremely useful -for reducing instruction count, however requires speculative execution -involving modifications of VL to get high performance implementations. -An additional mode (RC1=1) effectively turns what would otherwise be an -arithmetic operation into a type of `cmp`. The CR is stored (and the -CR.eq bit tested against the `inv` field). If the CR.eq bit is equal to -`inv` then the Vector is truncated and the loop ends. - -VLi is only available as an option when `Rc=0` (or for instructions -which do not have Rc). When set, the current element is always also -included in the count (the new length that VL will be set to). This may -be useful in combination with "inv" to truncate the Vector to *exclude* -elements that fail a test, or, in the case of implementations of strncpy, -to include the terminating zero. - -In CR-based data-driven fail-on-first there is only the option to select -and test one bit of each CR (just as with branch BO). For more complex -tests this may be insufficient. If that is the case, a vectorised crop -such as crand, cror or [[sv/cr_int_predication]] crweirder may be used, -and ffirst applied to the crop instead of to the arithmetic vector. Note -that crops are covered by the [[sv/cr_ops]] Mode format. - -Use of Fail-on-first with Vertical-First Mode is not prohibited but is -not really recommended. The effect of truncating VL -may have unintended and unexpected consequences on subsequent instructions. -VLi set will be fine: it is when VLi is clear that problems may be faced. - -*Programmer's note: `VLi` is only accessible in normal operations which in -turn limits the CR field bit-testing to only `EQ/NE`. [[sv/cr_ops]] are -not so limited. Thus it is possible to use for example `sv.cror/ff=gt/vli -*0,*0,*0`, which is not a `nop` because it allows Fail-First Mode to -perform a test and truncate VL.* - -*Hardware implementor's note: effective Sequential Program Order must -be preserved. Speculative Execution is perfectly permitted as long as -the speculative elements are held back from writing to register files -(kept in Resevation Stations), until such time as the relevant CR Field -bit(s) has been analysed. All Speculative elements sequentially beyond -the test-failure point **MUST** be cancelled. This is no different from -standard Out-of-Order Execution and the modification effort to efficiently -support Data-Dependent Fail-First within a pre-existing Multi-Issue -Out-of-Order Engine is anticipated to be minimal. In-Order systems on -the other hand are expected, unavoidably, to be low-performance*. - -Two extremely important aspects of ffirst are: - -* LDST ffirst may never set VL equal to zero. This because on the first - element an exception must be raised "as normal". -* CR-based data-dependent ffirst on the other hand **can** set VL equal - to zero. This is the only means in the entirety of SV that VL may be set - to zero (with the exception of via the SV.STATE SPR). When VL is set - zero due to the first element failing the CR bit-test, all subsequent - vectorised operations are effectively `nops` which is - *precisely the desired and intended behaviour*. - -The second crucial aspect, compared to LDST Ffirst: - -* LD/ST Failfirst may (beyond the initial first element - conditions) truncate VL for any architecturally suitable reason. Beyond - the first element LD/ST Failfirst is arbitrarily speculative and 100% - non-deterministic. -* CR-based data-dependent first on the other hand MUST NOT truncate VL - arbitrarily to a length decided by the hardware: VL MUST only be - truncated based explicitly on whether a test fails. This because it is - a precise Deterministic test on which algorithms can and will will rely. - -**Floating-point Exceptions** - -When Floating-point exceptions are enabled VL must be truncated at -the point where the Exception appears not to have occurred. If `VLi` -is set then VL must include the faulting element, and thus the faulting -element will always raise its exception. If however `VLi` is clear then -VL **excludes** the faulting element and thus the exception will **never** -be raised. - -Although very strongly discouraged the Exception Mode that permits -Floating Point Exception notification to arrive too late to unwind -is permitted (under protest, due it violating the otherwise 100% -Deterministic nature of Data-dependent Fail-first) and is `UNDEFINED` -behaviour. - -**Use of lax FP Exception Notification Mode could result in parallel -computations proceeding with invalid results that have to be explicitly -detected, whereas with the strict FP Execption Mode enabled, FFirst -truncates VL, allows subsequent parallel computation to avoid the -exceptions entirely** - -## Data-dependent fail-first on CR operations (crand etc) - -Operations that actually produce or alter CR Field as a result have -their own SVP64 Mode, described in [[sv/cr_ops]]. - -## pred-result mode - -This mode merges common CR testing with predication, saving on instruction -count. Below is the pseudocode excluding predicate zeroing and elwidth -overrides. Note that the pseudocode for SVP64 CR-ops is slightly different. - -``` - for i in range(VL): - # predication test, skip all masked out elements. - if predicate_masked_out(i): - continue - result = op(iregs[RA+i], iregs[RB+i]) - CRnew = analyse(result) # calculates eq/lt/gt - # Rc=1 always stores the CR field - if Rc=1 or RC1: - CR.field[offs+i] = CRnew - # now test CR, similar to branch - if RC1 or CR.field[BO[0:1]] != BO[2]: - continue # test failed: cancel store - # result optionally stored but CR always is - iregs[RT+i] = result -``` - -The reason for allowing the CR element to be stored is so that -post-analysis of the CR Vector may be carried out. For example: -Saturation may have occurred (and been prevented from updating, by the -test) but it is desirable to know *which* elements fail saturation. - -Note that RC1 Mode basically turns all operations into `cmp`. The -calculation is performed but it is only the CR that is written. The -element result is *always* discarded, never written (just like `cmp`). - -Note that predication is still respected: predicate zeroing is slightly -different: elements that fail the CR test *or* are masked out are zero'd. +[[!tag opf_rfc]] -------- \newpage{} - -[[!tag opf_rfc]] - diff --git a/openpower/sv/svp64.mdwn b/openpower/sv/svp64.mdwn index d47dc8fe8..99eb7a289 100644 --- a/openpower/sv/svp64.mdwn +++ b/openpower/sv/svp64.mdwn @@ -1,4 +1,4 @@ -# RFC ls010 SVP64 Zero-Overhead Loop Prefix Subsystem +# SVP64 Zero-Overhead Loop Prefix Subsystem * **DRAFT STATUS v0.1 18sep2021** Release notes @@ -424,19 +424,20 @@ related. Thus a `VL` of 16 must be used* ## Future expansion. -With the way that EXTRA fields are defined and applied to register fields, -future versions of SV may involve 256 or greater registers. Backwards -binary compatibility may be achieved with a PCR bit (Program Compatibility -Register) or an MSR bit analogous to SF. -Further discussion is out of scope for this version of SVP64. +With the way that EXTRA fields are defined and applied to register +fields, future versions of SV may involve 256 or greater registers +in some way as long as the reputation of Power ISA for full backwards +binary interoperability is preserved. Backwards binary compatibility +may be achieved with a PCR bit (Program Compatibility Register) or an +MSR bit analogous to SF. Further discussion is out of scope for this +version of SVP64. Additionally, a future variant of SVP64 will be applied to the Scalar -(Quad-precision and 128-bit) VSX instructions. Element-width overrides -are an opportunity to expand a future version of the Power ISA -to 256-bit, 512-bit and -1024-bit operations, as well as doubling or quadrupling the number -of VSX registers to 128 or 256. Again further discussion is out of -scope for this version of SVP64. +(Quad-precision and 128-bit) VSX instructions. Element-width overrides are +an opportunity to expand a future version of the Power ISA to 256-bit, +512-bit and 1024-bit operations, as well as doubling or quadrupling the +number of VSX registers to 128 or 256. Again further discussion is out +of scope for this version of SVP64. --------