X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsvp64_quirks.mdwn;h=75f3d98bdcf9290456d779c00aea526b3c2d766e;hb=5508a1ba5f5a5e750526b8ee16966b31795fee7c;hp=3085109f1285d1a961b85f3eee6f8e3976a00cf8;hpb=c0ce7ab520b076ce2092a26d29c37c87a48d920f;p=libreriscv.git diff --git a/openpower/sv/svp64_quirks.mdwn b/openpower/sv/svp64_quirks.mdwn index 3085109f1..75f3d98bd 100644 --- a/openpower/sv/svp64_quirks.mdwn +++ b/openpower/sv/svp64_quirks.mdwn @@ -101,8 +101,8 @@ makes no sense at all, such as `sc` or `mtmsr`). The categories are: **Arithmetic** Arithmetic (known as "normal" mode) is where Scalar and Parallel -Reduction can be done: Saturation as well, and two new innovative -modes for Vector ISAs: data-dependent fail-first and predicate result. +Reduction can be done: Saturation as well, and a new innovative +modes for Vector ISAs: data-dependent fail-first. Reduction and Saturation are common to see in Vector ISAs: it is just that they are usually added as explicit instructions, and NEC SX Aurora has even more iterative instructions. In SVP64 these @@ -142,8 +142,7 @@ overrides make absolutely no sense whatsoever. Therefore the elwidth override field bits can be used for other purposes when Vectorising CR Field instructions. Moreover, Rc=1 is completely invalid for CR operations such as `crand`: Rc=1 is for arithmetic operations, producing -a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes -such as predicate-result make no sense, and neither does Saturation. +a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense. All of these differences, which require quite a lot of logical reasoning and deduction, help explain why there is an entirely different CR ops Vectorisation Category. @@ -575,6 +574,18 @@ Where this breaks down is when attempting to do half-width on BF16 or FP16 operations: there does not exist a BF8 or an IEEE754 FP8 format, so these (`sv.fadds/ew=8`) should be avoided. +# Word frequently becomes "half" + +Again, related to "Single" becoming "half of element width", unless there +are compelling reasons the same trick applies to Scalar GPR operations. +With the pseudocode being "XLEN//2" then of course if XLEN=8 the operation +becomes a 4-bit one. + +Similarly byte operations which use "XLEN//8" when XLEN=8 actually become +single-bit operations, which is very useful with `sv.extsb/w=8` +for example. This instruction copies the LSB of each byte in a sequence of bytes, +and expands it to all 8 bits in each result byte. + # Vertical-First and Subvectors Documented in the [[sv/setvl]] page, Vertical-First goes through @@ -650,16 +661,8 @@ As pointed out in the [[sv/ldst]] page there is limited space in only * LD/ST Immediate has no individual control over src/dest zeroing, whereas LD/ST Indexed does. -* LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does) -* LD/ST Indexed has no Pack/Unpack, whereas LD/ST Immediate does. - -These are not insurmountable problems: there do exist workarounds. -For example it is possible to set up Matrix REMAP to perform the same -job as Pack/Unpack, at which point the LD/ST "Saturation" mode may -be used, saving on costly intermediary registers *at double the LD -width* if a Saturated MV had to be involved. Store on the other hand -it is extremely likely that an arithmetic operation already computed -a Saturated Vector of results, so is less of a problem than Load. +* Post-Increment is not possible with Saturation or Data-Dependent Fail-First +* Element-Strided LD/ST Indexed is not possible with Data-Dependent Fail-First. Also, the LD/ST Indexed Mode can be element-strided (RB as a Scalar, times @@ -671,3 +674,43 @@ for LDST Immediate only having `zz`. Simple-V is powerful but it cannot do everything! There is just not enough space and so some compromises had to be made. + +# sv.mtcr on entire 64-bit Condition Register + +Normally, CR operations are either bit-based (where the element numbering actually +applies to the CR Field) or field-based in which case the elements are still +fields. The `sv.mtcr` and other instructions are actually full 64-bit Condition +*Register* operations and are therefore qualified as Normal/Arithmetic not +CRops. + +This is to save on both Vector Length (VL of 16 is sufficient) as well as +complexity in the Hazard Management when context-switching CR fields, as the +entire batch of 128 CR Fields may be transferred to 8 GPRs with a VL of 16 +and elwidth overriding of 32. Truncation is sufficent, dropping the top 32 bits +of the Condition Register(s) which are always zero anyway. + +# Separate Scalar and Vector Condition Register files + +As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]] +Scalar Power ISA lacks "Conditional Execution" present in ARM +Scalar ISA of several decades. When Vectorised the fact that +Rc=1 Vector results can immediately be used as a Predicate Mask +back into the following instruction can result in large latency +unless "Vector Chaining" is used in the Micro-Architecture. + +But that aside is not the main problem faced by the introduction +of Simple-V to the Power ISA: it's that the existing implementations +(IBM) don't have "Conditional Execution" and to add it to their +existing designs would be too disruptive a first step. + +A compromise is to wipe blank certain entries in the Register Dependency +Matrices by prohibiting some operations involving the two groups +of CR Fields: those that fall into the existing Scalar 32-bit CR +(fields CR0-CR7) and those that fall into the newly-introduced +CR Fields, CR8-CR127. + +This will drive compiler writers nuts, and give assembler writers headaches, +but it gives IBM the opportunity to implement SVP64 without massive +disruption. They can add an entirely new Vector CR register file, +new pipelines etc safe in the knowledge that existing Scalar HDL +needs no modification.