X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsvp64_quirks.mdwn;h=43d7249df22871f5bf4e230c0bd6acd6501e1873;hb=8a686a6a4482b791e4ad1c83a64706d6a2037b82;hp=7c02217729e16f9367429ef7f57ce527eea4280f;hpb=93aae810e05f60cd118a287dd779b48d9ded6d58;p=libreriscv.git diff --git a/openpower/sv/svp64_quirks.mdwn b/openpower/sv/svp64_quirks.mdwn index 7c0221772..43d7249df 100644 --- a/openpower/sv/svp64_quirks.mdwn +++ b/openpower/sv/svp64_quirks.mdwn @@ -2,38 +2,362 @@ [[!toc]] -SVP64 is designed around these fundamental and inviolate principles: +SVP64 is designed around fundamental and inviolate RISC principles. +This gives a uniformity and regularity to the ISA, making implementation +straightforward, which was why RISC +as a concept became popular. 1. There are no actual Vector instructions: Scalar instructions are the sole exclusive bedrock. 2. No scalar instruction ever deviates in its encoding or meaning - just because it is prefixed (caveats below) -3. A hardware-level for-loop makes vector elements 100% synonymous - with scalar instructions (the suffix) + just because it is prefixed (semantic caveats below) +3. A hardware-level for-loop (the prefix) makes vector elements + 100% synonymous with scalar instructions (the suffix) +4. Exactly as with Scalar RISC ISAs, the uniformity does produce + "holes" in the encoding or some strange combinations. -That said, there are a few exceptional places where these rules get +How can a Vector ISA even exist when no actual Vector instructions +are permitted to be added? It comes down to the strict RISC abstraction. +First you start from a **scalar** instruction (32-bit). Second, the +Prefixing is applied *in the abstract* to give the *appearance* +and ultimately the same effect as if an explicit Vector instruction +had also been added. Looking at the pseudocode of any Vector ISA +(RVV, NEC SX Aurora, Cray) +they always comprise (a) a for-loop around (b) element-based operations. +It is perfectly reasonable and rational to separate (a) from (b) +then find a powerful pre-existing +Supercomputing-class ISA that qualifies for (b). + +There are a few exceptional places where these rules get bent, and others where the rules take some explaining, -and this page tracks them. +and this page tracks them all. -The modification caveat obviously exempts element width overrides, +The modification caveat in (2) above semantically +exempts element width overrides, which still do not actually modify the meaning of the instruction: -an add remains an add, even if it is only an 8-bit add rather than -a 64-bit add. elwidth overrides *definitely* do not alter the 3.0 encoding. -Other "modifications" such as saturation or Data-dependent Fail-First -likewise are post-augmentation or post-analysis, and do not actually -fundamentally change an add operation into a subtract for example. +an add remains an add, even if its override makes it an 8-bit add rather than +a 64-bit add. Even add-with-carry remains an add-with-carry: it's just +that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry +where the 9th bit becomes Carry-out (not the 65th bit). +In other words, elwidth overrides **definitely** do not fundamentally +alter the actual +Scalar v3.0 ISA encoding itself. Consequently we can still, in +the strictest semantic sense, not be breaking rule (2). + +Likewise, other "modifications" such as saturation or Data-dependent +Fail-First likewise are actually post-augmentation or post-analysis, and do +not fundamentally change an add operation into a subtract +for example, and under absolutely no circumstances do the actual 32-bit +Scalar v3.0 operand field bits change or the number of operands change. -*(An experiment was attempted to modify LD-immediate instructions +In an early Draft of SVP64, +an experiment was attempted, to modify LD-immediate instructions to include a third RC register i.e. reinterpret the normal -v3.0 32-bit instruction as a -different encoding if SVP64-prefixed: it did not go well. +v3.0 32-bit instruction as a completely +different encoding if SVP64-prefixed. It did not go well. The complexity that resulted -in the decode phase was too great)* +in the decode phase was too great. The lesson was learned, the +hard way: it would be infinitely preferable +to add a 32-bit Scalar Load-with-Shift +instruction *first*, which then inherently becomes Vectorised. +Perhaps a future Power ISA spec will have this Load-with-Shift instruction: +both ARM and x86 have it, because it saves greatly on instruction count in +hot-loops. + +The other reason for not adding an SVP64-Prefixed instruction without +also having it as a Scalar un-prefixed instruction is that if the +32-bit encoding is ever allocated in a future revision +of the Power ISA +to a completely unrelated operation +then how can a Vectorised version of that new instruction ever be added? +The uniformity and RISC Abstraction is irreparably damaged. +Bottom line here is that the fundamental RISC Principle is strictly adhered +to, even though these are Advanced 64-bit Vector instructions. +Advocates of the RISC Principle will appreciate the uniformity of +SVP64 and the level of systematic abstraction kept between Prefix and Suffix. + +# Instruction Groups + +The basic principle of SVP64 is the prefix, which contains mode +as well as register augmentation and predicates. When thinking of +instructions and Vectorising them, it is natural for arithmetic +operations (ADD, OR) to be the first to spring to mind. +Arithmetic instructions have registers, therefore augmentation +applies, end of story, right? + +Except, Load and Store deals also with Memory, not just registers. +Power ISA has Condition Register Fields: how can element widths +apply there? And branches: how can you have Saturation on something +that does not return an arithmetic result? In short: there are actually +four different categories (five including those for which Vectorisation +makes no sense at all, such as `sc` or `mtmsr`). The categories are: + +* arithmetic/logical including floating-point +* Load/Store +* Condition Register Field operations +* branch + +**Arithmetic** + +Arithmetic (known as "normal" mode) is where Scalar and Parallel +Reduction can be done: Saturation as well, and two new innovative +modes for Vector ISAs: data-dependent fail-first and predicate result. +Reduction and Saturation are common to see in Vector ISAs: it is just +that they are usually added as explicit instructions, +and NEC SX Aurora has even more iterative instructions. In SVP64 these +concepts are applied in the abstract general form, which takes some +getting used to. + +Reduction may, when applied to non-commutative +instructions incorrectly, result in invalid results, but ultimately +it is critical to think in terms of the "rules", that everything is +Scalar instructions in strict Program Order. Reduction on non-commutative +Scalar Operations is not *prohibited*: the strict Program Order allows +the programmer to think through what would happen and thus potentially +actually come up with legitimate use. + +**Branches** + +Branch is the one and only place where the Scalar +(non-prefixed) operations differ from the Vector (element) +instructions (as explained in a separate section) although +a case could be made for the perspective that they are identical, +but the defaults for new parameters in the Scalar case makes branch +identical to Power ISA v3.1 Scalar branches. + +The +RM bits can be used for other purposes because the Arithmetic modes +make no sense at all for a Branch. +Almost the entire +SVP64 RM Field is interpreted differently from other Modes, in +order to support a wide range of parallel boolean condition options +which are expected of a Vector / GPU ISA. These save a considerable +number of instructions in tight inner loop situations. + +**CR Field Ops** + +Condition Register Fields are 4-bit wide and consequently element-width +overrides make absolutely no sense whatsoever. Therefore the elwidth +override field bits can be used for other purposes when Vectorising +CR Field instructions. Moreover, Rc=1 is completely invalid for +CR operations such as `crand`: Rc=1 is for arithmetic operations, producing +a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes +such as predicate-result make no sense, and neither does Saturation. +All of these differences, which require quite a lot of logical +reasoning and deduction, help explain why there is an entirely different +CR ops Vectorisation Category. + +A particularly strange quirk of CR-based Vector Operations is that the +Scalar Power ISA CR Register is 32-bits, but actually comprises eight +CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO) +this makes up 32 bits, and therefore a CR operand referring to one bit +of the CR will be 5 bits in length (BA, BT). +*However*, some instructions refer +to a *CR Field* (CR0-CR7) and consequently these operands +(BF, BFA etc) are only 3-bits. + +(*It helps here to think of the top 3 bits of BA as referring +to a CR Field, like BFA does, and the bottom 2 bits of BA +referring to +EQ/LT/GT/SO within that Field*) + +With SVP64 extending the number of CR *Fields* to 128, the number of +32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields* +(8 per CR Register). Then, it gets even more strange, when it comes +to Vectorisation, which applies to the CR Field *numbers*. The +hardware-for-loop for Rc=1 for example starts at CR0 for element 0, +and moves to CR1 for element 1, and so on. The reason here is quite +simple: each element result has to have its own CR Field co-result. + +In other words, the +element is the 4-bit CR *Field*, not the bits *of* the 32-bit +CR Register, and not the CR *Register* (of which there are now 16). +All quite logical, but a little mind-bending. + +**Load/Store** + +LOAD/STORE is another area that has different needs: this time it is +down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes +which simply make no sense in a RISC Scalar ISA: element-stride and +unit-stride and the entire concept of a stride itself (a spacing +between elements) has no place at all in a Scalar ISA. The problems +come when trying to *retrofit* the concept of "Vector Elements" onto +a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64 +RM Prefix to convey the stride mode, changing the Effective Address +computation as a result. Interestingly, worth noting for Hardware +designers: it did turn out to be possible to perform pre-multiplication +of the D/DS Immediate by the stride amount, making it possible to avoid +actually modifying the LD/ST Pipeline itself. + +Other areas where LD/ST went quirky: element-width overrides especially +when combined with Saturation, given that LD/ST operations have byte, +halfword, word, dword and quad variants. The interaction between these +widths as part of the actual operation, and the source and destination +elwidth overrides, was particularly obtuse and hard to derive: some care +and attention is advised, here, when reading the specification, +especially on arithmetic loads (lbarx, lharx etc.) + +**Non-vectorised** + +The concept of a Vectorised halt (`attn`) makes no sense. There are never +going to be a Vector of global MSRs (Machine Status Register). `mtcr` +on the other hand is a grey area: `mtspr` is clearly Vectoriseable. +Even `td` and `tdi` makes a strange type of sense to permit it to be +Vectorised, because a sequence of comparisons could be Vectorised. +Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual +Nemory Management +instructions, these make no sense to Vectorise. + +However, it is really quite important to not be tempted to conclude that +just because these instructions are un-vectoriseable, the Prefix opcode space +must be free for reiterpretation and use for other purposes. This would +be a serious mistake because a future revision of the specification +might *retire* the Scalar instruction, and, worse, replace it with another. +Again this comes down to being quite strict about the rules: only Scalar +instructions get Vectorised: there are *no* actual explicit Vector +instructions. + +**Summary** + +Where a traditional Vector ISA effectively duplicates the entirety +of a Scalar ISA and then adds additional instructions which only +make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to +considerable lengths to keep strictly to augmentation and embedding +of an entire Scalar ISA's instructions into an abstract Vectorisation +Context. That abstraction subdivides down into Categories appropriate +for the type of operation (Branch, CRs, Memory, Arithmetic), +and each Category has its own relevant but +ultimately rational quirks. + +# Abstraction between Prefix and Suffix + +In the introduction paragraph, a great fuss was made emphasising that +the Prefix is kept separate from the Suffix. The whole idea there is +that a Multi-issue Decoder and subsequent pipelines would in no way have +"back-propagation" of state that can only be determined far too late. +This *has* been preserved, however there is a hiccup. + +Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001. +The encoding of the prefix has 6 bits that are dedicated to letting +the hardware know what the remainder of the Prefix bits mean: how they +are formatted, even without having to examine the Suffix to which +they are applied. + +SVP64 has such pressure on its 24-bit encoding that it was simply +not possible to perform the same trick used by Power ISA 3.1 Prefixing. +Therefore, rather unfortunately, it becomes necessary to perform +a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM +Fields may be identified. Fortunately this is straightforward, and +does not rely on any outside state, and even more fortunately +for a Multi-Issue Execution decoder, the length 32/64 is also +easy to identify by looking for the EXT001 pattern. Once identified +the 32/64 bits may be passed independently to multiple Decoders in +parallel. + +# Predication + +Predication is entirely missing from the Power ISA. +Adding it would be a costly mistake because it cannot be retrofitted +to an ISA without literally duplicating all instructions. Prefixing +is about the only sane way to go. + +CR Fields as predicate masks could be spread across multiple register +file entries, making them costly to read in one hit. Therefore the +possibility exists that an instruction element writing to a CR Field +could *overwrite* the Predicate mask CR Vector during the middle of +a for-loop. + +Clearly this is bad, so don't do it. If there are potential issues +they can be avoided by using the crweird instructions to get CR Field +bits into an Integer GPR (r3, r10 or r30) and use that GPR as a +Predicate mask instead. + +Even in Vertical-First Mode, which is a single Scalar instruction executed +with "offset" registers (in effect), the rule still applies: don't write +to the same register being used as the predicate, it's `UNDEFINED` +behaviour. + +## Single Predication + +So named because there is a Twin Predication concept as well, Single +Predication is also unlike other Vector ISAs because it allows zeroing +on both the source and destination. This takes some explaining. + +In Vector ISAs, there is a Predicate Mask, it applies to the +destination only, and there +is a choice of actions when a Predicate Mask bit +is zero: + +* set the destination element to zero +* skip that element operation entirely, leaving the destination unmodified + +The problem comes if the underlying register file SRAM is say 64-bit wide +write granularity but the Vector elements are say 8-bit wide. +Some Vector ISAs strongly advocate Zeroing because to leave one single +element at a small bitwidth in amongst other elements where the register +file does not have the prerequisite access granularity is very expensive, +requiring a Read-Modify-Write cycle to preserve the untouched elements. +Putting zero into the destination avoids that Read. + +This is technically very easy to solve: use a Register File that does +in fact have the smallest element-level write-enable granularity. +If the elements are 8 bit then allow 8-bit writes! + +With that technical issue solved there is nothing in the way of choosing +to support both zeroing and non-zeroing (skipping) at the ISA level: +SV chooses to further support both *on both the source and destination*. +This can result in the source and destination +element indices getting "out-of-sync" even though the Predicate Mask +is the same because the behaviour is different when zeros in the +Predicate are encountered. + +## Twin Predication + +Twin Predication is an entirely new concept not present in any commercial +Vector ISA of the past forty years. To explain how normal Single-predication +is applied in a standard Vector ISA: + +* Predication on the **source** of a LOAD instruction creates something + called "Vector Compressed Load" (VCOMPRESS). +* Predication on the **destination** of a STORE instruction creates something + called "Vector Expanded Store" (VEXPAND). +* SVP64 allows the two to be put back-to-back: one on source, one on + destination. + +The above allows a reader familiar with VCOMPRESS and VEXPAND to +conceptualise what the effect of Twin Predication is, but it actually +goes much further: in *any* twin-predicated instruction (extsw, fmv) +it is possible to apply one predicate to the source register (compressing +the source element array) and another *completely separate* predicate +to the destination register, not just on Load/Stores but on *arithmetic* +operations. + +No other Vector ISA in the world has this back-to-back +capability. All true Vector +ISAs have Predicate Masks: it is an absolutely essential characteristic. +However none of them have abstracted dual predicates out to the extent +where this VCOMPRESS-VEXPAND effect is applicable *in general* to a +wide range of arithmetic +instructions, as well as Load/Store. + +It is however important to note that not all instructions can be Twin +Predicated (2P): some remain only Single Predicated (1P), as is normally found +in other Vector ISAs. Arithmetic operations with +four registers (3-in, 1-out, VA-Form for example) are Single. The reason +is that there just wasn't enough space in the 24-bits of the SVP64 Prefix. +Consequently, when using a given instruction, it is necessary to look +up in the ISA Tables whether it is 1P or 2P. caveat emptor! + +Also worth a special mention: all Load/Store operations are Twin-Predicated. +The underlying key to understanding: + +* one Predicate effectively applies to the Array of Memory *Addresses*, +* the other Predicate effectively applies to the Array of Memory *Data*. # CR weird instructions -[[sv/int_cr_predication]] is by far the biggest violator of the SVP64 +[[sv/cr_int_predication]] is by far the biggest violator of the SVP64 rules, for good reasons. Transfers between Vectors of CR Fields and Integers for use as predicates is very awkward without them. @@ -48,7 +372,7 @@ From a hardware implementation perspective however they will need special handling as far as Hazard Dependencies are concerned, due to nonconformance (bit-level management) -# mv.x +# mv.x (vector permute) [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in terms of Register Hazard Management that its addition to any Scalar @@ -56,9 +380,10 @@ ISA is anathematic. In a Traditional Vector ISA however, where the indices are isolated behind a single Vector Hazard, there is no problem at all. `sv.mv.x` is also fraught, precisely because it sits on top of a Standard Scalar register paradigm, not a Vector -ISA, with separate and distinct Vector registers. +ISA with separate and distinct Vector registers. -To help partly solve this, `sv.mv.x` has to be made relative: +To help partly solve this, `sv.mv.x` would have had to have +been made relative: ``` for i in range(VL): @@ -69,14 +394,15 @@ The reason for doing so is that MAXVL or VL may be used to limit the number of Register Hazards that need to be raised to a fixed quantity, at Issue time. -`mv.x` itself will still have to be added as a Scalar instruction, -but the behaviour of `sv.mv.x` will have to be different from that +`mv.x` itself would still have to be added as a Scalar instruction, +but the behaviour of `sv.mv.x` would have to be different from that Scalar version. Normally, Scalar Instructions have a good justification for being added as Scalar instructions on their own merit. `mv.x` is the -polar opposite, and as such qualifies for a special mention in -this section. +polar opposite, and in the end, the idea was thrown out, and Indexed +REMAP added in its place. Indexed REMAP comes with its own quirks, +solving the Hazard problem, described in a later section. # Branch-Conditional @@ -93,6 +419,11 @@ One key difference is that LR is only updated if certain additional conditions are met, whereas Scalar `bclrl` for example unconditionally overwrites LR. +Another is that the Vectorised Branch-Conditional instructions are the +only ones where there are side-effects on predication when skipping +is enabled. This is so as to be able to use CTR to count down +*masked-out* elements. + Well over 500 Vectorised branch instructions exist in SVP64 due to the number of options available: close integration and interaction with the base Scalar Branch was unavoidable in order to create Conditional @@ -142,3 +473,170 @@ This is not exactly a violation of SVP64 Rules, more of a breakage of user expectations, particularly for LD/ST where exceptions would normally be expected to be raised, Fail-First provides for avoidance of those exceptions. + +For Hardware implementers, a standard Out-of-Order micro-architecture +allows for Cancellation of speculatively-executed elements that extended +beyond the Vector Truncation point. In-order systems will have a slightly +harder time and may choose to execute one element only at a time, reducing +performance as a result. + +# OE=1 + +The hardware cost of Sticky Overflow in a parallel environment is immense. +The SFFS Compliancy Level is permitted optionally to support XER.SO. +Therefore the decision is made to make it mandatory **not** to +support XER.SO. However, CR.SO *is* supported such that when Rc=1 +is set the CR.SO flag will contain only the overflow of +the current instruction, rather than being actually "sticky". +Hardware Out-of-Order designers will recognise and appreciate +that the Hazards are +reduced to Read-After-Write (RAW) and that the WAR Hazard is removed. + +This is sort-of a quirk and sort-of not, because the option to support +XER.SO is already optional from the SFFS Compliancy Level. + +# Indexed REMAP and CR Field Predication Hazards + +Normal Vector ISAs and those Packed SIMD ISAs inspired by them have +Vector "Permute" or "Shuffle" instructions. These provide a Vector of +indices whereby another Vector is reordered (permuted, shuffled) according +to the indices. Register Hazard Managent here is trivial because there +are three registers: indices source vector, elements source vector to +be shuffled, result vector. + +For SVP64 which is based on top of a Scalar Register File paradigm, +combined with the hard requirement to respect full Register Hazard +Management as if element instructions were actual Scalar instructions, +the addition of a Vector permute instruction under these strict +conditions would result in a catastrophic +reduction in performance, due to having to consider Read-after-Write +and Write-after-Read Hazards *at the element level*. + +A little leniency and rule-bending is therefore required. + +Rather than add explicit Vector permute instructions, the "Indexing" +has been separated out into a REMAP Schedule. When an Indexed +REMAP is requested, it is assumed (required, of software) that +subsequent instructions intending to use those indices *will not* +attempt to modify the indices. It is *Software* that must consider them +to be read-only. + +This simple relaxation of the rules releases Hardware from having the +horrendous job of dynamically detecting Write-after-Read Hazards on a +huge range of registers. + +A similar Hazard problem exists for CR Field Predicates, in Vertical-First +Mode. Instructions could modify CR Fields currently being used as Predicate +Masks: detecting this is so horrendous for hardware resource utilisation +and hardware complexity that, again, the decision is made to relax these +constraints and for Software to take that into account. + +# Floating-Point "Single" becomes "Half" + +In several places in the Power ISA there are operations that are on +32-bit quantities in 64-bit registers. The best example is FP which +has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or +FP Add "single"). Element-width overrides it would seem to +be unnecessary, under these circumstances. + +However, it is not possible for `fadds` to fit two elements into +64-bit: that breaks the simplicity of SVP64. +Bear in mind that the FP32 bits are spread out across a 64 +bit register in FP64 format. The solution here was to consider the +"s" at the end of each instruction +to mean "half of the element's width". Thus, `sv.fadds/ew=32` +actually stores an FP16 spread out across the 32 bits of an +element, in FP32 format, where `sv.fadd/ew=32` stores a full +FP32 result into the full 32 bits. + +Where this breaks down is when attempting to do half-width on +BF16 or FP16 operations: there does not exist a BF8 or an IEEE754 FP8 +format, so these (`sv.fadds/ew=8`) should be avoided. + +# Vertical-First and Subvectors + +Documented in the [[sv/setvl]] page, Vertical-First goes through +elements second instructions first and requires an explicit +[[sv/svstep]] instruction to move to the next element, +(whereas Horizontal-First +loops through elements in full first before moving on to +the next instruction): *Subvectors are considered "elements"* +in Vertical-First Mode. + +This is conceptually quite easy to keep in mind that a Vertical-First +instruction does one element at a time, and when SUBVL is set, +that "element" in essence becomes a vec2/3/4. + +# Swizzle and Pack/Unpack + +These are both so weird it's best to just read the pages in full +and pay attention: [[sv/mv.swizzle]] and [[sv/mv.vec]]. +Swizzle Moves only engage with vec2/3/4, *reordering* the copying +of the sub-vector elements (including allowing repeats and skips) +based on an immediate supplied by the instruction. The fun +comes when Pack/Unpack are enabled, and it is really important +to be aware how the Arrays of vec2/3/4 become re-ordered +*and swizzled at the same time*. + +Pack/Unpack started out as +[[sv/mv.vec]] but became its own distinct Mode over time. +The main thing to keep in mind about Pack/Unpack +is that it engages a swap of the ordering of the VL-SUBVL +nested for-loops, in exactly the same way that Matrix REMAP +can do. +When Pack or Unpack is enabled it is the SUBVL for-loop +that becomes outermost. A bit of thought shows that this is +a 2D "Transpose" where Dimension X is VL and Dimension Y is SUBVL. +However *both* source *and* destination may be independently +"Transposed", which makes no sense at all until the fact that +Swizzle can have a *different SUBVL* is taken into account. + +Basically Pack/Unpack covers everything that VSX `vpkpx` and +other ops can do, and then some: Saturation included, for arithmetic ops. + +# LD/ST with zero-immediate vs mapreduce mode + +LD/ST operations with a zero immediate effectively means that on a +Vector operation the element index to offset the memory location is +multiplied by zero. Thus, a sequence of LD operations will load from +the exact same address, and likewise STs to the exact same address. + +Ordinarily this would make absolutely no sense whatsoever, except +that Power ISA has cache-inhibited LD/STs (Power ISA v.1, Book III, +1.6.1, p1033), for accessing memory-mapped +peripherals and other crucial uses. Thus, *despite not being a mapreduce mode*, +zero-immediates cause multiple hits on the same element. + +Mapreduce mode is not actually mapreduce at all: it is +a relaxation of the normal rule where if the destination is a Scalar the +Vector for-looping is not terminated on first write to the destination. +Instead, the developer is expected to exploit the strict Program Order, +make one of the sources the same as that Scalar destination, effectively +making that Scalar register an "Accumulator", thus creating the *appearance* +(effect) of Simple-V having a mapreduce capability, when in fact it is +more of an artefact. + +LD/ST zero-immediate has similar quirky overwriting as the "mapreduce" +mode, but actually requires the registers to be Vectors. It is simply +a mathematical artefact of multiplying by zero, which happens to be +useful for cache-inhibited operations. + +# Limited space in LD/ST Mode + +As pointed out in the [[sv/ldst]] page there is limited space in only +5 mode bits to fully express all potential modes of operation. + +* LD/ST Immediate has no individual control over src/dest zeroing, + whereas LD/ST Indexed does. +* LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does) +* LD/ST Indexed has no Pack/Unpack (REMAP may be used instead) + +These are not insurmountable problems: there do exist workarounds. +For example it is possible to set up Matrix REMAP to perform the same +job as Pack/Unpack, at which point the LD/ST "Saturation" mode may +be used, saving on costly intermediary registers at double the LD +width. Also, although potentially costly it may be possible to +use Indexed Mode after using `svstep` to compute a sequence of +Indices, then activate either `sz` or `dz` as required, as a workaround +for LDST Immediate only having `zz`. +