Here, both srcstep and dststep remain in lockstep because sz=dz=1
+# Twin Predication <a name="2p"> </a>
+
+This is a novel concept that allows predication to be applied to a single
+source and a single dest register. The following types of traditional
+Vector operations may be encoded with it, *without requiring explicit
+opcodes to do so*
+
+* VSPLAT (a single scalar distributed across a vector)
+* VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
+* VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
+* VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
+* VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
+
+Those patterns (and more) may be applied to:
+
+* mv (the usual way that V\* ISA operations are created)
+* exts\* sign-extension
+* rwlinm and other RS-RA shift operations (**note**: excluding
+ those that take RA as both a src and dest. These are not
+ 1-src 1-dest, they are 2-src, 1-dest)
+* LD and ST (treating AGEN as one source)
+* FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
+* Condition Register ops mfcr, mtcr and other similar
+
+This is a huge list that creates extremely powerful combinations,
+particularly given that one of the predicate options is `(1<<r3)`
+
+Additional unusual capabilities of Twin Predication include a back-to-back
+version of VCOMPRESS-VEXPAND which is effectively the ability to do
+sequentially ordered multiple VINSERTs. The source predicate selects a
+sequentially ordered subset of elements to be inserted; the destination
+predicate specifies the sequentially ordered recipient locations.
+This is equivalent to
+`llvm.masked.compressstore.*`
+followed by
+`llvm.masked.expandload.*`
+with a single instruction.
+
+This extreme power and flexibility comes down to the fact that SVP64
+is not actually a Vector ISA: it is a loop-abstraction-concept that
+is applied *in general* to Scalar operations, just like the x86
+`REP` instruction (if put on steroids).
+
# EXTRA Pack/Unpack Modes
The pack/unpack concept of VSX `vpack` is abstracted out as a Sub-Vector
[[sv/mv.swizzle] has a slightly different pseudocode algorithm
for Vertical-First Mode.
-# Twin Predication <a name="2p"> </a>
-
-This is a novel concept that allows predication to be applied to a single
-source and a single dest register. The following types of traditional
-Vector operations may be encoded with it, *without requiring explicit
-opcodes to do so*
-
-* VSPLAT (a single scalar distributed across a vector)
-* VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
-* VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
-* VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
-* VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
-
-Those patterns (and more) may be applied to:
-
-* mv (the usual way that V\* ISA operations are created)
-* exts\* sign-extension
-* rwlinm and other RS-RA shift operations (**note**: excluding
- those that take RA as both a src and dest. These are not
- 1-src 1-dest, they are 2-src, 1-dest)
-* LD and ST (treating AGEN as one source)
-* FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
-* Condition Register ops mfcr, mtcr and other similar
-
-This is a huge list that creates extremely powerful combinations,
-particularly given that one of the predicate options is `(1<<r3)`
-
-Additional unusual capabilities of Twin Predication include a back-to-back
-version of VCOMPRESS-VEXPAND which is effectively the ability to do
-sequentially ordered multiple VINSERTs. The source predicate selects a
-sequentially ordered subset of elements to be inserted; the destination
-predicate specifies the sequentially ordered recipient locations.
-This is equivalent to
-`llvm.masked.compressstore.*`
-followed by
-`llvm.masked.expandload.*`
-with a single instruction.
-
-This extreme power and flexibility comes down to the fact that SVP64
-is not actually a Vector ISA: it is a loop-abstraction-concept that
-is applied *in general* to Scalar operations, just like the x86
-`REP` instruction (if put on steroids).
-
# Reduce modes
Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal