Additionally, in order to make life easier for implementers, some of
whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
-the usual strict Element-level Program Order is relaxed but only for
-Horizontal-First Mode:
-
-* In Horizontal-First Mode, an overlap between all and any Vectorised
- sources and destination Elements for the entirety of
- the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
-* In Vertical-First Mode, an overlap on any given one execution of
- the Swizzle instruction requires that all Swizzled source elements be
- copied into intermediary buffers (in-flight Reservation Stations,
- pipeline registers) **before* being swapped and placed in
- destinations. In-place (RT=RA) is required to work correctly.
- Strict Program Order is required in full.
-
-*Implementor's note: the cost of Vertical-First Mode in an Embedded design
-of storing four 64-bit in-flight elements may be considered
-too high. If this is the
-case it is acceptable to throw an Illegal Instruction Trap, and emulate
-the instruction in software. Performance will obviously be adversely affected.
-See [[sv/compliancy_levels]]: all aspects of
-Swizzle are entirely optional in hardware at the Embedded Level.*
-
-Implementors must consider Swizzle instructions to be atomically indivisible,
-even if implemented as Micro-coded. The rest of SVP64 permits element-level
-operations to be Precise-Interrupted: *Swizzle moves do not* because
-the multiple moves are, strictly speaking, one instruction. All XYZW
-elements *must* be completed in full before any Trap or Interrupt is
-permitted
-to be serviced. Out-of-Order Micro-architectures may of course cancel
-the in-flight instruction as usual if the Interrupt requires fast servicing.
+the usual strict Element-level Program Order is relaxed.
+An overlap between all and any Vectorised
+sources and destination Elements for the entirety of
+the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
+
+This in turn implies that Traps and Exceptions are, as usual,
+permitted in between element-level moves, because due to there
+being no overlap there is no risk of destroying a source with
+an overwrite.
Determining the source and destination subvector lengths is tricky.
Swizzle Pseudocode: