From de442bcfa03103954b50c005c8eafac22ecce13c Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 12 Jun 2022 14:33:47 +0100
Subject: [PATCH]

---
 openpower/sv/mv.swizzle.mdwn | 53 ++++++++++++++++++++++++++++++++----
 1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/openpower/sv/mv.swizzle.mdwn b/openpower/sv/mv.swizzle.mdwn
index 9e747abc0..d4c141251 100644
--- a/openpower/sv/mv.swizzle.mdwn
+++ b/openpower/sv/mv.swizzle.mdwn
@@ -121,12 +121,13 @@ the instruction in software. Performance will obviously be adversely affected.
 See [[sv/compliancy_levels]]: all aspects of
 Swizzle are entirely optional in hardware at the Embedded Level.*
 
-Implementors must consider `SUBVL` to have been implicitly set by
-the Swizzle instructions. Hardware may statically calculate `SUBVL`
-from the immediate.  "W.0Z" is SUBVL=4, where "X0Z." is SUBVL=3,
-and ".W.." sets SUBVL=2.  Setting `SUBVL` has a different meaning
-in Swizzle Move instructions,
-as explained below.
+Implementors must consider Swizzle instructions to be atomically indivisible,
+even if implemented as Micro-coded.  The rest of SVP64 permits element-level
+operations to be Precise-Interrupted: *Swizzle moves do not*.  All XYZW
+elements *must* be completed in full before any Trap or Interrupt is
+permitted
+to be serviced. Out-of-Order Micro-architectures may of course cancel
+the in-flight instruction as usual if the Interrupt requires fast servicing.
 
 # RM Mode Concept:
 
@@ -143,3 +144,43 @@ The inclusion of a separate src SUBVL allows
 `sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
 This is conceptually achieved by having both source and
 destination SUBVL be "outer" loops instead of inner loops.
+
+Illustrating a
+"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
+
+    def index():
+        for i in range(VL):
+            for j in range(SUBVL):
+                yield i*SUBVL+j
+
+    for idx in index():
+        operation_on(RA+idx)
+
+For a separate source/dest SUBVL (again, no elwidth overrides):
+
+    # yield an outer-SUBVL, inner VL loop with SRC SUBVL
+    def index_src():
+        for j in range(SRC_SUBVL):
+            for i in range(VL):
+                yield i*SRC_SUBVL+j
+
+    # yield an outer-SUBVL, inner VL loop with DEST SUBVL
+    def index_dest():
+        for j in range(SUBVL):
+            for i in range(VL):
+                yield i*SUBVL+j
+
+    # walk through both source and dest indices simultaneously
+    for src_idx, dst_idx in zip(index_src(), index_dst()):
+        move_operation(RT+dst_idx, RA+src_idx)
+
+"yield" from python is used here for simplicity and clarity.
+The two Finite State Machines for the generation of the source
+and destination element offsets progress incrementally in
+lock-step.
+
+Although not prohibited, it is not expected that
+Software would set both source and destination SUBVL at the
+same time.  Usually, either `SRC_SUBVL=1, SUBVL=2/3/4` to give
+a "pack" effect, or `SUBVL=1, SRC_SUBVL=2/3/4` to give an
+"unpack".
-- 
2.30.2