From 898b798b6ec33c0a752645dcc2d46f1116186ed6 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Thu, 24 Dec 2020 14:36:27 +0000
Subject: [PATCH]

---
 openpower/sv/overview.mdwn | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/openpower/sv/overview.mdwn b/openpower/sv/overview.mdwn
index c51731c50..c3f9d1411 100644
--- a/openpower/sv/overview.mdwn
+++ b/openpower/sv/overview.mdwn
@@ -43,6 +43,7 @@ The rest of this document builds on the above simple loop to add:
 * A new concept: Data-dependent fail-first
 * Condition-Register based *post-result* predication (also new)
 * A completely new concept: "Twin Predication"
+* vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
 
 All of this is *without modifying the OpenPOWER v3.0B ISA*, except to add "wrapping context", similar to how v3.1B 64 Prefixes work.
 
@@ -98,6 +99,8 @@ A particularly interesting case is if the destination is scalar, and the first f
 
 If all three registers are marked as Vector then the "traditional" predicated Vector behaviour is provided.  Yet, just as before, all other options are still provided, right the way back to the pure-scalar case, as if this were a straight OpenPOWER v3.0B non-augmented instruction.
 
+Predication therefore provides several modes traditionally seen in Vector ISAs, particularly if the predicate may be set conveniently as a single bit: this gives VINSERT (VINDEX) behaviour.  VSPLAT (result broadcasting) is provided by making the sources scalar and the destination a vector.
+
 # Predicate "zeroing" mode
 
 Sometimes with predication it is ok to leave the masked-out element alone (not modify the result) however sometimes it is better to zero the masked-out elrments.  This can be combined with bit-wise ORing to build up vectors from multiple predicate patterns.  Our pseudocode therefore ends up as follows, to take that into account:
@@ -119,9 +122,9 @@ Many Vector systems either have zeroing or they have nonzeroing, they do not hav
 
 # Element Width overrides
 
-All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64 bit integer operations, and IEEE754 FP32 and 64.  Often also included is FP16 and more recently BF16.  The *really* good Vector ISAs have variable-width vectors right down to bitlevel, and as high as 1024 bit arithmetic, as well as IEEE754 FP128.
+All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64 bit integer operations, and IEEE754 FP32 and 64.  Often also included is FP16 and more recently BF16.  The *really* good Vector ISAs have variable-width vectors right down to bitlevel, and as high as 1024 bit arithmetic per element, as well as IEEE754 FP128.
 
-SV has an "override" system that *changes* the bitwidth of operations that were intended by the original scalar ISA designers to have (for example) 64 bit operations.  The override widths are 8, 16 and 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in the future).
+SV has an "override" system that *changes* the bitwidth of operations that were intended by the original scalar ISA designers to have (for example) 64 bit operations (only).  The override widths are 8, 16 and 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in the future).
 
 This presents a particularly intriguing conundrum given that the OpenPOWER Scalar ISA was never designed with for example 8 bit operations in mind, let alone Vectors of 8 bit.
 
@@ -171,8 +174,18 @@ These basically provide a convenient parameterised way to access the register fi
        result = src1 + src2 # actual add here
        set_polymorphed_reg(rd, destwid, i, result)
 
+With this loop, if elwidth=16 and VL=3 the first 48 bits of the target register will contain three 16 bit addition results, and the upper 16 bits will be *unaltered*.
+
 Note that things such as zero/sign-extension (and predication) have been left out to illustrate the elwidth concept. Also note that it turns out to be important to perform the operation at the maximum bitwidth - `max(srcwid, destwid)` - such that any truncation, rounding errors or other artefacts may all be ironed out.  This turns out to be important when applying Saturation for Audio DSP workloads.
 
 Other than that, element width overrides, which can be applied to *either* source or destination or both, are pretty straightforward, conceptually.  The details, for hardware engineers, involve byte-level write-enable lines, which is exactly what is used on SRAMs anyway.  Compiler writers have to alter Register Allocation Tables to byte-level granularity.
 
 One critical thing to note: upper parts of the underlying 64 bit register are *not zero'd out* by a write involving a non-aligned Vector Length. An 8 bit operation with VL=7 will *not* overwrite the 8th byte of the destination.  This is extremely important to consider the register file as a byte-level store, not a 64-bit-level store.
+
+# Quick recap so far
+
+The above functionality pretty much covers around 85% of Vector ISA needs.  Predication is provided so that parallel if/then/else constructs can be performed: critical given that sequential if/then statements and branches simply do not translate successfully to Vector workloads.  VSPLAT capability is provided which is approximately 20% of all GPU workload operations.  Also covered, with elwidth overriding, is the smaller arithmetic operations that caused ISAs developed from the late 80s onwards to get themselves into a tiz when adding "Multimedia" acceleration aka "SIMD" instructions.
+
+Experienced Vector ISA readers will however have noted that VCOMPRESS and VEXPAND are missing, as is Vector "reduce" (mapreduce) capability.  Compress and Expand are covered by Twin Predication, and yet to also be covered is fail-on-first, CR-based result predication, and Subvectors and Swizzle.
+
+
-- 
2.30.2