(no commit message)

[libreriscv.git] / openpower / sv / svp_rewrite / svp64 / discussion.mdwn
diff --git a/openpower/sv/svp_rewrite/svp64/discussion.mdwn b/openpower/sv/svp_rewrite/svp64/discussion.mdwn

index 685411667751fc3847bb7dfe33c607fadca2fd46..69fa010bd387a61dc4b94fd51575e218d2d93344 100644 (file)
--- a/openpower/sv/svp_rewrite/svp64/discussion.mdwn
+++ b/openpower/sv/svp_rewrite/svp64/discussion.mdwn
@@ -12,9 +12,9 @@ do not try to jam VL or MAXVL in.  go with the flow of 24 bits spare.
  * 1: select INT or CR predication
  * 3: predicate selection and inversion (QTY 2 for tpred)
  * 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
-* 2: saturate mode
+* 3: saturate mode
  
-totals: 24 bits
+totals: 22 bits (dest elwidth shared)
  
  http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html
  
@@ -24,49 +24,88 @@ twin predication and twin elwidth overrides is extremely important to have to be
  
  something like:
  
-| 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 12 | 13 18 | 19 21 |
+| 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 12 | 13 18 | 19 23 |
  | ----- | --- | --- | ---- | ---- | ----- | ----- | ----- |
-| subvl | sew | dew | ptyp | psrc | pdst  | vspec | sat   |
+| subvl | sew | dew | ptyp | psrc | pdst  | vspec | mode  |
  
  * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  * sew / dew - DEFAULT / 8 / 16 /32 element width
  * ptyp - predication INT / CR
  * psrc / pdst - predicate mask selector and inversion
  * vspec - 3 bit src / dest scalar-vector extension
-* sat: DEFAULT / 8bit / 16bit / 32bit (signed/unsigned)
+* mode: 5 bits
+
+## twin predication, CR based.
+
+separate src and dest predicates are a critical part of SV for provision of VEXPAND, VREDUCE, VSPLAT, VINSERT and many more operations.
+
+Twin CR predication could be done in two ways:
+
+* start from different CRs for the src and dest
+* start from the same CR.
+
+With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
  
  # standard arith ops (single predication)
  
  these are of the form res = op(src1, src2, ...)
  
-| 0   1 | 2 3 | 4    | 5  7 | 8  16 | 17 19 |
-| ----- | --- | ---- | ---- | ----- | ----- |
-| subvl | ew  | ptyp | pred | vspec | sat   |
-
+| 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 18 | 19  23 |
+| ----- | --- | --- | ---- | ---- | ----- | ------ |
+| subvl | sew | dew | ptyp | pred | vspec | mode   |
  
  * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
-* ew - DEFAULT / 8 / 16 /32 element width
+* sew / dew - DEFAULT / 8 / 16 /32 element width
  * ptyp - predication INT / CR
  * pred - predicate mask selector and inversion
  * vspec - 2/3 bit src / dest scalar-vector extension
-* sat: DEFAULT / 8bit / 16bit / 32bit (signed/unsigned)
+* mode - 5 bit
  
  For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits.  for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
  
+Note:
+
+* the operation should always be done at max(srcwidth, dstwidth), unless it can 
+  be proven using the lower will lead to the same result
+* saturation is done on the result at the **dest** elwidth
+
+Some examples on different operation widths:
+
+    u16 / u16 = u8
+    256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong
+
+    u8 * u8 = u16
+    255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
+
+    u16 + u16 = u8
+    256 + 2 = 2 # this is correct whether we use the larger or smaller width
+                # aka hw can optimize narrowing addition
+
+# Mode
+
+     0   1   2   3    4    description
+     ------------------
+     0   0   0   0    0    nothing
+     0   1   N        zero sat mode: N=0/1 u/s
+     1   0   inv CR bit    Rc=1: ffirst CR sel
+     1   0            zero Rc=0: pred zero mode
+
+
  # Notes about rounding, clamp and saturate
  
  One of the issues with vector ops is that in integer DSP ops for example in Audio the operation must clamp or saturate rather than overflow or ignore the upper bits and become a modulo operation.  This for Audio is extremely important, also to provide an indicator as to whether saturation occurred.  see  [[av_opcodes]].
  
  If there are spare bits it would be very good to look at using some of them to specify the mode, because otherwise a SPR has to be used which will need to be set and unset.  This can get costly.
  
-Idea: 2 bits for clamping mode? similar to elwidth:
+# Fail-on-first
  
-* 0b00 default (no clamp)
-* 0b01 8 bit (sel: -128/127, us:0/255)
-* 0b10 16 bit
-* 0b11 32 bit
+Data-dependent fail-on-first has two distinct variants: one for LD/ST, the other for arithmetic operations (actually, CR-driven)
+
+* LD/ST ffirst treats the first LD/ST in a vector as an ordinary one.  Exceptions occur "as normal".  However for elements 1 and above, if an exception would occur, then VL is **truncated** to the previous element.
+* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp).  Similar to branch, an analysis of the CR is performed and if the test succeeds, the vector operation terminates all element operations at and above the current one, and VL is truncated to the *previous* element.
+
+The CR-based data-driven fail-on-first is new and not found in ARM SVE or RVV. It is extremely useful for reducing instruction count, however requires speculative execution involving modifications of VL to get high performance implementations.
  
-not the same *as* elwidth.
  
  # Notes about Swizzle
  
@@ -82,3 +121,10 @@ with 2x12 this would mean no need to have complex encoding of swizzle.
  
  if we really do need 2 bits spare then the complex encoder of swizzle could be deployed.
  
+# note about INT predicate
+
+001    ALWAYS (implicit)       Operation is not masked
+
+this means by default that 001 will always be in nonpredicated ops, which seems anomalous.  would 000 be better to indicate "no predication"?
+
+000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"