more notes about scalar reduction

[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn

index 43afa7e87529959058d67d46e7f309bb92962a3e..96f6021ecc168dfd41722c10c60b7ae202ee7991 100644 (file)
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -1,6 +1,10 @@
  # Appendix
  
-This is the appendix to [[sv/svp64]]
+* <https://bugs.libre-soc.org/show_bug.cgi?id=574>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
+
+This is the appendix to [[sv/svp64]], providing explanations of modes
+etc. leaving the main svp64 page's primary purpose as outlining the instruction format.
  
  Table of contents:
  
@@ -14,7 +18,7 @@ independent.  XER SO and other global "accumulation" flags (CR.OV) cause
  Read-Write Hazards on single-bit global resources, having a significant
  detrimental effect.
  
-Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including in cmp instructions) .  XER is
+Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including in `cmp` instructions).  XER is
  simply neither read nor written.  This includes when `scalar identity behaviour` occurs.  If precise OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1 instructions should be used without an SV Prefix.
  
  An interesting side-effect of this decision is that the OE flag is now free for other uses when SV Prefixing is used.
@@ -106,63 +110,6 @@ This is equivalent to
  followed by
  `llvm.masked.expandload.*`
  
-# LOAD/STORE Elwidths <a name="ldst"></a>
-
-Loads and Stores are almost unique in that the OpenPOWER Scalar ISA provides a width for the operation (lb, lh, lw, ld).  There are therefore three widths involved:
-
-* operation width (lb=8, lh=16, lw=32, ld=64)
-* source width override
-* destination element override
-
-The reason for all three is because Saturation (and other transformations) may occur in between, which rely on the source and destination width, and have nothing to do (per se) with the operation width.
-
-Note the following:
-
-* `scalar identity behaviour` SV Context parameter conditions turn this
-  into a straight absolute fully-compliant Scalar v3.0B LD operation
-* `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
-  rather than `ld`)
-* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`)
-* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`
-* `svctx` specifies the SV Context and includes VL as well as source and
-  destination elwidth overrides.
-
-Below is the pseudocode for Unit-Strided LD (which includes Vector capability):
-
-    # LD not VLD! (ldbrx if brev=True)
-    # this covers unit stride mode
-    function op_ld(rd, rs, brev, op_width, imm_offs, svctx)
-      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
-
-        # unit stride mode, compute the address
-        srcbase = ireg[rs] + i * op_width;
-
-        # takes care of (merges) processor LE/BE and ld/ldbrx
-        bytereverse = brev XNOR MSR.LE
-
-        # read the underlying memory
-        memread <= mem[srcbase + imm_offs];
-
-        # optionally performs 8-byte swap (because src_elwidth=64)
-        if (bytereverse):
-            memread = byteswap(memread, op_width)
-
-        # now truncate to source over-ridden width.
-        if (svctx.src_elwidth != default)
-            memread = adjust_wid(memread, op_width, svctx.src_elwidth)
-
-        # note that here we would now do saturation if it was enabled.
-        ... saturation adjustment...
-
-        # takes care of inserting memory-read (now correctly byteswapped)
-        # into regfile underlying LE-defined order, into the right place
-        # within the NEON-like register, respecting destination element
-        # bitwidth, and the element index (j)
-        set_polymorphed_reg(rd, svctx.dest_bitwidth, j, memread)
-
-        # increments both src and dest element indices (no predication here)
-        i++;
-        j++;
  
  # Rounding, clamp and saturate
  
@@ -200,6 +147,64 @@ Note that the operation takes place at the maximum bitwidth (max of src and dest
  
  # Reduce mode
  
+There are two variants here.  The first is when the destination is scalar
+and at least one of the sources is Vector.  The second is more complex
+and involves map-reduction on vectors.
+
+The first defining characteristic distinguishing Scalar-dest reduce mode
+from Vector reduce mode is that Scalar-dest reduce issues VL element
+operations, whereas Vector reduce mode performs an actual map-reduce
+(tree reduction): typically `O(VL log VL)` actual computations.
+
+The second defining characteristic of scalar-dest reduce mode is that it
+is, in simplistic and shallow terms *serial and sequential in nature*,
+whereas the Vector reduce mode is definitely inherently paralleliseable.
+
+The reason why scalar-dest reduce mode is "simplistically" serial and
+sequential is that in certain circumstances (such as an `OR` operation
+or a MIN/MAX operation) it may be possible to parallelise the reduction.
+
+## Scalar result reduce mode
+
+In this mode, one register is identified as being the "accumulator".
+Scalar reduction is thus categorised by:
+
+* One of the sources is a Vector
+* the destination is a scalar
+* optionally but most usefully when one source register is also the destination
+* That the source register type is the same as the destination register
+  type identified as the "accumulator".  scalar reduction on `cmp`,
+  `setb` or `isel` is not possible for example because of the mixture
+  between CRs and GPRs.
+
+Typical applications include simple operations such as `ADD r3, r10.v,
+r3` where, clearly, r3 is being used to accumulate the addition of all
+elements is the vector starting at r10.
+
+     # add RT, RA,RB but when RT==RA
+     for i in range(VL):
+          iregs[RA] += iregs[RB+i] # RT==RA
+
+However, *unless* the operation is marked as "mapreduce", SV ordinarily
+**terminates** at the first scalar operation.  Only by marking the
+operation as "mapreduce" will it continue to issue multiple sub-looped
+(element) instructions in `Program Order`.
+
+Other examples include shift-mask operations where a Vector of inserts
+into a single destination register is required, as a way to construct
+a value quickly from multiple arbitrary bit-ranges and bit-offsets.
+Using the same register as both the source and destination, with Vectors
+of different offsets masks and values to be inserted has multiple
+applications including Video, cryptography and JIT compilation.
+
+Subtract and Divide are still permitted to be executed in this mode,
+although from an algorithmic perspective it is strongly discouraged.
+It would be better to use addition followed by one final subtract,
+or in the case of divide, to get better accuracy, to perform a multiply
+cascade followed by a final divide.
+
+## Vector result reduce mode
+
  1. limited to single predicated dual src operations (add RT, RA, RB).
     triple source operations are prohibited (fma).
  2. limited to operations that make sense.  divide is excluded, as is
@@ -379,12 +384,12 @@ applies, **not** the CR\_bit portion (bits 0:1):
      else:
          spec = EXTRA2<<1 | 0b0
      if spec[2]:
-       # vector constructs "BA[2:4] spec[0:1] 0 BA[0:1]"
-       return ((BA >> 2)<<5) | # hi 3 bits shifted up
-              (spec[0:1]<<3) |  # to make room for these
+       # vector constructs "BA[2:4] spec[0:1] 00 BA[0:1]"
+       return ((BA >> 2)<<6) | # hi 3 bits shifted up
+              (spec[0:1]<<4) | # to make room for these
                (BA & 0b11)      # CR_bit on the end
      else:
-       # scalar constructs "0 spec[0:1] BA[0:4]"
+       # scalar constructs "00 spec[0:1] BA[0:4]"
         return (spec[0:1] << 5) | BA
  
  Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
@@ -392,10 +397,10 @@ algorithm to determin CR\_reg is modified to as follows:
  
      CR_index = 7-(BA>>2)      # top 3 bits but BE
      if spec[2]:
-        # vector mode
-        CR_index = (CR_index<<3) | (spec[0:1] << 1)
+        # vector mode, 0-124 increments of 4
+        CR_index = (CR_index<<4) | (spec[0:1] << 2)
      else:
-        # scalar mode
+        # scalar mode, 0-32 increments of 1
          CR_index = (spec[0:1]<<3) | CR_index
      # same as for v3.0/v3.1 from this point onwards
      bit_index = 3-(BA & 0b11) # low 2 bits but BE