(no commit message)

[libreriscv.git] / openpower / sv / vector_ops.mdwn
diff --git a/openpower/sv/vector_ops.mdwn b/openpower/sv/vector_ops.mdwn

index 770c462109bc7b06c0e91e90a9b4b55b300a1618..1a69d92070cc2980debbb5d6aa30e5ee0487c546 100644 (file)
--- a/openpower/sv/vector_ops.mdwn
+++ b/openpower/sv/vector_ops.mdwn
@@ -1,11 +1,17 @@
+[[!tag standards]]
+
  # SV Vector Operations.
  
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVC512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+TODO merge old standards page [[simple_v_extension/vector_ops/]]
+
+The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
  
-However some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
-Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+Notes:
+
+* Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  
-.
  Links:
  
  * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
@@ -20,7 +26,7 @@ Links:
  
  ## conflictd
  
-This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion
+This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  
      input = [100, 100,   3, 100,   5, 100, 100,   3]
      conflict result = [
@@ -43,13 +49,13 @@ Pseudocode:
  
  ## iota
  
-Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of cntlz, where instead of stopping at the first zero with a count to produce a single scalar result, the process continues on, producing another element at the next encounter of a 1.
+Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  
-The viota.m instruction reads a source vector mask register and writes to each element of the destination vector register group the sum of all the bits of elements in the mask register whose index is less than the element, e.g., a parallel prefix sum of the mask values.
+When masked, only the bits not masked out are included in the count process.
  
-This instruction can be masked, in which case only the enabled elements contribute to the sum and only the enabled elements are written.
+    viota RT/v, RA, RB
  
-    viota.m vd, vs2, vm
+Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  
  Example
  
@@ -65,11 +71,31 @@ Example
                         viota.m v4, v2, v0.t # Masked
       1 1 1 5 1 7 1 0   v4 results
  
-The result value is zero-extended to fill the destination element if SEW is wider than the result. If the result value would overflow the destination SEW, the least-significant SEW bits are retained.
+     def iota(RT, RA, RB): 
+        mask = RB ? iregs[RB] : 0b111111...1
+        val = RA ? iregs[RA] : 0b111111...1
+        for i in range(VL):
+            if RA.scalar:
+            testmask = (1<<i)-1 # only count below
+            to_test = val & testmask & mask
+            iregs[RT+i] = popcount(to_test)
+
+a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
+where bit 2 is inv, bits 0:1 select the bit of the CR.
  
-Traps on viota.m are always reported with a vstart of 0, and execution is always restarted from the beginning when resuming after a trap handler. An illegal instruction exception is raised if vstart is non-zero.
+     def test_CR_bit(CR, BO):
+         return CR[BO[0:1]] == BO[2]
  
+     def iotacr(RT, BA, BO): 
+        mask = get_src_predicate()
+        count = 0
+        for i in range(VL):
+            if mask & (1<<i) == 0: continue
+            iregs[RT+i] = count
+            if test_CR_bit(CR[i+BA], BO):
+                 count += 1
  
+the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
  
  # Scalar
  
@@ -229,3 +255,41 @@ Pseudo-code:
                  # next loop starts skipping
  
              i += 1
+
+# Carry-lookahead
+
+used not just for carry lookahead, also a special type of predication mask operation.
+
+* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
+* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
+* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
+* <https://i.stack.imgur.com/QSLKY.png>
+* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
+  `((P|G)+G)^P`
+* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
+
+```
+     P = (A | B) & Ci
+     G = (A & B)
+```
+
+Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
+
+```
+    At each id, compute C[id] = A[id]+B[id]+0
+    Get G[id] = C[id] > radix -1
+    Get P[id] = C[id] == radix-1
+    Join all P[id] together, likewise G[id]
+    Compute newC = ((P|G)+G)^P
+    result[id] = (C[id] + newC[id]) % radix
+```   
+
+two versions: scalar int version and CR based version.
+
+scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
+
+vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
+
+if zero (no propagation) then CR0.eq is zero
+
+CR based version, TODO.