From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Mon, 20 Jun 2022 16:34:19 +0000 (+0100)
Subject: add copy of vector_ops page as discussion
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=cd2ab2c86cf4141b39353fa47f9b66d47d7977bf;p=libreriscv.git

add copy of vector_ops page as discussion
---

diff --git a/openpower/sv/vector_ops/discussion.mdwn b/openpower/sv/vector_ops/discussion.mdwn
new file mode 100644
index 000000000..9efe21825
--- /dev/null
+++ b/openpower/sv/vector_ops/discussion.mdwn
@@ -0,0 +1,302 @@
+[[!tag standards]]
+
+# SV Vector Operations.
+
+Links:
+
+* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
+* <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
+* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
+ out of scope for this document [[openpower/sv/3d_vector_ops]]
+* [[simple_v_extension/specification/bitmanip]] previous version,
+  contains pseudocode for sof, sif, sbf
+
+The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+
+Notes:
+
+* Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
+
+# Vector
+
+Both of these instructions may be synthesised from SVP64 Vector
+instructions.  conflictd is an O(N^2) instruction based on
+`sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
+with the appropriate predication 
+
+## conflictd
+
+This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
+
+    input = [100, 100,   3, 100,   5, 100, 100,   3]
+    conflict result = [
+         0b00000000,    // Note: first element always zero
+         0b00000001,    // 100 is present on #0
+         0b00000000,
+         0b00000011,    // 100 is present on #0 and #1
+         0b00000000,
+         0b00001011,    // 100 is present on #0, #1, #3
+         0b00011011,    // .. and #4
+         0b00000100     // 3 is present on #2
+    ]
+
+Pseudocode:
+
+    for i in range(VL):
+        for j in range(1, i):
+            if src1[i] == src2[j]:
+                result[j] |= 1<<i
+
+Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
+  using `mfcrweird` and `cmpi`. first triangular schedule on src1,
+secpnd on src2.
+
+Idea 2: implement using outer loop on varying setvl Horizontal-First
+with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
+result.
+
+    li r3, 1
+    li result, 0
+    for i in range(target):
+        setvl target
+        sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
+        sv.cmpi src2.v, t0 # compare src2 vector to scalar
+        sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
+        srr t1, t1, i # shift up by i before ORing
+        or result, result, t1
+        srr r3, r3, 1 # shift r3 predicate up by one
+
+See [[sv/cr_int_predication]] for full details on the crweird instructions:
+the primary important aspect here is that a Vector of CR Field's EQ bits is
+transferred into a single GPR.  The secondary important aspect is that VL
+is being adjusted in each loop, testing successively more of the input
+vector against a given scalar, each time.
+
+To investigate:
+
+* <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
+* <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
+
+## iota
+
+Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
+
+When masked, only the bits not masked out are included in the count process.
+
+    viota RT/v, RA, RB
+
+Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
+
+Example
+
+     7 6 5 4 3 2 1 0   Element number
+
+     1 0 0 1 0 0 0 1   v2 contents
+                       viota.m v4, v2 # Unmasked
+     2 2 2 1 1 1 1 0   v4 result
+
+     1 1 1 0 1 0 1 1   v0 contents
+     1 0 0 1 0 0 0 1   v2 contents
+     2 3 4 5 6 7 8 9   v4 contents
+                       viota.m v4, v2, v0.t # Masked
+     1 1 1 5 1 7 1 0   v4 results
+
+     def iota(RT, RA, RB): 
+        mask = RB ? iregs[RB] : 0b111111...1
+        val = RA ? iregs[RA] : 0b111111...1
+        for i in range(VL):
+            if RA.scalar:
+            testmask = (1<<i)-1 # only count below
+            to_test = val & testmask & mask
+            iregs[RT+i] = popcount(to_test)
+
+a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
+where bit 2 is inv, bits 0:1 select the bit of the CR.
+
+     def test_CR_bit(CR, BO):
+         return CR[BO[0:1]] == BO[2]
+
+     def iotacr(RT, BA, BO): 
+        mask = get_src_predicate()
+        count = 0
+        for i in range(VL):
+            if mask & (1<<i) == 0:
+                count = 0 # reset back to zero
+                continue
+            iregs[RT+i] = count
+            if test_CR_bit(CR[i+BA], BO):
+                 count += 1
+
+the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
+
+scalar variant which can be Vectorised to give iotacr:
+
+     def crtaddi(RT, RA, BA, BO, D): 
+         if test_CR_bit(BA, BO):
+             RT = RA + EXTS(D)
+         else:
+             RT = RA
+
+a Vector for-loop with zero-ing on dest will give the
+mask-out effect of resetting the count back to zero.
+However close examination shows that the above may actually
+be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1` 
+
+# Scalar
+
+These may all be viewed as suitable for fitting into a scalar bitmanip extension.
+
+## sbfm
+
+   sbfm RT, RA, RB!=0
+
+Example
+
+     7 6 5 4 3 2 1 0   Bit index
+
+     1 0 0 1 0 1 0 0   v3 contents
+                       vmsbf.m v2, v3
+     0 0 0 0 0 0 1 1   v2 contents
+
+     1 0 0 1 0 1 0 1   v3 contents
+                       vmsbf.m v2, v3
+     0 0 0 0 0 0 0 0   v2
+
+     0 0 0 0 0 0 0 0   v3 contents
+                       vmsbf.m v2, v3
+     1 1 1 1 1 1 1 1   v2
+
+     1 1 0 0 0 0 1 1   RB vcontents
+     1 0 0 1 0 1 0 0   v3 contents
+                       vmsbf.m v2, v3, v0.t
+     0 1 x x x x 1 1   v2 contents
+
+The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
+
+Executable demo:
+
+```
+[[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
+```
+
+## sifm
+
+The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
+
+    sifm RT, RA, RB!=0
+
+ # Example
+
+     7 6 5 4 3 2 1 0   Bit number
+
+     1 0 0 1 0 1 0 0   v3 contents
+                       vmsif.m v2, v3
+     0 0 0 0 0 1 1 1   v2 contents
+
+     1 0 0 1 0 1 0 1   v3 contents
+                       vmsif.m v2, v3
+     0 0 0 0 0 0 0 1   v2
+
+     1 1 0 0 0 0 1 1   RB vcontents
+     1 0 0 1 0 1 0 0   v3 contents
+                       vmsif.m v2, v3, v0.t
+     1 1 x x x x 1 1   v2 contents
+
+Executable demo:
+
+```
+[[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
+```
+
+## vmsof
+
+The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
+
+    sofm RT, RA, RB
+
+Example
+
+     7 6 5 4 3 2 1 0   Bit number
+
+     1 0 0 1 0 1 0 0   v3 contents
+                       vmsof.m v2, v3
+     0 0 0 0 0 1 0 0   v2 contents
+
+     1 0 0 1 0 1 0 1   v3 contents
+                       vmsof.m v2, v3
+     0 0 0 0 0 0 0 1   v2
+
+     1 1 0 0 0 0 1 1   RB vcontents
+     1 1 0 1 0 1 0 0   v3 contents
+                       vmsof.m v2, v3, v0.t
+     0 1 x x x x 0 0   v2 content
+
+Executable demo:
+
+```
+[[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
+```
+
+# Carry-lookahead
+
+used not just for carry lookahead, also a special type of predication mask operation.
+
+* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
+* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
+* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
+* <https://i.stack.imgur.com/QSLKY.png>
+* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
+  `((P|G)+G)^P`
+* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
+
+From QLSKY.png:
+
+```
+    x0 = nand(CIn, P0)
+    C0 = nand(x0, ~G0)
+
+    x1 = nand(CIn, P0, P1)
+    y1 = nand(G0, P1)
+    C1 = nand(x1, y1, ~G1)
+
+    x2 = nand(CIn, P0, P1, P2)
+    y2 = nand(G0, P1, P2)
+    z2 = nand(G1, P2)
+    C1 = nand(x2, y2, z2, ~G2)
+
+    # Gen*
+    x3 = nand(G0, P1, P2, P3)
+    y3 = nand(G1, P2, P3)
+    z3 = nand(G2, P3)
+    G* = nand(x3, y3, z3, ~G3)
+```
+
+```
+     P = (A | B) & Ci
+     G = (A & B)
+```
+
+Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
+
+```
+    At each id, compute C[id] = A[id]+B[id]+0
+    Get G[id] = C[id] > radix -1
+    Get P[id] = C[id] == radix-1
+    Join all P[id] together, likewise G[id]
+    Compute newC = ((P|G)+G)^P
+    result[id] = (C[id] + newC[id]) % radix
+```   
+
+two versions: scalar int version and CR based version.
+
+scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
+
+vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
+
+if zero (no propagation) then CR0.eq is zero
+
+CR based version, TODO.