From: lkcl <lkcl@web>
Date: Mon, 20 Jun 2022 16:52:41 +0000 (+0100)
Subject: (no commit message)
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=010bad250be020198c84de3dc3ece8c38b15cadc;p=libreriscv.git

---

diff --git a/openpower/sv/vector_ops.mdwn b/openpower/sv/vector_ops.mdwn
index faca26c12..06c8f0634 100644
--- a/openpower/sv/vector_ops.mdwn
+++ b/openpower/sv/vector_ops.mdwn
@@ -4,6 +4,7 @@
 
 Links:
 
+* [[discussion]]
 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
@@ -13,7 +14,12 @@ Links:
 * [[simple_v_extension/specification/bitmanip]] previous version,
   contains pseudocode for sof, sif, sbf
 
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
+Therefore there are not that many cases where *actual* Vector
+instructions are needed. If they are, they are more "assistance"
+functions.  Two traditional Vector instructions were initially
+considered (conflictd and vmiota) however they may be synthesised
+from existing SVP64 instructions and have been moved to [[discussion]]
 
 Notes:
 
@@ -21,136 +27,7 @@ Notes:
 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
 
-# Vector
-
-Both of these instructions may be synthesised from SVP64 Vector
-instructions.  conflictd is an O(N^2) instruction based on
-`sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
-with the appropriate predication 
-
-## conflictd
-
-This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
-
-    input = [100, 100,   3, 100,   5, 100, 100,   3]
-    conflict result = [
-         0b00000000,    // Note: first element always zero
-         0b00000001,    // 100 is present on #0
-         0b00000000,
-         0b00000011,    // 100 is present on #0 and #1
-         0b00000000,
-         0b00001011,    // 100 is present on #0, #1, #3
-         0b00011011,    // .. and #4
-         0b00000100     // 3 is present on #2
-    ]
-
-Pseudocode:
-
-    for i in range(VL):
-        for j in range(1, i):
-            if src1[i] == src2[j]:
-                result[j] |= 1<<i
-
-Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
-  using `mfcrweird` and `cmpi`. first triangular schedule on src1,
-secpnd on src2.
-
-Idea 2: implement using outer loop on varying setvl Horizontal-First
-with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
-result.
-
-    li r3, 1
-    li result, 0
-    for i in range(target):
-        setvl target
-        sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
-        sv.cmpi src2.v, t0 # compare src2 vector to scalar
-        sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
-        srr t1, t1, i # shift up by i before ORing
-        or result, result, t1
-        srr r3, r3, 1 # shift r3 predicate up by one
-
-See [[sv/cr_int_predication]] for full details on the crweird instructions:
-the primary important aspect here is that a Vector of CR Field's EQ bits is
-transferred into a single GPR.  The secondary important aspect is that VL
-is being adjusted in each loop, testing successively more of the input
-vector against a given scalar, each time.
-
-To investigate:
-
-* <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
-* <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
-
-## iota
-
-Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
-
-When masked, only the bits not masked out are included in the count process.
-
-    viota RT/v, RA, RB
-
-Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
-
-Example
-
-     7 6 5 4 3 2 1 0   Element number
-
-     1 0 0 1 0 0 0 1   v2 contents
-                       viota.m v4, v2 # Unmasked
-     2 2 2 1 1 1 1 0   v4 result
-
-     1 1 1 0 1 0 1 1   v0 contents
-     1 0 0 1 0 0 0 1   v2 contents
-     2 3 4 5 6 7 8 9   v4 contents
-                       viota.m v4, v2, v0.t # Masked
-     1 1 1 5 1 7 1 0   v4 results
-
-     def iota(RT, RA, RB): 
-        mask = RB ? iregs[RB] : 0b111111...1
-        val = RA ? iregs[RA] : 0b111111...1
-        for i in range(VL):
-            if RA.scalar:
-            testmask = (1<<i)-1 # only count below
-            to_test = val & testmask & mask
-            iregs[RT+i] = popcount(to_test)
-
-a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
-where bit 2 is inv, bits 0:1 select the bit of the CR.
-
-     def test_CR_bit(CR, BO):
-         return CR[BO[0:1]] == BO[2]
-
-     def iotacr(RT, BA, BO): 
-        mask = get_src_predicate()
-        count = 0
-        for i in range(VL):
-            if mask & (1<<i) == 0:
-                count = 0 # reset back to zero
-                continue
-            iregs[RT+i] = count
-            if test_CR_bit(CR[i+BA], BO):
-                 count += 1
-
-the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
-
-scalar variant which can be Vectorised to give iotacr:
-
-     def crtaddi(RT, RA, BA, BO, D): 
-         if test_CR_bit(BA, BO):
-             RT = RA + EXTS(D)
-         else:
-             RT = RA
-
-a Vector for-loop with zero-ing on dest will give the
-mask-out effect of resetting the count back to zero.
-However close examination shows that the above may actually
-be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1` 
-
-# Scalar
-
-These may all be viewed as suitable for fitting into a scalar bitmanip extension.
-
-## sbfm
+# sbfm
 
    sbfm RT, RA, RB!=0
 
@@ -183,7 +60,7 @@ Executable pseudocode demo:
 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
 ```
 
-## sifm
+# sifm
 
 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 
@@ -212,7 +89,7 @@ Executable pseudocode demo:
 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
 ```
 
-## vmsof
+# vmsof
 
 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.