add twin butterfly page (stub)

[libreriscv.git] / openpower / sv / vector_ops.mdwn
diff --git a/openpower/sv/vector_ops.mdwn b/openpower/sv/vector_ops.mdwn

index e29ac09cad1584b8dd02397a7481aa9893489cce..b2d526af69e34d43f8016980e1b6240c5f3c9c88 100644 (file)
--- a/openpower/sv/vector_ops.mdwn
+++ b/openpower/sv/vector_ops.mdwn
@@ -1,258 +1,115 @@
-# SV Vector Operations.
+[[!tag standards]]
  
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
-
-Notes:
-
-* Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
-* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
-* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
+# SV Vector-assist Operations.
  
  Links:
  
+* [[discussion]]
  * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
-* <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
+* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=865> implementation in simulator
  * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
- out of scope for this document
+ out of scope for this document [[openpower/sv/3d_vector_ops]]
  * [[simple_v_extension/specification/bitmanip]] previous version,
    contains pseudocode for sof, sif, sbf
+* <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>
  
-# Vector
-
-## conflictd
-
-This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
-
-    input = [100, 100,   3, 100,   5, 100, 100,   3]
-    conflict result = [
-         0b00000000,    // Note: first element always zero
-         0b00000001,    // 100 is present on #0
-         0b00000000,
-         0b00000011,    // 100 is present on #0 and #1
-         0b00000000,
-         0b00001011,    // 100 is present on #0, #1, #3
-         0b00011011,    // .. and #4
-         0b00000100     // 3 is present on #2
-    ]
-
-Pseudocode:
-
-    for i in range(VL):
-        for j in range(1, i):
-            if src1[i] == src2[j]:
-                result[j] |= 1<<i
-
-## iota
-
-Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
-
-When masked, only the bits not masked out are included in the count process.
-
-    viota RT/v, RA, RB
-
-Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
-
-Example
-
-     7 6 5 4 3 2 1 0   Element number
-
-     1 0 0 1 0 0 0 1   v2 contents
-                       viota.m v4, v2 # Unmasked
-     2 2 2 1 1 1 1 0   v4 result
-
-     1 1 1 0 1 0 1 1   v0 contents
-     1 0 0 1 0 0 0 1   v2 contents
-     2 3 4 5 6 7 8 9   v4 contents
-                       viota.m v4, v2, v0.t # Masked
-     1 1 1 5 1 7 1 0   v4 results
-
-     def iota(RT, RA, RB): 
-        mask = RB ? iregs[RB] : 0b111111...1
-        val = RA ? iregs[RA] : 0b111111...1
-        for i in range(VL):
-            if RA.scalar:
-            testmask = (1<<i)-1 # only count below
-            to_test = val & testmask & mask
-            iregs[RT+i] = popcount(to_test)
-
-a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
-where bit 2 is inv, bits 0:1 select the bit of the CR.
-
-     def test_CR_bit(CR, BO):
-         return CR[BO[0:1]] == BO[2]
-
-     def iotacr(RT, BA, BO): 
-        mask = get_src_predicate()
-        count = 0
-        for i in range(VL):
-            if mask & (1<<i) == 0: continue
-            iregs[RT+i] = count
-            if test_CR_bit(CR[i+BA], BO):
-                 count += 1
-
-the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
-
-# Scalar
-
-These may all be viewed as suitable for fitting into a scalar bitmanip extension.
-
-## sbfm
-
-   sbfm RT, RA, RB!=0
-
-Example
-
-     7 6 5 4 3 2 1 0   Bit index
-
-     1 0 0 1 0 1 0 0   v3 contents
-                       vmsbf.m v2, v3
-     0 0 0 0 0 0 1 1   v2 contents
+The core Power ISA was designed as scalar: SV provides a level of
+abstraction to add variable-length element-independent parallelism.
+Therefore there are not that many cases where *actual* Vector instructions
+are needed. If they are, they are more "assistance" functions.  Two
+traditional Vector instructions were initially considered (conflictd and
+vmiota) however they may be synthesised from existing SVP64 instructions:
+vmiota may use [[svstep]].  Details in [[discussion]]
  
-     1 0 0 1 0 1 0 1   v3 contents
-                       vmsbf.m v2, v3
-     0 0 0 0 0 0 0 0   v2
-
-     0 0 0 0 0 0 0 0   v3 contents
-                       vmsbf.m v2, v3
-     1 1 1 1 1 1 1 1   v2
-
-     1 1 0 0 0 0 1 1   RB vcontents
-     1 0 0 1 0 1 0 0   v3 contents
-                       vmsbf.m v2, v3, v0.t
-     0 1 x x x x 1 1   v2 contents
-
-The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
-
-pseudocode:
-
-    def sbf(rd, rs1, rs2):
-        rd = 0
-        # start setting if no predicate or if 1st predicate bit set
-        setting_mode = rs2 == x0 or (regs[rs2] & 1)
-        while i < XLEN:
-            bit = 1<<i
-            if rs2 != x0 and (regs[rs2] & bit):
-                # reset searching
-                setting_mode = False
-            if setting_mode:
-                if regs[rs1] & bit: # found a bit in rs1: stop setting rd
-                    setting_mode = False
-                else:
-                    regs[rd] |= bit
-            else if rs2 != x0: # searching mode
-                if (regs[rs2] & bit):
-                    setting_mode = True # back into "setting" mode
-            i += 1
-
-## sifm
-
-The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
-
-    sifm RT, RA, RB!=0
-
- # Example
-
-     7 6 5 4 3 2 1 0   Bit number
-
-     1 0 0 1 0 1 0 0   v3 contents
-                       vmsif.m v2, v3
-     0 0 0 0 0 1 1 1   v2 contents
-
-     1 0 0 1 0 1 0 1   v3 contents
-                       vmsif.m v2, v3
-     0 0 0 0 0 0 0 1   v2
-
-     1 1 0 0 0 0 1 1   RB vcontents
-     1 0 0 1 0 1 0 0   v3 contents
-                       vmsif.m v2, v3, v0.t
-     1 1 x x x x 1 1   v2 contents
-
-Pseudo-code:
-
-    def sif(rd, rs1, rs2):
-        rd = 0
-        setting_mode = rs2 == x0 or (regs[rs2] & 1)
-
-        while i < XLEN:
-            bit = 1<<i
-
-            # only reenable when predicate in use, and bit valid
-            if !setting_mode && rs2 != x0:
-                if (regs[rs2] & bit):
-                    # back into "setting" mode
-                    setting_mode = True
-
-            # skipping mode
-            if !setting_mode:
-                # skip any more 1s
-                if regs[rs1] & bit == 1:
-                    i += 1
-                    continue
-
-            # setting mode, search for 1
-            regs[rd] |= bit # always set during search
-            if regs[rs1] & bit: # found a bit in rs1:
-                setting_mode = False
-                # next loop starts skipping
-
-            i += 1
-
-
-## vmsof
-
-The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
+Notes:
  
-    sofm RT, RA, RB
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct,
+  normalise) are out of scope: this document is for more general-purpose
+  instructions that underpin and are critical to general-purpose Vector
+  workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as
+  predicate masks are covered separately, by crweird operations.
+  See [[sv/cr_int_predication]].
  
-Example
+## Mask-suited Bitmanipulation
  
-     7 6 5 4 3 2 1 0   Bit number
  
-     1 0 0 1 0 1 0 0   v3 contents
-                       vmsof.m v2, v3
-     0 0 0 0 0 1 0 0   v2 contents
+BM2-Form
  
-     1 0 0 1 0 1 0 1   v3 contents
-                       vmsof.m v2, v3
-     0 0 0 0 0 0 0 1   v2
+|0..5  |6..10|11..15|16..20|21-25|26|27..31| Form |
+|------|-----|------|------|-----|--|------|------|
+| PO   |  RS |   RA |   RB |bm   |L |   XO | BM2-Form |
  
-     1 1 0 0 0 0 1 1   RB vcontents
-     1 1 0 1 0 1 0 0   v3 contents
-                       vmsof.m v2, v3, v0.t
-     0 1 x x x x 0 0   v2 content
+* bmask RS,RA,RB,bm,L
  
  Pseudo-code:
  
-    def sof(rd, rs1, rs2):
-        rd = 0
-        setting_mode = rs2 == x0 or (regs[rs2] & 1)
-
-        while i < XLEN:
-            bit = 1<<i
+```
+    if _RB = 0 then mask <- [1] * XLEN
+    else            mask <- (RB)
+    ra <- (RA) & mask
+    a1 <- ra
+    if bm[4] = 0 then a1 <- ¬ra
+    mode2 <- bm[2:3]
+    if mode2 = 0 then a2 <- (¬ra)+1
+    if mode2 = 1 then a2 <- ra-1
+    if mode2 = 2 then a2 <- ra+1
+    if mode2 = 3 then a2 <- ¬(ra+1)
+    a1 <- a1 & mask
+    a2 <- a2 & mask
+    # select operator
+    mode3 <- bm[0:1]
+    if mode3 = 0 then result <- a1 | a2
+    if mode3 = 1 then result <- a1 & a2
+    if mode3 = 2 then result <- a1 ^ a2
+    if mode3 = 3 then result <- undefined([0]*XLEN)
+    # mask output
+    result <- result & mask
+    # optionally restore masked-out bits
+    if L = 1 then
+        result <- result | (RA & ¬mask)
+    RT <- result
+```
+
+* first pattern A: two options `x` or `~x`
+* second pattern B: three options `|` `&` or `^`
+* third pattern C: four options `x+1`, `x-1`, `~(x+1)` or `(~x)+1`
+
+
+The lower two bits of `bm` set to 0b11 are `RESERVED`. An illegal instruction
+trap must be raised.
+
+Special Registers Altered:
+
+```
+    None
+```
+
+## Carry-lookahead
+
+As a single scalar 32-bit instruction, up to 64 carry-propagation bits
+may be computed.  When the output is then used as a Predicate mask it can
+be used to selectively perform the "add carry" of biginteger math, with
+`sv.addi/sm=rN RT.v, RA.v, 1`.
+
+* cprop RT,RA,RB (Rc=0)
+* cprop. RT,RA,RB (Rc=1)
  
-            # only reenable when predicate in use, and bit valid
-            if !setting_mode && rs2 != x0:
-                if (regs[rs2] & bit):
-                    # back into "setting" mode
-                    setting_mode = True
+pseudocode:
  
-            # skipping mode
-            if !setting_mode:
-                # skip any more 1s
-                if regs[rs1] & bit == 1:
-                    i += 1
-                    continue
+```
+    P = (RA)
+    G = (RB)
+    RT = ((P|G)+G)^P 
+```
  
-            # setting mode, search for 1
-            if regs[rs1] & bit: # found a bit in rs1:
-                regs[rd] |= bit # only set when search succeeds
-                setting_mode = False
-                # next loop starts skipping
+X-Form
  
-            i += 1
+| 0:5|6:10|11:15|16:20| 21:30      |31| name      |  Form   |
+| -- | -- | --- | --- | ---------  |--| ----      | ------- |
+| PO | RT | RA  | RB  | XO         |Rc|     cprop | X-Form  |
  
-# Carry-lookahead
+used not just for carry lookahead, also a special type of predication mask operation.
  
-* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
-* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>