[[!tag standards]]

# SV Vector Operations.

The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)

Notes:

* Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].

Links:

* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
* <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
* <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
 out of scope for this document
* [[simple_v_extension/specification/bitmanip]] previous version,
  contains pseudocode for sof, sif, sbf

# Vector

## conflictd

This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)

    input = [100, 100,   3, 100,   5, 100, 100,   3]
    conflict result = [
         0b00000000,    // Note: first element always zero
         0b00000001,    // 100 is present on #0
         0b00000000,
         0b00000011,    // 100 is present on #0 and #1
         0b00000000,
         0b00001011,    // 100 is present on #0, #1, #3
         0b00011011,    // .. and #4
         0b00000100     // 3 is present on #2
    ]

Pseudocode:

    for i in range(VL):
        for j in range(1, i):
            if src1[i] == src2[j]:
                result[j] |= 1<<i

## iota

Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.

When masked, only the bits not masked out are included in the count process.

    viota RT/v, RA, RB

Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).

Example

     7 6 5 4 3 2 1 0   Element number

     1 0 0 1 0 0 0 1   v2 contents
                       viota.m v4, v2 # Unmasked
     2 2 2 1 1 1 1 0   v4 result

     1 1 1 0 1 0 1 1   v0 contents
     1 0 0 1 0 0 0 1   v2 contents
     2 3 4 5 6 7 8 9   v4 contents
                       viota.m v4, v2, v0.t # Masked
     1 1 1 5 1 7 1 0   v4 results

     def iota(RT, RA, RB): 
        mask = RB ? iregs[RB] : 0b111111...1
        val = RA ? iregs[RA] : 0b111111...1
        for i in range(VL):
            if RA.scalar:
            testmask = (1<<i)-1 # only count below
            to_test = val & testmask & mask
            iregs[RT+i] = popcount(to_test)

a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
where bit 2 is inv, bits 0:1 select the bit of the CR.

     def test_CR_bit(CR, BO):
         return CR[BO[0:1]] == BO[2]

     def iotacr(RT, BA, BO): 
        mask = get_src_predicate()
        count = 0
        for i in range(VL):
            if mask & (1<<i) == 0: continue
            iregs[RT+i] = count
            if test_CR_bit(CR[i+BA], BO):
                 count += 1

the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.

# Scalar

These may all be viewed as suitable for fitting into a scalar bitmanip extension.

## sbfm

   sbfm RT, RA, RB!=0

Example

     7 6 5 4 3 2 1 0   Bit index

     1 0 0 1 0 1 0 0   v3 contents
                       vmsbf.m v2, v3
     0 0 0 0 0 0 1 1   v2 contents

     1 0 0 1 0 1 0 1   v3 contents
                       vmsbf.m v2, v3
     0 0 0 0 0 0 0 0   v2

     0 0 0 0 0 0 0 0   v3 contents
                       vmsbf.m v2, v3
     1 1 1 1 1 1 1 1   v2

     1 1 0 0 0 0 1 1   RB vcontents
     1 0 0 1 0 1 0 0   v3 contents
                       vmsbf.m v2, v3, v0.t
     0 1 x x x x 1 1   v2 contents

The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.

pseudocode:

    def sbf(rd, rs1, rs2):
        rd = 0
        # start setting if no predicate or if 1st predicate bit set
        setting_mode = rs2 == x0 or (regs[rs2] & 1)
        while i < XLEN:
            bit = 1<<i
            if rs2 != x0 and (regs[rs2] & bit):
                # reset searching
                setting_mode = False
            if setting_mode:
                if regs[rs1] & bit: # found a bit in rs1: stop setting rd
                    setting_mode = False
                else:
                    regs[rd] |= bit
            else if rs2 != x0: # searching mode
                if (regs[rs2] & bit):
                    setting_mode = True # back into "setting" mode
            i += 1

## sifm

The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.

    sifm RT, RA, RB!=0

 # Example

     7 6 5 4 3 2 1 0   Bit number

     1 0 0 1 0 1 0 0   v3 contents
                       vmsif.m v2, v3
     0 0 0 0 0 1 1 1   v2 contents

     1 0 0 1 0 1 0 1   v3 contents
                       vmsif.m v2, v3
     0 0 0 0 0 0 0 1   v2

     1 1 0 0 0 0 1 1   RB vcontents
     1 0 0 1 0 1 0 0   v3 contents
                       vmsif.m v2, v3, v0.t
     1 1 x x x x 1 1   v2 contents

Pseudo-code:

    def sif(rd, rs1, rs2):
        rd = 0
        setting_mode = rs2 == x0 or (regs[rs2] & 1)

        while i < XLEN:
            bit = 1<<i

            # only reenable when predicate in use, and bit valid
            if !setting_mode && rs2 != x0:
                if (regs[rs2] & bit):
                    # back into "setting" mode
                    setting_mode = True

            # skipping mode
            if !setting_mode:
                # skip any more 1s
                if regs[rs1] & bit == 1:
                    i += 1
                    continue

            # setting mode, search for 1
            regs[rd] |= bit # always set during search
            if regs[rs1] & bit: # found a bit in rs1:
                setting_mode = False
                # next loop starts skipping

            i += 1


## vmsof

The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.

    sofm RT, RA, RB

Example

     7 6 5 4 3 2 1 0   Bit number

     1 0 0 1 0 1 0 0   v3 contents
                       vmsof.m v2, v3
     0 0 0 0 0 1 0 0   v2 contents

     1 0 0 1 0 1 0 1   v3 contents
                       vmsof.m v2, v3
     0 0 0 0 0 0 0 1   v2

     1 1 0 0 0 0 1 1   RB vcontents
     1 1 0 1 0 1 0 0   v3 contents
                       vmsof.m v2, v3, v0.t
     0 1 x x x x 0 0   v2 content

Pseudo-code:

    def sof(rd, rs1, rs2):
        rd = 0
        setting_mode = rs2 == x0 or (regs[rs2] & 1)

        while i < XLEN:
            bit = 1<<i

            # only reenable when predicate in use, and bit valid
            if !setting_mode && rs2 != x0:
                if (regs[rs2] & bit):
                    # back into "setting" mode
                    setting_mode = True

            # skipping mode
            if !setting_mode:
                # skip any more 1s
                if regs[rs1] & bit == 1:
                    i += 1
                    continue

            # setting mode, search for 1
            if regs[rs1] & bit: # found a bit in rs1:
                regs[rd] |= bit # only set when search succeeds
                setting_mode = False
                # next loop starts skipping

            i += 1

# Carry-lookahead

used not just for carry lookahead, also a special type of predication mask operation.

* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
* <https://i.stack.imgur.com/QSLKY.png>
* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
  `((P|G)+G)^P`
* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>

```
     P = (A | B) & Ci
     G = (A & B)
```

Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.

```
    At each id, compute C[id] = A[id]+B[id]+0
    Get G[id] = C[id] > radix -1
    Get P[id] = C[id] == radix-1
    Join all P[id] together, likewise G[id]
    Compute newC = ((P|G)+G)^P
    result[id] = (C[id] + newC[id]) % radix
```   

two versions: scalar int version and CR based version.

scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge

vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.

if zero (no propagation) then CR0.eq is zero

CR based version, TODO.