[[!tag standards]]

# SV Vector Operations.

Links:

* [[discussion]]
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
* <https://bugs.libre-soc.org/show_bug.cgi?id=865> implementation in simulator
* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
* <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
 out of scope for this document [[openpower/sv/3d_vector_ops]]
* [[simple_v_extension/specification/bitmanip]] previous version,
  contains pseudocode for sof, sif, sbf
* <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>

The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
Therefore there are not that many cases where *actual* Vector
instructions are needed. If they are, they are more "assistance"
functions.  Two traditional Vector instructions were initially
considered (conflictd and vmiota) however they may be synthesised
from existing SVP64 instructions: details in [[discussion]]

Notes:

* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].

# Mask-suited Bitmanipulation

Based on Cray-style masked set-before-first, set-after-first etc.
and Intel and AMD Bitmanip instructions made generalised then
advanced further to include masks, this is a single instruction
covering 24 individual instructions in other ISAs.
*(sbf/sof/sif moved to [[discussion]])*

BM2-Form

    |0     |6    |11    |16    |21-25|26|27..31|
    |------|-----|------|------|-----|--|------|
    | PO   |  RS |   RA |   RB |mode |L |   XO |

* bmask RT,RA,RB,mode,L

The patterns within the pseudocode for AMD TBM and x86 BMI1 are
as follows:

* first pattern A: `x / ~x`
* second pattern B: `| / & / ^`
* third pattern C: `x+1 / x-1 / ~(x+1) / -x`

Thus it makes sense to create a single instruction
that covers all of these.

Executable pseudocode demo:

```
[[!inline quick="yes" raw="yes" pages="openpower/sv/bmask.py"]]
```

# Carry-lookahead

* cprop RT,RA,RB
* cprop. RT,RA,RB

pseudocode:

    P = (RA)
    G = (RB)
    RT = ((P|G)+G)^P 

X-Form

| 0.5|6.10|11.15|16.20| 21..30     |31| name      |  Form   |
| -- | -- | --- | --- | ---------  |--| ----      | ------- |
| NN | RT | RA  | RB  | 0110001110 |Rc|     cprop | X-Form  |

used not just for carry lookahead, also a special type of predication mask operation.

* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
* <https://i.stack.imgur.com/QSLKY.png>
* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
  `((P|G)+G)^P`
* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>

From QLSKY.png:

```
    x0 = nand(CIn, P0)
    C0 = nand(x0, ~G0)

    x1 = nand(CIn, P0, P1)
    y1 = nand(G0, P1)
    C1 = nand(x1, y1, ~G1)

    x2 = nand(CIn, P0, P1, P2)
    y2 = nand(G0, P1, P2)
    z2 = nand(G1, P2)
    C1 = nand(x2, y2, z2, ~G2)

    # Gen*
    x3 = nand(G0, P1, P2, P3)
    y3 = nand(G1, P2, P3)
    z3 = nand(G2, P3)
    G* = nand(x3, y3, z3, ~G3)
```

```
     P = (A | B) & Ci
     G = (A & B)
```

Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.

```
    At each id, compute C[id] = A[id]+B[id]+0
    Get G[id] = C[id] > radix -1
    Get P[id] = C[id] == radix-1
    Join all P[id] together, likewise G[id]
    Compute newC = ((P|G)+G)^P
    result[id] = (C[id] + newC[id]) % radix
```   

two versions: scalar int version and CR based version.

scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge

vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.

if zero (no propagation) then CR0.eq is zero

CR based version, TODO.