openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 Links:
   6
   7 * [[discussion]]
   8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
   9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  12  out of scope for this document [[openpower/sv/3d_vector_ops]]
  13 * [[simple_v_extension/specification/bitmanip]] previous version,
  14   contains pseudocode for sof, sif, sbf
  15 * https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)
  16
  17 The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
  18 Therefore there are not that many cases where *actual* Vector
  19 instructions are needed. If they are, they are more "assistance"
  20 functions.  Two traditional Vector instructions were initially
  21 considered (conflictd and vmiota) however they may be synthesised
  22 from existing SVP64 instructions: details in [[discussion]]
  23
  24 Notes:
  25
  26 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  27 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  28
  29 # sbfm
  30
  31    sbfm RT, RA, RB!=0
  32
  33 Example
  34
  35      7 6 5 4 3 2 1 0   Bit index
  36
  37      1 0 0 1 0 1 0 0   v3 contents
  38                        vmsbf.m v2, v3
  39      0 0 0 0 0 0 1 1   v2 contents
  40
  41      1 0 0 1 0 1 0 1   v3 contents
  42                        vmsbf.m v2, v3
  43      0 0 0 0 0 0 0 0   v2
  44
  45      0 0 0 0 0 0 0 0   v3 contents
  46                        vmsbf.m v2, v3
  47      1 1 1 1 1 1 1 1   v2
  48
  49      1 1 0 0 0 0 1 1   RB vcontents
  50      1 0 0 1 0 1 0 0   v3 contents
  51                        vmsbf.m v2, v3, v0.t
  52      0 1 x x x x 1 1   v2 contents
  53
  54 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
  55
  56 Executable pseudocode demo:
  57
  58 ```
  59 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
  60 ```
  61
  62 # sifm
  63
  64 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
  65
  66     sifm RT, RA, RB!=0
  67
  68  # Example
  69
  70      7 6 5 4 3 2 1 0   Bit number
  71
  72      1 0 0 1 0 1 0 0   v3 contents
  73                        vmsif.m v2, v3
  74      0 0 0 0 0 1 1 1   v2 contents
  75
  76      1 0 0 1 0 1 0 1   v3 contents
  77                        vmsif.m v2, v3
  78      0 0 0 0 0 0 0 1   v2
  79
  80      1 1 0 0 0 0 1 1   RB vcontents
  81      1 0 0 1 0 1 0 0   v3 contents
  82                        vmsif.m v2, v3, v0.t
  83      1 1 x x x x 1 1   v2 contents
  84
  85 Executable pseudocode demo:
  86
  87 ```
  88 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
  89 ```
  90
  91 # vmsof
  92
  93 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
  94
  95     sofm RT, RA, RB
  96
  97 Example
  98
  99      7 6 5 4 3 2 1 0   Bit number
 100
 101      1 0 0 1 0 1 0 0   v3 contents
 102                        vmsof.m v2, v3
 103      0 0 0 0 0 1 0 0   v2 contents
 104
 105      1 0 0 1 0 1 0 1   v3 contents
 106                        vmsof.m v2, v3
 107      0 0 0 0 0 0 0 1   v2
 108
 109      1 1 0 0 0 0 1 1   RB vcontents
 110      1 1 0 1 0 1 0 0   v3 contents
 111                        vmsof.m v2, v3, v0.t
 112      0 1 x x x x 0 0   v2 content
 113
 114 Executable pseudocode demo:
 115
 116 ```
 117 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
 118 ```
 119
 120 # Carry-lookahead
 121
 122 used not just for carry lookahead, also a special type of predication mask operation.
 123
 124 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 125 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 126 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 127 * <https://i.stack.imgur.com/QSLKY.png>
 128 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 129   `((P|G)+G)^P`
 130 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 131
 132 From QLSKY.png:
 133
 134 ```
 135     x0 = nand(CIn, P0)
 136     C0 = nand(x0, ~G0)
 137
 138     x1 = nand(CIn, P0, P1)
 139     y1 = nand(G0, P1)
 140     C1 = nand(x1, y1, ~G1)
 141
 142     x2 = nand(CIn, P0, P1, P2)
 143     y2 = nand(G0, P1, P2)
 144     z2 = nand(G1, P2)
 145     C1 = nand(x2, y2, z2, ~G2)
 146
 147     # Gen*
 148     x3 = nand(G0, P1, P2, P3)
 149     y3 = nand(G1, P2, P3)
 150     z3 = nand(G2, P3)
 151     G* = nand(x3, y3, z3, ~G3)
 152 ```
 153
 154 ```
 155      P = (A | B) & Ci
 156      G = (A & B)
 157 ```
 158
 159 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 160
 161 ```
 162     At each id, compute C[id] = A[id]+B[id]+0
 163     Get G[id] = C[id] > radix -1
 164     Get P[id] = C[id] == radix-1
 165     Join all P[id] together, likewise G[id]
 166     Compute newC = ((P|G)+G)^P
 167     result[id] = (C[id] + newC[id]) % radix
 168 ```
 169
 170 two versions: scalar int version and CR based version.
 171
 172 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 173
 174 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 175
 176 if zero (no propagation) then CR0.eq is zero
 177
 178 CR based version, TODO.