openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 Links:
   6
   7 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
   8 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
   9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  12  out of scope for this document [[openpower/sv/3d_vector_ops]]
  13 * [[simple_v_extension/specification/bitmanip]] previous version,
  14   contains pseudocode for sof, sif, sbf
  15
  16 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
  17
  18 Notes:
  19
  20 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  21 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  22 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  23
  24 # Vector
  25
  26 Both of these instructions may be synthesised from SVP64 Vector
  27 instructions.  conflictd is an O(N^2) instruction based on
  28 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
  29 with the appropriate predication
  30
  31 ## conflictd
  32
  33 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  34
  35     input = [100, 100,   3, 100,   5, 100, 100,   3]
  36     conflict result = [
  37          0b00000000,    // Note: first element always zero
  38          0b00000001,    // 100 is present on #0
  39          0b00000000,
  40          0b00000011,    // 100 is present on #0 and #1
  41          0b00000000,
  42          0b00001011,    // 100 is present on #0, #1, #3
  43          0b00011011,    // .. and #4
  44          0b00000100     // 3 is present on #2
  45     ]
  46
  47 Pseudocode:
  48
  49     for i in range(VL):
  50         for j in range(1, i):
  51             if src1[i] == src2[j]:
  52                 result[j] |= 1<<i
  53
  54 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
  55   using `mfcrweird` and `cmpi`. first triangular schedule on src1,
  56 secpnd on src2.
  57
  58 Idea 2: implement using outer loop on varying setvl Horizontal-First
  59 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
  60 result.
  61
  62     li r3, 1
  63     li result, 0
  64     for i in range(target):
  65         setvl target
  66         sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
  67         sv.cmpi src2.v, t0 # compare src2 vector to scalar
  68         sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
  69         srr t1, t1, i # shift up by i before ORing
  70         or result, result, t1
  71         srr r3, r3, 1 # shift r3 predicate up by one
  72
  73 See [[sv/cr_int_predication]] for full details on the crweird instructions:
  74 the primary important aspect here is that a Vector of CR Field's EQ bits is
  75 transferred into a single GPR.  The secondary important aspect is that VL
  76 is being adjusted in each loop, testing successively more of the input
  77 vector against a given scalar, each time.
  78
  79 To investigate:
  80
  81 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
  82 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
  83
  84 ## iota
  85
  86 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  87
  88 When masked, only the bits not masked out are included in the count process.
  89
  90     viota RT/v, RA, RB
  91
  92 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  93
  94 Example
  95
  96      7 6 5 4 3 2 1 0   Element number
  97
  98      1 0 0 1 0 0 0 1   v2 contents
  99                        viota.m v4, v2 # Unmasked
 100      2 2 2 1 1 1 1 0   v4 result
 101
 102      1 1 1 0 1 0 1 1   v0 contents
 103      1 0 0 1 0 0 0 1   v2 contents
 104      2 3 4 5 6 7 8 9   v4 contents
 105                        viota.m v4, v2, v0.t # Masked
 106      1 1 1 5 1 7 1 0   v4 results
 107
 108      def iota(RT, RA, RB):
 109         mask = RB ? iregs[RB] : 0b111111...1
 110         val = RA ? iregs[RA] : 0b111111...1
 111         for i in range(VL):
 112             if RA.scalar:
 113             testmask = (1<<i)-1 # only count below
 114             to_test = val & testmask & mask
 115             iregs[RT+i] = popcount(to_test)
 116
 117 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
 118 where bit 2 is inv, bits 0:1 select the bit of the CR.
 119
 120      def test_CR_bit(CR, BO):
 121          return CR[BO[0:1]] == BO[2]
 122
 123      def iotacr(RT, BA, BO):
 124         mask = get_src_predicate()
 125         count = 0
 126         for i in range(VL):
 127             if mask & (1<<i) == 0:
 128                 count = 0 # reset back to zero
 129                 continue
 130             iregs[RT+i] = count
 131             if test_CR_bit(CR[i+BA], BO):
 132                  count += 1
 133
 134 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
 135
 136 scalar variant which can be Vectorised to give iotacr:
 137
 138      def crtaddi(RT, RA, BA, BO, D):
 139          if test_CR_bit(BA, BO):
 140              RT = RA + EXTS(D)
 141          else:
 142              RT = RA
 143
 144 a Vector for-loop with zero-ing on dest will give the
 145 mask-out effect of resetting the count back to zero.
 146 However close examination shows that the above may actually
 147 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
 148
 149 # Scalar
 150
 151 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 152
 153 ## sbfm
 154
 155    sbfm RT, RA, RB!=0
 156
 157 Example
 158
 159      7 6 5 4 3 2 1 0   Bit index
 160
 161      1 0 0 1 0 1 0 0   v3 contents
 162                        vmsbf.m v2, v3
 163      0 0 0 0 0 0 1 1   v2 contents
 164
 165      1 0 0 1 0 1 0 1   v3 contents
 166                        vmsbf.m v2, v3
 167      0 0 0 0 0 0 0 0   v2
 168
 169      0 0 0 0 0 0 0 0   v3 contents
 170                        vmsbf.m v2, v3
 171      1 1 1 1 1 1 1 1   v2
 172
 173      1 1 0 0 0 0 1 1   RB vcontents
 174      1 0 0 1 0 1 0 0   v3 contents
 175                        vmsbf.m v2, v3, v0.t
 176      0 1 x x x x 1 1   v2 contents
 177
 178 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 179
 180 Executable demo:
 181
 182 ```
 183 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
 184 ```
 185
 186 ## sifm
 187
 188 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 189
 190     sifm RT, RA, RB!=0
 191
 192  # Example
 193
 194      7 6 5 4 3 2 1 0   Bit number
 195
 196      1 0 0 1 0 1 0 0   v3 contents
 197                        vmsif.m v2, v3
 198      0 0 0 0 0 1 1 1   v2 contents
 199
 200      1 0 0 1 0 1 0 1   v3 contents
 201                        vmsif.m v2, v3
 202      0 0 0 0 0 0 0 1   v2
 203
 204      1 1 0 0 0 0 1 1   RB vcontents
 205      1 0 0 1 0 1 0 0   v3 contents
 206                        vmsif.m v2, v3, v0.t
 207      1 1 x x x x 1 1   v2 contents
 208
 209 Executable demo:
 210
 211 ```
 212 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
 213 ```
 214
 215 ## vmsof
 216
 217 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 218
 219     sofm RT, RA, RB
 220
 221 Example
 222
 223      7 6 5 4 3 2 1 0   Bit number
 224
 225      1 0 0 1 0 1 0 0   v3 contents
 226                        vmsof.m v2, v3
 227      0 0 0 0 0 1 0 0   v2 contents
 228
 229      1 0 0 1 0 1 0 1   v3 contents
 230                        vmsof.m v2, v3
 231      0 0 0 0 0 0 0 1   v2
 232
 233      1 1 0 0 0 0 1 1   RB vcontents
 234      1 1 0 1 0 1 0 0   v3 contents
 235                        vmsof.m v2, v3, v0.t
 236      0 1 x x x x 0 0   v2 content
 237
 238 Executable demo:
 239
 240 ```
 241 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
 242 ```
 243
 244 # Carry-lookahead
 245
 246 used not just for carry lookahead, also a special type of predication mask operation.
 247
 248 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 249 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 250 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 251 * <https://i.stack.imgur.com/QSLKY.png>
 252 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 253   `((P|G)+G)^P`
 254 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 255
 256 From QLSKY.png:
 257
 258 ```
 259     x0 = nand(CIn, P0)
 260     C0 = nand(x0, ~G0)
 261
 262     x1 = nand(CIn, P0, P1)
 263     y1 = nand(G0, P1)
 264     C1 = nand(x1, y1, ~G1)
 265
 266     x2 = nand(CIn, P0, P1, P2)
 267     y2 = nand(G0, P1, P2)
 268     z2 = nand(G1, P2)
 269     C1 = nand(x2, y2, z2, ~G2)
 270
 271     # Gen*
 272     x3 = nand(G0, P1, P2, P3)
 273     y3 = nand(G1, P2, P3)
 274     z3 = nand(G2, P3)
 275     G* = nand(x3, y3, z3, ~G3)
 276 ```
 277
 278 ```
 279      P = (A | B) & Ci
 280      G = (A & B)
 281 ```
 282
 283 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 284
 285 ```
 286     At each id, compute C[id] = A[id]+B[id]+0
 287     Get G[id] = C[id] > radix -1
 288     Get P[id] = C[id] == radix-1
 289     Join all P[id] together, likewise G[id]
 290     Compute newC = ((P|G)+G)^P
 291     result[id] = (C[id] + newC[id]) % radix
 292 ```
 293
 294 two versions: scalar int version and CR based version.
 295
 296 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 297
 298 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 299
 300 if zero (no propagation) then CR0.eq is zero
 301
 302 CR based version, TODO.