openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 Links:
   6
   7 * TODO merge old standards page [[simple_v_extension/vector_ops/]]
   8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
   9 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
  10 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  13  out of scope for this document [[openpower/sv/3d_vector_ops]]
  14 * [[simple_v_extension/specification/bitmanip]] previous version,
  15   contains pseudocode for sof, sif, sbf
  16
  17
  18 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
  19
  20 Notes:
  21
  22 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  23 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  24 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  25
  26 # Vector
  27
  28 Both of these instructions may be synthesised from SVP64 Vector
  29 instructions.  conflictd is an O(N^2) instruction based on
  30 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
  31 with the appropriate predication
  32
  33 ## conflictd
  34
  35 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  36
  37     input = [100, 100,   3, 100,   5, 100, 100,   3]
  38     conflict result = [
  39          0b00000000,    // Note: first element always zero
  40          0b00000001,    // 100 is present on #0
  41          0b00000000,
  42          0b00000011,    // 100 is present on #0 and #1
  43          0b00000000,
  44          0b00001011,    // 100 is present on #0, #1, #3
  45          0b00011011,    // .. and #4
  46          0b00000100     // 3 is present on #2
  47     ]
  48
  49 Pseudocode:
  50
  51     for i in range(VL):
  52         for j in range(1, i):
  53             if src1[i] == src2[j]:
  54                 result[j] |= 1<<i
  55
  56 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
  57   using `mfcrweird` and `cmpi`. first triangular schedule on src1,
  58 secpnd on src2.
  59
  60 Idea 2: implement using outer loop on varying setvl Horizontal-First
  61 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
  62 result.
  63
  64     li r3, 1
  65     li result, 0
  66     for i in range(target):
  67         setvl target
  68         sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
  69         sv.cmpi src2.v, t0 # compare src2 vector to scalar
  70         sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
  71         srr t1, t1, i # shift up by i before ORing
  72         or result, result, t1
  73         srr r3, r3, 1 # shift r3 predicate up by one
  74
  75 See [[sv/cr_int_predication]] for full details on the crweird instructions:
  76 the primary important aspect here is that a Vector of CR Field's EQ bits is
  77 transferred into a single GPR.  The secondary important aspect is that VL
  78 is being adjusted in each loop, testing successively more of the input
  79 vector against a given scalar, each time.
  80
  81 To investigate:
  82
  83 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
  84 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
  85
  86 ## iota
  87
  88 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  89
  90 When masked, only the bits not masked out are included in the count process.
  91
  92     viota RT/v, RA, RB
  93
  94 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  95
  96 Example
  97
  98      7 6 5 4 3 2 1 0   Element number
  99
 100      1 0 0 1 0 0 0 1   v2 contents
 101                        viota.m v4, v2 # Unmasked
 102      2 2 2 1 1 1 1 0   v4 result
 103
 104      1 1 1 0 1 0 1 1   v0 contents
 105      1 0 0 1 0 0 0 1   v2 contents
 106      2 3 4 5 6 7 8 9   v4 contents
 107                        viota.m v4, v2, v0.t # Masked
 108      1 1 1 5 1 7 1 0   v4 results
 109
 110      def iota(RT, RA, RB):
 111         mask = RB ? iregs[RB] : 0b111111...1
 112         val = RA ? iregs[RA] : 0b111111...1
 113         for i in range(VL):
 114             if RA.scalar:
 115             testmask = (1<<i)-1 # only count below
 116             to_test = val & testmask & mask
 117             iregs[RT+i] = popcount(to_test)
 118
 119 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
 120 where bit 2 is inv, bits 0:1 select the bit of the CR.
 121
 122      def test_CR_bit(CR, BO):
 123          return CR[BO[0:1]] == BO[2]
 124
 125      def iotacr(RT, BA, BO):
 126         mask = get_src_predicate()
 127         count = 0
 128         for i in range(VL):
 129             if mask & (1<<i) == 0:
 130                 count = 0 # reset back to zero
 131                 continue
 132             iregs[RT+i] = count
 133             if test_CR_bit(CR[i+BA], BO):
 134                  count += 1
 135
 136 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
 137
 138 scalar variant which can be Vectorised to give iotacr:
 139
 140      def crtaddi(RT, RA, BA, BO, D):
 141          if test_CR_bit(BA, BO):
 142              RT = RA + EXTS(D)
 143          else:
 144              RT = RA
 145
 146 a Vector for-loop with zero-ing on dest will give the
 147 mask-out effect of resetting the count back to zero.
 148 However close examination shows that the above may actually
 149 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
 150
 151 # Scalar
 152
 153 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 154
 155 ## sbfm
 156
 157    sbfm RT, RA, RB!=0
 158
 159 Example
 160
 161      7 6 5 4 3 2 1 0   Bit index
 162
 163      1 0 0 1 0 1 0 0   v3 contents
 164                        vmsbf.m v2, v3
 165      0 0 0 0 0 0 1 1   v2 contents
 166
 167      1 0 0 1 0 1 0 1   v3 contents
 168                        vmsbf.m v2, v3
 169      0 0 0 0 0 0 0 0   v2
 170
 171      0 0 0 0 0 0 0 0   v3 contents
 172                        vmsbf.m v2, v3
 173      1 1 1 1 1 1 1 1   v2
 174
 175      1 1 0 0 0 0 1 1   RB vcontents
 176      1 0 0 1 0 1 0 0   v3 contents
 177                        vmsbf.m v2, v3, v0.t
 178      0 1 x x x x 1 1   v2 contents
 179
 180 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 181
 182 pseudocode:
 183
 184     def sbf(rd, rs1, rs2):
 185         rd = 0
 186         # start setting if no predicate or if 1st predicate bit set
 187         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 188         while i < XLEN:
 189             bit = 1<<i
 190             if rs2 != x0 and (regs[rs2] & bit):
 191                 # reset searching
 192                 setting_mode = False
 193             if setting_mode:
 194                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 195                     setting_mode = False
 196                 else:
 197                     regs[rd] |= bit
 198             else if rs2 != x0: # searching mode
 199                 if (regs[rs2] & bit):
 200                     setting_mode = True # back into "setting" mode
 201             i += 1
 202
 203 ## sifm
 204
 205 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 206
 207     sifm RT, RA, RB!=0
 208
 209  # Example
 210
 211      7 6 5 4 3 2 1 0   Bit number
 212
 213      1 0 0 1 0 1 0 0   v3 contents
 214                        vmsif.m v2, v3
 215      0 0 0 0 0 1 1 1   v2 contents
 216
 217      1 0 0 1 0 1 0 1   v3 contents
 218                        vmsif.m v2, v3
 219      0 0 0 0 0 0 0 1   v2
 220
 221      1 1 0 0 0 0 1 1   RB vcontents
 222      1 0 0 1 0 1 0 0   v3 contents
 223                        vmsif.m v2, v3, v0.t
 224      1 1 x x x x 1 1   v2 contents
 225
 226 Pseudo-code:
 227
 228     def sif(rd, rs1, rs2):
 229         rd = 0
 230         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 231
 232         while i < XLEN:
 233             bit = 1<<i
 234
 235             # only reenable when predicate in use, and bit valid
 236             if !setting_mode && rs2 != x0:
 237                 if (regs[rs2] & bit):
 238                     # back into "setting" mode
 239                     setting_mode = True
 240
 241             # skipping mode
 242             if !setting_mode:
 243                 # skip any more 1s
 244                 if regs[rs1] & bit == 1:
 245                     i += 1
 246                     continue
 247
 248             # setting mode, search for 1
 249             regs[rd] |= bit # always set during search
 250             if regs[rs1] & bit: # found a bit in rs1:
 251                 setting_mode = False
 252                 # next loop starts skipping
 253
 254             i += 1
 255
 256
 257 ## vmsof
 258
 259 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 260
 261     sofm RT, RA, RB
 262
 263 Example
 264
 265      7 6 5 4 3 2 1 0   Bit number
 266
 267      1 0 0 1 0 1 0 0   v3 contents
 268                        vmsof.m v2, v3
 269      0 0 0 0 0 1 0 0   v2 contents
 270
 271      1 0 0 1 0 1 0 1   v3 contents
 272                        vmsof.m v2, v3
 273      0 0 0 0 0 0 0 1   v2
 274
 275      1 1 0 0 0 0 1 1   RB vcontents
 276      1 1 0 1 0 1 0 0   v3 contents
 277                        vmsof.m v2, v3, v0.t
 278      0 1 x x x x 0 0   v2 content
 279
 280 Pseudo-code:
 281
 282     def sof(rd, rs1, rs2):
 283         rd = 0
 284         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 285
 286         while i < XLEN:
 287             bit = 1<<i
 288
 289             # only reenable when predicate in use, and bit valid
 290             if !setting_mode && rs2 != x0:
 291                 if (regs[rs2] & bit):
 292                     # back into "setting" mode
 293                     setting_mode = True
 294
 295             # skipping mode
 296             if !setting_mode:
 297                 # skip any more 1s
 298                 if regs[rs1] & bit == 1:
 299                     i += 1
 300                     continue
 301
 302             # setting mode, search for 1
 303             if regs[rs1] & bit: # found a bit in rs1:
 304                 regs[rd] |= bit # only set when search succeeds
 305                 setting_mode = False
 306                 # next loop starts skipping
 307
 308             i += 1
 309
 310 # Carry-lookahead
 311
 312 used not just for carry lookahead, also a special type of predication mask operation.
 313
 314 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 315 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 316 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 317 * <https://i.stack.imgur.com/QSLKY.png>
 318 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 319   `((P|G)+G)^P`
 320 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 321
 322 From QLSKY.png:
 323
 324 ```
 325     x0 = nand(CIn, P0)
 326     C0 = nand(x0, ~G0)
 327
 328     x1 = nand(CIn, P0, P1)
 329     y1 = nand(G0, P1)
 330     C1 = nand(x1, y1, ~G1)
 331
 332     x2 = nand(CIn, P0, P1, P2)
 333     y2 = nand(G0, P1, P2)
 334     z2 = nand(G1, P2)
 335     C1 = nand(x2, y2, z2, ~G2)
 336
 337     # Gen*
 338     x3 = nand(G0, P1, P2, P3)
 339     y3 = nand(G1, P2, P3)
 340     z3 = nand(G2, P3)
 341     G* = nand(x3, y3, z3, ~G3)
 342 ```
 343
 344 ```
 345      P = (A | B) & Ci
 346      G = (A & B)
 347 ```
 348
 349 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 350
 351 ```
 352     At each id, compute C[id] = A[id]+B[id]+0
 353     Get G[id] = C[id] > radix -1
 354     Get P[id] = C[id] == radix-1
 355     Join all P[id] together, likewise G[id]
 356     Compute newC = ((P|G)+G)^P
 357     result[id] = (C[id] + newC[id]) % radix
 358 ```
 359
 360 two versions: scalar int version and CR based version.
 361
 362 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 363
 364 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 365
 366 if zero (no propagation) then CR0.eq is zero
 367
 368 CR based version, TODO.