openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 * TODO merge old standards page [[simple_v_extension/vector_ops/]]
   6 * bugreport <https://bugs.libre-soc.org/show_bug.cgi?id=142>
   7
   8 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
   9
  10 Notes:
  11
  12 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  13 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  14 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  15
  16 Links:
  17
  18 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
  19 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
  20 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  21 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  22 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  23  out of scope for this document
  24 * [[simple_v_extension/specification/bitmanip]] previous version,
  25   contains pseudocode for sof, sif, sbf
  26
  27 # Vector
  28
  29 Both of these instructions may be synthesised from SVP64 Vector
  30 instructions.  conflictd is an O(N^2) instruction based on
  31 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
  32 with the appropriate predication
  33
  34 ## conflictd
  35
  36 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  37
  38     input = [100, 100,   3, 100,   5, 100, 100,   3]
  39     conflict result = [
  40          0b00000000,    // Note: first element always zero
  41          0b00000001,    // 100 is present on #0
  42          0b00000000,
  43          0b00000011,    // 100 is present on #0 and #1
  44          0b00000000,
  45          0b00001011,    // 100 is present on #0, #1, #3
  46          0b00011011,    // .. and #4
  47          0b00000100     // 3 is present on #2
  48     ]
  49
  50 Pseudocode:
  51
  52     for i in range(VL):
  53         for j in range(1, i):
  54             if src1[i] == src2[j]:
  55                 result[j] |= 1<<i
  56
  57 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
  58   using `mfcrweird` and `cmpi`. first triangular schedule on src1,
  59 secpnd on src2.
  60
  61 Idea 2: implement using outer loop on varying setvl Horizontal-First
  62 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
  63 result.
  64
  65     li r3, 1
  66     li result, 0
  67     for i in range(target):
  68         setvl target
  69         sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
  70         sv.cmpi src2.v, t0 # compare src2 vector to scalar
  71         sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
  72         srr t1, t1, i # shift up by i before ORing
  73         or result, result, t1
  74         srr r3, r3, 1 # shift r3 predicate up by one
  75
  76 See [[sv/cr_int_predication]] for full details on the crweird instructions:
  77 the primary important aspect here is that a Vector of CR Field's EQ bits is
  78 transferred into a single GPR.  The secondary important aspect is that VL
  79 is being adjusted in each loop, testing successively more of the input
  80 vector against a given scalar, each time.
  81
  82 To investigate:
  83
  84 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
  85 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
  86
  87 ## iota
  88
  89 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  90
  91 When masked, only the bits not masked out are included in the count process.
  92
  93     viota RT/v, RA, RB
  94
  95 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  96
  97 Example
  98
  99      7 6 5 4 3 2 1 0   Element number
 100
 101      1 0 0 1 0 0 0 1   v2 contents
 102                        viota.m v4, v2 # Unmasked
 103      2 2 2 1 1 1 1 0   v4 result
 104
 105      1 1 1 0 1 0 1 1   v0 contents
 106      1 0 0 1 0 0 0 1   v2 contents
 107      2 3 4 5 6 7 8 9   v4 contents
 108                        viota.m v4, v2, v0.t # Masked
 109      1 1 1 5 1 7 1 0   v4 results
 110
 111      def iota(RT, RA, RB):
 112         mask = RB ? iregs[RB] : 0b111111...1
 113         val = RA ? iregs[RA] : 0b111111...1
 114         for i in range(VL):
 115             if RA.scalar:
 116             testmask = (1<<i)-1 # only count below
 117             to_test = val & testmask & mask
 118             iregs[RT+i] = popcount(to_test)
 119
 120 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
 121 where bit 2 is inv, bits 0:1 select the bit of the CR.
 122
 123      def test_CR_bit(CR, BO):
 124          return CR[BO[0:1]] == BO[2]
 125
 126      def iotacr(RT, BA, BO):
 127         mask = get_src_predicate()
 128         count = 0
 129         for i in range(VL):
 130             if mask & (1<<i) == 0:
 131                 count = 0 # reset back to zero
 132                 continue
 133             iregs[RT+i] = count
 134             if test_CR_bit(CR[i+BA], BO):
 135                  count += 1
 136
 137 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
 138
 139 scalar variant which can be Vectorised to give iotacr:
 140
 141      def crtaddi(RT, RA, BA, BO, D):
 142          if test_CR_bit(BA, BO):
 143              RT = RA + EXTS(D)
 144          else:
 145              RT = RA
 146
 147 a Vector for-loop with zero-ing on dest will give the
 148 mask-out effect of resetting the count back to zero.
 149 However close examination shows that the above may actually
 150 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
 151
 152 # Scalar
 153
 154 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 155
 156 ## sbfm
 157
 158    sbfm RT, RA, RB!=0
 159
 160 Example
 161
 162      7 6 5 4 3 2 1 0   Bit index
 163
 164      1 0 0 1 0 1 0 0   v3 contents
 165                        vmsbf.m v2, v3
 166      0 0 0 0 0 0 1 1   v2 contents
 167
 168      1 0 0 1 0 1 0 1   v3 contents
 169                        vmsbf.m v2, v3
 170      0 0 0 0 0 0 0 0   v2
 171
 172      0 0 0 0 0 0 0 0   v3 contents
 173                        vmsbf.m v2, v3
 174      1 1 1 1 1 1 1 1   v2
 175
 176      1 1 0 0 0 0 1 1   RB vcontents
 177      1 0 0 1 0 1 0 0   v3 contents
 178                        vmsbf.m v2, v3, v0.t
 179      0 1 x x x x 1 1   v2 contents
 180
 181 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 182
 183 pseudocode:
 184
 185     def sbf(rd, rs1, rs2):
 186         rd = 0
 187         # start setting if no predicate or if 1st predicate bit set
 188         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 189         while i < XLEN:
 190             bit = 1<<i
 191             if rs2 != x0 and (regs[rs2] & bit):
 192                 # reset searching
 193                 setting_mode = False
 194             if setting_mode:
 195                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 196                     setting_mode = False
 197                 else:
 198                     regs[rd] |= bit
 199             else if rs2 != x0: # searching mode
 200                 if (regs[rs2] & bit):
 201                     setting_mode = True # back into "setting" mode
 202             i += 1
 203
 204 ## sifm
 205
 206 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 207
 208     sifm RT, RA, RB!=0
 209
 210  # Example
 211
 212      7 6 5 4 3 2 1 0   Bit number
 213
 214      1 0 0 1 0 1 0 0   v3 contents
 215                        vmsif.m v2, v3
 216      0 0 0 0 0 1 1 1   v2 contents
 217
 218      1 0 0 1 0 1 0 1   v3 contents
 219                        vmsif.m v2, v3
 220      0 0 0 0 0 0 0 1   v2
 221
 222      1 1 0 0 0 0 1 1   RB vcontents
 223      1 0 0 1 0 1 0 0   v3 contents
 224                        vmsif.m v2, v3, v0.t
 225      1 1 x x x x 1 1   v2 contents
 226
 227 Pseudo-code:
 228
 229     def sif(rd, rs1, rs2):
 230         rd = 0
 231         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 232
 233         while i < XLEN:
 234             bit = 1<<i
 235
 236             # only reenable when predicate in use, and bit valid
 237             if !setting_mode && rs2 != x0:
 238                 if (regs[rs2] & bit):
 239                     # back into "setting" mode
 240                     setting_mode = True
 241
 242             # skipping mode
 243             if !setting_mode:
 244                 # skip any more 1s
 245                 if regs[rs1] & bit == 1:
 246                     i += 1
 247                     continue
 248
 249             # setting mode, search for 1
 250             regs[rd] |= bit # always set during search
 251             if regs[rs1] & bit: # found a bit in rs1:
 252                 setting_mode = False
 253                 # next loop starts skipping
 254
 255             i += 1
 256
 257
 258 ## vmsof
 259
 260 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 261
 262     sofm RT, RA, RB
 263
 264 Example
 265
 266      7 6 5 4 3 2 1 0   Bit number
 267
 268      1 0 0 1 0 1 0 0   v3 contents
 269                        vmsof.m v2, v3
 270      0 0 0 0 0 1 0 0   v2 contents
 271
 272      1 0 0 1 0 1 0 1   v3 contents
 273                        vmsof.m v2, v3
 274      0 0 0 0 0 0 0 1   v2
 275
 276      1 1 0 0 0 0 1 1   RB vcontents
 277      1 1 0 1 0 1 0 0   v3 contents
 278                        vmsof.m v2, v3, v0.t
 279      0 1 x x x x 0 0   v2 content
 280
 281 Pseudo-code:
 282
 283     def sof(rd, rs1, rs2):
 284         rd = 0
 285         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 286
 287         while i < XLEN:
 288             bit = 1<<i
 289
 290             # only reenable when predicate in use, and bit valid
 291             if !setting_mode && rs2 != x0:
 292                 if (regs[rs2] & bit):
 293                     # back into "setting" mode
 294                     setting_mode = True
 295
 296             # skipping mode
 297             if !setting_mode:
 298                 # skip any more 1s
 299                 if regs[rs1] & bit == 1:
 300                     i += 1
 301                     continue
 302
 303             # setting mode, search for 1
 304             if regs[rs1] & bit: # found a bit in rs1:
 305                 regs[rd] |= bit # only set when search succeeds
 306                 setting_mode = False
 307                 # next loop starts skipping
 308
 309             i += 1
 310
 311 # Carry-lookahead
 312
 313 used not just for carry lookahead, also a special type of predication mask operation.
 314
 315 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 316 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 317 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 318 * <https://i.stack.imgur.com/QSLKY.png>
 319 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 320   `((P|G)+G)^P`
 321 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 322
 323 From QLSKY.png:
 324
 325 ```
 326     x0 = nand(CIn, P0)
 327     C0 = nand(x0, ~G0)
 328
 329     x1 = nand(CIn, P0, P1)
 330     y1 = nand(G0, P1)
 331     C1 = nand(x1, y1, ~G1)
 332
 333     x2 = nand(CIn, P0, P1, P2)
 334     y2 = nand(G0, P1, P2)
 335     z2 = nand(G1, P2)
 336     C1 = nand(x2, y2, z2, ~G2)
 337
 338     # Gen*
 339     x3 = nand(G0, P1, P2, P3)
 340     y3 = nand(G1, P2, P3)
 341     z3 = nand(G2, P3)
 342     G* = nand(x3, y3, z3, ~G3)
 343 ```
 344
 345 ```
 346      P = (A | B) & Ci
 347      G = (A & B)
 348 ```
 349
 350 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 351
 352 ```
 353     At each id, compute C[id] = A[id]+B[id]+0
 354     Get G[id] = C[id] > radix -1
 355     Get P[id] = C[id] == radix-1
 356     Join all P[id] together, likewise G[id]
 357     Compute newC = ((P|G)+G)^P
 358     result[id] = (C[id] + newC[id]) % radix
 359 ```
 360
 361 two versions: scalar int version and CR based version.
 362
 363 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 364
 365 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 366
 367 if zero (no propagation) then CR0.eq is zero
 368
 369 CR based version, TODO.