openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 Links:
   6
   7 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
   8 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
   9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  12  out of scope for this document [[openpower/sv/3d_vector_ops]]
  13 * [[simple_v_extension/specification/bitmanip]] previous version,
  14   contains pseudocode for sof, sif, sbf
  15
  16
  17 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
  18
  19 Notes:
  20
  21 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  22 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  23 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  24
  25 # Vector
  26
  27 Both of these instructions may be synthesised from SVP64 Vector
  28 instructions.  conflictd is an O(N^2) instruction based on
  29 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
  30 with the appropriate predication
  31
  32 ## conflictd
  33
  34 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  35
  36     input = [100, 100,   3, 100,   5, 100, 100,   3]
  37     conflict result = [
  38          0b00000000,    // Note: first element always zero
  39          0b00000001,    // 100 is present on #0
  40          0b00000000,
  41          0b00000011,    // 100 is present on #0 and #1
  42          0b00000000,
  43          0b00001011,    // 100 is present on #0, #1, #3
  44          0b00011011,    // .. and #4
  45          0b00000100     // 3 is present on #2
  46     ]
  47
  48 Pseudocode:
  49
  50     for i in range(VL):
  51         for j in range(1, i):
  52             if src1[i] == src2[j]:
  53                 result[j] |= 1<<i
  54
  55 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
  56   using `mfcrweird` and `cmpi`. first triangular schedule on src1,
  57 secpnd on src2.
  58
  59 Idea 2: implement using outer loop on varying setvl Horizontal-First
  60 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
  61 result.
  62
  63     li r3, 1
  64     li result, 0
  65     for i in range(target):
  66         setvl target
  67         sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
  68         sv.cmpi src2.v, t0 # compare src2 vector to scalar
  69         sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
  70         srr t1, t1, i # shift up by i before ORing
  71         or result, result, t1
  72         srr r3, r3, 1 # shift r3 predicate up by one
  73
  74 See [[sv/cr_int_predication]] for full details on the crweird instructions:
  75 the primary important aspect here is that a Vector of CR Field's EQ bits is
  76 transferred into a single GPR.  The secondary important aspect is that VL
  77 is being adjusted in each loop, testing successively more of the input
  78 vector against a given scalar, each time.
  79
  80 To investigate:
  81
  82 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
  83 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
  84
  85 ## iota
  86
  87 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  88
  89 When masked, only the bits not masked out are included in the count process.
  90
  91     viota RT/v, RA, RB
  92
  93 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  94
  95 Example
  96
  97      7 6 5 4 3 2 1 0   Element number
  98
  99      1 0 0 1 0 0 0 1   v2 contents
 100                        viota.m v4, v2 # Unmasked
 101      2 2 2 1 1 1 1 0   v4 result
 102
 103      1 1 1 0 1 0 1 1   v0 contents
 104      1 0 0 1 0 0 0 1   v2 contents
 105      2 3 4 5 6 7 8 9   v4 contents
 106                        viota.m v4, v2, v0.t # Masked
 107      1 1 1 5 1 7 1 0   v4 results
 108
 109      def iota(RT, RA, RB):
 110         mask = RB ? iregs[RB] : 0b111111...1
 111         val = RA ? iregs[RA] : 0b111111...1
 112         for i in range(VL):
 113             if RA.scalar:
 114             testmask = (1<<i)-1 # only count below
 115             to_test = val & testmask & mask
 116             iregs[RT+i] = popcount(to_test)
 117
 118 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
 119 where bit 2 is inv, bits 0:1 select the bit of the CR.
 120
 121      def test_CR_bit(CR, BO):
 122          return CR[BO[0:1]] == BO[2]
 123
 124      def iotacr(RT, BA, BO):
 125         mask = get_src_predicate()
 126         count = 0
 127         for i in range(VL):
 128             if mask & (1<<i) == 0:
 129                 count = 0 # reset back to zero
 130                 continue
 131             iregs[RT+i] = count
 132             if test_CR_bit(CR[i+BA], BO):
 133                  count += 1
 134
 135 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
 136
 137 scalar variant which can be Vectorised to give iotacr:
 138
 139      def crtaddi(RT, RA, BA, BO, D):
 140          if test_CR_bit(BA, BO):
 141              RT = RA + EXTS(D)
 142          else:
 143              RT = RA
 144
 145 a Vector for-loop with zero-ing on dest will give the
 146 mask-out effect of resetting the count back to zero.
 147 However close examination shows that the above may actually
 148 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
 149
 150 # Scalar
 151
 152 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 153
 154 ## sbfm
 155
 156    sbfm RT, RA, RB!=0
 157
 158 Example
 159
 160      7 6 5 4 3 2 1 0   Bit index
 161
 162      1 0 0 1 0 1 0 0   v3 contents
 163                        vmsbf.m v2, v3
 164      0 0 0 0 0 0 1 1   v2 contents
 165
 166      1 0 0 1 0 1 0 1   v3 contents
 167                        vmsbf.m v2, v3
 168      0 0 0 0 0 0 0 0   v2
 169
 170      0 0 0 0 0 0 0 0   v3 contents
 171                        vmsbf.m v2, v3
 172      1 1 1 1 1 1 1 1   v2
 173
 174      1 1 0 0 0 0 1 1   RB vcontents
 175      1 0 0 1 0 1 0 0   v3 contents
 176                        vmsbf.m v2, v3, v0.t
 177      0 1 x x x x 1 1   v2 contents
 178
 179 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 180
 181 pseudocode:
 182
 183     def sbf(rd, rs1, rs2):
 184         rd = 0
 185         # start setting if no predicate or if 1st predicate bit set
 186         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 187         while i < XLEN:
 188             bit = 1<<i
 189             if rs2 != x0 and (regs[rs2] & bit):
 190                 # reset searching
 191                 setting_mode = False
 192             if setting_mode:
 193                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 194                     setting_mode = False
 195                 else:
 196                     regs[rd] |= bit
 197             else if rs2 != x0: # searching mode
 198                 if (regs[rs2] & bit):
 199                     setting_mode = True # back into "setting" mode
 200             i += 1
 201
 202 ## sifm
 203
 204 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 205
 206     sifm RT, RA, RB!=0
 207
 208  # Example
 209
 210      7 6 5 4 3 2 1 0   Bit number
 211
 212      1 0 0 1 0 1 0 0   v3 contents
 213                        vmsif.m v2, v3
 214      0 0 0 0 0 1 1 1   v2 contents
 215
 216      1 0 0 1 0 1 0 1   v3 contents
 217                        vmsif.m v2, v3
 218      0 0 0 0 0 0 0 1   v2
 219
 220      1 1 0 0 0 0 1 1   RB vcontents
 221      1 0 0 1 0 1 0 0   v3 contents
 222                        vmsif.m v2, v3, v0.t
 223      1 1 x x x x 1 1   v2 contents
 224
 225 Pseudo-code:
 226
 227     def sif(rd, rs1, rs2):
 228         rd = 0
 229         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 230
 231         while i < XLEN:
 232             bit = 1<<i
 233
 234             # only reenable when predicate in use, and bit valid
 235             if !setting_mode && rs2 != x0:
 236                 if (regs[rs2] & bit):
 237                     # back into "setting" mode
 238                     setting_mode = True
 239
 240             # skipping mode
 241             if !setting_mode:
 242                 # skip any more 1s
 243                 if regs[rs1] & bit == 1:
 244                     i += 1
 245                     continue
 246
 247             # setting mode, search for 1
 248             regs[rd] |= bit # always set during search
 249             if regs[rs1] & bit: # found a bit in rs1:
 250                 setting_mode = False
 251                 # next loop starts skipping
 252
 253             i += 1
 254
 255
 256 ## vmsof
 257
 258 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 259
 260     sofm RT, RA, RB
 261
 262 Example
 263
 264      7 6 5 4 3 2 1 0   Bit number
 265
 266      1 0 0 1 0 1 0 0   v3 contents
 267                        vmsof.m v2, v3
 268      0 0 0 0 0 1 0 0   v2 contents
 269
 270      1 0 0 1 0 1 0 1   v3 contents
 271                        vmsof.m v2, v3
 272      0 0 0 0 0 0 0 1   v2
 273
 274      1 1 0 0 0 0 1 1   RB vcontents
 275      1 1 0 1 0 1 0 0   v3 contents
 276                        vmsof.m v2, v3, v0.t
 277      0 1 x x x x 0 0   v2 content
 278
 279 Pseudo-code:
 280
 281     def sof(rd, rs1, rs2):
 282         rd = 0
 283         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 284
 285         while i < XLEN:
 286             bit = 1<<i
 287
 288             # only reenable when predicate in use, and bit valid
 289             if !setting_mode && rs2 != x0:
 290                 if (regs[rs2] & bit):
 291                     # back into "setting" mode
 292                     setting_mode = True
 293
 294             # skipping mode
 295             if !setting_mode:
 296                 # skip any more 1s
 297                 if regs[rs1] & bit == 1:
 298                     i += 1
 299                     continue
 300
 301             # setting mode, search for 1
 302             if regs[rs1] & bit: # found a bit in rs1:
 303                 regs[rd] |= bit # only set when search succeeds
 304                 setting_mode = False
 305                 # next loop starts skipping
 306
 307             i += 1
 308
 309 # Carry-lookahead
 310
 311 used not just for carry lookahead, also a special type of predication mask operation.
 312
 313 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 314 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 315 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 316 * <https://i.stack.imgur.com/QSLKY.png>
 317 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 318   `((P|G)+G)^P`
 319 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 320
 321 From QLSKY.png:
 322
 323 ```
 324     x0 = nand(CIn, P0)
 325     C0 = nand(x0, ~G0)
 326
 327     x1 = nand(CIn, P0, P1)
 328     y1 = nand(G0, P1)
 329     C1 = nand(x1, y1, ~G1)
 330
 331     x2 = nand(CIn, P0, P1, P2)
 332     y2 = nand(G0, P1, P2)
 333     z2 = nand(G1, P2)
 334     C1 = nand(x2, y2, z2, ~G2)
 335
 336     # Gen*
 337     x3 = nand(G0, P1, P2, P3)
 338     y3 = nand(G1, P2, P3)
 339     z3 = nand(G2, P3)
 340     G* = nand(x3, y3, z3, ~G3)
 341 ```
 342
 343 ```
 344      P = (A | B) & Ci
 345      G = (A & B)
 346 ```
 347
 348 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 349
 350 ```
 351     At each id, compute C[id] = A[id]+B[id]+0
 352     Get G[id] = C[id] > radix -1
 353     Get P[id] = C[id] == radix-1
 354     Join all P[id] together, likewise G[id]
 355     Compute newC = ((P|G)+G)^P
 356     result[id] = (C[id] + newC[id]) % radix
 357 ```
 358
 359 two versions: scalar int version and CR based version.
 360
 361 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 362
 363 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 364
 365 if zero (no propagation) then CR0.eq is zero
 366
 367 CR based version, TODO.