openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 TODO merge old standards page [[simple_v_extension/vector_ops/]]
   6
   7 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
   8
   9 Notes:
  10
  11 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  12 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  13 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  14
  15 Links:
  16
  17 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
  18 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
  19 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  20 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  21  out of scope for this document
  22 * [[simple_v_extension/specification/bitmanip]] previous version,
  23   contains pseudocode for sof, sif, sbf
  24
  25 # Vector
  26
  27 ## conflictd
  28
  29 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  30
  31     input = [100, 100,   3, 100,   5, 100, 100,   3]
  32     conflict result = [
  33          0b00000000,    // Note: first element always zero
  34          0b00000001,    // 100 is present on #0
  35          0b00000000,
  36          0b00000011,    // 100 is present on #0 and #1
  37          0b00000000,
  38          0b00001011,    // 100 is present on #0, #1, #3
  39          0b00011011,    // .. and #4
  40          0b00000100     // 3 is present on #2
  41     ]
  42
  43 Pseudocode:
  44
  45     for i in range(VL):
  46         for j in range(1, i):
  47             if src1[i] == src2[j]:
  48                 result[j] |= 1<<i
  49
  50 ## iota
  51
  52 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  53
  54 When masked, only the bits not masked out are included in the count process.
  55
  56     viota RT/v, RA, RB
  57
  58 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  59
  60 Example
  61
  62      7 6 5 4 3 2 1 0   Element number
  63
  64      1 0 0 1 0 0 0 1   v2 contents
  65                        viota.m v4, v2 # Unmasked
  66      2 2 2 1 1 1 1 0   v4 result
  67
  68      1 1 1 0 1 0 1 1   v0 contents
  69      1 0 0 1 0 0 0 1   v2 contents
  70      2 3 4 5 6 7 8 9   v4 contents
  71                        viota.m v4, v2, v0.t # Masked
  72      1 1 1 5 1 7 1 0   v4 results
  73
  74      def iota(RT, RA, RB):
  75         mask = RB ? iregs[RB] : 0b111111...1
  76         val = RA ? iregs[RA] : 0b111111...1
  77         for i in range(VL):
  78             if RA.scalar:
  79             testmask = (1<<i)-1 # only count below
  80             to_test = val & testmask & mask
  81             iregs[RT+i] = popcount(to_test)
  82
  83 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
  84 where bit 2 is inv, bits 0:1 select the bit of the CR.
  85
  86      def test_CR_bit(CR, BO):
  87          return CR[BO[0:1]] == BO[2]
  88
  89      def iotacr(RT, BA, BO):
  90         mask = get_src_predicate()
  91         count = 0
  92         for i in range(VL):
  93             if mask & (1<<i) == 0: continue
  94             iregs[RT+i] = count
  95             if test_CR_bit(CR[i+BA], BO):
  96                  count += 1
  97
  98 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
  99
 100 # Scalar
 101
 102 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 103
 104 ## sbfm
 105
 106    sbfm RT, RA, RB!=0
 107
 108 Example
 109
 110      7 6 5 4 3 2 1 0   Bit index
 111
 112      1 0 0 1 0 1 0 0   v3 contents
 113                        vmsbf.m v2, v3
 114      0 0 0 0 0 0 1 1   v2 contents
 115
 116      1 0 0 1 0 1 0 1   v3 contents
 117                        vmsbf.m v2, v3
 118      0 0 0 0 0 0 0 0   v2
 119
 120      0 0 0 0 0 0 0 0   v3 contents
 121                        vmsbf.m v2, v3
 122      1 1 1 1 1 1 1 1   v2
 123
 124      1 1 0 0 0 0 1 1   RB vcontents
 125      1 0 0 1 0 1 0 0   v3 contents
 126                        vmsbf.m v2, v3, v0.t
 127      0 1 x x x x 1 1   v2 contents
 128
 129 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 130
 131 pseudocode:
 132
 133     def sbf(rd, rs1, rs2):
 134         rd = 0
 135         # start setting if no predicate or if 1st predicate bit set
 136         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 137         while i < XLEN:
 138             bit = 1<<i
 139             if rs2 != x0 and (regs[rs2] & bit):
 140                 # reset searching
 141                 setting_mode = False
 142             if setting_mode:
 143                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 144                     setting_mode = False
 145                 else:
 146                     regs[rd] |= bit
 147             else if rs2 != x0: # searching mode
 148                 if (regs[rs2] & bit):
 149                     setting_mode = True # back into "setting" mode
 150             i += 1
 151
 152 ## sifm
 153
 154 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 155
 156     sifm RT, RA, RB!=0
 157
 158  # Example
 159
 160      7 6 5 4 3 2 1 0   Bit number
 161
 162      1 0 0 1 0 1 0 0   v3 contents
 163                        vmsif.m v2, v3
 164      0 0 0 0 0 1 1 1   v2 contents
 165
 166      1 0 0 1 0 1 0 1   v3 contents
 167                        vmsif.m v2, v3
 168      0 0 0 0 0 0 0 1   v2
 169
 170      1 1 0 0 0 0 1 1   RB vcontents
 171      1 0 0 1 0 1 0 0   v3 contents
 172                        vmsif.m v2, v3, v0.t
 173      1 1 x x x x 1 1   v2 contents
 174
 175 Pseudo-code:
 176
 177     def sif(rd, rs1, rs2):
 178         rd = 0
 179         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 180
 181         while i < XLEN:
 182             bit = 1<<i
 183
 184             # only reenable when predicate in use, and bit valid
 185             if !setting_mode && rs2 != x0:
 186                 if (regs[rs2] & bit):
 187                     # back into "setting" mode
 188                     setting_mode = True
 189
 190             # skipping mode
 191             if !setting_mode:
 192                 # skip any more 1s
 193                 if regs[rs1] & bit == 1:
 194                     i += 1
 195                     continue
 196
 197             # setting mode, search for 1
 198             regs[rd] |= bit # always set during search
 199             if regs[rs1] & bit: # found a bit in rs1:
 200                 setting_mode = False
 201                 # next loop starts skipping
 202
 203             i += 1
 204
 205
 206 ## vmsof
 207
 208 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 209
 210     sofm RT, RA, RB
 211
 212 Example
 213
 214      7 6 5 4 3 2 1 0   Bit number
 215
 216      1 0 0 1 0 1 0 0   v3 contents
 217                        vmsof.m v2, v3
 218      0 0 0 0 0 1 0 0   v2 contents
 219
 220      1 0 0 1 0 1 0 1   v3 contents
 221                        vmsof.m v2, v3
 222      0 0 0 0 0 0 0 1   v2
 223
 224      1 1 0 0 0 0 1 1   RB vcontents
 225      1 1 0 1 0 1 0 0   v3 contents
 226                        vmsof.m v2, v3, v0.t
 227      0 1 x x x x 0 0   v2 content
 228
 229 Pseudo-code:
 230
 231     def sof(rd, rs1, rs2):
 232         rd = 0
 233         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 234
 235         while i < XLEN:
 236             bit = 1<<i
 237
 238             # only reenable when predicate in use, and bit valid
 239             if !setting_mode && rs2 != x0:
 240                 if (regs[rs2] & bit):
 241                     # back into "setting" mode
 242                     setting_mode = True
 243
 244             # skipping mode
 245             if !setting_mode:
 246                 # skip any more 1s
 247                 if regs[rs1] & bit == 1:
 248                     i += 1
 249                     continue
 250
 251             # setting mode, search for 1
 252             if regs[rs1] & bit: # found a bit in rs1:
 253                 regs[rd] |= bit # only set when search succeeds
 254                 setting_mode = False
 255                 # next loop starts skipping
 256
 257             i += 1
 258
 259 # Carry-lookahead
 260
 261 used not just for carry lookahead, also a special type of predication mask operation.
 262
 263 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 264 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 265 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 266 * <https://i.stack.imgur.com/QSLKY.png>
 267 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 268   `((P|G)+G)^P`
 269 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 270
 271 From QLSKY.png:
 272
 273 ```
 274     x0 = nand(CIn, P0)
 275     C0 = nand(x0, ~G0)
 276
 277     x1 = nand(CIn, P0, P1)
 278     y1 = nand(G0, P1)
 279     C1 = nand(x1, y1, ~G1)
 280
 281     x2 = nand(CIn, P0, P1, P2)
 282     y2 = nand(G0, P1, P2)
 283     z2 = nand(G1, P2)
 284     C1 = nand(x2, y2, z2, ~G2)
 285
 286     # Gen*
 287     x3 = nand(G0, P1, P2, P3)
 288     y3 = nand(G1, P2, P3)
 289     z3 = nand(G2, P3)
 290     G* = nand(x3, y3, z3, ~G3)
 291 ```
 292
 293 ```
 294      P = (A | B) & Ci
 295      G = (A & B)
 296 ```
 297
 298 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 299
 300 ```
 301     At each id, compute C[id] = A[id]+B[id]+0
 302     Get G[id] = C[id] > radix -1
 303     Get P[id] = C[id] == radix-1
 304     Join all P[id] together, likewise G[id]
 305     Compute newC = ((P|G)+G)^P
 306     result[id] = (C[id] + newC[id]) % radix
 307 ```
 308
 309 two versions: scalar int version and CR based version.
 310
 311 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 312
 313 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 314
 315 if zero (no propagation) then CR0.eq is zero
 316
 317 CR based version, TODO.