openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
   6
   7 Notes:
   8
   9 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
  10 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  11 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  12
  13 Links:
  14
  15 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
  16 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
  17 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  18 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  19  out of scope for this document
  20 * [[simple_v_extension/specification/bitmanip]] previous version,
  21   contains pseudocode for sof, sif, sbf
  22
  23 # Vector
  24
  25 ## conflictd
  26
  27 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  28
  29     input = [100, 100,   3, 100,   5, 100, 100,   3]
  30     conflict result = [
  31          0b00000000,    // Note: first element always zero
  32          0b00000001,    // 100 is present on #0
  33          0b00000000,
  34          0b00000011,    // 100 is present on #0 and #1
  35          0b00000000,
  36          0b00001011,    // 100 is present on #0, #1, #3
  37          0b00011011,    // .. and #4
  38          0b00000100     // 3 is present on #2
  39     ]
  40
  41 Pseudocode:
  42
  43     for i in range(VL):
  44         for j in range(1, i):
  45             if src1[i] == src2[j]:
  46                 result[j] |= 1<<i
  47
  48 ## iota
  49
  50 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  51
  52 When masked, only the bits not masked out are included in the count process.
  53
  54     viota RT/v, RA, RB
  55
  56 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  57
  58 Example
  59
  60      7 6 5 4 3 2 1 0   Element number
  61
  62      1 0 0 1 0 0 0 1   v2 contents
  63                        viota.m v4, v2 # Unmasked
  64      2 2 2 1 1 1 1 0   v4 result
  65
  66      1 1 1 0 1 0 1 1   v0 contents
  67      1 0 0 1 0 0 0 1   v2 contents
  68      2 3 4 5 6 7 8 9   v4 contents
  69                        viota.m v4, v2, v0.t # Masked
  70      1 1 1 5 1 7 1 0   v4 results
  71
  72      def iota(RT, RA, RB):
  73         mask = RB ? iregs[RB] : 0b111111...1
  74         val = RA ? iregs[RA] : 0b111111...1
  75         for i in range(VL):
  76             if RA.scalar:
  77             testmask = (1<<i)-1 # only count below
  78             to_test = val & testmask & mask
  79             iregs[RT+i] = popcount(to_test)
  80
  81 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
  82 where bit 2 is inv, bits 0:1 select the bit of the CR.
  83
  84      def test_CR_bit(CR, BO):
  85          return CR[BO[0:1]] == BO[2]
  86
  87      def iotacr(RT, BA, BO):
  88         mask = get_src_predicate()
  89         count = 0
  90         for i in range(VL):
  91             if mask & (1<<i) == 0: continue
  92             iregs[RT+i] = count
  93             if test_CR_bit(CR[i+BA], BO):
  94                  count += 1
  95
  96 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
  97
  98 # Scalar
  99
 100 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
 101
 102 ## sbfm
 103
 104    sbfm RT, RA, RB!=0
 105
 106 Example
 107
 108      7 6 5 4 3 2 1 0   Bit index
 109
 110      1 0 0 1 0 1 0 0   v3 contents
 111                        vmsbf.m v2, v3
 112      0 0 0 0 0 0 1 1   v2 contents
 113
 114      1 0 0 1 0 1 0 1   v3 contents
 115                        vmsbf.m v2, v3
 116      0 0 0 0 0 0 0 0   v2
 117
 118      0 0 0 0 0 0 0 0   v3 contents
 119                        vmsbf.m v2, v3
 120      1 1 1 1 1 1 1 1   v2
 121
 122      1 1 0 0 0 0 1 1   RB vcontents
 123      1 0 0 1 0 1 0 0   v3 contents
 124                        vmsbf.m v2, v3, v0.t
 125      0 1 x x x x 1 1   v2 contents
 126
 127 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 128
 129 pseudocode:
 130
 131     def sbf(rd, rs1, rs2):
 132         rd = 0
 133         # start setting if no predicate or if 1st predicate bit set
 134         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 135         while i < XLEN:
 136             bit = 1<<i
 137             if rs2 != x0 and (regs[rs2] & bit):
 138                 # reset searching
 139                 setting_mode = False
 140             if setting_mode:
 141                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 142                     setting_mode = False
 143                 else:
 144                     regs[rd] |= bit
 145             else if rs2 != x0: # searching mode
 146                 if (regs[rs2] & bit):
 147                     setting_mode = True # back into "setting" mode
 148             i += 1
 149
 150 ## sifm
 151
 152 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 153
 154     sifm RT, RA, RB!=0
 155
 156  # Example
 157
 158      7 6 5 4 3 2 1 0   Bit number
 159
 160      1 0 0 1 0 1 0 0   v3 contents
 161                        vmsif.m v2, v3
 162      0 0 0 0 0 1 1 1   v2 contents
 163
 164      1 0 0 1 0 1 0 1   v3 contents
 165                        vmsif.m v2, v3
 166      0 0 0 0 0 0 0 1   v2
 167
 168      1 1 0 0 0 0 1 1   RB vcontents
 169      1 0 0 1 0 1 0 0   v3 contents
 170                        vmsif.m v2, v3, v0.t
 171      1 1 x x x x 1 1   v2 contents
 172
 173 Pseudo-code:
 174
 175     def sif(rd, rs1, rs2):
 176         rd = 0
 177         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 178
 179         while i < XLEN:
 180             bit = 1<<i
 181
 182             # only reenable when predicate in use, and bit valid
 183             if !setting_mode && rs2 != x0:
 184                 if (regs[rs2] & bit):
 185                     # back into "setting" mode
 186                     setting_mode = True
 187
 188             # skipping mode
 189             if !setting_mode:
 190                 # skip any more 1s
 191                 if regs[rs1] & bit == 1:
 192                     i += 1
 193                     continue
 194
 195             # setting mode, search for 1
 196             regs[rd] |= bit # always set during search
 197             if regs[rs1] & bit: # found a bit in rs1:
 198                 setting_mode = False
 199                 # next loop starts skipping
 200
 201             i += 1
 202
 203
 204 ## vmsof
 205
 206 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 207
 208     sofm RT, RA, RB
 209
 210 Example
 211
 212      7 6 5 4 3 2 1 0   Bit number
 213
 214      1 0 0 1 0 1 0 0   v3 contents
 215                        vmsof.m v2, v3
 216      0 0 0 0 0 1 0 0   v2 contents
 217
 218      1 0 0 1 0 1 0 1   v3 contents
 219                        vmsof.m v2, v3
 220      0 0 0 0 0 0 0 1   v2
 221
 222      1 1 0 0 0 0 1 1   RB vcontents
 223      1 1 0 1 0 1 0 0   v3 contents
 224                        vmsof.m v2, v3, v0.t
 225      0 1 x x x x 0 0   v2 content
 226
 227 Pseudo-code:
 228
 229     def sof(rd, rs1, rs2):
 230         rd = 0
 231         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 232
 233         while i < XLEN:
 234             bit = 1<<i
 235
 236             # only reenable when predicate in use, and bit valid
 237             if !setting_mode && rs2 != x0:
 238                 if (regs[rs2] & bit):
 239                     # back into "setting" mode
 240                     setting_mode = True
 241
 242             # skipping mode
 243             if !setting_mode:
 244                 # skip any more 1s
 245                 if regs[rs1] & bit == 1:
 246                     i += 1
 247                     continue
 248
 249             # setting mode, search for 1
 250             if regs[rs1] & bit: # found a bit in rs1:
 251                 regs[rd] |= bit # only set when search succeeds
 252                 setting_mode = False
 253                 # next loop starts skipping
 254
 255             i += 1
 256
 257 # Carry-lookahead
 258
 259 used not just for carry lookahead, also a special type of predication mask operation.
 260
 261 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 262 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 263 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 264 * <https://i.stack.imgur.com/QSLKY.png>
 265 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 266   `((P|G)+G)^P`
 267 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 268
 269      P = (A | B) & Ci
 270      G = (A & B)
 271
 272 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 273
 274
 275
 276 two versions: scalar int version and CR based version.
 277
 278 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 279
 280 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 281
 282 if zero (no propagation) then CR0.eq is zero
 283
 284 CR based version, TODO.