openpower/sv/vector_ops.mdwn

   1 # SV Vector Operations.
   2
   3 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example.  This section includes such examples.  Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
   4
   5 Notes:
   6
   7 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions.  These are separated out into their own section.
   8 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
   9 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  10
  11 Links:
  12
  13 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
  14 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
  15 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  16 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  17  out of scope for this document
  18 * [[simple_v_extension/specification/bitmanip]] previous version,
  19   contains pseudocode for sof, sif, sbf
  20
  21 # Vector
  22
  23 ## conflictd
  24
  25 This is based on the AVX512 conflict detection instruction.  Internally the logic is used to detect address conflicts in multi-issue LD/ST operations.  Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion.  the instruction may be used for histograms (computed in parallel)
  26
  27     input = [100, 100,   3, 100,   5, 100, 100,   3]
  28     conflict result = [
  29          0b00000000,    // Note: first element always zero
  30          0b00000001,    // 100 is present on #0
  31          0b00000000,
  32          0b00000011,    // 100 is present on #0 and #1
  33          0b00000000,
  34          0b00001011,    // 100 is present on #0, #1, #3
  35          0b00011011,    // .. and #4
  36          0b00000100     // 3 is present on #2
  37     ]
  38
  39 Pseudocode:
  40
  41     for i in range(VL):
  42         for j in range(1, i):
  43             if src1[i] == src2[j]:
  44                 result[j] |= 1<<i
  45
  46 ## iota
  47
  48 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
  49
  50 When masked, only the bits not masked out are included in the count process.
  51
  52     viota RT/v, RA, RB
  53
  54 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
  55
  56 Example
  57
  58      7 6 5 4 3 2 1 0   Element number
  59
  60      1 0 0 1 0 0 0 1   v2 contents
  61                        viota.m v4, v2 # Unmasked
  62      2 2 2 1 1 1 1 0   v4 result
  63
  64      1 1 1 0 1 0 1 1   v0 contents
  65      1 0 0 1 0 0 0 1   v2 contents
  66      2 3 4 5 6 7 8 9   v4 contents
  67                        viota.m v4, v2, v0.t # Masked
  68      1 1 1 5 1 7 1 0   v4 results
  69
  70      def iota(RT, RA, RB):
  71         mask = RB ? iregs[RB] : 0b111111...1
  72         val = RA ? iregs[RA] : 0b111111...1
  73         for i in range(VL):
  74             if RA.scalar:
  75             testmask = (1<<i)-1 # only count below
  76             to_test = val & testmask & mask
  77             iregs[RT+i] = popcount(to_test)
  78
  79 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
  80 where bit 2 is inv, bits 0:1 select the bit of the CR.
  81
  82      def test_CR_bit(CR, BO):
  83          return CR[BO[0:1]] == BO[2]
  84
  85      def iotacr(RT, BA, BO):
  86         mask = get_src_predicate()
  87         count = 0
  88         for i in range(VL):
  89             if mask & (1<<i) == 0: continue
  90             iregs[RT+i] = count
  91             if test_CR_bit(CR[i+BA], BO):
  92                  count += 1
  93
  94 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
  95
  96 # Scalar
  97
  98 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
  99
 100 ## sbfm
 101
 102    sbfm RT, RA, RB!=0
 103
 104 Example
 105
 106      7 6 5 4 3 2 1 0   Bit index
 107
 108      1 0 0 1 0 1 0 0   v3 contents
 109                        vmsbf.m v2, v3
 110      0 0 0 0 0 0 1 1   v2 contents
 111
 112      1 0 0 1 0 1 0 1   v3 contents
 113                        vmsbf.m v2, v3
 114      0 0 0 0 0 0 0 0   v2
 115
 116      0 0 0 0 0 0 0 0   v3 contents
 117                        vmsbf.m v2, v3
 118      1 1 1 1 1 1 1 1   v2
 119
 120      1 1 0 0 0 0 1 1   RB vcontents
 121      1 0 0 1 0 1 0 0   v3 contents
 122                        vmsbf.m v2, v3, v0.t
 123      0 1 x x x x 1 1   v2 contents
 124
 125 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
 126
 127 pseudocode:
 128
 129     def sbf(rd, rs1, rs2):
 130         rd = 0
 131         # start setting if no predicate or if 1st predicate bit set
 132         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 133         while i < XLEN:
 134             bit = 1<<i
 135             if rs2 != x0 and (regs[rs2] & bit):
 136                 # reset searching
 137                 setting_mode = False
 138             if setting_mode:
 139                 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
 140                     setting_mode = False
 141                 else:
 142                     regs[rd] |= bit
 143             else if rs2 != x0: # searching mode
 144                 if (regs[rs2] & bit):
 145                     setting_mode = True # back into "setting" mode
 146             i += 1
 147
 148 ## sifm
 149
 150 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
 151
 152     sifm RT, RA, RB!=0
 153
 154  # Example
 155
 156      7 6 5 4 3 2 1 0   Bit number
 157
 158      1 0 0 1 0 1 0 0   v3 contents
 159                        vmsif.m v2, v3
 160      0 0 0 0 0 1 1 1   v2 contents
 161
 162      1 0 0 1 0 1 0 1   v3 contents
 163                        vmsif.m v2, v3
 164      0 0 0 0 0 0 0 1   v2
 165
 166      1 1 0 0 0 0 1 1   RB vcontents
 167      1 0 0 1 0 1 0 0   v3 contents
 168                        vmsif.m v2, v3, v0.t
 169      1 1 x x x x 1 1   v2 contents
 170
 171 Pseudo-code:
 172
 173     def sif(rd, rs1, rs2):
 174         rd = 0
 175         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 176
 177         while i < XLEN:
 178             bit = 1<<i
 179
 180             # only reenable when predicate in use, and bit valid
 181             if !setting_mode && rs2 != x0:
 182                 if (regs[rs2] & bit):
 183                     # back into "setting" mode
 184                     setting_mode = True
 185
 186             # skipping mode
 187             if !setting_mode:
 188                 # skip any more 1s
 189                 if regs[rs1] & bit == 1:
 190                     i += 1
 191                     continue
 192
 193             # setting mode, search for 1
 194             regs[rd] |= bit # always set during search
 195             if regs[rs1] & bit: # found a bit in rs1:
 196                 setting_mode = False
 197                 # next loop starts skipping
 198
 199             i += 1
 200
 201
 202 ## vmsof
 203
 204 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 205
 206     sofm RT, RA, RB
 207
 208 Example
 209
 210      7 6 5 4 3 2 1 0   Bit number
 211
 212      1 0 0 1 0 1 0 0   v3 contents
 213                        vmsof.m v2, v3
 214      0 0 0 0 0 1 0 0   v2 contents
 215
 216      1 0 0 1 0 1 0 1   v3 contents
 217                        vmsof.m v2, v3
 218      0 0 0 0 0 0 0 1   v2
 219
 220      1 1 0 0 0 0 1 1   RB vcontents
 221      1 1 0 1 0 1 0 0   v3 contents
 222                        vmsof.m v2, v3, v0.t
 223      0 1 x x x x 0 0   v2 content
 224
 225 Pseudo-code:
 226
 227     def sof(rd, rs1, rs2):
 228         rd = 0
 229         setting_mode = rs2 == x0 or (regs[rs2] & 1)
 230
 231         while i < XLEN:
 232             bit = 1<<i
 233
 234             # only reenable when predicate in use, and bit valid
 235             if !setting_mode && rs2 != x0:
 236                 if (regs[rs2] & bit):
 237                     # back into "setting" mode
 238                     setting_mode = True
 239
 240             # skipping mode
 241             if !setting_mode:
 242                 # skip any more 1s
 243                 if regs[rs1] & bit == 1:
 244                     i += 1
 245                     continue
 246
 247             # setting mode, search for 1
 248             if regs[rs1] & bit: # found a bit in rs1:
 249                 regs[rd] |= bit # only set when search succeeds
 250                 setting_mode = False
 251                 # next loop starts skipping
 252
 253             i += 1
 254
 255 # Carry-lookahead
 256
 257 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 258 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>