openpower/sv/vector_ops/discussion.mdwn

   1 [[!tag standards]]
   2
   3 # sof/sif/sbfm etc.
   4
   5 These all it turns out can be done as bitmanip of the form
   6 x/~x &/|/^ (x / -x / x+1 / x-1) so are being superceded.
   7 needs some work though
   8
   9 ```
  10     if _RB = 0 then mask <- [1] * XLEN else mask = (RB)
  11     a1 <- (RA) & mask
  12     if mode[1] then a1 <- ¬ra
  13     mode2 <- mode[2:3]
  14     if mode2 = 0 then a2 <- (¬ra)+1
  15     if mode2 = 1 then a2 <- ra-1
  16     if mode2 = 2 then a2 <- ra+1
  17     if mode2 = 3 then a2 <- ¬(ra+1)
  18     a1 <- a1 & mask
  19     a2 <- a2 & mask
  20     # select operator
  21     mode3 <- mode[3:4]
  22     if mode3 = 0 then result <- a1 | a2
  23     if mode3 = 1 then result <- a1 & a2
  24     if mode3 = 2 then result <- a1 ^ a2
  25     if mode3 = 3 then result <- UNDEFINED
  26     result <- result & mask
  27     # optionally restore masked-out bits
  28     if L = 1 then
  29         result <- result | (RA & ¬mask)
  30     RT <- result
  31
  32 SBF = 0b01010 # set before first
  33 SOF = 0b01001 # set only first
  34 SIF = 0b10000 # set including first 10011 also works no idea why yet
  35 ```
  36
  37
  38 ## OBSOLETE sbfm
  39
  40    sbfm RT, RA, RB!=0
  41
  42 Example
  43
  44      7 6 5 4 3 2 1 0   Bit index
  45
  46      1 0 0 1 0 1 0 0   v3 contents
  47                        vmsbf.m v2, v3
  48      0 0 0 0 0 0 1 1   v2 contents
  49
  50      1 0 0 1 0 1 0 1   v3 contents
  51                        vmsbf.m v2, v3
  52      0 0 0 0 0 0 0 0   v2
  53
  54      0 0 0 0 0 0 0 0   v3 contents
  55                        vmsbf.m v2, v3
  56      1 1 1 1 1 1 1 1   v2
  57
  58      1 1 0 0 0 0 1 1   RB vcontents
  59      1 0 0 1 0 1 0 0   v3 contents
  60                        vmsbf.m v2, v3, v0.t
  61      0 1 x x x x 1 1   v2 contents
  62
  63 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
  64
  65 Executable pseudocode demo:
  66
  67 ```
  68 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
  69 ```
  70
  71 ## OBSOLETE sifm
  72
  73 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
  74
  75     sifm RT, RA, RB!=0
  76
  77  # Example
  78
  79      7 6 5 4 3 2 1 0   Bit number
  80
  81      1 0 0 1 0 1 0 0   v3 contents
  82                        vmsif.m v2, v3
  83      0 0 0 0 0 1 1 1   v2 contents
  84
  85      1 0 0 1 0 1 0 1   v3 contents
  86                        vmsif.m v2, v3
  87      0 0 0 0 0 0 0 1   v2
  88
  89      1 1 0 0 0 0 1 1   RB vcontents
  90      1 0 0 1 0 1 0 0   v3 contents
  91                        vmsif.m v2, v3, v0.t
  92      1 1 x x x x 1 1   v2 contents
  93
  94 Executable pseudocode demo:
  95
  96 ```
  97 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
  98 ```
  99
 100 ## OBSOLETE vmsof
 101
 102 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
 103
 104     sofm RT, RA, RB
 105
 106 Example
 107
 108      7 6 5 4 3 2 1 0   Bit number
 109
 110      1 0 0 1 0 1 0 0   v3 contents
 111                        vmsof.m v2, v3
 112      0 0 0 0 0 1 0 0   v2 contents
 113
 114      1 0 0 1 0 1 0 1   v3 contents
 115                        vmsof.m v2, v3
 116      0 0 0 0 0 0 0 1   v2
 117
 118      1 1 0 0 0 0 1 1   RB vcontents
 119      1 1 0 1 0 1 0 0   v3 contents
 120                        vmsof.m v2, v3, v0.t
 121      0 1 x x x x 0 0   v2 content
 122
 123 Executable pseudocode demo:
 124
 125 ```
 126 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
 127 ```
 128
 129
 130 # SV Vector Operations not added
 131
 132 Links:
 133
 134 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
 135 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
 136 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
 137 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
 138
 139 Both of these instructions may be synthesised from SVP64 Vector
 140 instructions.  conflictd is an O(N^2) instruction based on
 141 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
 142 with the appropriate predication
 143
 144 # conflictd
 145
 146 moved to [[sv/cookbook/conflictd]]
 147
 148
 149
 150 # iota
 151
 152 Based on RVV vmiota.  vmiota may be viewed as a cumulative variant of popcount, generating multiple results.  successive iterations include more and more bits of the bitstream being tested.
 153
 154 When masked, only the bits not masked out are included in the count process.
 155
 156     viota RT/v, RA, RB
 157
 158 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
 159
 160 Example
 161
 162      7 6 5 4 3 2 1 0   Element number
 163
 164      1 0 0 1 0 0 0 1   v2 contents
 165                        viota.m v4, v2 # Unmasked
 166      2 2 2 1 1 1 1 0   v4 result
 167
 168      1 1 1 0 1 0 1 1   v0 contents
 169      1 0 0 1 0 0 0 1   v2 contents
 170      2 3 4 5 6 7 8 9   v4 contents
 171                        viota.m v4, v2, v0.t # Masked
 172      1 1 1 5 1 7 1 0   v4 results
 173
 174      def iota(RT, RA, RB):
 175         mask = RB ? iregs[RB] : 0b111111...1
 176         val = RA ? iregs[RA] : 0b111111...1
 177         for i in range(VL):
 178             if RA.scalar:
 179             testmask = (1<<i)-1 # only count below
 180             to_test = val & testmask & mask
 181             iregs[RT+i] = popcount(to_test)
 182
 183 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
 184 where bit 2 is inv, bits 0:1 select the bit of the CR.
 185
 186      def test_CR_bit(CR, BO):
 187          return CR[BO[0:1]] == BO[2]
 188
 189      def iotacr(RT, BA, BO):
 190         mask = get_src_predicate()
 191         count = 0
 192         for i in range(VL):
 193             if mask & (1<<i) == 0:
 194                 count = 0 # reset back to zero
 195                 continue
 196             iregs[RT+i] = count
 197             if test_CR_bit(CR[i+BA], BO):
 198                  count += 1
 199
 200 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway.  The integer version covers it, by not reading the int regfile at all.
 201
 202 scalar variant which can be Vectorized to give iotacr:
 203
 204      def crtaddi(RT, RA, BA, BO, D):
 205          if test_CR_bit(BA, BO):
 206              RT = RA + EXTS(D)
 207          else:
 208              RT = RA
 209
 210 a Vector for-loop with zero-ing on dest will give the
 211 mask-out effect of resetting the count back to zero.
 212 However close examination shows that the above may actually
 213 be `sv.addi/mr/sm=EQ/sz *r1, *r0, 1` TODO check this
 214
 215 this looks promising:
 216
 217     sv.add *v4+1, *v4, *v2
 218
 219 where v2 is guaranteed to contain 0 or 1, the dependency chain
 220 accumulates v2 via the overlap between src1 and dest vectors.
 221
 222 # cprop
 223
 224 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
 225 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
 226 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
 227 * <https://i.stack.imgur.com/QSLKY.png>
 228 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
 229   `((P|G)+G)^P`
 230 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
 231
 232 From QLSKY.png:
 233
 234 ```
 235     x0 = nand(CIn, P0)
 236     C0 = nand(x0, ~G0)
 237
 238     x1 = nand(CIn, P0, P1)
 239     y1 = nand(G0, P1)
 240     C1 = nand(x1, y1, ~G1)
 241
 242     x2 = nand(CIn, P0, P1, P2)
 243     y2 = nand(G0, P1, P2)
 244     z2 = nand(G1, P2)
 245     C1 = nand(x2, y2, z2, ~G2)
 246
 247     # Gen*
 248     x3 = nand(G0, P1, P2, P3)
 249     y3 = nand(G1, P2, P3)
 250     z3 = nand(G2, P3)
 251     G* = nand(x3, y3, z3, ~G3)
 252 ```
 253
 254 ```
 255      P = (A | B) & Ci
 256      G = (A & B)
 257 ```
 258
 259 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 260
 261 ```
 262     At each id, compute C[id] = A[id]+B[id]+0
 263     Get G[id] = C[id] > radix -1
 264     Get P[id] = C[id] == radix-1
 265     Join all P[id] together, likewise G[id]
 266     Compute newC = ((P|G)+G)^P
 267     result[id] = (C[id] + newC[id]) % radix
 268 ```
 269
 270 two versions: scalar int version and CR based version.
 271
 272 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 273
 274 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 275
 276 if zero (no propagation) then CR0.eq is zero
 277
 278 CR based version, TODO.