openpower/sv/vector_ops.mdwn

   1 [[!tag standards]]
   2
   3 # SV Vector Operations.
   4
   5 Links:
   6
   7 * [[discussion]]
   8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
   9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=865> implementation in simulator
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
  13  out of scope for this document [[openpower/sv/3d_vector_ops]]
  14 * [[simple_v_extension/specification/bitmanip]] previous version,
  15   contains pseudocode for sof, sif, sbf
  16 * <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>
  17
  18 The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
  19 Therefore there are not that many cases where *actual* Vector
  20 instructions are needed. If they are, they are more "assistance"
  21 functions.  Two traditional Vector instructions were initially
  22 considered (conflictd and vmiota) however they may be synthesised
  23 from existing SVP64 instructions: vmiota may use [[svstep]].
  24 Details in [[discussion]]
  25
  26 Notes:
  27
  28 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
  29 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
  30
  31 # Mask-suited Bitmanipulation
  32
  33 Based on RVV masked set-before-first, set-after-first etc.
  34 and Intel and AMD Bitmanip instructions made generalised then
  35 advanced further to include masks, this is a single instruction
  36 covering 24 individual instructions in other ISAs.
  37 *(sbf/sof/sif moved to [[discussion]])*
  38
  39 BM2-Form
  40
  41 |0..5  |6..10|11..15|16..20|21-25|26|27..31| Form |
  42 |------|-----|------|------|-----|--|------|------|
  43 | PO   |  RS |   RA |   RB |bm   |L |   XO | BM2-Form |
  44
  45 * bmask RS,RA,RB,bm,L
  46
  47 The patterns within the pseudocode for AMD TBM and x86 BMI1 are
  48 as follows:
  49
  50 * first pattern A: `x / ~x`
  51 * second pattern B: `| / & / ^`
  52 * third pattern C: `x+1 / x-1 / ~(x+1) / (~x)+1`
  53
  54 Thus it makes sense to create a single instruction
  55 that covers all of these.  A crucial addition that is essential
  56 for Scalable Vector usage as Predicate Masks, is the second mask parameter
  57 (RB).  The additional paramater, L, if set, will leave bits of RA masked
  58 by RB unaltered, otherwise those bits are set to zero. Note that when `RB=0`
  59 then instead of reading from the register file the mask is set to all ones.
  60
  61 Executable pseudocode demo:
  62
  63 ```
  64 [[!inline pages="openpower/sv/bmask.py" quick="yes" raw="yes" ]]
  65 ```
  66
  67 # Carry-lookahead
  68
  69 As a single scalar 32-bit instruction, up to 64 carry-propagation bits
  70 may be computed.  When the output is then used as a Predicate mask it can
  71 be used to selectively perform the "add carry" of biginteger math, with
  72 `sv.addi/sm=rN RT.v, RA.v, 1`.
  73
  74 * cprop RT,RA,RB
  75 * cprop. RT,RA,RB
  76
  77 pseudocode:
  78
  79     P = (RA)
  80     G = (RB)
  81     RT = ((P|G)+G)^P
  82
  83 X-Form
  84
  85 | 0.5|6.10|11.15|16.20| 21..30     |31| name      |  Form   |
  86 | -- | -- | --- | --- | ---------  |--| ----      | ------- |
  87 | NN | RT | RA  | RB  | 0110001110 |Rc|     cprop | X-Form  |
  88
  89 used not just for carry lookahead, also a special type of predication mask operation.
  90
  91 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
  92 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
  93 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
  94 * <https://i.stack.imgur.com/QSLKY.png>
  95 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
  96   `((P|G)+G)^P`
  97 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
  98
  99 From QLSKY.png:
 100
 101 ```
 102     x0 = nand(CIn, P0)
 103     C0 = nand(x0, ~G0)
 104
 105     x1 = nand(CIn, P0, P1)
 106     y1 = nand(G0, P1)
 107     C1 = nand(x1, y1, ~G1)
 108
 109     x2 = nand(CIn, P0, P1, P2)
 110     y2 = nand(G0, P1, P2)
 111     z2 = nand(G1, P2)
 112     C1 = nand(x2, y2, z2, ~G2)
 113
 114     # Gen*
 115     x3 = nand(G0, P1, P2, P3)
 116     y3 = nand(G1, P2, P3)
 117     z3 = nand(G2, P3)
 118     G* = nand(x3, y3, z3, ~G3)
 119 ```
 120
 121 ```
 122      P = (A | B) & Ci
 123      G = (A & B)
 124 ```
 125
 126 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
 127
 128 ```
 129     At each id, compute C[id] = A[id]+B[id]+0
 130     Get G[id] = C[id] > radix -1
 131     Get P[id] = C[id] == radix-1
 132     Join all P[id] together, likewise G[id]
 133     Compute newC = ((P|G)+G)^P
 134     result[id] = (C[id] + newC[id]) % radix
 135 ```
 136
 137 two versions: scalar int version and CR based version.
 138
 139 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
 140
 141 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
 142
 143 if zero (no propagation) then CR0.eq is zero
 144
 145 CR based version, TODO.