openpower/sv/av_opcodes.mdwn

   1 # Scalar OpenPOWER Audio and Video Opcodes
   2
   3 the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA.  However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
   4
   5 This page therefore has acompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
   6
   7 # Audio
   8
   9 The fundamental principle for these instructions is:
  10
  11 * identify the scalar primitive
  12 * assume that longer runs of scalars will have Simple-V vectorisatin applied
  13 * assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
  14   in order to perform the necessary HI/LO selection normally hard-coded
  15   into SIMD ISAs.
  16
  17 Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
  18
  19 * addition of a scalar ext/clamp instruction
  20 * 1st op, swizzle-selection vec2 "select X only" from source to dest:
  21   dest.X = extclamp(src.X)
  22 * 2nd op, swizzle-select vec2 "select Y only" from source to dest
  23   dest.Y = extclamp(src.Y)
  24
  25 Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
  26
  27 ## Scalar element operations
  28
  29 * clamping / saturation for signed and unsigned.  best done similar to FP rounding modes, i.e. with an SPR.
  30 * average-add.  result = (src1 + src2 + 1) >> 1
  31 * abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
  32 * signed min/max
  33
  34 # Video
  35
  36 TODO
  37
  38 * DCT <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
  39 * <https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms>
  40
  41 # VSX SIMD
  42
  43 Useful parts of VSX, and how they might map.
  44
  45 ## vpks[\*][\*]s (vec_pack*)
  46
  47 signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations. May be implemented by a clamped move to a smaller elwidth.
  48
  49 The other direction, vec_unpack widening ops, may need some way to tell whether to sign-extend or zero-extend.
  50
  51 *scalar extsw/b/h gives one set, mv gives another.  src elwidth override and dest rlwidth override provide the pack/unpack*
  52
  53 ## vavgs\* (vec_avg)
  54
  55 signed and unsigned, 8/16/32: these are all of the form:
  56
  57     result = truncate((a + b + 1) >> 1))
  58
  59 *These do not exist in scalar ISA and would need to be added.  Essentially it is a type of post-processing involving the CA bit so could be included in the existing scalar pipeline ALU*
  60
  61 ## vabsdu\* (vec_abs)
  62
  63 unsigned 8/16/32: these are all of the form:
  64
  65     result = (src1 > src2) ? truncate(src1-src2) :
  66                              truncate(src2-src1)
  67
  68 *These do not exist in the scalar ISA and would need to be added*
  69
  70 ## vmaxs\* / vmaxu\* (and min)
  71
  72 signed and unsigned, 8/16/32: these are all of the form:
  73
  74     result = (src1 > src2) ? src1 : src2 # max
  75     result = (src1 < src2) ? src1 : src2 # min
  76
  77 *These do not exist in the scalar INTEGER ISA and would need to be added*
  78
  79 ## vmerge operations
  80
  81 Their main point was to work around the odd/even multiplies. SV swizzles and mv.x should handle all cases.
  82
  83 these take two src vectors of various widths and splice them together.  the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
  84
  85 in the swizzle case the first instruction would be destvec2.X = srcvec2.X and the second would swizzle-select Y: destvec2.Y = srcvec2.Y.  macro-op fusion in both the predicated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs (or macro-op fusion identifies the patterns)
  86
  87 with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vector prefixing)
  88
  89 ## Float estimates
  90
  91     vec_expte - float 2^x
  92     vec_loge - float log2(x)
  93     vec_re - float 1/x
  94     vec_rsqrte - float 1/sqrt(x)
  95
  96 The spec says the max relative inaccuracy is 1/4096.
  97
  98 *These could be done by assigning meaning to the "sat mode" SVP64 bits in a FP context. 0b00 is IEEE754 FP, 0b01 is 2^12 accuracy for FP32. These can be applied to standard scalar FP ops*
  99
 100 ## vec_madd(s) - FMA, multiply-add, optionally saturated
 101
 102     a * b + c
 103
 104 *Standard scalar madd*
 105
 106 ## vec_msum(s) - horizontal gather multiply-add, optionally saturated
 107
 108 This should be separated to a horizontal multiply and a horizontal add. How a horizontal operation would work in SV is TBD, how wide is it, etc.
 109
 110     a.x + a.y + a.z ...
 111     a.x * a.y * a.z ...
 112
 113 *This would realistically need to be done with a loop doing a mapreduce sequence.  I looked very early on at doing this type of operation and concluded it would be better done with a series of halvings each time, as separate instructions:  VL=16 then VL=8 then 4 then 2 and finally one scalar.  i.e. not an actual part of SV al all. An OoO multi-issue engine would be more than capable of dealing with the Dependencies.*
 114
 115 That has the issue that's it's a massive PITA to code, plus it's slow. Plus there's the "access to non-4-offset regs stalls". Even if there's no ready operation, it should be made easier and faster than a manual mapreduce loop.
 116
 117 --
 118
 119 As a mid-solution, 4-element gathers were discussed. 4 elements would also make them useful for pixel packing, not just the general vector gather. This is because OR and ADD are the same operation when bits don't overlap.
 120
 121     gather-add: d = a.x + a.y + a.z + a.w
 122     gather-mul: d = a.x * a.y * a.z * a.w
 123
 124 But can the SV loop increment the src reg # by 4? Hmm.
 125
 126 The idea then leads to the opposite operation, a 1-to-4 bit scatter instruction. Like gather, it could be implemented with a normal loop, but it's faster for certain uses.
 127
 128     bit-scatter dest, src, bits
 129
 130     bit-scatter rd, rs, 8 # assuming source and dest are 32-bit wide
 131     rd = (rs >> 0 * 8) & (2^8 - 1)
 132     rd+1 = (rs >> 1 * 8) & (2^8 - 1)
 133     rd+2 = (rs >> 2 * 8) & (2^8 - 1)
 134     rd+3 = (rs >> 3 * 8) & (2^8 - 1)
 135
 136 So at the start you have a RGBA packed pixel in one 32-bit register, at the end you have each channel separated into its own register, in the low bits, and ANDed so only the relevant bits are there.
 137
 138 ## vec_mul*
 139
 140 There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.
 141
 142     u8 * u8 = u8
 143     u8 * u8 = u16
 144
 145 For 8,16,32,64, resulting in 8,16,32,64,128.
 146
 147 *All of these can be done with SV elwidth overrides, as long as the dest is no greater than 128.  SV specifically does not do 128 bit arithmetic. Instead, vec2.X mul-lo followed by vec2.Y mul-hi can be macro-op fused to get at the full 128 bit internal result.  Specifying e.g. src elwidth=8 and dest elwidth=16 will give a widening multiply*
 148
 149 ## vec_rl - rotate left
 150
 151     (a << x) | (a >> (WIDTH - x))
 152
 153 *Standard scalar rlwinm*
 154
 155 ## vec_sel - bitwise select
 156
 157     (a ? b : c)
 158
 159 *This does not exist in the scalar ISA and would need to be added*
 160
 161 ## vec_splat - scatter
 162
 163 Implemented using swizzle/predicate.
 164
 165 ## vec_perm - permute
 166
 167 Implemented using swizzle, mv.x.
 168
 169 ## vec_*c[tl]z, vec_popcnt - count leading/trailing zeroes, set bits
 170
 171 Bit counts.
 172
 173     ctz - count trailing zeroes
 174     clz - count leading zeroes
 175     popcnt - count set bits
 176
 177 *These all exist in the scalar ISA*