openpower/sv/av_opcodes.mdwn

   1 # Scalar OpenPOWER Audio and Video Opcodes
   2
   3 the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA.  However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
   4
   5 This page therefore has acompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
   6
   7 # Audio
   8
   9 The fundamental principle for these instructions is:
  10
  11 * identify the scalar primitive
  12 * assume that longer runs of scalars will have Simple-V vectorisatin applied
  13 * assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
  14   in order to perform the necessary HI/LO selection normally hard-coded
  15   into SIMD ISAs.
  16
  17 Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
  18
  19 * addition of a scalar ext/clamp instruction
  20 * 1st op, swizzle-selection vec2 "select X only" from source to dest:
  21   dest.X = extclamp(src.X)
  22 * 2nd op, swizzle-select vec2 "select Y only" from source to dest
  23   dest.Y = extclamp(src.Y)
  24
  25 Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
  26
  27 ## Scalar element operations
  28
  29 * clamping / saturation for signed and unsigned.  best done similar to FP rounding modes, i.e. with an SPR.
  30 * average-add.  result = (src1 + src2 + 1) >> 1
  31 * abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
  32 * signed min/max
  33
  34 # Video
  35
  36 TODO
  37
  38 * DCT <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
  39
  40 # VSX SIMD
  41
  42 ## vpkpx
  43
  44 vpkpx is a 32-bit to 16-bit 8888 into 1555 conversion
  45
  46 SV notes:
  47
  48 a single 32-bit to 16-bit operation should suffice, fitting cleanly into one single scalar op:
  49
  50     dest[0]     = src[7]
  51     dest[1 : 5] = src[8 :12]
  52     dest[6 :10] = src[16:20]
  53     dest[11:15] = src[24:28]
  54
  55 ## vpks[\*][\*]s
  56
  57 signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations
  58
  59 ## vupkhpx / vupklpx
  60
  61 these are 16-bit to 32-bit 1555 to 8888 conversion
  62
  63 ## vavgs\*
  64
  65 signed and unsigned, 8/16/32: these are all of the form:
  66
  67     result = truncate((a + b + 1) >> 1))
  68
  69 ## vabsdu\*
  70
  71 unsigned 8/16/32: these are all of the form:
  72
  73     result = (src1 > src2) ? truncate(src1-src2) :
  74                              truncate(src2-src1)
  75
  76 ## vmaxs\* / vmaxu\* (and min)
  77
  78 signed and unsigned, 8/16/32: these are all of the form:
  79
  80     result = (src1 > src2) ? src1 : src2 # max
  81     result = (src1 < src2) ? src1 : src2 # min
  82
  83 ## vmerge operations
  84
  85 these take two src vectors of various widths and splice them together.  the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
  86
  87 in the swizzle case the first instruction would be destvect2.X = srcvec2.X and the second would swizzle-select Y.  macro-op fusion in both the prefixated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs.
  88
  89 with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vectir prefixing)