+[[!tag standards]]
+
# Scalar OpenPOWER Audio and Video Opcodes
the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
-This page therefore has acompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
-
-# Audio
-
-The fundamental principle for these instructions is:
-
-* identify the scalar primitive
-* assume that longer runs of scalars will have Simple-V vectorisatin applied
-* assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
- in order to perform the necessary HI/LO selection normally hard-coded
- into SIMD ISAs.
-
-Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
-
-* addition of a scalar ext/clamp instruction
-* 1st op, swizzle-selection vec2 "select X only" from source to dest:
- dest.X = extclamp(src.X)
-* 2nd op, swizzle-select vec2 "select Y only" from source to dest
- dest.Y = extclamp(src.Y)
-
-Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
-
-## Scalar element operations
-
-* clamping / saturation for signed and unsigned. best done similar to FP rounding modes, i.e. with an SPR.
-* average-add. result = (src1 + src2 + 1) >> 1
-* abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
-* signed min/max
-
-# Video
+This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
-TODO
+Links
-* DCT <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
-* <https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=915> add overflow to maxmin.
+* <https://bugs.libre-soc.org/show_bug.cgi?id=863> add pseudocode etc.
+* <https://bugs.libre-soc.org/show_bug.cgi?id=234> hardware implementation
+* <https://bugs.libre-soc.org/show_bug.cgi?id=910> mins/maxs zero-option?
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1057> move all int/fp min/max to ls013
+* [[vpu]]
+* [[sv/int_fp_mv]]
+* [[openpower/isa/av]] pseudocode
+* [[av_opcodes/analysis]]
+* TODO review HP 1994-6 PA-RISC MAX <https://en.m.wikipedia.org/wiki/Multimedia_Acceleration_eXtensions>
+* <https://en.m.wikipedia.org/wiki/Sum_of_absolute_differences>
+* List of MMX instructions <https://cs.fit.edu/~mmahoney/cse3101/mmx.html>
-# VSX SIMD
+# Summary
-Useful parts of VSX, and how they might map.
+In-advance, the summary of base scalar operations that need to be added is:
-## vpks[\*][\*]s (vec_pack*)
+| instruction | pseudocode |
+| ------------ | ------------------------ |
+| average-add. | result = (src1 + src2 + 1) >> 1 |
+| abs-diff | result = abs (src1-src2) |
+| abs-accumulate| result += abs (src1-src2) |
+| (un)signed min| result = (src1 < src2) ? src1 : src2 [[ls013]] |
+| (un)signed max| result = (src1 > src2) ? src1 : src2 [[ls013]] |
+| bitwise sel | (a ? b : c) - use [[sv/bitmanip]] ternary |
+| int/fp move | covered by REMAP and Pack/Unpack |
-signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations. May be implemented by a clamped move to a smaller elwidth.
+Implemented at the [[openpower/isa/av]] pseudocode page.
-The other direction, vec_unpack widening ops, may need some way to tell whether to sign-extend or zero-extend.
-
-## vavgs\* (vec_avg)
+All other capabilities (saturate in particular) are achieved with [[sv/svp64]] modes and swizzle. Note that minmax and ternary are added in bitmanip.
-signed and unsigned, 8/16/32: these are all of the form:
+# Instructions
- result = truncate((a + b + 1) >> 1))
+## Average Add
-## vabsdu\* (vec_abs)
+X-Form
-unsigned 8/16/32: these are all of the form:
+* avgadd RT,RA,RB (Rc=0)
+* avgadd. RT,RA,RB (Rc=1)
- result = (src1 > src2) ? truncate(src1-src2) :
- truncate(src2-src1)
+Pseudo-code:
-## vmaxs\* / vmaxu\* (and min)
+ a <- [0] * (XLEN+1)
+ b <- [0] * (XLEN+1)
+ a[1:XLEN] <- (RA)
+ b[1:XLEN] <- (RB)
+ r <- (a + b + 1)
+ RT <- r[0:XLEN-1]
-signed and unsigned, 8/16/32: these are all of the form:
+Special Registers Altered:
- result = (src1 > src2) ? src1 : src2 # max
- result = (src1 < src2) ? src1 : src2 # min
+ CR0 (if Rc=1)
-## vmerge operations
+## Absolute Signed Difference
-Their main point was to work around the odd/even multiplies. SV swizzles and mv.x should handle all cases.
+X-Form
-these take two src vectors of various widths and splice them together. the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
+* absds RT,RA,RB (Rc=0)
+* absds. RT,RA,RB (Rc=1)
-in the swizzle case the first instruction would be destvect2.X = srcvec2.X and the second would swizzle-select Y. macro-op fusion in both the prefixated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs.
+Pseudo-code:
-with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vectir prefixing)
+ if (RA) < (RB) then RT <- ¬(RA) + (RB) + 1
+ else RT <- ¬(RB) + (RA) + 1
-## Float estimates
+Special Registers Altered:
- vec_expte - float 2^x
- vec_loge - float log2(x)
- vec_re - float 1/x
- vec_rsqrte - float 1/sqrt(x)
+ CR0 (if Rc=1)
-The spec says the max relative inaccuracy is 1/4096.
+## Absolute Unsigned Difference
-## vec_madd(s) - FMA, multiply-add, optionally saturated
+X-Form
- a * b + c
+* absdu RT,RA,RB (Rc=0)
+* absdu. RT,RA,RB (Rc=1)
-## vec_msum(s) - horizontal gather multiply-add, optionally saturated
+Pseudo-code:
-This should be separated to a horizontal multiply and a horizontal add. How a horizontal operation would work in SV is TBD, how wide is it, etc.
+ if (RA) <u (RB) then RT <- ¬(RA) + (RB) + 1
+ else RT <- ¬(RB) + (RA) + 1
- a.x + a.y + a.z ...
- a.x * a.y * a.z ...
+Special Registers Altered:
-## vec_mul*
+ CR0 (if Rc=1)
-There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.
+## Absolute Accumulate Unsigned Difference
- u8 * u8 = u8
- u8 * u8 = u16
+X-Form
-For 8,16,32,64, resulting in 8,16,32,64,128.
+* absdacu RT,RA,RB (Rc=0)
+* absdacu. RT,RA,RB (Rc=1)
-## vec_rl - rotate left
+Pseudo-code:
- (a << x) | (a >> (WIDTH - x))
+ if (RA) <u (RB) then r <- ¬(RA) + (RB) + 1
+ else r <- ¬(RB) + (RA) + 1
+ RT <- (RT) + r
-## vec_sel - bitwise select
+Special Registers Altered:
- (a ? b : c)
+ CR0 (if Rc=1)
-## vec_splat - scatter
+## Absolute Accumulate Signed Difference
-Implemented using swizzle/predicate.
+X-Form
-## vec_perm - permute
+* absdacs RT,RA,RB (Rc=0)
+* absdacs. RT,RA,RB (Rc=1)
-Implemented using swizzle, mv.x.
+Pseudo-code:
-## vec_*c[tl]z, vec_popcnt - count leading/trailing zeroes, set bits
+ if (RA) < (RB) then r <- ¬(RA) + (RB) + 1
+ else r <- ¬(RB) + (RA) + 1
+ RT <- (RT) + r
-Bit counts.
+Special Registers Altered:
- ctz - count trailing zeroes
- clz - count leading zeroes
- popcnt - count set bits
+ CR0 (if Rc=1)