(no commit message)

[libreriscv.git] / openpower / sv / av_opcodes.mdwn
diff --git a/openpower/sv/av_opcodes.mdwn b/openpower/sv/av_opcodes.mdwn

index 3393898ac7732462036c0b4ce28b433503a8a7da..9614d0090532bd0329826cb1b05c0b4224708bfb 100644 (file)
--- a/openpower/sv/av_opcodes.mdwn
+++ b/openpower/sv/av_opcodes.mdwn
@@ -1,132 +1,128 @@
+[[!tag standards]]
+
  # Scalar OpenPOWER Audio and Video Opcodes
  
  the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA.  However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
  
-This page therefore has acompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
-
-# Audio
-
-The fundamental principle for these instructions is:
-
-* identify the scalar primitive
-* assume that longer runs of scalars will have Simple-V vectorisatin applied
-* assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
-  in order to perform the necessary HI/LO selection normally hard-coded
-  into SIMD ISAs.
-
-Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
-
-* addition of a scalar ext/clamp instruction
-* 1st op, swizzle-selection vec2 "select X only" from source to dest:
-  dest.X = extclamp(src.X)
-* 2nd op, swizzle-select vec2 "select Y only" from source to dest
-  dest.Y = extclamp(src.Y)
-
-Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
-
-## Scalar element operations
-
-* clamping / saturation for signed and unsigned.  best done similar to FP rounding modes, i.e. with an SPR.
-* average-add.  result = (src1 + src2 + 1) >> 1
-* abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
-* signed min/max
-
-# Video
+This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
  
-TODO
+Links
  
-* DCT <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
-* <https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=915> add overflow to maxmin.
+* <https://bugs.libre-soc.org/show_bug.cgi?id=863> add pseudocode etc.
+* <https://bugs.libre-soc.org/show_bug.cgi?id=234> hardware implementation
+* <https://bugs.libre-soc.org/show_bug.cgi?id=910> mins/maxs zero-option?
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1057> move all int/fp min/max to ls013
+* [[vpu]]
+* [[sv/int_fp_mv]]
+* [[openpower/isa/av]] pseudocode
+* [[av_opcodes/analysis]]
+* TODO review HP 1994-6 PA-RISC MAX <https://en.m.wikipedia.org/wiki/Multimedia_Acceleration_eXtensions>
+* <https://en.m.wikipedia.org/wiki/Sum_of_absolute_differences>
+* List of MMX instructions <https://cs.fit.edu/~mmahoney/cse3101/mmx.html>
  
-# VSX SIMD
+# Summary
  
-Useful parts of VSX, and how they might map.
+In-advance, the summary of base scalar operations that need to be added is:
  
-## vpks[\*][\*]s (vec_pack*)
+| instruction   | pseudocode               |
+| ------------  | ------------------------      |
+| average-add.  | result = (src1 + src2 + 1) >> 1 |
+| abs-diff      | result = abs (src1-src2) |
+| abs-accumulate| result += abs (src1-src2) |
+| (un)signed min| result = (src1 < src2) ? src1 : src2 [[ls013]] |
+| (un)signed max| result = (src1 > src2) ? src1 : src2 [[ls013]]  |
+| bitwise sel   | (a ? b : c) - use [[sv/bitmanip]] ternary |
+| int/fp move   | covered by REMAP and Pack/Unpack |
  
-signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations. May be implemented by a clamped move to a smaller elwidth.
+Implemented at the [[openpower/isa/av]] pseudocode page.
  
-The other direction, vec_unpack widening ops, may need some way to tell whether to sign-extend or zero-extend.
- 
-## vavgs\* (vec_avg)
+All other capabilities (saturate in particular) are achieved with [[sv/svp64]] modes and swizzle.  Note that minmax and ternary are added in bitmanip.
  
-signed and unsigned, 8/16/32: these are all of the form:
+# Instructions
  
-    result = truncate((a + b + 1) >> 1))
+## Average Add
  
-## vabsdu\* (vec_abs)
+X-Form
  
-unsigned 8/16/32: these are all of the form:
+* avgadd  RT,RA,RB (Rc=0)
+* avgadd. RT,RA,RB (Rc=1)
  
-    result = (src1 > src2) ? truncate(src1-src2) :
-                             truncate(src2-src1)
+Pseudo-code:
  
-## vmaxs\* / vmaxu\* (and min)
+    a <- [0] * (XLEN+1)
+    b <- [0] * (XLEN+1)
+    a[1:XLEN] <- (RA)
+    b[1:XLEN] <- (RB)
+    r <- (a + b + 1)
+    RT <- r[0:XLEN-1]
  
-signed and unsigned, 8/16/32: these are all of the form:
+Special Registers Altered:
  
-    result = (src1 > src2) ? src1 : src2 # max
-    result = (src1 < src2) ? src1 : src2 # min
+    CR0                     (if Rc=1)
  
-## vmerge operations
+## Absolute Signed Difference
  
-Their main point was to work around the odd/even multiplies. SV swizzles and mv.x should handle all cases.
+X-Form
  
-these take two src vectors of various widths and splice them together.  the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
+* absds  RT,RA,RB (Rc=0)
+* absds. RT,RA,RB (Rc=1)
  
-in the swizzle case the first instruction would be destvect2.X = srcvec2.X and the second would swizzle-select Y.  macro-op fusion in both the prefixated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs.
+Pseudo-code:
  
-with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vectir prefixing)
+    if (RA) < (RB) then RT <- ¬(RA) + (RB) + 1
+    else                RT <- ¬(RB) + (RA) + 1
  
-## Float estimates
+Special Registers Altered:
  
-    vec_expte - float 2^x
-    vec_loge - float log2(x)
-    vec_re - float 1/x
-    vec_rsqrte - float 1/sqrt(x)
+    CR0                     (if Rc=1)
  
-The spec says the max relative inaccuracy is 1/4096.
+## Absolute Unsigned Difference
  
-## vec_madd(s) - FMA, multiply-add, optionally saturated
+X-Form
  
-    a * b + c
+* absdu  RT,RA,RB (Rc=0)
+* absdu. RT,RA,RB (Rc=1)
  
-## vec_msum(s) - horizontal gather multiply-add, optionally saturated
+Pseudo-code:
  
-This should be separated to a horizontal multiply and a horizontal add. How a horizontal operation would work in SV is TBD, how wide is it, etc.
+    if (RA) <u (RB) then RT <- ¬(RA) + (RB) + 1
+    else                RT <- ¬(RB) + (RA) + 1
  
-    a.x + a.y + a.z ...
-    a.x * a.y * a.z ...
+Special Registers Altered:
  
-## vec_mul*
+    CR0                     (if Rc=1)
  
-There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.
+## Absolute Accumulate Unsigned Difference
  
-    u8 * u8 = u8
-    u8 * u8 = u16
+X-Form
  
-For 8,16,32,64, resulting in 8,16,32,64,128.
+* absdacu  RT,RA,RB (Rc=0)
+* absdacu. RT,RA,RB (Rc=1)
  
-## vec_rl - rotate left
+Pseudo-code:
  
-    (a << x) | (a >> (WIDTH - x))
+    if (RA) <u (RB) then r <- ¬(RA) + (RB) + 1
+    else                 r <- ¬(RB) + (RA) + 1
+    RT <- (RT) + r
  
-## vec_sel - bitwise select
+Special Registers Altered:
  
-    (a ? b : c)
+    CR0                     (if Rc=1)
  
-## vec_splat - scatter
+## Absolute Accumulate Signed Difference
  
-Implemented using swizzle/predicate.
+X-Form
  
-## vec_perm - permute
+* absdacs  RT,RA,RB (Rc=0)
+* absdacs. RT,RA,RB (Rc=1)
  
-Implemented using swizzle, mv.x.
+Pseudo-code:
  
-## vec_*c[tl]z, vec_popcnt - count leading/trailing zeroes, set bits
+    if (RA) < (RB) then r <- ¬(RA) + (RB) + 1
+    else                r <- ¬(RB) + (RA) + 1
+    RT <- (RT) + r
  
-Bit counts.
+Special Registers Altered:
  
-    ctz - count trailing zeroes
-    clz - count leading zeroes
-    popcnt - count set bits
+    CR0                     (if Rc=1)