ztrans_proposal.mdwn

   1 # Zftrans - transcendental operations
   2
   3 See:
   4
   5 * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
   6 * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
   7 * Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
   8 * [[rv_major_opcode_1010011]] for opcode listing.
   9 * [[zfpacc_proposal]] for accuracy settings proposal
  10
  11 Extension subsets:
  12
  13 * **Zftrans**: standard transcendentals (best suited to 3D)
  14 * **ZftransExt**: extra functions (useful, not generally needed for 3D,
  15   can be synthesised using Ztrans)
  16 * **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
  17 * **Ztrignpi**: trig non-xxx-pi sin cos tan
  18 * **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
  19 * **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
  20 * **Zfhyp**: hyperbolic/inverse-hyperbolic.  sinh, cosh, tanh, asinh,
  21   acosh, atanh (can be synthesised - see below)
  22 * **ZftransAdv**: much more complex to implement in hardware
  23 * **Zfrsqrt**: Reciprocal square-root.
  24
  25 Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
  26 Zarctrignpi
  27
  28 [[!toc levels=2]]
  29
  30 # TODO:
  31
  32 * Decision on accuracy, moved to [[zfpacc_proposal]]
  33 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
  34 * Errors **MUST** be repeatable.
  35 * How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
  36 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
  37   Accuracy requirements for dual (triple) purpose implementations must
  38   meet the higher standard.
  39 * Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
  40   it is desirable on its own by other implementors.  This to be evaluated.
  41
  42 # Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
  43
  44 This list shows the (direct) equivalence between proposed opcodes and
  45 their Khronos OpenCL equivalents.
  46
  47 See
  48 <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
  49
  50 Special FP16 opcodes are *not* being proposed, except by indirect / inherent
  51 use of the "fmt" field that is already present in the RISC-V Specification.
  52
  53 "Native" opcodes are *not* being proposed: implementors will be expected
  54 to use the (equivalent) proposed opcode covering the same function.
  55
  56 "Fast" opcodes are *not* being proposed, because the Khronos Specification
  57 fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
  58 vectors (or can be done as scalar operations using other RISC-V instructions).
  59
  60 The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
  61 Deviation from conformance with the Khronos Specification - including the
  62 Khronos Specification accuracy requirements - is not an option.
  63
  64 [[!table data="""
  65 Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
  66 FSIN            | sin         | half\_sin   | native\_sin   | NONE        |
  67 FCOS            | cos         | half\_cos   | native\_cos   | NONE        |
  68 FTAN            | tan         | half\_tan   | native\_tan   | NONE        |
  69 NONE (1)        | sincos      | NONE        | NONE          | NONE        |
  70 FASIN           | asin        | NONE        | NONE          | NONE        |
  71 FACOS           | acos        | NONE        | NONE          | NONE        |
  72 FATAN           | atan        | NONE        | NONE          | NONE        |
  73 FSINPI          | sinpi       | NONE        | NONE          | NONE        |
  74 FCOSPI          | cospi       | NONE        | NONE          | NONE        |
  75 FTANPI          | tanpi       | NONE        | NONE          | NONE        |
  76 FASINPI         | asinpi      | NONE        | NONE          | NONE        |
  77 FACOSPI         | acospi      | NONE        | NONE          | NONE        |
  78 FATANPI         | atanpi      | NONE        | NONE          | NONE        |
  79 FSINH           | sinh        | NONE        | NONE          | NONE        |
  80 FCOSH           | cosh        | NONE        | NONE          | NONE        |
  81 FTANH           | tanh        | NONE        | NONE          | NONE        |
  82 FASINH          | asinh       | NONE        | NONE          | NONE        |
  83 FACOSH          | acosh       | NONE        | NONE          | NONE        |
  84 FATANH          | atanh       | NONE        | NONE          | NONE        |
  85 FRSQRT          | rsqrt       | half\_rsqrt | native\_rsqrt | NONE        |
  86 FCBRT           | cbrt        | NONE        | NONE          | NONE        |
  87 FEXP2           | exp2        | half\_exp2  | native\_exp2  | NONE        |
  88 FLOG2           | log2        | half\_log2  | native\_log2  | NONE        |
  89 FEXPM1          | expm1       | NONE        | NONE          | NONE        |
  90 FLOG1P          | log1p       | NONE        | NONE          | NONE        |
  91 FEXP            | exp         | half\_exp   | native\_exp   | NONE        |
  92 FLOG            | log         | half\_log   | native\_log   | NONE        |
  93 FEXP10          | exp10       | half\_exp10 | native\_exp10 | NONE        |
  94 FLOG10          | log10       | half\_log10 | native\_log10 | NONE        |
  95 FATAN2          | atan2       | NONE        | NONE          | NONE        |
  96 FATAN2PI        | atan2pi     | NONE        | NONE          | NONE        |
  97 FPOW            | pow         | NONE        | NONE          | NONE        |
  98 FROOT           | rootn       | NONE        | NONE          | NONE        |
  99 FHYPOT          | hypot       | NONE        | NONE          | NONE        |
 100 FRECIP          | NONE        | half\_recip | native\_recip | NONE        |
 101 """]]
 102
 103 Note (1) FSINCOS is macro-op fused (see below).
 104
 105 # List of 2-arg opcodes
 106
 107 [[!table  data="""
 108 opcode    | Description            | pseudo-code                | Extension   |
 109 FATAN2    | atan2 arc tangent      | rd = atan2(rs2, rs1)       | Zarctrignpi |
 110 FATAN2PI  | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi  | Zarctrigpi  |
 111 FPOW      | x power of y           | rd = pow(rs1, rs2)         | ZftransAdv  |
 112 FROOT     | x power 1/y            | rd = pow(rs1, 1/rs2)       | ZftransAdv  |
 113 FHYPOT    | hypotenuse             | rd = sqrt(rs1^2 + rs2^2)   | Zftrans     |
 114 """]]
 115
 116 # List of 1-arg transcendental opcodes
 117
 118 [[!table  data="""
 119 opcode   | Description              | pseudo-code             | Extension  |
 120 FRSQRT   | Reciprocal Square-root   | rd = sqrt(rs1)          | Zfrsqrt    |
 121 FCBRT    | Cube Root                | rd = pow(rs1, 3)        | Zftrans    |
 122 FRECIP   | Reciprocal               | rd = 1.0 /   rs1        | Zftrans    |
 123 FEXP2    | power-of-2               | rd = pow(2, rs1)        | Zftrans    |
 124 FLOG2    | log2                     | rd = log2(rs1)          | Zftrans    |
 125 FEXPM1   | exponential minus 1      | rd = pow(e, rs1) - 1.0  | Zftrans    |
 126 FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Zftrans    |
 127 FEXP     | exponential              | rd = pow(e, rs1)        | ZftransExt |
 128 FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZftransExt |
 129 FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZftransExt |
 130 FLOG10   | log base 10              | rd = log10(rs1)         | ZftransExt |
 131 """]]
 132
 133 # List of 1-arg trigonometric opcodes
 134
 135 [[!table  data="""
 136 opcode   | Description              | pseudo-code             | Extension |
 137 FSIN     | sin (radians)            | rd = sin(rs1)           | Ztrignpi    |
 138 FCOS     | cos (radians)            | rd = cos(rs1)           | Ztrignpi    |
 139 FTAN     | tan (radians)            | rd = tan(rs1)           | Ztrignpi    |
 140 FASIN    | arcsin (radians)         | rd = asin(rs1)          | Zarctrignpi |
 141 FACOS    | arccos (radians)         | rd = acos(rs1)          | Zarctrignpi |
 142 FATAN    | arctan (radians)         | rd = atan(rs1)          | Zarctrignpi |
 143 FSINPI   | sin times pi             | rd = sin(pi * rs1)      | Ztrigpi |
 144 FCOSPI   | cos times pi             | rd = cos(pi * rs1)      | Ztrigpi |
 145 FTANPI   | tan times pi             | rd = tan(pi * rs1)      | Ztrigpi |
 146 FASINPI  | arcsin / pi              | rd = asin(rs1) / pi     | Zarctrigpi |
 147 FACOSPI  | arccos / pi              | rd = acos(rs1) / pi     | Zarctrigpi |
 148 FATANPI  | arctan / pi              | rd = atan(rs1) / pi     | Zarctrigpi |
 149 FSINH    | hyperbolic sin (radians) | rd = sinh(rs1)          | Zfhyp |
 150 FCOSH    | hyperbolic cos (radians) | rd = cosh(rs1)          | Zfhyp |
 151 FTANH    | hyperbolic tan (radians) | rd = tanh(rs1)          | Zfhyp |
 152 FASINH   | inverse hyperbolic sin   | rd = asinh(rs1)         | Zfhyp |
 153 FACOSH   | inverse hyperbolic cos   | rd = acosh(rs1)         | Zfhyp |
 154 FATANH   | inverse hyperbolic tan   | rd = atanh(rs1)         | Zfhyp |
 155 """]]
 156
 157 # Synthesis, Pseudo-code ops and macro-ops
 158
 159 The pseudo-ops are best left up to the compiler rather than being actual
 160 pseudo-ops, by allocating one scalar FP register for use as a constant
 161 (loop invariant) set to "1.0" at the beginning of a function or other
 162 suitable code block.
 163
 164 * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
 165 * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
 166
 167 FATANPI example pseudo-code:
 168
 169     lui t0, 0x3F800 // upper bits of f32 1.0
 170     fmv.x.s ft0, t0
 171     fatan2pi.s rd, rs1, ft0
 172
 173 Hyperbolic function example (obviates need for Zfhyp except for
 174 high-performance or correctly-rounding):
 175
 176     ASINH( x ) = ln( x + SQRT(x**2+1))
 177
 178 # Reciprocal
 179
 180 Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)
 181
 182 # To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
 183
 184 RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
 185 Research needed to ensure that implementors are not compromised by such
 186 a decision
 187 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
 188
 189 > > correctly-rounded LOG will return different results than LOGP1 and ADD.
 190 > > Likewise for EXP and EXPM1
 191
 192 > ok, they stay in as real opcodes, then.
 193
 194 # ATAN / ATAN2 commentary
 195
 196 Discussion starts here:
 197 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
 198
 199 from Mitch Alsup:
 200
 201 would like to point out that the general implementations of ATAN2 do a
 202 bunch of special case checks and then simply call ATAN.
 203
 204     double ATAN2( double y, double x )
 205     {   // IEEE 754-2008 quality ATAN2
 206
 207         // deal with NANs
 208         if( ISNAN( x )             ) return x;
 209         if( ISNAN( y )             ) return y;
 210
 211         // deal with infinities
 212         if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
 213         if( x == +∞                 ) return copysign(  0.0, y );
 214         if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
 215         if( x == -∞                 ) return copysign(    π, y );
 216         if(               |y|== +∞  ) return copysign(  π/2, y );
 217
 218         // deal with signed zeros
 219         if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
 220         if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
 221         if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );
 222
 223         // calculate ATAN2 textbook style
 224         if( x  > 0.0               ) return     ATAN( |y / x| );
 225         if( x  < 0.0               ) return π - ATAN( |y / x| );
 226     }
 227
 228
 229 Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
 230 a constant and then call/use ATAN2.
 231
 232 When one considers an implementation of ATAN, one must consider several
 233 ranges of evaluation::
 234
 235      x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
 236      x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
 237      x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
 238
 239 I should point out that the add/sub of π/2 can not lose significance
 240 since the result of ATAN(1/x) is bounded 0..π/2
 241
 242 The bottom line is that I think you are choosing to make too many of
 243 these into OpCodes, making the hardware function/calculation unit (and
 244 sequencer) more complicated that necessary.
 245
 246 --------------------------------------------------------
 247
 248 I might suggest that if there were a way for a calculation to be performed
 249 and the result of that calculation
 250
 251 chained to a subsequent calculation such that the precision of the
 252 result-becomes-operand is wider than
 253
 254 what will fit in a register, then you can dramatically reduce the count
 255 of instructions in this category while retaining
 256
 257 acceptable accuracy:
 258
 259      z = x / y
 260
 261 can be calculated as::
 262
 263      z = x * (1/y)
 264
 265 Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
 266 accurate, but GPUs want speed and
 267
 268 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
 269 is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
 270
 271 Given that one has the ability to carry (and process) more fraction bits,
 272 one can then do high precision multiplies of  π or other transcendental
 273 radixes.
 274
 275 And GPUs have been doing this almost since the dawn of 3D.
 276
 277     // calculate ATAN2 high performance style
 278     // Note: at this point x != y
 279     //
 280     if( x  > 0.0             )
 281     {
 282         if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
 283         if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
 284         if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
 285         if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
 286     }
 287     if( x  < 0.0             )
 288     {
 289         if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
 290         if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
 291         if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
 292         if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
 293     }
 294
 295 This way the adds and subtracts from the constant are not in a precision
 296 precarious position.