add comments from discussion

[libreriscv.git] / ztrans_proposal.mdwn
diff --git a/ztrans_proposal.mdwn b/ztrans_proposal.mdwn

index 6c41aba5ca394186c7fa6e17c024f95b37c3bd74..8ef6063dcc7098ea363f17b72265115501b9b476 100644 (file)
--- a/ztrans_proposal.mdwn
+++ b/ztrans_proposal.mdwn
@@ -1,63 +1,540 @@
-# Ztrans - transcendental operations
+# Zftrans - transcendental operations
  
  See:
  
  * <http://bugs.libre-riscv.org/show_bug.cgi?id=127>
  * <https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
+* Discussion: <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002342.html>
+* [[rv_major_opcode_1010011]] for opcode listing.
+* [[zfpacc_proposal]] for accuracy settings proposal
  
  Extension subsets:
  
-* **Ztrans**: standard transcendentals (best suited to 3D)
-* **ZtransExt**: extra functions (useful, not generally needed for 3D)
-* **ZtransAdv**: much more complex to implement in hardware
+* **Zftrans**: standard transcendentals (best suited to 3D)
+* **ZftransExt**: extra functions (useful, not generally needed for 3D,
+  can be synthesised using Ztrans)
+* **Ztrigpi**: trig. xxx-pi sinpi cospi tanpi
+* **Ztrignpi**: trig non-xxx-pi sin cos tan
+* **Zarctrigpi**: arc-trig. a-xxx-pi: atan2pi asinpi acospi
+* **Zarctrignpi**: arc-trig. non-a-xxx-pi: atan2, asin, acos
+* **Zfhyp**: hyperbolic/inverse-hyperbolic.  sinh, cosh, tanh, asinh,
+  acosh, atanh (can be synthesised - see below)
+* **ZftransAdv**: much more complex to implement in hardware
+* **Zfrsqrt**: Reciprocal square-root.
+
+Minimum recommended requirements for 3D: Zftrans, Ztrigpi, Zarctrigpi,
+Zarctrignpi
  
  [[!toc levels=2]]
  
+# TODO:
+
+* Decision on accuracy, moved to [[zfpacc_proposal]]
+<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002355.html>
+* Errors **MUST** be repeatable.
+* How about four Platform Specifications? 3DUNIX, UNIX, 3DEmbedded and Embedded?
+<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002361.html>
+  Accuracy requirements for dual (triple) purpose implementations must
+  meet the higher standard.
+* Reciprocal Square-root is in its own separate extension (Zfrsqrt) as
+  it is desirable on its own by other implementors.  This to be evaluated.
+
+# Requirements <a name="requirements"></a>
+
+This proposal is designed to meet a wide range of extremely diverse needs,
+allowing implementors from all of them to benefit from the tools and hardware
+cost reductions associated with common standards adoption.
+
+**There are *four* different, disparate platform's needs (two new)**:
+
+* 3D Embedded Platform
+* Embedded Platform
+* 3D UNIX Platform
+* UNIX Platform
+
+**The use-cases are**:
+
+* 3D GPUs
+* Numerical Computation
+* (Potentially) A.I. / Machine-learning (1)
+
+(1) although approximations suffice in this field, making it more likely
+to use a custom extension.  High-end ML would inherently definitely
+be excluded.
+
+**The power and die-area requirements vary from**:
+
+* Ultra-low-power (smartwatches where GPU power budgets are in milliwatts)
+* Mobile-Embedded (good performance with high efficiency for battery life)
+* Desktop Computing
+* Server / HPC (2)
+
+(2) Supercomputing is left out of the requirements as it is traditionally
+covered by Supercomputer Vectorisation Standards (such as RVV).
+
+**The software requirements are**:
+
+* Full public integration into GNU math libraries (libm)
+* Full public integration into well-known Numerical Computation systems (numpy)
+* Full public integration into upstream GNU and LLVM Compiler toolchains
+* Full public integration into Khronos OpenCL SPIR-V compatible Compilers
+  seeking public Certification and Endorsement from the Khronos Group
+  under their Trademarked Certification Programme.
+
+**The "contra"-requirements are**:
+
+* The requirements are **not** for the purposes of developing a full custom
+  proprietary GPU with proprietary firmware.
+* A full custom proprietary GPU ASIC Manufacturer *may* benefit from
+  this proposal however the fact that they typically develop proprietary
+  software that is not shared with the rest of the community likely to
+  use this proposal means that they have completely different needs.
+* This proposal is for *sharing* of effort in reducing development costs
+
+# Requirements Analysis <a name="requirements_analysis"></a>
+
+**Platforms**:
+
+3D Embedded will require significantly less accuracy and will need to make
+power budget and die area compromises that other platforms (including Embedded)
+will not need to make.
+
+3D UNIX Platform has to be performance-price-competitive: subtly-reduced
+accuracy in FP32 is acceptable where, conversely, in the UNIX Platform,
+IEEE754 compliance is a hard requirement that would compromise power
+and efficiency on a 3D UNIX Platform.
+
+Even in the Embedded platform, IEEE754 interoperability is beneficial,
+where if it was a hard requirement the 3D Embedded platform would be severely
+compromised in its ability to meet the demanding power budgets of that market.
+
+Thus, learning from the lessons of
+[SIMD considered harmful](https://www.sigarch.org/simd-instructions-considered-harmful/)
+this proposal works in conjunction with the [[zfpacc_proposal]], so as
+not to overburden the OP32 ISA space with extra "reduced-accuracy" opcodes.
+
+**Use-cases**:
+
+There really is little else in the way of suitable markets.  3D GPUs
+have extremely competitive power-efficiency and power-budget requirements
+that are completely at odds with the other market at the other end of
+the spectrum: Numerical Computation.
+
+Interoperability in Numerical Computation is absolutely critical: it implies
+IEEE754 compliance.  However full IEEE754 compliance automatically and
+inherently penalises a GPU, where accuracy is simply just not necessary.
+
+To meet the needs of both markets, the two new platforms have to be created,
+and [[zfpacc_proposal]] is a critical dependency.  Runtime selection of
+FP accuracy allows an implementation to be "Hybrid" - cover UNIX IEEE754
+compliance *and* 3D performance in a single ASIC.
+
+**Power and die-area requirements**:
+
+This is where the conflicts really start to hit home.
+
+A "Numerical High performance only" proposal (suitable for Server / HPC
+only) would customise and target the Extension based on a quantitative
+analysis of the value of certain opcodes *for HPC only*.  It would
+conclude, reasonably and rationally, that it is worthwhile adding opcodes
+to RVV as parallel Vector operations, and that further discussion of
+the matter is pointless.
+
+A "Proprietary GPU effort" (even one that was intended for publication
+of its API through, for example, a public libre-licensed Vulkan SPIR-V
+Compiler) would conclude, reasonably and rationally, that, likewise, the
+opcodes were best suited to be added to RVV, and, further, that their
+requirements conflict with the HPC world, due to the reduced accuracy.
+This on the basis that the silicon die area required for IEEE754 is far
+greater than that needed for reduced-accuracy, and thus their product would
+be completely unacceptable in the market.
+
+An "Embedded 3D" GPU has radically different performance, power
+and die-area requirements (and may even target SoftCores in FPGA).
+Sharing of the silicon to cover multi-function uses (CORDIC for example)
+is absolutely essential in order to keep cost and power down, and high
+performance simply is not.  Multi-cycle FSMs instead of pipelines may
+be considered acceptable, and so on.  Subsets of functionality are
+also essential.
+
+An "Embedded Numerical" platform has requirements that are separate and
+distinct from all of the above!
+
+Mobile Computing needs (tablets, smartphones) again pull in a different
+direction: high performance, reasonable accuracy, but efficiency is
+critical.  Screen sizes are not at the 4K range: they are within the
+800x600 range at the low end (320x240 at the extreme budget end), and
+only the high-performance smartphones and tablets provide 1080p (1920x1080).
+With lower resolution, accuracy compromises are possible which the Desktop
+market (4k and soon to be above) would find unacceptable.
+
+Meeting these disparate markets may be achieved, again, through
+[[zfpacc_proposal]], by subdividing into four platforms, yet, in addition
+to that, subdividing the extension into subsets that best suit the different
+market areas.
+
+**Software requirements**:
+
+A "custom" extension is developed in near-complete isolation from the
+rest of the RISC-V Community.  Cost savings to the Corporation are
+large, with no direct beneficial feedback to (or impact on) the rest
+of the RISC-V ecosystem.
+
+However given that 3D revolves around Standards - DirectX, Vulkan, OpenGL,
+OpenCL - users have much more influence than first appears.  Compliance
+with these standards is critical as the userbase (Games writers, scientific
+applications) expects not to have to rewrite large codebases to conform
+with non-standards-compliant hardware.
+
+Therefore, compliance with public APIs is paramount, and compliance with
+Trademarked Standards is critical.  Any deviation from Trademarked Standards
+means that an implementation may not be sold and also make a claim of being,
+for example, "Vulkan compatible".
+
+This in turn reinforces and makes a hard requirement a need for public
+compliance with such standards, over-and-above what would otherwise be
+set by a RISC-V Standards Development Process, including both the
+software compliance and the knock-on implications that has for hardware.
+
+**Collaboration**:
+
+The case for collaboration on any Extension is already well-known.
+In this particular case, the precedent for inclusion of Transcendentals
+in other ISAs, both from Graphics and High-performance Computing, has
+these primitives well-established in high-profile software libraries and
+compilers in both GPU and HPC Computer Science divisions.  Collaboration
+and shared public compliance with those standards brooks no argument.
+
+*Overall this proposal is categorically and wholly unsuited to
+relegation of "custom" status*.
+
+# Quantitative Analysis
+
+This is extremely challenging.  Normally, an Extension would require full,
+comprehensive and detailed analysis of every single instruction, for every
+single possible use-case, in every single market.  The amount of silicon
+area required would be balanced against the benefits of introducing extra
+opcodes, as well as a full market analysis performed to see which divisions
+of Computer Science benefit from the introduction of the instruction,
+in each and every case.
+
+With 34 instructions, four possible Platforms, and sub-categories of
+implementations even within each Platform, over 136 separate and distinct
+analyses is not a practical proposition.
+
+A little more intelligence has to be applied to the problem space,
+to reduce it down to manageable levels.
+
+Fortunately, the subdivision by Platform, in combination with the
+identification of only two primary markets (Numerical Computation and
+3D), means that the logical reasoning applies *uniformly* and broadly
+across *groups* of instructions rather than individually.
+
+In addition, hardware algorithms such as CORDIC can cover such a wide
+range of operations (simply by changing the input parameters) that the
+normal argument of compromising and excluding certain opcodes because they
+would significantly increase the silicon area is knocked down.
+
+However, CORDIC, whilst space-efficient, and thus well-suited to
+Embedded, is an old iterative algorithm not well-suited to High-Performance
+Computing or Mid to High-end GPUs, where commercially-competitive
+FP32 pipeline lengths are only around 5 stages.
+
+Not only that, but some operations such as LOG1P, which would normally
+be excluded from one market (due to there being an alternative macro-op
+fused sequence replacing it) are required for other markets due to
+the higher accuracy obtainable at the lower range of input values when
+compared to LOG(1+P).
+
+ATAN and ATAN2 is another example area in which one market's needs
+conflict directly with another: the only viable solution, without compromising
+one market to the detriment of the other, is to provide both opcodes
+and let implementors make the call as to which (or both) to optimise.
+
+Likewise it is well-known that loops involving "0 to 2 times pi", often
+done in subdivisions of powers of two, are costly to do because they
+involve floating-point multiplication by PI in each and every loop.
+3D GPUs solved this by providing SINPI variants which range from 0 to 1
+and perform the multiply *inside* the hardware itself.  In the case of
+CORDIC, it turns out that the multiply by PI is not even needed (is a
+loop invariant magic constant).
+
+However, some markets may not be able to *use* CORDIC, for reasons
+mentioned above, and, again, one market would be penalised if SINPI
+was prioritised over SIN, or vice-versa.
+
+Thus the best that can be done is to use Quantitative Analysis to work
+out which "subsets" - sub-Extensions - to include, and be as "inclusive"
+as possible, and thus allow implementors to decide what to add to their
+implementation, and how best to optimise them.
+
+# Proposed Opcodes vs Khronos OpenCL Opcodes <a name="khronos_equiv"></a>
+
+This list shows the (direct) equivalence between proposed opcodes and
+their Khronos OpenCL equivalents.
+
+See
+<https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html>
+
+Special FP16 opcodes are *not* being proposed, except by indirect / inherent
+use of the "fmt" field that is already present in the RISC-V Specification.
+
+"Native" opcodes are *not* being proposed: implementors will be expected
+to use the (equivalent) proposed opcode covering the same function.
+
+"Fast" opcodes are *not* being proposed, because the Khronos Specification
+fast\_length, fast\_normalise and fast\_distance OpenCL opcodes require
+vectors (or can be done as scalar operations using other RISC-V instructions).
+
+The OpenCL FP32 opcodes are **direct** equivalents to the proposed opcodes.
+Deviation from conformance with the Khronos Specification - including the
+Khronos Specification accuracy requirements - is not an option.
+
+[[!table data="""
+Proposed opcode | OpenCL FP32 | OpenCL FP16 | OpenCL native | OpenCL fast |
+FSIN            | sin         | half\_sin   | native\_sin   | NONE        |
+FCOS            | cos         | half\_cos   | native\_cos   | NONE        |
+FTAN            | tan         | half\_tan   | native\_tan   | NONE        |
+NONE (1)        | sincos      | NONE        | NONE          | NONE        |
+FASIN           | asin        | NONE        | NONE          | NONE        |
+FACOS           | acos        | NONE        | NONE          | NONE        |
+FATAN           | atan        | NONE        | NONE          | NONE        |
+FSINPI          | sinpi       | NONE        | NONE          | NONE        |
+FCOSPI          | cospi       | NONE        | NONE          | NONE        |
+FTANPI          | tanpi       | NONE        | NONE          | NONE        |
+FASINPI         | asinpi      | NONE        | NONE          | NONE        |
+FACOSPI         | acospi      | NONE        | NONE          | NONE        |
+FATANPI         | atanpi      | NONE        | NONE          | NONE        |
+FSINH           | sinh        | NONE        | NONE          | NONE        |
+FCOSH           | cosh        | NONE        | NONE          | NONE        |
+FTANH           | tanh        | NONE        | NONE          | NONE        |
+FASINH          | asinh       | NONE        | NONE          | NONE        |
+FACOSH          | acosh       | NONE        | NONE          | NONE        |
+FATANH          | atanh       | NONE        | NONE          | NONE        |
+FRSQRT          | rsqrt       | half\_rsqrt | native\_rsqrt | NONE        |
+FCBRT           | cbrt        | NONE        | NONE          | NONE        |
+FEXP2           | exp2        | half\_exp2  | native\_exp2  | NONE        |
+FLOG2           | log2        | half\_log2  | native\_log2  | NONE        |
+FEXPM1          | expm1       | NONE        | NONE          | NONE        |
+FLOG1P          | log1p       | NONE        | NONE          | NONE        |
+FEXP            | exp         | half\_exp   | native\_exp   | NONE        |
+FLOG            | log         | half\_log   | native\_log   | NONE        |
+FEXP10          | exp10       | half\_exp10 | native\_exp10 | NONE        |
+FLOG10          | log10       | half\_log10 | native\_log10 | NONE        |
+FATAN2          | atan2       | NONE        | NONE          | NONE        |
+FATAN2PI        | atan2pi     | NONE        | NONE          | NONE        |
+FPOW            | pow         | NONE        | NONE          | NONE        |
+FROOT           | rootn       | NONE        | NONE          | NONE        |
+FHYPOT          | hypot       | NONE        | NONE          | NONE        |
+FRECIP          | NONE        | half\_recip | native\_recip | NONE        |
+"""]]
+
+Note (1) FSINCOS is macro-op fused (see below).
+
  # List of 2-arg opcodes
  
  [[!table  data="""
-opcode    | Description           | pseudo-code                | Extension |
-FATAN2    | atan2 arc tangent     | rd = atan2(rs2, rs1)       | Ztrans    |
-FATAN2PI  | atan arc tangent / pi | rd = atan2(rs2, rs1) / pi  | ZtransExt |
-FPOW      | x power of y          | rd = pow(rs1, rs2)         | ZtransAdv |
-FROOT     | x power 1/y           | rd = pow(rs1, 1/rs2)       | ZtransAdv |
+opcode    | Description            | pseudo-code                | Extension   |
+FATAN2    | atan2 arc tangent      | rd = atan2(rs2, rs1)       | Zarctrignpi |
+FATAN2PI  | atan2 arc tangent / pi | rd = atan2(rs2, rs1) / pi  | Zarctrigpi  |
+FPOW      | x power of y           | rd = pow(rs1, rs2)         | ZftransAdv  |
+FROOT     | x power 1/y            | rd = pow(rs1, 1/rs2)       | ZftransAdv  |
+FHYPOT    | hypotenuse             | rd = sqrt(rs1^2 + rs2^2)   | Zftrans     |
  """]]
  
-# List of 1-arg opcodes
+# List of 1-arg transcendental opcodes
  
  [[!table  data="""
-opcode   | Description              | pseudo-code             | Extension |
-FCBRT    | Cube Root                | rd = pow(rs1, 3)        | Ztrans    |
-FEXP2    | power-of-2               | rd = pow(2, rs1)        | Ztrans    |
-FLOG2    | log2                     | rd = log2(rs1)          | Ztrans    |
-FEXPM1   | exponent minus 1         | rd = pow(e, rs1) - 1.0  | Ztrans    |
-FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Ztrans    |
-FEXP     | exponent                 | rd = pow(e, rs1)        | ZtransExt |
-FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZtransExt |
-FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZtransExt |
-FLOG10   | log base 10              | rd = log10(rs1)         | ZtransExt |
-FSIN     | sin (radians)            |                         | Ztrans    |
-FCOS     | cos (radians)            |                         | Ztrans    |
-FTAN     | tan (radians)            |                         | Ztrans    |
-FASIN    | arcsin (radians)         | rd = asin(rs1)          | Ztrans    |
-FACOS    | arccos (radians)         | rd = acos(rs1)          | Ztrans    |
-FATAN    | arctan (radians)         | rd = atan(rs1)          | Ztrans    |
-FSINPI   | sin times pi             | rd = sin(pi * rs1)      | ZtransExt |
-FCOSPI   | cos times pi             | rd = cos(pi * rs1)      | ZtransExt |
-FTANPI   | tan times pi             | rd = tan(pi * rs1)      | ZtransExt |
-FSINH    | hyperbolic sin (radians) |                         | ZtransExt |
-FCOSH    | hyperbolic cos (radians) |                         | ZtransExt |
-FTANH    | hyperbolic tan (radians) |                         | ZtransExt |
-FASINH   | inverse hyperbolic sin   |                         | ZtransExt |
-FACOSH   | inverse hyperbolic cos   |                         | ZtransExt |
-FATANH   | inverse hyperbolic tan   |                         | ZtransExt |
+opcode   | Description              | pseudo-code             | Extension  |
+FRSQRT   | Reciprocal Square-root   | rd = sqrt(rs1)          | Zfrsqrt    |
+FCBRT    | Cube Root                | rd = pow(rs1, 1.0 / 3)  | Zftrans    |
+FRECIP   | Reciprocal               | rd = 1.0 / rs1          | Zftrans    |
+FEXP2    | power-of-2               | rd = pow(2, rs1)        | Zftrans    |
+FLOG2    | log2                     | rd = log(2. rs1)        | Zftrans    |
+FEXPM1   | exponential minus 1      | rd = pow(e, rs1) - 1.0  | Zftrans    |
+FLOG1P   | log plus 1               | rd = log(e, 1 + rs1)    | Zftrans    |
+FEXP     | exponential              | rd = pow(e, rs1)        | ZftransExt |
+FLOG     | natural log (base e)     | rd = log(e, rs1)        | ZftransExt |
+FEXP10   | power-of-10              | rd = pow(10, rs1)       | ZftransExt |
+FLOG10   | log base 10              | rd = log(10, rs1)       | ZftransExt |
  """]]
  
-# Pseudo-code ops
+# List of 1-arg trigonometric opcodes
+
+[[!table  data="""
+opcode      | Description              | pseudo-code             | Extension |
+FSIN        | sin (radians)            | rd = sin(rs1)           | Ztrignpi    |
+FCOS        | cos (radians)            | rd = cos(rs1)           | Ztrignpi    |
+FTAN        | tan (radians)            | rd = tan(rs1)           | Ztrignpi    |
+FASIN       | arcsin (radians)         | rd = asin(rs1)          | Zarctrignpi |
+FACOS       | arccos (radians)         | rd = acos(rs1)          | Zarctrignpi |
+FATAN (1)   | arctan (radians)         | rd = atan(rs1)          | Zarctrignpi |
+FSINPI      | sin times pi             | rd = sin(pi * rs1)      | Ztrigpi |
+FCOSPI      | cos times pi             | rd = cos(pi * rs1)      | Ztrigpi |
+FTANPI      | tan times pi             | rd = tan(pi * rs1)      | Ztrigpi |
+FASINPI     | arcsin / pi              | rd = asin(rs1) / pi     | Zarctrigpi |
+FACOSPI     | arccos / pi              | rd = acos(rs1) / pi     | Zarctrigpi |
+FATANPI (1) | arctan / pi              | rd = atan(rs1) / pi     | Zarctrigpi |
+FSINH       | hyperbolic sin (radians) | rd = sinh(rs1)          | Zfhyp |
+FCOSH       | hyperbolic cos (radians) | rd = cosh(rs1)          | Zfhyp |
+FTANH       | hyperbolic tan (radians) | rd = tanh(rs1)          | Zfhyp |
+FASINH      | inverse hyperbolic sin   | rd = asinh(rs1)         | Zfhyp |
+FACOSH      | inverse hyperbolic cos   | rd = acosh(rs1)         | Zfhyp |
+FATANH      | inverse hyperbolic tan   | rd = atanh(rs1)         | Zfhyp |
+"""]]
+
+Note (1): FATAN/FATANPI is a pseudo-op expanding to FATAN2/FATAN2PI (needs deciding)
+
+# Synthesis, Pseudo-code ops and macro-ops
+
+The pseudo-ops are best left up to the compiler rather than being actual
+pseudo-ops, by allocating one scalar FP register for use as a constant
+(loop invariant) set to "1.0" at the beginning of a function or other
+suitable code block.
  
-* FRCP rd, rs1 - pseudo-code alias for rd = 1.0 / rs1
-* FATAN - pseudo-code alias for rd = atan2(rs1, 1.0) - FATAN2
-* FATANPI - pseudo alias for rd = atan2pi(rs1, 1.0) - FATAN2PI
  * FSINCOS - fused macro-op between FSIN and FCOS (issued in that order).
  * FSINCOSPI - fused macro-op between FSINPI and FCOSPI (issued in that order).
  
+FATANPI example pseudo-code:
+
+    lui t0, 0x3F800 // upper bits of f32 1.0
+    fmv.x.s ft0, t0
+    fatan2pi.s rd, rs1, ft0
+
+Hyperbolic function example (obviates need for Zfhyp except for
+high-performance or correctly-rounding):
+
+    ASINH( x ) = ln( x + SQRT(x**2+1))
+
+# Reciprocal
+
+Used to be an alias. Some imolementors may wish to implement divide as y times recip(x)
+
+# To evaluate: should LOG be replaced with LOG1P (and EXP with EXPM1)?
+
+RISC principle says "exclude LOG because it's covered by LOGP1 plus an ADD".
+Research needed to ensure that implementors are not compromised by such
+a decision
+<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002358.html>
+
+> > correctly-rounded LOG will return different results than LOGP1 and ADD.
+> > Likewise for EXP and EXPM1
+
+> ok, they stay in as real opcodes, then.
+
+# ATAN / ATAN2 commentary
+
+Discussion starts here:
+<http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002470.html>
+
+from Mitch Alsup:
+
+would like to point out that the general implementations of ATAN2 do a
+bunch of special case checks and then simply call ATAN.
+
+    double ATAN2( double y, double x )
+    {   // IEEE 754-2008 quality ATAN2
+
+        // deal with NANs
+        if( ISNAN( x )             ) return x;
+        if( ISNAN( y )             ) return y;
+
+        // deal with infinities
+        if( x == +∞    && |y|== +∞  ) return copysign(  π/4, y );
+        if( x == +∞                 ) return copysign(  0.0, y );
+        if( x == -∞    && |y|== +∞  ) return copysign( 3π/4, y );
+        if( x == -∞                 ) return copysign(    π, y );
+        if(               |y|== +∞  ) return copysign(  π/2, y );
+
+        // deal with signed zeros
+        if( x == 0.0  &&  y != 0.0 ) return copysign(  π/2, y );
+        if( x >=+0.0  &&  y == 0.0 ) return copysign(  0.0, y );
+        if( x <=-0.0  &&  y == 0.0 ) return copysign(    π, y );
+
+        // calculate ATAN2 textbook style
+        if( x  > 0.0               ) return     ATAN( |y / x| );
+        if( x  < 0.0               ) return π - ATAN( |y / x| );
+    }
+
+
+Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent
+a constant and then call/use ATAN2.
+
+When one considers an implementation of ATAN, one must consider several
+ranges of evaluation::
+
+     x  [  -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
+     x  (-1.0, +1.0]:: ATAN( x ) =      + ATAN(   x );
+     x  [ 1.0,   +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
+
+I should point out that the add/sub of π/2 can not lose significance
+since the result of ATAN(1/x) is bounded 0..π/2
+
+The bottom line is that I think you are choosing to make too many of
+these into OpCodes, making the hardware function/calculation unit (and
+sequencer) more complicated that necessary.
+
+--------------------------------------------------------
+
+We therefore I think have a case for bringing back ATAN and including ATAN2.
+
+The reason is that whilst a microcode-like GPU-centric platform would do ATAN2 in terms of ATAN, a UNIX-centric platform would do it the other way round.
+
+(that is the hypothesis, to be evaluated for correctness. feedback requested).
+
+Thie because we cannot compromise or prioritise one platfrom's speed/accuracy over another. That is not reasonable or desirable, to penalise one implementor over another.
+
+Thus, all implementors, to keep interoperability, must both have both opcodes and may choose, at the architectural and routing level, which one to implement in terms of the other.
+
+Allowing implementors to choose to add either opcode and let traps sort it out leaves an uncertainty in the software developer's mind: they cannot trust the hardware, available from many vendors, to be performant right across the board.
+
+Standards are a pig.
+
+---
+
+I might suggest that if there were a way for a calculation to be performed
+and the result of that calculation chained to a subsequent calculation
+such that the precision of the result-becomes-operand is wider than
+what will fit in a register, then you can dramatically reduce the count
+of instructions in this category while retaining
+
+acceptable accuracy:
+
+     z = x / y
+
+can be calculated as::
+
+     z = x * (1/y)
+
+Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008
+accurate, but GPUs want speed and
+
+1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It
+is also not "that inaccurate" displaying 0.625-to-0.52 ULP.
+
+Given that one has the ability to carry (and process) more fraction bits,
+one can then do high precision multiplies of  π or other transcendental
+radixes.
+
+And GPUs have been doing this almost since the dawn of 3D.
+
+    // calculate ATAN2 high performance style
+    // Note: at this point x != y
+    //
+    if( x  > 0.0             )
+    {
+        if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
+        if( y < 0.0 && |y| > |x| ) return       + ATAN( y / x );
+        if( y > 0.0 && |y| < |x| ) return       + ATAN( y / x );
+        if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
+    }
+    if( x  < 0.0             )
+    {
+        if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
+        if( y < 0.0 && |y| > |x| ) return + π   - ATAN( y / x );
+        if( y > 0.0 && |y| < |x| ) return + π   - ATAN( y / x );
+        if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
+    }
+
+This way the adds and subtracts from the constant are not in a precision
+precarious position.