(no commit message)

[libreriscv.git] / zfpacc_proposal.mdwn
diff --git a/zfpacc_proposal.mdwn b/zfpacc_proposal.mdwn

index 68895d7d3b4a8c4ed905f95ec3ce5f89a3688552..8b22582a5d984dacb72bb766a487a30045959e86 100644 (file)
--- a/zfpacc_proposal.mdwn
+++ b/zfpacc_proposal.mdwn
@@ -1,55 +1,90 @@
+
  # FP Accuracy proposal
  
-TODO: writeup
+Credits:
+
+* Bruce Hoult
+* Allen Baum
+* Dan Petroski
+* Jacob Lifshay
+
+TODO: complete writeup
+
  * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
  * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
  
-    A natural place for a standard reduced accuracy extension "Zfpacc"
-    would be in the reserved bits of FCSR.  It could be treated very
-    similarly to how dynamic frm is treated now. Currently, there are 5
-    bits of fflags, 3 bits of frm and 24 Reserved bits. The L (decimal
-    floating-point) extension will presumably use some, but not all of
-    them. I'm unable to find any public proposals for L bit encodings
-    in FCSR.
-
-    For reference, frm is treated as follows: Floating-point operations
-    use either a static rounding mode encoded in the instruction, or
-    a dynamic rounding mode held in frm. Rounding modes are encoded
-    as shown in Table 11.1. A value of 111 in the instruction’s rm
-    field selects the dynamic rounding mode held in frm. If frm is set
-    to an invalid value (101–111), any subsequent attempt to execute
-    a floating-point operation with a dynamic rounding mode will raise
-    an illegal instruction exception.
-
-    Let's say that we wish to support up to 4 accuracy modes -- 2 'fam'
-    bits.  Default would be IEEE-compliant, encoded as 00.  This means
-    that all current hardware would be compliant with the default mode.
-
-    the unsupported modes would cause a trap to allow emulation where
-    traps are supported. emulation of unsupported modes would be required
-    for unix platforms.
-
-    As with frm, an implementation can choose to support any permutation
-    of dynamic fam-instruction pairs. It will illegal-instruction
-    trap upon executing an unsupported fam-instruction pair.
-    The implementation can then emulate the accuracy mode required.
-
-    there would be a mechanism for user mode code to detect which modes
-    are emulated (csr? syscall?) (if the supervisor decides to make the
-    emulation visible) that would allow user code to switch to faster
-    software implementations if it chooses to.
-
-    If the bits are in FCSR, then the switch itself would be exposed
-    to user mode.  User-mode would not be able to detect emulation vs
-    hardware supported instructions, however (by design).  That would
-    require some platform-specific code.
-
-    Now, which accuracy modes should be included is a question outside
-    of my expertise and would require a literature review of instruction
+Zfpacc: a proposal to allow implementations to dynamically set the
+bit-accuracy of floating-point results, trading speed (reduced latency)
+*at runtime* for accuracy (higher latency).  IEEE754 format is preserved:
+instruction operand and result format requirements are unmodified by
+this proposal.  Only ULP (Unit in Last Place) of the instruction *result*
+is permitted to meet alternative accuracy requirements, whilst still
+retaining the instruction's requested format.
+
+This proposal is *only* suitable for adding pre-existing accuracy standards
+where it is clearly established, well in advance of applications being
+written that conform to that standard, that dealing with variations in
+accuracy across hardware implementations is the responsibility of the
+application writer.  This is the case for both Vulkan and OpenCL.
+
+This proposal is *not* suitable for inclusion of "de-facto" (proprietary)
+accuracy standards (historic IBM Mainframe vs Ahmdahl incompatibility)
+where there was no prior agreement or notification to applications
+writers that variations in accuracy across hardware implementations
+would occur.  In the unlikely event that they *are* ever to be included
+(n the future, rather than as a Custom Extension, then, unlike Vulkan
+and OpenCL, they must **only** be added as "bit-for-bit compatible".
+
+# Extension of FCSR
+
+Zfpacc would use some of the the reserved bits of FCSR.  It would be treated
+very similarly to how dynamic frm is treated.
+
+frm is treated as follows:
+
+* Floating-point operations use either a static rounding mode encoded
+  in the instruction, or a dynamic rounding mode held in frm.
+* Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
+* A value of 111 in the instruction’s rm field selects the dynamic rounding
+  mode held in frm. If frm is set to an invalid value (101–111),
+  any subsequent attempt to execute a floating-point operation with a
+  dynamic rounding mode will raise an illegal instruction exception.
+
+If we wish to support up to 4 accuracy modes, that would require 2 'fam'
+bits.  The Default would be IEEE754-compliant, encoded as 00.  This means
+that all current hardware would be compliant with the default mode.
+
+Unsupported modes cause a trap to allow emulation where traps are supported.
+Emulation of unsupported modes would be required for UNIX platforms.
+As with frm, an implementation may choose to support any permutation
+of dynamic fam-instruction pairs. It will illegal-instruction trap upon
+executing an unsupported fam-instruction pair.  The implementation can
+then emulate the accuracy mode required.
+
+If the bits are in FCSR, then the switch itself would be exposed to
+user mode.  User-mode would not be able to detect emulation vs hardware
+supported instructions, however (by design).  That would require some
+platform-specific code.
+
+Emulation of unsupported modes would be required for unix platforms.
+
+TODO:
+
+A mechanism for user mode code to detect which modes are emulated
+(csr? syscall?) (if the supervisor decides to make the emulation visible)
+that would allow user code to switch to faster software implementations
+if it chooses to.
+
+TODO:
+
+Choose which accuracy modes are required
+
+    Which accuracy modes should be included is a question outside of
+    my expertise and would require a literature review of instruction
      frequency in key workloads, PPA analysis of simple and advanced
-    implementations, etc.  (Thanks for the insights, Mitch!)
+    implementations, etc.
  
-    emulation of unsupported modes would be required for unix platforms.
+TODO: reduced accuracy
  
      I don't see why Unix should be required to emulate some arbitrary
      reduced accuracy ML mode.  My guess would be that Unix Platform Spec
@@ -60,3 +95,122 @@ TODO: writeup
      accuracy modes is guaranteed (and therefore does not need discovery
      sequences), while allowing portable code to execute discovery
      sequences to detect support for alternative accuracy modes.
+
+# Dynamic accuracy CSR <a name="dynamic"></a>
+
+FCSR to be modified to include accuracy bits:
+
+| 31....11 | 10..8  | 7..5 | 4....0 |
+| -------- | ------ | ---- | ------ |
+| reserved | facc   | frm  | fflags |
+
+The values for the field facc to include the following:
+
+| facc  | mode    | description         |
+| ----- | ------- | ------------------- |
+| 0b000 | IEEE754 | correctly rounded   | 
+| 0b010 | ULP<1   | Unit Last Place < 1 | 
+| 0b100 | Vulkan  | Vulkan compliant    | 
+| 0b110 | Appx    | Machine Learning    
+
+(TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16)
+
+Notes: 
+
+* facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
+* The format of the operands and result remain the same for
+all opcodes. The only change is in the *accuracy* of the result, not
+its format.
+* facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
+
+## Discussion
+
+maybe a solution would be to add an extra field to the fp control csr
+to allow selecting one of several accurate or fast modes:
+
+- machine-learning-mode: fast as possible
+  (maybe need additional requirements such as monotonicity for atanh?)
+- GPU-mode: accurate to within a few ULP
+  (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
+- almost-accurate-mode: accurate to <1 ULP
+     (would 0.51 or some other value be better?)
+- fully-accurate-mode: correctly rounded in all cases
+- maybe more modes?
+
+extra mode suggestions:
+
+    it might be reasonable to add a mode saying you're prepared to accept
+    worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4,
+    16 or something like that.
+
+Question: should better accuracy than is requested be permitted? Example:
+Ahmdahl-370 issues.
+
+Comments:
+
+    Yes, embedded systems typically can do with 12, 16 or 32 bit
+    accuracy. Rarely does it require 64 bits. But the idea of making
+    a low power 32 bit FPU/DSP that can accommodate 64 bits is already
+    being done in other designs such as PIC etc I believe. For embedded
+    graphics 16 bit is more than adequate. In fact, Cornell had a very
+    innovative 18-bit floating point format described here (useful for
+    FPGA designs with 18-bit DSPs):
+
+    <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
+
+    A very interesting GPU using the 18-bit FPU is also described here:
+
+    <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
+
+    There are also 8 and 9-bit floating point formats that could be useful
+
+    <https://en.wikipedia.org/wiki/Minifloat>
+
+### function accuracy in standards (opencl, vulkan)
+
+[[resources]] for OpenCL and Vulkan
+
+Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
+
+<https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation>
+
+Source is here:
+<https://github.com/KhronosGroup/Vulkan-Docs/blob/master/appendices/spirvenv.txt#L1172>
+
+OpenCL slightly different, suggest adding as an extra entry.
+
+<https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps>
+
+Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1
+
+https://www.google.com/search?q=opencl+environment+specification
+
+2.1 superceded by 2.2
+<https://github.com/KhronosGroup/OpenCL-Docs/blob/master/env/numerical_compliance.asciidoc>
+
+### Compliance
+
+Dan Petroski:
+
+    It’s a bit more complicated than that. Different FP
+    representations/algorithms have different quantization ranges, so you
+    can get more or less precise depending on how large the arguments are.
+
+    For instance, machine A can compute within ULP3 from 0 to 10000, but
+    ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
+    6000, then ULP3 for 6000+. How do you design a compliance suite which
+    guarantees behavior across all fpaccs?
+
+and from Allen Baum:
+
+    In the example above, you'd need a ratified spec with the defined
+    ranges  (possbily per range and per op) - and then implementations
+    would need to at least meet that spec (but could be more accurate)
+
+    so - not impossible, but a lot more work to write different kinds
+    of tests than standard IEEE compatible test would have.
+
+    And, by the way, if you want it to be a ratified spec, it needs a
+    compliance suite, and whoever has defined the spec is responsible
+    for writing it.,
+