+
# FP Accuracy proposal
-TODO: writeup
+Credits:
+
+* Bruce Hoult
+* Allen Baum
+* Dan Petroski
+* Jacob Lifshay
+
+TODO: complete writeup
+
* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
* <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
- A natural place for a standard reduced accuracy extension "Zfpacc"
- would be in the reserved bits of FCSR. It could be treated very
- similarly to how dynamic frm is treated now. Currently, there are 5
- bits of fflags, 3 bits of frm and 24 Reserved bits. The L (decimal
- floating-point) extension will presumably use some, but not all of
- them. I'm unable to find any public proposals for L bit encodings
- in FCSR.
-
- For reference, frm is treated as follows: Floating-point operations
- use either a static rounding mode encoded in the instruction, or
- a dynamic rounding mode held in frm. Rounding modes are encoded
- as shown in Table 11.1. A value of 111 in the instruction’s rm
- field selects the dynamic rounding mode held in frm. If frm is set
- to an invalid value (101–111), any subsequent attempt to execute
- a floating-point operation with a dynamic rounding mode will raise
- an illegal instruction exception.
-
- Let's say that we wish to support up to 4 accuracy modes -- 2 'fam'
- bits. Default would be IEEE-compliant, encoded as 00. This means
- that all current hardware would be compliant with the default mode.
-
- the unsupported modes would cause a trap to allow emulation where
- traps are supported. emulation of unsupported modes would be required
- for unix platforms.
-
- As with frm, an implementation can choose to support any permutation
- of dynamic fam-instruction pairs. It will illegal-instruction
- trap upon executing an unsupported fam-instruction pair.
- The implementation can then emulate the accuracy mode required.
-
- there would be a mechanism for user mode code to detect which modes
- are emulated (csr? syscall?) (if the supervisor decides to make the
- emulation visible) that would allow user code to switch to faster
- software implementations if it chooses to.
-
- If the bits are in FCSR, then the switch itself would be exposed
- to user mode. User-mode would not be able to detect emulation vs
- hardware supported instructions, however (by design). That would
- require some platform-specific code.
-
- Now, which accuracy modes should be included is a question outside
- of my expertise and would require a literature review of instruction
+Zfpacc: a proposal to allow implementations to dynamically set the
+bit-accuracy of floating-point results, trading speed (reduced latency)
+*at runtime* for accuracy (higher latency). IEEE754 format is preserved:
+instruction operand and result format requirements are unmodified by
+this proposal. Only ULP (Unit in Last Place) of the instruction *result*
+is permitted to meet alternative accuracy requirements, whilst still
+retaining the instruction's requested format.
+
+This proposal is *only* suitable for adding pre-existing accuracy standards
+where it is clearly established, well in advance of applications being
+written that conform to that standard, that dealing with variations in
+accuracy across hardware implementations is the responsibility of the
+application writer. This is the case for both Vulkan and OpenCL.
+
+This proposal is *not* suitable for inclusion of "de-facto" (proprietary)
+accuracy standards (historic IBM Mainframe vs Ahmdahl incompatibility)
+where there was no prior agreement or notification to applications
+writers that variations in accuracy across hardware implementations
+would occur. In the unlikely event that they *are* ever to be included
+(n the future, rather than as a Custom Extension, then, unlike Vulkan
+and OpenCL, they must **only** be added as "bit-for-bit compatible".
+
+# Extension of FCSR
+
+Zfpacc would use some of the the reserved bits of FCSR. It would be treated
+very similarly to how dynamic frm is treated.
+
+frm is treated as follows:
+
+* Floating-point operations use either a static rounding mode encoded
+ in the instruction, or a dynamic rounding mode held in frm.
+* Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
+* A value of 111 in the instruction’s rm field selects the dynamic rounding
+ mode held in frm. If frm is set to an invalid value (101–111),
+ any subsequent attempt to execute a floating-point operation with a
+ dynamic rounding mode will raise an illegal instruction exception.
+
+If we wish to support up to 4 accuracy modes, that would require 2 'fam'
+bits. The Default would be IEEE754-compliant, encoded as 00. This means
+that all current hardware would be compliant with the default mode.
+
+Unsupported modes cause a trap to allow emulation where traps are supported.
+Emulation of unsupported modes would be required for UNIX platforms.
+As with frm, an implementation may choose to support any permutation
+of dynamic fam-instruction pairs. It will illegal-instruction trap upon
+executing an unsupported fam-instruction pair. The implementation can
+then emulate the accuracy mode required.
+
+If the bits are in FCSR, then the switch itself would be exposed to
+user mode. User-mode would not be able to detect emulation vs hardware
+supported instructions, however (by design). That would require some
+platform-specific code.
+
+Emulation of unsupported modes would be required for unix platforms.
+
+TODO:
+
+A mechanism for user mode code to detect which modes are emulated
+(csr? syscall?) (if the supervisor decides to make the emulation visible)
+that would allow user code to switch to faster software implementations
+if it chooses to.
+
+TODO:
+
+Choose which accuracy modes are required
+
+ Which accuracy modes should be included is a question outside of
+ my expertise and would require a literature review of instruction
frequency in key workloads, PPA analysis of simple and advanced
- implementations, etc. (Thanks for the insights, Mitch!)
+ implementations, etc.
- emulation of unsupported modes would be required for unix platforms.
+TODO: reduced accuracy
I don't see why Unix should be required to emulate some arbitrary
reduced accuracy ML mode. My guess would be that Unix Platform Spec
accuracy modes is guaranteed (and therefore does not need discovery
sequences), while allowing portable code to execute discovery
sequences to detect support for alternative accuracy modes.
+
+# Dynamic accuracy CSR <a name="dynamic"></a>
+
+FCSR to be modified to include accuracy bits:
+
+| 31....11 | 10..8 | 7..5 | 4....0 |
+| -------- | ------ | ---- | ------ |
+| reserved | facc | frm | fflags |
+
+The values for the field facc to include the following:
+
+| facc | mode | description |
+| ----- | ------- | ------------------- |
+| 0b000 | IEEE754 | correctly rounded |
+| 0b010 | ULP<1 | Unit Last Place < 1 |
+| 0b100 | Vulkan | Vulkan compliant |
+| 0b110 | Appx | Machine Learning
+
+(TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16)
+
+Notes:
+
+* facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
+* The format of the operands and result remain the same for
+all opcodes. The only change is in the *accuracy* of the result, not
+its format.
+* facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
+
+## Discussion
+
+maybe a solution would be to add an extra field to the fp control csr
+to allow selecting one of several accurate or fast modes:
+
+- machine-learning-mode: fast as possible
+ (maybe need additional requirements such as monotonicity for atanh?)
+- GPU-mode: accurate to within a few ULP
+ (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
+- almost-accurate-mode: accurate to <1 ULP
+ (would 0.51 or some other value be better?)
+- fully-accurate-mode: correctly rounded in all cases
+- maybe more modes?
+
+extra mode suggestions:
+
+ it might be reasonable to add a mode saying you're prepared to accept
+ worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4,
+ 16 or something like that.
+
+Question: should better accuracy than is requested be permitted? Example:
+Ahmdahl-370 issues.
+
+Comments:
+
+ Yes, embedded systems typically can do with 12, 16 or 32 bit
+ accuracy. Rarely does it require 64 bits. But the idea of making
+ a low power 32 bit FPU/DSP that can accommodate 64 bits is already
+ being done in other designs such as PIC etc I believe. For embedded
+ graphics 16 bit is more than adequate. In fact, Cornell had a very
+ innovative 18-bit floating point format described here (useful for
+ FPGA designs with 18-bit DSPs):
+
+ <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
+
+ A very interesting GPU using the 18-bit FPU is also described here:
+
+ <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
+
+ There are also 8 and 9-bit floating point formats that could be useful
+
+ <https://en.wikipedia.org/wiki/Minifloat>
+
+### function accuracy in standards (opencl, vulkan)
+
+[[resources]] for OpenCL and Vulkan
+
+Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
+
+<https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation>
+
+Source is here:
+<https://github.com/KhronosGroup/Vulkan-Docs/blob/master/appendices/spirvenv.txt#L1172>
+
+OpenCL slightly different, suggest adding as an extra entry.
+
+<https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps>
+
+Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1
+
+https://www.google.com/search?q=opencl+environment+specification
+
+2.1 superceded by 2.2
+<https://github.com/KhronosGroup/OpenCL-Docs/blob/master/env/numerical_compliance.asciidoc>
+
+### Compliance
+
+Dan Petroski:
+
+ It’s a bit more complicated than that. Different FP
+ representations/algorithms have different quantization ranges, so you
+ can get more or less precise depending on how large the arguments are.
+
+ For instance, machine A can compute within ULP3 from 0 to 10000, but
+ ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
+ 6000, then ULP3 for 6000+. How do you design a compliance suite which
+ guarantees behavior across all fpaccs?
+
+and from Allen Baum:
+
+ In the example above, you'd need a ratified spec with the defined
+ ranges (possbily per range and per op) - and then implementations
+ would need to at least meet that spec (but could be more accurate)
+
+ so - not impossible, but a lot more work to write different kinds
+ of tests than standard IEEE compatible test would have.
+
+ And, by the way, if you want it to be a ratified spec, it needs a
+ compliance suite, and whoever has defined the spec is responsible
+ for writing it.,
+