X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=zfpacc_proposal.mdwn;h=8b22582a5d984dacb72bb766a487a30045959e86;hb=3fca0b5a420965cc7c5a8370941f0ab3ecb8ba5f;hp=23feec9c18c7df5965f242d08eed1da1fd9ccb74;hpb=c605bd3e2ea172a47de2bec63160771f2361ff25;p=libreriscv.git diff --git a/zfpacc_proposal.mdwn b/zfpacc_proposal.mdwn index 23feec9c1..8b22582a5 100644 --- a/zfpacc_proposal.mdwn +++ b/zfpacc_proposal.mdwn @@ -1,54 +1,90 @@ + # FP Accuracy proposal -TODO: writeup - - - A natural place for a standard reduced accuracy extension "Zfpacc" - would be in the reserved bits of FCSR. It could be treated very - similarly to how dynamic frm is treated now. Currently, there are 5 - bits of fflags, 3 bits of frm and 24 Reserved bits. The L (decimal - floating-point) extension will presumably use some, but not all of - them. I'm unable to find any public proposals for L bit encodings - in FCSR. - - For reference, frm is treated as follows: Floating-point operations - use either a static rounding mode encoded in the instruction, or - a dynamic rounding mode held in frm. Rounding modes are encoded - as shown in Table 11.1. A value of 111 in the instruction’s rm - field selects the dynamic rounding mode held in frm. If frm is set - to an invalid value (101–111), any subsequent attempt to execute - a floating-point operation with a dynamic rounding mode will raise - an illegal instruction exception. - - Let's say that we wish to support up to 4 accuracy modes -- 2 'fam' - bits. Default would be IEEE-compliant, encoded as 00. This means - that all current hardware would be compliant with the default mode. - - the unsupported modes would cause a trap to allow emulation where - traps are supported. emulation of unsupported modes would be required - for unix platforms. - - As with frm, an implementation can choose to support any permutation - of dynamic fam-instruction pairs. It will illegal-instruction - trap upon executing an unsupported fam-instruction pair. - The implementation can then emulate the accuracy mode required. - - there would be a mechanism for user mode code to detect which modes - are emulated (csr? syscall?) (if the supervisor decides to make the - emulation visible) that would allow user code to switch to faster - software implementations if it chooses to. - - If the bits are in FCSR, then the switch itself would be exposed - to user mode. User-mode would not be able to detect emulation vs - hardware supported instructions, however (by design). That would - require some platform-specific code. - - Now, which accuracy modes should be included is a question outside - of my expertise and would require a literature review of instruction +Credits: + +* Bruce Hoult +* Allen Baum +* Dan Petroski +* Jacob Lifshay + +TODO: complete writeup + +* +* + +Zfpacc: a proposal to allow implementations to dynamically set the +bit-accuracy of floating-point results, trading speed (reduced latency) +*at runtime* for accuracy (higher latency). IEEE754 format is preserved: +instruction operand and result format requirements are unmodified by +this proposal. Only ULP (Unit in Last Place) of the instruction *result* +is permitted to meet alternative accuracy requirements, whilst still +retaining the instruction's requested format. + +This proposal is *only* suitable for adding pre-existing accuracy standards +where it is clearly established, well in advance of applications being +written that conform to that standard, that dealing with variations in +accuracy across hardware implementations is the responsibility of the +application writer. This is the case for both Vulkan and OpenCL. + +This proposal is *not* suitable for inclusion of "de-facto" (proprietary) +accuracy standards (historic IBM Mainframe vs Ahmdahl incompatibility) +where there was no prior agreement or notification to applications +writers that variations in accuracy across hardware implementations +would occur. In the unlikely event that they *are* ever to be included +(n the future, rather than as a Custom Extension, then, unlike Vulkan +and OpenCL, they must **only** be added as "bit-for-bit compatible". + +# Extension of FCSR + +Zfpacc would use some of the the reserved bits of FCSR. It would be treated +very similarly to how dynamic frm is treated. + +frm is treated as follows: + +* Floating-point operations use either a static rounding mode encoded + in the instruction, or a dynamic rounding mode held in frm. +* Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec +* A value of 111 in the instruction’s rm field selects the dynamic rounding + mode held in frm. If frm is set to an invalid value (101–111), + any subsequent attempt to execute a floating-point operation with a + dynamic rounding mode will raise an illegal instruction exception. + +If we wish to support up to 4 accuracy modes, that would require 2 'fam' +bits. The Default would be IEEE754-compliant, encoded as 00. This means +that all current hardware would be compliant with the default mode. + +Unsupported modes cause a trap to allow emulation where traps are supported. +Emulation of unsupported modes would be required for UNIX platforms. +As with frm, an implementation may choose to support any permutation +of dynamic fam-instruction pairs. It will illegal-instruction trap upon +executing an unsupported fam-instruction pair. The implementation can +then emulate the accuracy mode required. + +If the bits are in FCSR, then the switch itself would be exposed to +user mode. User-mode would not be able to detect emulation vs hardware +supported instructions, however (by design). That would require some +platform-specific code. + +Emulation of unsupported modes would be required for unix platforms. + +TODO: + +A mechanism for user mode code to detect which modes are emulated +(csr? syscall?) (if the supervisor decides to make the emulation visible) +that would allow user code to switch to faster software implementations +if it chooses to. + +TODO: + +Choose which accuracy modes are required + + Which accuracy modes should be included is a question outside of + my expertise and would require a literature review of instruction frequency in key workloads, PPA analysis of simple and advanced - implementations, etc. (Thanks for the insights, Mitch!) + implementations, etc. - emulation of unsupported modes would be required for unix platforms. +TODO: reduced accuracy I don't see why Unix should be required to emulate some arbitrary reduced accuracy ML mode. My guess would be that Unix Platform Spec @@ -59,3 +95,122 @@ TODO: writeup accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes. + +# Dynamic accuracy CSR + +FCSR to be modified to include accuracy bits: + +| 31....11 | 10..8 | 7..5 | 4....0 | +| -------- | ------ | ---- | ------ | +| reserved | facc | frm | fflags | + +The values for the field facc to include the following: + +| facc | mode | description | +| ----- | ------- | ------------------- | +| 0b000 | IEEE754 | correctly rounded | +| 0b010 | ULP<1 | Unit Last Place < 1 | +| 0b100 | Vulkan | Vulkan compliant | +| 0b110 | Appx | Machine Learning + +(TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16) + +Notes: + +* facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero. +* The format of the operands and result remain the same for +all opcodes. The only change is in the *accuracy* of the result, not +its format. +* facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed). + +## Discussion + +maybe a solution would be to add an extra field to the fp control csr +to allow selecting one of several accurate or fast modes: + +- machine-learning-mode: fast as possible + (maybe need additional requirements such as monotonicity for atanh?) +- GPU-mode: accurate to within a few ULP + (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines) +- almost-accurate-mode: accurate to <1 ULP + (would 0.51 or some other value be better?) +- fully-accurate-mode: correctly rounded in all cases +- maybe more modes? + +extra mode suggestions: + + it might be reasonable to add a mode saying you're prepared to accept + worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4, + 16 or something like that. + +Question: should better accuracy than is requested be permitted? Example: +Ahmdahl-370 issues. + +Comments: + + Yes, embedded systems typically can do with 12, 16 or 32 bit + accuracy. Rarely does it require 64 bits. But the idea of making + a low power 32 bit FPU/DSP that can accommodate 64 bits is already + being done in other designs such as PIC etc I believe. For embedded + graphics 16 bit is more than adequate. In fact, Cornell had a very + innovative 18-bit floating point format described here (useful for + FPGA designs with 18-bit DSPs): + + + + A very interesting GPU using the 18-bit FPU is also described here: + + + + There are also 8 and 9-bit floating point formats that could be useful + + + +### function accuracy in standards (opencl, vulkan) + +[[resources]] for OpenCL and Vulkan + +Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt. + + + +Source is here: + + +OpenCL slightly different, suggest adding as an extra entry. + + + +Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1 + +https://www.google.com/search?q=opencl+environment+specification + +2.1 superceded by 2.2 + + +### Compliance + +Dan Petroski: + + It’s a bit more complicated than that. Different FP + representations/algorithms have different quantization ranges, so you + can get more or less precise depending on how large the arguments are. + + For instance, machine A can compute within ULP3 from 0 to 10000, but + ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to + 6000, then ULP3 for 6000+. How do you design a compliance suite which + guarantees behavior across all fpaccs? + +and from Allen Baum: + + In the example above, you'd need a ratified spec with the defined + ranges (possbily per range and per op) - and then implementations + would need to at least meet that spec (but could be more accurate) + + so - not impossible, but a lot more work to write different kinds + of tests than standard IEEE compatible test would have. + + And, by the way, if you want it to be a ratified spec, it needs a + compliance suite, and whoever has defined the spec is responsible + for writing it., +