zfpacc_proposal.mdwn

   1
   2 # FP Accuracy proposal
   3
   4 Credits:
   5
   6 * Bruce Hoult
   7 * Allen Baum
   8 * Dan Petroski
   9 * Jacob Lifshay
  10
  11 TODO: complete writeup
  12
  13 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002400.html>
  14 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002412.html>
  15
  16 Zfpacc: a proposal to allow implementations to dynamically set the
  17 bit-accuracy of floating-point results, trading speed (reduced latency)
  18 *at runtime* for accuracy (higher latency).  IEEE754 format is preserved:
  19 instruction operand and result format requirements are unmodified by
  20 this proposal.  Only ULP (Unit in Last Place) of the instruction *result*
  21 is permitted to meet alternative accuracy requirements, whilst still
  22 retaining the instruction's requested format.
  23
  24 This proposal is *only* suitable for adding pre-existing accuracy standards
  25 where it is clearly established, well in advance of applications being
  26 written that conform to that standard, that dealing with variations in
  27 accuracy across hardware implementations is the responsibility of the
  28 application writer.  This is the case for both Vulkan and OpenCL.
  29
  30 This proposal is *not* suitable for inclusion of "de-facto" (proprietary)
  31 accuracy standards (historic IBM Mainframe vs Ahmdahl incompatibility)
  32 where there was no prior agreement or notification to applications
  33 writers that variations in accuracy across hardware implementations
  34 would occur.  In the unlikely event that they *are* ever to be included
  35 (n the future, rather than as a Custom Extension, then, unlike Vulkan
  36 and OpenCL, they must **only** be added as "bit-for-bit compatible".
  37
  38 # Extension of FCSR
  39
  40 Zfpacc would use some of the the reserved bits of FCSR.  It would be treated
  41 very similarly to how dynamic frm is treated.
  42
  43 frm is treated as follows:
  44
  45 * Floating-point operations use either a static rounding mode encoded
  46   in the instruction, or a dynamic rounding mode held in frm.
  47 * Rounding modes are encoded as shown in Table 11.1 of the RISC-V ISA Spec
  48 * A value of 111 in the instruction’s rm field selects the dynamic rounding
  49   mode held in frm. If frm is set to an invalid value (101–111),
  50   any subsequent attempt to execute a floating-point operation with a
  51   dynamic rounding mode will raise an illegal instruction exception.
  52
  53 If we wish to support up to 4 accuracy modes, that would require 2 'fam'
  54 bits.  The Default would be IEEE754-compliant, encoded as 00.  This means
  55 that all current hardware would be compliant with the default mode.
  56
  57 Unsupported modes cause a trap to allow emulation where traps are supported.
  58 Emulation of unsupported modes would be required for UNIX platforms.
  59 As with frm, an implementation may choose to support any permutation
  60 of dynamic fam-instruction pairs. It will illegal-instruction trap upon
  61 executing an unsupported fam-instruction pair.  The implementation can
  62 then emulate the accuracy mode required.
  63
  64 If the bits are in FCSR, then the switch itself would be exposed to
  65 user mode.  User-mode would not be able to detect emulation vs hardware
  66 supported instructions, however (by design).  That would require some
  67 platform-specific code.
  68
  69 Emulation of unsupported modes would be required for unix platforms.
  70
  71 TODO:
  72
  73 A mechanism for user mode code to detect which modes are emulated
  74 (csr? syscall?) (if the supervisor decides to make the emulation visible)
  75 that would allow user code to switch to faster software implementations
  76 if it chooses to.
  77
  78 TODO:
  79
  80 Choose which accuracy modes are required
  81
  82     Which accuracy modes should be included is a question outside of
  83     my expertise and would require a literature review of instruction
  84     frequency in key workloads, PPA analysis of simple and advanced
  85     implementations, etc.
  86
  87 TODO: reduced accuracy
  88
  89     I don't see why Unix should be required to emulate some arbitrary
  90     reduced accuracy ML mode.  My guess would be that Unix Platform Spec
  91     requires support for IEEE, whereas arbitrary ML platform requires
  92     support for Mode XYZ.  Of course, implementations of either platform
  93     would be free to support any/all modes that they find valuable.
  94     Compiling for a specific platform means that support for required
  95     accuracy modes is guaranteed (and therefore does not need discovery
  96     sequences), while allowing portable code to execute discovery
  97     sequences to detect support for alternative accuracy modes.
  98
  99 # Dynamic accuracy CSR <a name="dynamic"></a>
 100
 101 FCSR to be modified to include accuracy bits:
 102
 103 | 31....11 | 10..8  | 7..5 | 4....0 |
 104 | -------- | ------ | ---- | ------ |
 105 | reserved | facc   | frm  | fflags |
 106
 107 The values for the field facc to include the following:
 108
 109 | facc  | mode    | description         |
 110 | ----- | ------- | ------------------- |
 111 | 0b000 | IEEE754 | correctly rounded   |
 112 | 0b010 | ULP<1   | Unit Last Place < 1 |
 113 | 0b100 | Vulkan  | Vulkan compliant    |
 114 | 0b110 | Appx    | Machine Learning
 115
 116 (TODO: review alternative idea: ULP0.5, ULP1, ULP2, ULP4, ULP16)
 117
 118 Notes:
 119
 120 * facc=0 to match current RISC-V behaviour, where these bits were formerly reserved and set to zero.
 121 * The format of the operands and result remain the same for
 122 all opcodes. The only change is in the *accuracy* of the result, not
 123 its format.
 124 * facc sets the *minimum* accuracy. It is acceptable to provide *more* accurate results than is requested by a given facc mode (although, clearly, the opportunity for reduced power and latency would be missed).
 125
 126 ## Discussion
 127
 128 maybe a solution would be to add an extra field to the fp control csr
 129 to allow selecting one of several accurate or fast modes:
 130
 131 - machine-learning-mode: fast as possible
 132   (maybe need additional requirements such as monotonicity for atanh?)
 133 - GPU-mode: accurate to within a few ULP
 134   (see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines)
 135 - almost-accurate-mode: accurate to <1 ULP
 136      (would 0.51 or some other value be better?)
 137 - fully-accurate-mode: correctly rounded in all cases
 138 - maybe more modes?
 139
 140 extra mode suggestions:
 141
 142     it might be reasonable to add a mode saying you're prepared to accept
 143     worse then 0.5 ULP accuracy, perhaps with a few options: 1, 2, 4,
 144     16 or something like that.
 145
 146 Question: should better accuracy than is requested be permitted? Example:
 147 Ahmdahl-370 issues.
 148
 149 Comments:
 150
 151     Yes, embedded systems typically can do with 12, 16 or 32 bit
 152     accuracy. Rarely does it require 64 bits. But the idea of making
 153     a low power 32 bit FPU/DSP that can accommodate 64 bits is already
 154     being done in other designs such as PIC etc I believe. For embedded
 155     graphics 16 bit is more than adequate. In fact, Cornell had a very
 156     innovative 18-bit floating point format described here (useful for
 157     FPGA designs with 18-bit DSPs):
 158
 159     <https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html>
 160
 161     A very interesting GPU using the 18-bit FPU is also described here:
 162
 163     <https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html>
 164
 165     There are also 8 and 9-bit floating point formats that could be useful
 166
 167     <https://en.wikipedia.org/wiki/Minifloat>
 168
 169 ### function accuracy in standards (opencl, vulkan)
 170
 171 [[resources]] for OpenCL and Vulkan
 172
 173 Vulkan requires full ieee754 precision for all F/D instructions except for fdiv and fsqrt.
 174
 175 <https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/chap40.html#spirvenv-precision-operation>
 176
 177 Source is here:
 178 <https://github.com/KhronosGroup/Vulkan-Docs/blob/master/appendices/spirvenv.txt#L1172>
 179
 180 OpenCL slightly different, suggest adding as an extra entry.
 181
 182 <https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Env.html#relative-error-as-ulps>
 183
 184 Link, finds version 2.1 of opencl environment specification, table 8.4.1 however needs checking if it is the same as the above, which has "SPIRV" in it and is 2.2 not 2.1
 185
 186 https://www.google.com/search?q=opencl+environment+specification
 187
 188 2.1 superceded by 2.2
 189 <https://github.com/KhronosGroup/OpenCL-Docs/blob/master/env/numerical_compliance.asciidoc>
 190
 191 ### Compliance
 192
 193 Dan Petroski:
 194
 195     It’s a bit more complicated than that. Different FP
 196     representations/algorithms have different quantization ranges, so you
 197     can get more or less precise depending on how large the arguments are.
 198
 199     For instance, machine A can compute within ULP3 from 0 to 10000, but
 200     ULP2 from 10000 upwards. Machine B can compute within ULP2 from 0 to
 201     6000, then ULP3 for 6000+. How do you design a compliance suite which
 202     guarantees behavior across all fpaccs?
 203
 204 and from Allen Baum:
 205
 206     In the example above, you'd need a ratified spec with the defined
 207     ranges  (possbily per range and per op) - and then implementations
 208     would need to at least meet that spec (but could be more accurate)
 209
 210     so - not impossible, but a lot more work to write different kinds
 211     of tests than standard IEEE compatible test would have.
 212
 213     And, by the way, if you want it to be a ratified spec, it needs a
 214     compliance suite, and whoever has defined the spec is responsible
 215     for writing it.,
 216