u_queue: add a futex-based implementation of fences
Fences are now 4 bytes instead of 96 bytes (on my 64-bit system).
Signaling a fence is a single atomic operation in the fast case plus a
syscall in the slow case.
Testing if a fence is signaled is the same as before (a simple comparison),
but waiting on a fence is now no more expensive than just testing it in
the fast (already signaled) case.
v2:
- style fixes
- use p_atomic_xxx macros with the right barriers
Acked-by: Marek Olšák <marek.olsak@amd.com>