(no commit message)
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=236>
4
5 ## Motivation
6
7 Power ISA currently has some issues with its atomic operations support,
8 which are exacerbated by 3D Data structure processing needing
9 of the order of 10^5 or greater SMP atomic locks per second.
10
11 ### Power ISA's current atomic operations are inefficient
12
13 Implementations have a hard time recognizing existing atomic operations
14 via macro-op fusion because they would often have to detect and fuse a
15 large number of instructions, including branches.
16
17 There is also the issue that PowerISA's memory fences are unnecessarily
18 strong, particularly `isync` which is used for a lot of `acquire` and
19 stronger fences. `isync` forces the cpu to do a full pipeline flush,
20 which is unnecessary when all that is needed is a memory barrier.
21
22 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
23
24 ```
25 # address in r4, addend in r5
26 sync
27 loop:
28 ldarx 3, 0, 4
29 add 6, 5, 3
30 stdcx. 6, 0, 4
31 bne 0, loop
32 lwsync
33 # output in r3
34 ```
35
36 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
37 unnecessarily-strong memory fence:
38
39 ```
40 # address in r3
41 sync
42 ld 3, 0(3)
43 cmpw 0, 3, 3
44 bne- 0, skip
45 isync
46 skip:
47 # output in r3
48 ```
49
50 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
51 a loop with 2 branches, and an unnecessarily-strong memory fence:
52
53 ```
54 # address in r4, compared-to value in r5, replacement value in r6
55 sync
56 loop:
57 ldarx 3, 0, 4
58 cmpd 0, 3, 5
59 bne 0, not_eq
60 stdcx. 6, 0, 4
61 bne 0, loop
62 not_eq:
63 isync
64 # output loaded value in r3, store-occurred flag in cr0.eq
65 ```
66
67 `atomic_load_acquire` is 4 instructions, including a branch and an
68 unnecessarily-strong memory fence:
69
70 ```
71 # address in r3
72 ld 3, 0(3)
73 cmpw 0, 3, 3
74 bne- skip
75 isync
76 skip:
77 # output in r3
78 ```
79
80 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
81
82 ### Power ISA doesn't align well with C++11 atomics
83
84 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
85
86 > Existing implementation schemes on Power and ARM are not correct with
87 > respect to the current memory model definition. These implementation
88 > schemes can lead to results that are disallowed by the current memory
89 > model when the user combines acquire/release ordering with seq_cst
90 > ordering. On some architectures, especially Power and Nvidia GPUs, it
91 > is expensive to repair the implementations to satisfy the existing
92 > memory model. Details are discussed in (Lahav et al)
93 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
94 > on heavily).
95
96 ### Power ISA's Atomic-Memory-Operations have issues
97
98 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
99
100 They are still missing better fences, combined operation/fence
101 instructions, and operations on 8/16-bit values, as well as issues with
102 unnecessary restrictions:
103
104 it has only 32-bit and 64-bit atomic operations.
105
106 the operations it has that I was going to propose:
107 fetch_add
108 fetch_xor
109 fetch_or
110 fetch_and
111 fetch_umax
112 fetch_smax
113 fetch_umin
114 fetch_smin
115 exchange
116
117 as well as a few I wasn't going to propose (they seem less useful to me):
118 compare-and-swap-not-equal
119 fetch-and-increment-bounded
120 fetch-and-increment-equal
121 fetch-and-decrement-bounded
122 store-twin
123
124 The spec also basically says that the atomic memory operations are only
125 intended for when you want to do atomic operations on memory, but don't
126 want that memory to be loaded into your L1 cache.
127
128 imho that restriction is specifically *not* wanted, because there are
129 plenty of cases where atomic operations should happen in your L1 cache.
130
131 I'd guess that part of why those atomic operations weren't included in
132 gcc or clang as the default implementation of atomic operations (when
133 the appropriate ISA feature is enabled) is because of that restriction.
134
135 imho the cpu should be able to (but not required to) predict whether to
136 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
137 it directly in the L1 cache. The prediction could be based on how often
138 that cache block was accessed from different cpus, e.g. by having a
139 small saturating counter and a last-accessing-cpu field, where it would
140 count how many times the same cpu accessed it in a row, sending it to the
141 L1 cache if that's more than some limit, otherwise doing the operation
142 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
143 tried to access it.
144
145 # TODO: add list of proposed instructions