(no commit message)
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=236>
4
5 ## Motivation
6
7 Power ISA currently has some issues with its atomic operations support,
8 which are exacerbated by 3D Data structure processing needing
9 of the order of 10^5 or greater SMP atomic locks per second.
10
11 ### Power ISA's current atomic operations are inefficient
12
13 Implementations have a hard time recognizing existing atomic operations
14 via macro-op fusion because they would often have to detect and fuse a
15 large number of instructions, including branches. This is contrary
16 to the RISC paradigm.
17
18 There is also the issue that PowerISA's memory fences are unnecessarily
19 strong, particularly `isync` which is used for a lot of `acquire` and
20 stronger fences. `isync` forces the cpu to do a full pipeline flush,
21 which is unnecessary when all that is needed is a memory barrier.
22
23 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
24
25 ```
26 # address in r4, addend in r5
27 sync
28 loop:
29 ldarx 3, 0, 4
30 add 6, 5, 3
31 stdcx. 6, 0, 4
32 bne 0, loop
33 lwsync
34 # output in r3
35 ```
36
37 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
38 unnecessarily-strong memory fence:
39
40 ```
41 # address in r3
42 sync
43 ld 3, 0(3)
44 cmpw 0, 3, 3
45 bne- 0, skip
46 isync
47 skip:
48 # output in r3
49 ```
50
51 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
52 a loop with 2 branches, and an unnecessarily-strong memory fence:
53
54 ```
55 # address in r4, compared-to value in r5, replacement value in r6
56 sync
57 loop:
58 ldarx 3, 0, 4
59 cmpd 0, 3, 5
60 bne 0, not_eq
61 stdcx. 6, 0, 4
62 bne 0, loop
63 not_eq:
64 isync
65 # output loaded value in r3, store-occurred flag in cr0.eq
66 ```
67
68 `atomic_load_acquire` is 4 instructions, including a branch and an
69 unnecessarily-strong memory fence:
70
71 ```
72 # address in r3
73 ld 3, 0(3)
74 cmpw 0, 3, 3
75 bne- skip
76 isync
77 skip:
78 # output in r3
79 ```
80
81 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
82 TODO
83
84 ### Power ISA doesn't align well with C++11 atomics
85
86 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
87
88 > Existing implementation schemes on Power and ARM are not correct with
89 > respect to the current memory model definition. These implementation
90 > schemes can lead to results that are disallowed by the current memory
91 > model when the user combines acquire/release ordering with seq_cst
92 > ordering. On some architectures, especially Power and Nvidia GPUs, it
93 > is expensive to repair the implementations to satisfy the existing
94 > memory model. Details are discussed in (Lahav et al)
95 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
96 > on heavily).
97
98 ### Power ISA's Atomic-Memory-Operations have issues
99
100 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
101
102 They are still missing better fences, combined operation/fence
103 instructions, and operations on 8/16-bit values, as well as issues with
104 unnecessary restrictions:
105
106 it has only 32-bit and 64-bit atomic operations.
107
108 These operations are recognised as being part of the
109 OpenCAPI Specification.
110 the operations it has that I was going to propose:
111
112 * fetch_add
113 * fetch_xor
114 * fetch_or
115 * fetch_and
116 * fetch_umax
117 * fetch_smax
118 * fetch_umin
119 * fetch_smin
120 * exchange
121
122 as well as a few I wasn't going to propose (they seem less useful to me):
123
124 * compare-and-swap-not-equal
125 * fetch-and-increment-bounded
126 * fetch-and-increment-equal
127 * fetch-and-decrement-bounded
128 * store-twin
129
130 The spec also basically says that the atomic memory operations are only
131 intended for when you want to do atomic operations on memory, but don't
132 want that memory to be loaded into your L1 cache.
133
134 imho that restriction is specifically *not* wanted, because there are
135 plenty of cases where atomic operations should happen in your L1 cache.
136
137 I'd guess that part of why those atomic operations weren't included in
138 gcc or clang as the default implementation of atomic operations (when
139 the appropriate ISA feature is enabled) is because of that restriction.
140
141 imho the cpu should be able to (but not required to) predict whether to
142 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
143 it directly in the L1 cache. The prediction could be based on how often
144 that cache block was accessed from different cpus, e.g. by having a
145 small saturating counter and a last-accessing-cpu field, where it would
146 count how many times the same cpu accessed it in a row, sending it to the
147 L1 cache if that's more than some limit, otherwise doing the operation
148 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
149 tried to access it.
150
151 # TODO: add list of proposed instructions