(no commit message)
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 Links:
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
6 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
7
8 ## Motivation
9
10 Power ISA currently has some issues with its atomic operations support,
11 which are exacerbated by 3D Data structure processing in 3D
12 Shader Binaries needing
13 of the order of 10^5 or greater SMP atomic locks per second.
14
15 ### Power ISA's current atomic operations are inefficient
16
17 Implementations have a hard time recognizing existing atomic operations
18 via macro-op fusion because they would often have to detect and fuse a
19 large number of instructions, including branches. This is contrary
20 to the RISC paradigm.
21
22 There is also the issue that PowerISA's memory fences are unnecessarily
23 strong, particularly `isync` which is used for a lot of `acquire` and
24 stronger fences. `isync` forces the cpu to do a full pipeline flush,
25 which is unnecessary when all that is needed is a memory barrier.
26
27 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
28
29 ```
30 # address in r4, addend in r5
31 sync
32 loop:
33 ldarx 3, 0, 4
34 add 6, 5, 3
35 stdcx. 6, 0, 4
36 bne 0, loop
37 lwsync
38 # output in r3
39 ```
40
41 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
42 unnecessarily-strong memory fence:
43
44 ```
45 # address in r3
46 sync
47 ld 3, 0(3)
48 cmpw 0, 3, 3
49 bne- 0, skip
50 isync
51 skip:
52 # output in r3
53 ```
54
55 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
56 a loop with 2 branches, and an unnecessarily-strong memory fence:
57
58 ```
59 # address in r4, compared-to value in r5, replacement value in r6
60 sync
61 loop:
62 ldarx 3, 0, 4
63 cmpd 0, 3, 5
64 bne 0, not_eq
65 stdcx. 6, 0, 4
66 bne 0, loop
67 not_eq:
68 isync
69 # output loaded value in r3, store-occurred flag in cr0.eq
70 ```
71
72 `atomic_load_acquire` is 4 instructions, including a branch and an
73 unnecessarily-strong memory fence:
74
75 ```
76 # address in r3
77 ld 3, 0(3)
78 cmpw 0, 3, 3
79 bne- skip
80 isync
81 skip:
82 # output in r3
83 ```
84
85 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
86 TODO
87
88 ### Power ISA doesn't align well with C++11 atomics
89
90 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
91
92 > Existing implementation schemes on Power and ARM are not correct with
93 > respect to the current memory model definition. These implementation
94 > schemes can lead to results that are disallowed by the current memory
95 > model when the user combines acquire/release ordering with seq_cst
96 > ordering. On some architectures, especially Power and Nvidia GPUs, it
97 > is expensive to repair the implementations to satisfy the existing
98 > memory model. Details are discussed in (Lahav et al)
99 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
100 > on heavily).
101
102 ### Power ISA's Atomic-Memory-Operations have issues
103
104 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
105
106 They are still missing better fences, combined operation/fence
107 instructions, and operations on 8/16-bit values, as well as issues with
108 unnecessary restrictions:
109
110 it has only 32-bit and 64-bit atomic operations.
111
112 read operations v3.1 book II section 4.5.1 p1071
113
114 | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
115 | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
116 | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
117 | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
118 | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
119 | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
120 | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
121 | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
122 | 01000 | RT, RT+1 | mem(EA,s) | Swap |
123 | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
124 | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
125 | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
126 | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
127
128 store operations
129
130 | 00000 RS mem(EA,s) Store Add
131 | 00001 RS mem(EA,s) Store XOR
132 | 00010 RS mem(EA,s) Store OR
133 | 00011 RS mem(EA,s) Store AND t
134 | 00100 RS mem(EA,s) Store Maximum Unsigned
135 | 00101 RS mem(EA,s) Store Maximum Signed t
136 | 00110 RS mem(EA,s) Store Minimum Unsigned
137 | 00111 RS mem(EA,s) Store Minimum Signed
138 | 11000 RS mem(EA,s) Store Twin
139
140 These operations are recognised as being part of the
141 OpenCAPI Specification.
142 the operations it has that I was going to propose:
143
144 * fetch_add
145 * fetch_xor
146 * fetch_or
147 * fetch_and
148 * fetch_umax
149 * fetch_smax
150 * fetch_umin
151 * fetch_smin
152 * exchange
153
154 as well as a few I wasn't going to propose (they seem less useful to me):
155
156 * compare-and-swap-not-equal
157 * fetch-and-increment-bounded
158 * fetch-and-increment-equal
159 * fetch-and-decrement-bounded
160 * store-twin
161
162 The spec also basically says that the atomic memory operations are only
163 intended for when you want to do atomic operations on memory, but don't
164 want that memory to be loaded into your L1 cache.
165
166 imho that restriction is specifically *not* wanted, because there are
167 plenty of cases where atomic operations should happen in your L1 cache.
168
169 I'd guess that part of why those atomic operations weren't included in
170 gcc or clang as the default implementation of atomic operations (when
171 the appropriate ISA feature is enabled) is because of that restriction.
172
173 imho the cpu should be able to (but not required to) predict whether to
174 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
175 it directly in the L1 cache. The prediction could be based on how often
176 that cache block was accessed from different cpus, e.g. by having a
177 small saturating counter and a last-accessing-cpu field, where it would
178 count how many times the same cpu accessed it in a row, sending it to the
179 L1 cache if that's more than some limit, otherwise doing the operation
180 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
181 tried to access it.
182
183 # TODO: add list of proposed instructions