(no commit message)
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 Links:
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
6 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
7 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
8 * [[atomics/discussion]]
9
10 # Motivation
11
12 Power ISA currently has some issues with its atomic operations support,
13 which are exacerbated by 3D Data structure processing in 3D
14 Shader Binaries needing
15 of the order of 10^5 or greater atomic locks per second per SMP Core.
16
17 ## Power ISA's current atomic operations are inefficient
18
19 Implementations have a hard time recognizing existing atomic operations
20 via macro-op fusion because they would often have to detect and fuse a
21 large number of instructions, including branches. This is contrary
22 to the RISC paradigm.
23
24 There is also the issue that PowerISA's memory fences are unnecessarily
25 strong, particularly `isync` which is used for a lot of `acquire` and
26 stronger fences. `isync` forces the cpu to do a full pipeline flush,
27 which is unnecessary when all that is needed is a memory barrier.
28
29 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
30
31 ```
32 # address in r4, addend in r5
33 sync
34 loop:
35 ldarx 3, 0, 4
36 add 6, 5, 3
37 stdcx. 6, 0, 4
38 bne 0, loop
39 lwsync
40 # output in r3
41 ```
42
43 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
44 unnecessarily-strong memory fence:
45
46 ```
47 # address in r3
48 sync
49 ld 3, 0(3)
50 cmpw 0, 3, 3
51 bne- 0, skip
52 isync
53 skip:
54 # output in r3
55 ```
56
57 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
58 a loop with 2 branches, and an unnecessarily-strong memory fence:
59
60 ```
61 # address in r4, compared-to value in r5, replacement value in r6
62 sync
63 loop:
64 ldarx 3, 0, 4
65 cmpd 0, 3, 5
66 bne 0, not_eq
67 stdcx. 6, 0, 4
68 bne 0, loop
69 not_eq:
70 isync
71 # output loaded value in r3, store-occurred flag in cr0.eq
72 ```
73
74 `atomic_load_acquire` is 4 instructions, including a branch and an
75 unnecessarily-strong memory fence:
76
77 ```
78 # address in r3
79 ld 3, 0(3)
80 cmpw 0, 3, 3
81 bne- skip
82 isync
83 skip:
84 # output in r3
85 ```
86
87 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
88 TODO
89
90 ## Power ISA doesn't align well with C++11 atomics
91
92 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
93
94 > Existing implementation schemes on Power and ARM are not correct with
95 > respect to the current memory model definition. These implementation
96 > schemes can lead to results that are disallowed by the current memory
97 > model when the user combines acquire/release ordering with seq_cst
98 > ordering. On some architectures, especially Power and Nvidia GPUs, it
99 > is expensive to repair the implementations to satisfy the existing
100 > memory model. Details are discussed in (Lahav et al)
101 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
102 > on heavily).
103
104 ## Power ISA's Atomic-Memory-Operations have issues
105
106 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
107
108 They are still missing better fences, combined operation/fence
109 instructions, and operations on 8/16-bit values, as well as issues with
110 unnecessary restrictions:
111
112 it has only 32-bit and 64-bit atomic operations.
113
114 see [[discussion]] for proposed operations and thoughts TODO
115 remove this sentence
116
117
118 # TODO: add list of proposed instructions
119
120 AT-Form (TODO)
121
122 * lat RT,RA,FC,ew
123 * lataq RT,RA,FC,ew
124 * latrl RT,RA,FC,ew
125 * lataqrl RT,RA,FC,ew
126 * stat RT,RA,FC,ew
127 * stataq RT,RA,FC,ew
128 * statrl RT,RA,FC,ew
129 * stataqrl RT,RA,FC,ew
130
131 | 0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31| name | Form |
132 | -- | -- | --- | --- |--|--|---- |------|--| ---- | ------- |
133 | PO | RT | RA | FC |lr|sc|ew |xxxxxx|/ | lat[aq][rl]| TODO-Form |
134 | PO | RS | RA | FC |lr|sc|ew |xxxxxx|/ | stat[aq][rl]| TODO-Form |
135
136 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
137 * If the `aq` bit is set,
138 then no later atomic memory operations can be observed
139 to take place before the AMO.
140 * If the `rl` bit is set, then other cores will not observe the AMO before
141 memory accesses preceding the AMO.
142 * Setting both the `aq` and the `rl` bit makes the sequence
143 sequentially consistent, meaning that
144 it cannot be reordered with earlier or later atomic
145 memory operations.
146 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
147 and `stwat`
148
149 read operations v3.1 book II section 4.5.1 p1071
150
151 | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
152 | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
153 | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
154 | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
155 | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
156 | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
157 | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
158 | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
159 | 01000 | RT, RT+1 | mem(EA,s) | Swap |
160 | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
161 | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
162 | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
163 | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
164
165 store operations
166
167 | 00000 RS mem(EA,s) Store Add
168 | 00001 RS mem(EA,s) Store XOR
169 | 00010 RS mem(EA,s) Store OR
170 | 00011 RS mem(EA,s) Store AND
171 | 00100 RS mem(EA,s) Store Maximum Unsigned
172 | 00101 RS mem(EA,s) Store Maximum Signed
173 | 00110 RS mem(EA,s) Store Minimum Unsigned
174 | 00111 RS mem(EA,s) Store Minimum Signed
175 | 11000 RS mem(EA,s) Store Twin
176
177 These operations are recognised as being part of the
178 OpenCAPI Specification.