(no commit message)
[libreriscv.git] / openpower / atomics.mdwn
1 # Draft proposal for improved atomic operations for the Power ISA
2
3 http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
8 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
9 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
10 * [[atomics/discussion]]
11
12 # Motivation
13
14 Power ISA currently has some issues with its atomic operations support,
15 which are exacerbated by 3D Data structure processing in 3D
16 Shader Binaries needing
17 of the order of 10^5 or greater atomic locks per second per SMP Core.
18
19 ## Power ISA's current atomic operations are inefficient
20
21 Implementations have a hard time recognizing existing atomic operations
22 via macro-op fusion because they would often have to detect and fuse a
23 large number of instructions, including branches. This is contrary
24 to the RISC paradigm.
25
26 There is also the issue that PowerISA's memory fences are unnecessarily
27 strong, particularly `isync` which is used for a lot of `acquire` and
28 stronger fences. `isync` forces the cpu to do a full pipeline flush,
29 which is unnecessary when all that is needed is a memory barrier.
30
31 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
32
33 ```
34 # address in r4, addend in r5
35 sync
36 loop:
37 ldarx 3, 0, 4
38 add 6, 5, 3
39 stdcx. 6, 0, 4
40 bne 0, loop
41 lwsync
42 # output in r3
43 ```
44
45 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
46 unnecessarily-strong memory fence:
47
48 ```
49 # address in r3
50 sync
51 ld 3, 0(3)
52 cmpw 0, 3, 3
53 bne- 0, skip
54 isync
55 skip:
56 # output in r3
57 ```
58
59 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
60 a loop with 2 branches, and an unnecessarily-strong memory fence:
61
62 ```
63 # address in r4, compared-to value in r5, replacement value in r6
64 sync
65 loop:
66 ldarx 3, 0, 4
67 cmpd 0, 3, 5
68 bne 0, not_eq
69 stdcx. 6, 0, 4
70 bne 0, loop
71 not_eq:
72 isync
73 # output loaded value in r3, store-occurred flag in cr0.eq
74 ```
75
76 `atomic_load_acquire` is 4 instructions, including a branch and an
77 unnecessarily-strong memory fence:
78
79 ```
80 # address in r3
81 ld 3, 0(3)
82 cmpw 0, 3, 3
83 bne- skip
84 isync
85 skip:
86 # output in r3
87 ```
88
89 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
90 TODO
91
92 ## Power ISA doesn't align well with C++11 atomics
93
94 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
95
96 > Existing implementation schemes on Power and ARM are not correct with
97 > respect to the current memory model definition. These implementation
98 > schemes can lead to results that are disallowed by the current memory
99 > model when the user combines acquire/release ordering with seq_cst
100 > ordering. On some architectures, especially Power and Nvidia GPUs, it
101 > is expensive to repair the implementations to satisfy the existing
102 > memory model. Details are discussed in (Lahav et al)
103 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
104 > on heavily).
105
106 ## Power ISA's Atomic-Memory-Operations have issues
107
108 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
109
110 They are still missing better fences, combined operation/fence
111 instructions, and operations on 8/16-bit values, as well as issues with
112 unnecessary restrictions:
113
114 it has only 32-bit and 64-bit atomic operations.
115
116 see [[discussion]] for proposed operations and thoughts TODO
117 remove this sentence
118
119
120 # DRAFT atomic instructions
121
122 These two instructions, `lat` and `stat`, are identical
123 to `lwat/ldat` and `stwat/stdat` except add acquire and
124 release guaranteed ordering semantics as well as 8 and
125 16 bit memory widths.
126
127 AT-Form (TODO)
128
129 * lat. RT,RA,FC,aq,rl,ew
130 * stat. RS,RA,FC,aq,rl,ew
131
132 **DRAFT** EXT031 and XO, these are near to the existing
133 atomic memory operations
134
135 |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name| Form |
136 |-- | -- | --- | --- |--|--|---- |------|--|----|------------|
137 |31 | RT | RA | FC |lr|sc|ew |000101|Rc|lat | TODO-Form |
138 |31 | RS | RA | FC |lr|sc|ew |100101|/ |stat| TODO-Form |
139
140 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
141 * If the `aq` bit is set,
142 then no later atomic memory operations can be observed
143 to take place before the AMO in this or other cores.
144 (A global Write-after-Read Memory Hazard is created)
145 * If the `rl` bit is set, then other cores will not observe the AMO before
146 memory accesses preceding the AMO.
147 (A global Read-after-Write Memory Hazard is created)
148 * Setting both the `aq` and the `rl` bit makes the sequence
149 sequentially consistent, meaning that
150 it cannot be reordered with respect to earlier or later atomic
151 memory operations. (Both a RaW and WaR are simultaneously created)
152 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
153 and `stwat`
154
155 read functions v3.1 book II section 4.5.1 p1071
156
157 |opcode| regs | memory | description |
158 |------|----------------|------------------------|-----------------------------|
159 |00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
160 |00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
161 |00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
162 |00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
163 |00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
164 |00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
165 |00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
166 |00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
167 |01000 | RT, RT+1 | mem(EA,s) | Swap |
168 |10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
169 |11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
170 |11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
171 |11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
172
173 store functions
174
175 |opcode| regs | memory | description |
176 |------|------|-----------|-----------------------------|
177 |00000 | RS | mem(EA,s) | Store Add |
178 |00001 | RS | mem(EA,s) | Store XOR |
179 |00010 | RS | mem(EA,s) | Store OR |
180 |00011 | RS | mem(EA,s) | Store AND |
181 |00100 | RS | mem(EA,s) | Store Maximum Unsigned |
182 |00101 | RS | mem(EA,s) | Store Maximum Signed |
183 |00110 | RS | mem(EA,s) | Store Minimum Unsigned |
184 |00111 | RS | mem(EA,s) | Store Minimum Signed |
185 |11000 | RS | mem(EA,s) | Store Twin |
186
187 These functions are also recognised as being part of the
188 OpenCAPI Specification.