dab711f6b16bf57b0eb0a3aa544c90768603e94c
[libreriscv.git] / openpower / sv / predication.mdwn
1 # TODO
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
4
5 * idea 1: modify cmp (and other CR generators?) with qualifiers that
6 create single bit prefix vector into int reg
7 * idea 2: override CR SO field in vector form to be predicate bit per element
8 * idea 3: reading of predicates is from bits of int reg
9 * idea 4: SO CR field no longer overflow, contains copy of int reg
10 predicate element bit (passed through). when OE set?
11
12
13 # Requirements
14
15 * must be easily implementable in any microarchitecture including:
16 - small and large out-of-order
17 - in-order
18 - FSM (0.3 IPC or below)
19 * must not compromise or penalise any microarchitectural performance
20 * must cover up to 64 elements
21 * must still work for elwidth over-rides
22
23 ## Additional Capabilities
24
25 * two modes, "zeroing" and "non-zeroing". zeroing mode places a zero in the masked-out element results, where non-zeroing leaves the destination (result) element unmodified.
26 * predicate must be invertable via an opcode bit (to avoid the need for an instruction which inverts all bits of the predicate mask)
27
28 Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file. This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to be directly ANDed with the regfile write-enable lines to achieve the required functionality. The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance. Avoided very simply with byte-level write-enable.
29
30 # Proposals
31
32 ## Adding new predicate register file type and associated opcodes
33
34 This idea, adding new predicate manipulation opcodes,
35 violates the fundamental design principles of SV to not add
36 new vector-related instructions unless essential or compelling.
37
38 All other proposals utilise existing scalar opcodes which already happen to have bitmanipulation, arithmetic, and inter-file transfer capability (mfcr, mfspr etc).
39 They also involve adding extra bitmanip opcodes, such that by utilising those scalar registers as predicate masks SV achieves "par" with other Cray-style Vector ISAs, all without actually having to add any actual Vector opcodes.
40
41 In addition those bitmanip operations, although some of them are obscure and unusual in the scalar world, do actually have practical applicatiobe outside of a vector context.
42
43 Adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is however anomalous, costly, and unnecessary.
44
45 ## CR-based predication proposal
46
47 this involves treating each CR as providing one bit of predicate. If
48 there is limited space in SVPrefix it will be a fixed bit (bit 0)
49 otherwise it may be selected (bit 0 to 3 of the CR) through a firld in the opcode.
50
51 the crucial advantage of this proposal is that the Function Units can
52 have one more register (a CR) added as their Read Dependency Hazards
53 just like all the other incoming source registers, and there is no need
54 for a special "Predicate Shadow Function Unit".
55
56 a big advantage of this is that unpredicated operations just set the
57 predicate to an immediate of all 1s and the actual ALUs require very
58 little modification.
59
60 a disadvantage is that to support the selection of 8 bit of predicate
61 from 8 CRs (via the "full" 8x CR port") would require allocating 32-bit
62 datapath to the relevant FUs. This could be reduced by adding yet another
63 type of special virtual register port or datapath that masks out the
64 required predicate bits closer to the regfile.
65
66 another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs. Beyond that they can be transferred using vectorised mfcr and mtcrf into INT regs. this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix. however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane".
67
68 ### Predicated SIMD HI32-LO32 FUs
69
70 an analysis of changing the element widths (for SIMD) gives the following
71 potential arrangements, for which it is assumed that 2x 32-bit FUs
72 "pair up" for single 64 bit arithmetic, HI32 and LO32 style.
73
74 * 64-bit operations. 2 FUs and their DM rows "collaborate"
75 - 2x 32-bit source registers gang together for 64 bit input
76 - 2x 32-bit output registers likewise for output
77 - 1x CR (from the LO32 FU DM side) for a predicate bit
78 * 32-bit operations. 2 FUs collaborate 2x32 SIMD style
79 - 2x 32-bit source registers go into separate input halves of the
80 SIMD ALU
81 - 2x 32-bit outputs likewise for output
82 - 2x CRs (one for HI32, one for LO32) for a predicate bit for each of
83 the 2x32bit SIMD pair
84 * 16-bit operations. 2 FUs collaborate 4x16 SIMD style
85 - 2x 2x16-bit source registers group together to provide 4x16 inputs
86 - likewise for outputs
87 - EITHER 2x 2xCRs (2 for HI32, 2 for LO32) provide 4 predicate bits
88 - OR 1x 8xCR "full" port is utilised (on LO32 FU) followed by masking
89 at the ALU behind the FU pair, extracting the required 4 predicate bits
90 * 8-bit operations. 2 FUs collaborate 8x8 SIMD style
91 - 2x 4x8-bit source registers
92 - likewise for outputs
93 - 1x 8xCR "full" port is utilised (on LO32 FU) and all 8 bits are
94 passed through to the underlying 64-bit ALU to perform 8x 8-bit
95 predicated operations
96
97 ### Predicated SIMD straight 64-bit FUs
98
99 * 64-bit operations. 1 FU, 1 64 bit operation
100 - 1x 64-bit source register
101 - 1x 64-bit output register
102 - 1x CR for a predicate bit
103 * 32-bit operations. 1 FUs 2x32 SIMD style
104 - 1x 64-bit source register dynamically splits to 2x 32-bit
105 - 1x 64-bit output likewise
106 - 2x CRs for a predicate bit for each of the 2x32bit SIMD pair
107 * 16-bit operations. 1 FUs 4x16 SIMD style
108 - 1x 4x16-bit source registers
109 - likewise for outputs
110 - 1x 8xCR "full" port is utilised followed by masking at the ALU behind
111 the FU pair, extracting the required 4 predicate bits
112 * 8-bit operations. 1 FU 8x8 SIMD style
113 - 1x 8x8-bit source registers
114 - likewise for outputs
115 - 1x 8xCR "full" port is utilised LO32 and all 8 bits used
116 to perform 8x 8-bit predicated operations
117
118 Here again the underying 64-bit ALU requires the 8x predicate bits to
119 cover the 8x8-bit SIMD operations (7 of which are dormant/unused in 64-bit
120 predicated operations but still have to be there to cover 8x8-bit SIMD).
121
122 Given that the initial idea of using the "full" (virtual) 32-bit CR read
123 port (which reads all 8 CRs CR0-CR7 simultaneously) would require a
124 32-bit broadcast bus to every predication-capable Function Unit, the bus
125 bandwidth can again be reduced by performing the selection of the masks
126 (bit 0 thru bit 3 of each CR) closer to the regfile i.e. before hitting
127 the broadcast bus.
128
129 ## One scalar int per predicate element.
130
131 Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates. This idea has quite a lot of merit to it.
132
133 Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
134
135 The disadvantages appear on closer analysis:
136
137 * Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain just 8 predicate bits (each being an LSB of a given 64 bit scalar int), would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read. Resource-wise, then, this idea is expensive.
138 * With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly. Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
139
140 In a "normal" Vector ISA this would be solved by adding opcodes that perform the kinds of bitmanipulation operations normally needed for predicate masks, as specialist operations *on* those masks. However for SV the rule has been set: "no unnecessary additional Vector Instructions" because it is possible to use existing PowerISA scalar bitmanip opcodes to cover the same job.
141
142 The problem is that vectors of LSBs need to be transferred *to* scalar int regs, bitmanip operations carried out, *and then transferred back*, which is exceptionally costly.
143
144 On balance this is a less favourable option than vectorising CRs
145
146 ## Scalar (single) integer as predicate, with one DM row
147
148 This idea has merit in that to perform predicate bitmanip operations the preficate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
149
150 This idea has several disadvantages.
151
152 * the single DM entry for the entire 64 bits creates a read hazard
153 that has to be resolved through the addition of a special Shadowing
154 Function Unit. Only when the entire predicate is available can the
155 die-cancel/ok be pulled on the FU elements each bit covers
156 * this situation is exacerbated if one vector creates a predicate
157 mask that is then used to mask immediately following instructions.
158 Ordinarily (i.e. without the predicate involved), Cray-style "chaining"
159 would be possible. The single DM entry for the entire predicate mask
160 prohibits this because the subsequent operations can only proceed when
161 the *entire* mask has been computed and placed in full
162 into the scalar integer register.
163 * Allocation of bits to FUs gets particularly complex for SIMD (elwidth
164 overrides) requiring shift and mask logic that is simply not needed
165 compared to "one-for-one" schemes (above)
166
167 Overall there is very little in favour of this concept.
168
169 ## Scalar (single) integer as predicate with one DM row per bit
170
171 The Dependency Matrix logic from the CR proposal favourably applies
172 equally to this proposal. However there are additional caveats that
173 weigh against it:
174
175 * Like the single scalar DM entry proposal, the integer scalar register
176 had to be covered also by a single DM entry (for when it is used *as*
177 an integer register).
178 * Unlike the same, it must also be covered by a 64-wide suite of bitlevel
179 Dependency Matrix Rows. These numbers are so massive as to cause some
180 concern.
181 * A solution is to introduce a virtual register naming scheme however
182 this also introduces huge complexity as the register cache has to be
183 capable of swapping reservations from 64 bitlevel to full 64bit scalar
184 level *and* keep the Dependency Matrices synchronised
185
186 it is enormously complex and likely to result in debugging, verification
187 and ongoing maintenance difficulties.
188
189 ## Schemes which split (a scalar) integer reg into mask "chunks"
190
191 These ideas are based on the principle that each chunk of 8 (or 16)
192 bits of a scalar integer register may be covered by its own DM column
193 in FU-REGs.
194 8 chunks of a scalar 64-bit integer register for use as a bit-level
195 predicate mask onto 64 vector elements would for example require 8
196 DM entries.
197
198 This would, for vector sizes of 8, solve the "chaining" problem reasonably
199 well even when two FUs (or two clock cycles) were required to deal with
200 4 elements at a time. The "compare" that generated the predicate would
201 be ready to go into the first "chunk" of predicate bits whilst the second
202 compare was still being issued.
203
204 It would also require a lot smaller DMs than the single-bit-per-element
205 ideas.
206
207 The problems start when trying to allocate bits of predicate to units.
208 Just like the single-DM-row per entire scalar reg case, a shadow-capable
209 Predicate Function Unit is now required (already determined to be costly)
210 except now if there are 8 chunks requiring 8 Predicate FUs *the problem
211 is now made 8x worse*.
212
213 Not only that but it is even more complex when trying to bring in virtual
214 register cacheing in order to bring down overall FU-REGs DM row count,
215 although the numbers are much lower: 8x 8-bit chunks of scalar int
216 only requires 8 DM Rows and 8 virtual subdivisions however *this is per
217 in-flight register*.
218
219 The additional complexity of the cross-over point between use as a chunked
220 predicate mask and when the same underlying register is used as an actual
221 scalar (or even vector) integer register is also carried over from the
222 bit-level DM subdivision case.
223
224 Out-of-order systems, to be effective, require several operations to
225 be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if
226 every predicated vector operation needed one 8-chunked scalar register
227 each it becomes exceedingly complex very quickly.
228
229 Even more than that, in a predicated chaining scenario, when computing
230 the mask from a vector "compare", the groupings are troublesome to
231 think through how to implement, which is itself a bad sign. It is
232 suspected that chaining will be complex or adversely affected by certain
233 combinations of element width.
234
235 (see [[masked_vector_chaining]])
236
237 Overall this idea which initially seems to save resources brings together
238 all the least favourable implementation aspects of other proposals and
239 requires and combines all of them.