55933657aa8b814368d9494b86a1ef2b1518359b
[libreriscv.git] / openpower / sv / predication.mdwn
1 # TODO
2
3 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
4
5 * idea 1: modify cmp (and other CR generators?) with qualifiers that
6 create single bit prefix vector into int reg
7 * idea 2: override CR SO field in vector form to be predicate bit per element
8 * idea 3: reading of predicates is from bits of int reg
9 * idea 4: SO CR field no longer overflow, contains copy of int reg
10 predicate element bit (passed through). when OE set?
11
12
13 # Requirements
14
15 * must be easily implementable in any microarchitecture including out-of-order
16 * must not compromise or penalise any microarchitectural performance
17 * must cover up to 64 elements
18
19 # Proposals
20
21 ## CR-based predication proposal
22
23 this involves treating each CR as providing one bit of predicate. If
24 there is limited space in SVPrefix it will be a fixed bit (bit 0)
25 otherwise it may be selected (bit 0 to 3 of the CR)
26
27 the crucial advantage of this proposal is that the Function Units can
28 have one more register (a CR) added as their Read Dependency Hazards
29 just like all the other incoming source registers, and there is no need
30 for a special "Predicate Shadow Function Unit".
31
32 an analysis of changing the element widths (for SIMD) gives the following
33 potential arrangements, for which it is assumed that 2x 32-bit FUs
34 "pair up" for single 64 bit arithmetic, HI32 and LO32 style.
35
36 * 64-bit operations. 2 FUs and their DM rows "collaborate"
37 - 2x 32-bit source registers gang together for 64 bit input
38 - 2x 32-bit output registers likewise for output
39 - 1x CR (from the LO32 FU DM side) for a predicate bit
40 * 32-bit operations. 2 FUs collaborate 2x32 SIMD style
41 - 2x 32-bit source registers go into separate input halves of the
42 SIMD ALU
43 - 2x 32-bit outputs likewise for output
44 - 2x CRs (one for HI32, one for LO32) for a predicate bit for each of
45 the 2x32bit SIMD pair
46 * 16-bit operations. 2 FUs collaborate 4x16 SIMD style
47 - 2x 2x16-bit source registers group together to provide 4x16 inputs
48 - likewise for outputs
49 - EITHER 2x 2xCRs (2 for HI32, 2 for LO32) provide 4 predicate bits
50 - OR 1x 8xCR "full" port is utilised (on LO32 FU) followed by masking
51 at the ALU behind the FU pair, extracting the required 4 predicate bits
52 * 8-bit operations. 2 FUs collaborate 8x8 SIMD style
53 - 2x 4x8-bit source registers
54 - likewise for outputs
55 - 1x 8xCR "full" port is utilised (on LO32 FU) and all 8 bits are
56 passed through to the underlying 64-bit ALU to perform 8x 8-bit
57 predicated operations
58
59 a big advantage of this is that unpredicated operations just set the
60 predicate to an immediate of all 1s and the actual ALUs require very
61 little modification.
62
63 ## Scalar (single) integer as predicate, with one DM row
64
65 This idea has several disadvantages.
66
67 * the single DM entry for the entire 64 bits creates a read hazard
68 that has to be resolved through the addition of a special Shadowing
69 Function Unit. Only when the entire predicate is available can the
70 die-cancel/ok be pulled on the FU elements each bit covers
71 * this situation is exacerbated if one vector creates a predicate
72 mask that is then used to mask immediately following instructions.
73 Ordinarily (i.e. without the predicate involved), Cray-style "chaining"
74 would be possible. The single DM entry for the entire predicate mask
75 prohibits this because the subsequent operations can only proceed when
76 the *entire* mask has been computed.
77 * Allocation of bits to FUs gets particularly complex for SIMD (elwidth
78 overrides) requiring shift and mask logic that is simply not needed
79 compared to "one-for-one" schemes (above)
80
81 Overall there is very little in favour of this concept.
82
83 ## Scalar (single) integer as predicate with one DM row per bit
84
85 The Dependency Matrix logic from the CR proposal favourably applies
86 equally to this proposal. However there are additional caveats that
87 weigh against it:
88
89 * Like the single scalar DM entry proposal, the integer scalar register
90 had to be covered also by a single FM entry (for when it is used *as*
91 an integer register).
92 * Unlike the same, it must also be covered by a 64-wide suite of bitlevel
93 Dependency Matrix Rows. These numbers are so massive as to cause some
94 concern.
95 * A solution is to introduce a virtual register naming scheme however
96 this slso introduces huge complexity as the register cache has to be
97 capable of swapping reservations from 64 bitlevel to full 64bit scalar
98 level *and* keep the Dependency Matrices synchronised
99
100 it is enormously complex and likely to result in debugging, verification
101 and ongoing maintenance difficulties.
102
103 ## Schemes which split (a scalar) integer reg into mask "chunks"
104
105 These ideas are based on the principle that each chunk of 8 (or 16)
106 bits of a scalar integer register may be covered by its own DM row.
107 8 chunks of a scalar 64-bit integer register for use as a bit-level
108 predicate mask onto 64 vector elements would for example require 8
109 DM entries.
110
111 This would, for vector sizes of 8, solve the "chaining" problem reasonably
112 well even when two FUs (or two clock cycles) were required to deal with
113 4 elements at a time. The "compare" that generated the predicate would
114 be ready to go into the first "chunk" of predicate bits whilst the second
115 compare was still being issued.
116
117 It would also require a lot smaller DMs than the single-bit-per-element
118 ideas.
119
120 The problems start when trying to allocate bits of predicate to units.
121 Just like the single-DM-row per entire scalar reg case, a shadow-capable
122 Predicate Funxtion Unit is now required (already determined to be costly)
123 except now if there are 8 chunks requiring 8 Predicate FUs *the problem
124 is now made 8x worse*.
125
126 Not only that but it is even more complex when trying to bring in virtual
127 register cacheing in order to bring down overall FU-REGs DM row count,
128 although the numbers are much lower: 8x 8-bit chunks of scalar int
129 only requires 8 DM Rows and 8 virtual subdivisions however *this is per
130 in-flight register*.
131
132 The additional complexity of the cross-over point between use as a chunked
133 predicate mask and when the same underlying register is used as an actual
134 scalar (or even vector) integer register is also carried over from the
135 bit-level DM subdivision case.
136
137 Out-of-order systems, to be effective, require several operations to
138 be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if
139 every predicated vector operation needed one 8-chunked scalar register
140 each it becomes exceedingly complex very quickly.
141
142 Even more than that, in a predicated chaining scenario, when computing
143 the mask from a vector "compare", the groupings are troublesome to
144 think through how to implement, which is itself a bad sign. It is
145 suspected that chaining will be complex or adversely affected by certain
146 combinations of element width.
147
148 (see [[masked_vector_chaining]])
149
150 Overall this idea which initially seems to save resources brings together
151 all the least favourable implementation aspects of other proposals and
152 requires and combines all of them.