openpower/sv/predication.mdwn

   1 # TODO ideas
   2
   3 <https://bugs.libre-soc.org/show_bug.cgi?id=527>
   4
   5
   6 * idea 1: modify cmp (and other CR generators?) with qualifiers that
   7   create single bit prefix vector into int reg
   8 * idea 2: override CR SO field in vector form to be predicate bit per element
   9 * idea 3: reading of predicates is from bits of int reg
  10 * idea 4: SO CR field no longer overflow, contains copy of int reg
  11   predicate element bit (passed through).  when OE set?
  12
  13
  14 # Requirements
  15
  16 * must be easily implementable in any  microarchitecture including:
  17   - small and large out-of-order
  18   - in-order
  19   - FSM (0.3 IPC or below)
  20   - single or multi-issue
  21 * must not compromise or penalise any microarchitectural performance
  22 * must cover up to 64 elements
  23 * must still work for elwidth over-rides
  24
  25 ## Additional Capabilities
  26
  27 * two modes, "zeroing" and "non-zeroing". zeroing mode places a zero in the masked-out element results, where non-zeroing leaves the destination (result) element unmodified.
  28 * predicate must be invertable via an opcode bit (to avoid the need for an instruction which inverts all bits of the predicate mask)
  29
  30 Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file.  This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to very simply be directly ANDed with the regfile write-enable lines to achieve the required functionality of leaving masked-out elements unmodified, right down to the 8 bit element level.  The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance.  Avoided very simply with byte-level write-enable.
  31
  32 ## General implications and considerations
  33
  34 ### OE=1 and SO
  35
  36 XER.SO (sticky overflow) is known to cause massive slowdown in pretty much every microarchitecture and it definitely compromises the performance of out-of-order systems.  The reason is that it introduces a READ-MODIFY-WRITE cycle between XER.SO and CR0 (which contains a copy of the SO field after inclusion of the overflow). The result and source registers branch off as RaW and WaR hazards from this RMW chain.
  37
  38 This is even before predication or vectorization were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA.
  39
  40 As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests.  Consequently it makes very little sense to continue to propagate OE=1 in the Vectorization context of SV.
  41
  42 ### Vector Chaining
  43
  44 (see [[masked_vector_chaining]])
  45
  46 One of the design principles of SV is that the use of VL should be as closely equivalent to a direct substitution of the scalar operations of the hardware for-loop as possible, as if those looped operations were actually in the instruction stream (as scalar operations) rather than being issued from the Vector loop.
  47
  48 The implications here are that *register dependency hazards still have to be respected inter-element* even when (conceptually) pushed into the instruction stream from a hardware for-loop.
  49
  50 Using a multi-issue out-of-order engine as the underlying microarchitectural basis this is not as difficult to achieve as it first seems (the hard work having been done by the Dependency Matrices).  In addition, Vector Chaining should also be possible for a multi-issue out-of-order engine to cope with, as long as false (unnecessary) Dependency Hazards are not introduced in between Vectors, where the dependencies actually only exist between elements *in* the Vector.
  51
  52 The concept of recognising that it is the elements within the Vector that have Dependency Hazards rather than the Vectors themselves is what permits Cray-style "chaining".
  53
  54 This "false/unnecessary hazard" condition eliminates and/or compromises the performance or drives up resource utilisation in at least two of the proposals below.
  55
  56 # Proposals
  57
  58 ## Adding new predicate register file type and associated opcodes
  59
  60 This idea, adding new predicate manipulation opcodes,
  61 violates the fundamental design principles of SV to not add
  62 new vector-related instructions unless essential or compelling.
  63
  64 All other proposals utilise existing scalar opcodes which already happen to have bitmanipulation, arithmetic, and inter-file transfer capability (mfcr, mfspr etc).
  65 They also involve adding extra scalar bitmanip opcodes, such that by utilising  scalar registers as predicate masks SV achieves "par" with other Cray-style (variable-length) Vector ISAs, all without actually having to add any actual Vector opcodes.
  66
  67 In addition those scalar 64-bit bitmanip operations, although some of them are obscure and unusual in the scalar world, do actually have practical applications outside of a vector context.
  68
  69 (Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorized however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice).
  70
  71 The summary is that adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is anomalous, costly, and unnecessary.
  72
  73 ## CR-based predication proposal
  74
  75 this involves treating each CR as providing one bit of predicate. If
  76 there is limited space in SVPrefix it will be a fixed bit (bit 0)
  77 otherwise it may be selected (bit 0 to 3 of the CR) through a field in the opcode.
  78
  79 the crucial advantage of this proposal is that the Function Units can
  80 have one more register (a CR) added as their Read Dependency Hazards
  81 just like all the other incoming source registers, and there is no need
  82 for a special "Predicate Shadow Function Unit".
  83
  84 a big advantage of this is that unpredicated operations just set the
  85 predicate to an immediate of all 1s and the actual ALUs require very
  86 little modification.
  87
  88 a disadvantage is that to support the selection of 8 bit of predicate
  89 from 8 CRs (via the "full" 8x CR port") would require allocating 32-bit
  90 datapath to the relevant FUs.  This could be reduced by adding yet another
  91 type of special virtual register port or datapath that masks out the
  92 required predicate bits closer to the regfile.
  93
  94 another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs.  Beyond that they can be transferred using vectorized mfcr and mtcrf into INT regs.  this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix.  however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane".
  95
  96 ### Predicated SIMD HI32-LO32 FUs
  97
  98 an analysis of changing the element widths (for SIMD) gives the following
  99 potential arrangements, for which it is assumed that 2x 32-bit FUs
 100 "pair up" for single 64 bit arithmetic, HI32 and LO32 style.
 101
 102 * 64-bit operations.  2 FUs and their DM rows "collaborate"
 103   - 2x 32-bit source registers gang together for 64 bit input
 104   - 2x 32-bit output registers likewise for output
 105   - 1x CR (from the LO32 FU DM side) for a predicate bit
 106 * 32-bit operations.  2 FUs collaborate 2x32 SIMD style
 107   - 2x 32-bit source registers go into separate input halves of the
 108     SIMD ALU
 109   - 2x 32-bit outputs likewise for output
 110   - 2x CRs (one for HI32, one for LO32) for a predicate bit for each of
 111     the 2x32bit SIMD pair
 112 * 16-bit operations. 2 FUs collaborate 4x16 SIMD style
 113   - 2x 2x16-bit source registers group together to provide 4x16 inputs
 114   - likewise for outputs
 115   - EITHER 2x 2xCRs (2 for HI32, 2 for LO32) provide 4 predicate bits
 116   - OR 1x 8xCR "full" port is utilised (on LO32 FU) followed by masking
 117     at the ALU behind the FU pair, extracting the required 4 predicate bits
 118 * 8-bit operations. 2 FUs collaborate 8x8 SIMD style
 119   - 2x 4x8-bit source registers
 120   - likewise for outputs
 121   - 1x 8xCR "full" port is utilised (on LO32 FU) and all 8 bits are
 122     passed through to the underlying 64-bit ALU to perform 8x 8-bit
 123     predicated operations
 124
 125 ### Predicated SIMD straight 64-bit FUs
 126
 127 * 64-bit operations. 1 FU, 1 64 bit operation
 128   - 1x 64-bit source register
 129   - 1x 64-bit output register
 130   - 1x CR for a predicate bit
 131 * 32-bit operations.  1 FUs 2x32 SIMD style
 132   - 1x 64-bit source register dynamically splits to 2x 32-bit
 133   - 1x 64-bit output likewise
 134   - 2x CRs for a predicate bit for each of the 2x32bit SIMD pair
 135 * 16-bit operations. 1 FUs 4x16 SIMD style
 136   - 1x 4x16-bit source registers
 137   - likewise for outputs
 138   - 1x 8xCR "full" port is utilised followed by masking at the ALU behind
 139     the FU pair, extracting the required 4 predicate bits
 140 * 8-bit operations. 1 FU 8x8 SIMD style
 141   - 1x 8x8-bit source registers
 142   - likewise for outputs
 143   - 1x 8xCR "full" port is utilised LO32 and all 8 bits used
 144     to perform 8x 8-bit predicated operations
 145
 146 Here again the underying 64-bit ALU requires the 8x predicate bits to
 147 cover the 8x8-bit SIMD operations (7 of which are dormant/unused in 64-bit
 148 predicated operations but still have to be there to cover 8x8-bit SIMD).
 149
 150 Given that the initial idea of using the "full" (virtual) 32-bit CR read
 151 port (which reads all 8 CRs CR0-CR7 simultaneously) would require a
 152 32-bit broadcast bus to every predication-capable Function Unit, the bus
 153 bandwidth can again be reduced by performing the selection of the masks
 154 (bit 0 thru bit 3 of each CR) closer to the regfile i.e. before hitting
 155 the broadcast bus.
 156
 157 ## One scalar int per predicate element.
 158
 159 Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates.  This idea has quite a lot of merit to it.
 160
 161 Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
 162
 163 The disadvantages appear on closer analysis:
 164
 165 * Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain just 8 predicate bits (each being an LSB of a given 64 bit scalar int), would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read.  Resource-wise, then, this idea is expensive.
 166 * With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorized-mfcr) are more challenging and costly.  Rather than use vectorized mfcr, complex transfers of the LSBs into a single scalar int are required.
 167
 168 In a "normal" Vector ISA this would be solved by adding opcodes that perform the kinds of bitmanipulation operations normally needed for predicate masks, as specialist operations *on* those masks.  However for SV the rule has been set: "no unnecessary additional Vector Instructions" because it is possible to use existing PowerISA scalar bitmanip opcodes to cover the same job.
 169
 170 The problem is that vectors of LSBs need to be transferred *to* scalar int regs, bitmanip operations carried out, *and then transferred back*, which is exceptionally costly.
 171
 172 On balance this is a less favourable option than vectorizing CRs
 173
 174 ## Scalar (single) integer as predicate, with one DM row
 175
 176 This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away.  Vectorized mfcr can be used to get CMP results or Vectorized Rc=1 CRs into the scalar INT, easily.
 177
 178 This idea has several disadvantages.
 179
 180 * the single DM entry for the entire 64 bits creates a read hazard
 181   that has to be resolved through the addition of a special Shadowing
 182   Function Unit.  Only when the entire predicate is available can the
 183   die-cancel/ok be pulled on the FU elements each bit covers
 184 * this situation is exacerbated if one vector creates a predicate
 185   mask that is then used to mask immediately following instructions.
 186   Ordinarily (i.e. without the predicate involved), Cray-style "chaining"
 187   would be possible.  The single DM entry for the entire predicate mask
 188   prohibits this because the subsequent operations can only proceed when
 189   the *entire* mask has been computed and placed in full
 190   into the scalar integer register.
 191 * Allocation of bits to FUs gets particularly complex for SIMD (elwidth
 192   overrides) requiring shift and mask logic that is simply not needed
 193   compared to "one-for-one" schemes (above)
 194
 195 Overall there is very little in favour of this concept.
 196
 197 ## Scalar (single) integer as predicate with one DM row per bit
 198
 199 The Dependency Matrix logic from the CR proposal favourably applies
 200 equally to this proposal.  However there are additional caveats that
 201 weigh against it:
 202
 203 * Like the single scalar DM entry proposal, the integer scalar register
 204   had to be covered also by a single DM entry (for when it is used *as*
 205   an integer register).
 206 * Unlike the same, it must also be covered by a 64-wide suite of bitlevel
 207   Dependency Matrix Rows.  These numbers are so massive as to cause some
 208   concern.
 209 * A solution is to introduce a virtual register naming scheme however
 210   this also introduces huge complexity as the register cache has to be
 211   capable of swapping reservations from 64 bitlevel to full 64bit scalar
 212   level *and* keep the Dependency Matrices synchronised
 213
 214 it is enormously complex and likely to result in debugging, verification
 215 and ongoing maintenance difficulties.
 216
 217 ## Schemes which split (a scalar) integer reg into mask "chunks"
 218
 219 These ideas are based on the principle that each chunk of 8 (or 16)
 220 bits of a scalar integer register may be covered by its own DM column
 221   in FU-REGs.
 222 8 chunks of a scalar 64-bit integer register for use as a bit-level
 223 predicate mask onto 64 vector elements would for example require 8
 224 DM entries.
 225
 226 This would, for vector sizes of 8, solve the "chaining" problem reasonably
 227 well even when two FUs (or two clock cycles) were required to deal with
 228 4 elements at a time.  The "compare" that generated the predicate would
 229 be ready to go into the first "chunk" of predicate bits whilst the second
 230 compare was still being issued.
 231
 232 It would also require a lot smaller DMs than the single-bit-per-element
 233 ideas.
 234
 235 The problems start when trying to allocate bits of predicate to units.
 236 Just like the single-DM-row per entire scalar reg case, a shadow-capable
 237 Predicate Function Unit is now required (already determined to be costly)
 238 except now if there are 8 chunks requiring 8 Predicate FUs *the problem
 239 is now made 8x worse*.
 240
 241 Not only that but it is even more complex when trying to bring in virtual
 242 register cacheing in order to bring down overall FU-REGs DM row count,
 243 although the numbers are much lower: 8x 8-bit chunks of scalar int
 244 only requires 8 DM Rows and 8 virtual subdivisions however *this is per
 245 in-flight register*.
 246
 247 The additional complexity of the cross-over point between use as a chunked
 248 predicate mask and when the same underlying register is used as an actual
 249 scalar (or even vector) integer register is also carried over from the
 250 bit-level DM subdivision case.
 251
 252 Out-of-order systems, to be effective, require several operations to
 253 be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if
 254 every predicated vector operation needed one 8-chunked scalar register
 255 each it becomes exceedingly complex very quickly.
 256
 257 Even more than that, in a predicated chaining scenario, when computing
 258 the mask from a vector "compare", the groupings are troublesome to
 259 think through how to implement, which is itself a bad sign.  It is
 260 suspected that chaining will be complex or adversely affected by certain
 261 combinations of element width.
 262
 263 Overall this idea which initially seems to save resources brings together
 264 all the least favourable implementation aspects of other proposals and
 265 requires and combines all of them.