openpower/sv/predication.mdwn

   1 # TODO
   2
   3 <https://bugs.libre-soc.org/show_bug.cgi?id=213>
   4
   5 * idea 1: modify cmp (and other CR generators?) with qualifiers that
   6   create single bit prefix vector into int reg
   7 * idea 2: override CR SO field in vector form to be predicate bit per element
   8 * idea 3: reading of predicates is from bits of int reg
   9 * idea 4: SO CR field no longer overflow, contains copy of int reg
  10   predicate element bit (passed through).  when OE set?
  11
  12
  13 # Requirements
  14
  15 * must be easily implementable in any  microarchitecture including out-of-order
  16 * must not compromise or penalise any microarchitectural performance
  17 * must cover up to 64 elements
  18
  19 # Proposals
  20
  21 ## CR-based predication proposal
  22
  23 this involves treating each CR as providing one bit of predicate. If
  24 there is limited space in SVPrefix it will be a fixed bit (bit 0)
  25 otherwise it may be selected (bit 0 to 3 of the CR)
  26
  27 the crucial advantage of this proposal is that the Function Units can
  28 have one more register (a CR) added as their Read Dependency Hazards
  29 just like all the other incoming source registers, and there is no need
  30 for a special "Predicate Shadow Function Unit".
  31
  32 an analysis of changing the element widths (for SIMD) gives the following
  33 potential arrangements, for which it is assumed that 2x 32-bit FUs
  34 "pair up" for single 64 bit arithmetic, HI32 and LO32 style.
  35
  36 * 64-bit operations.  2 FUs and their DM rows "collaborate"
  37   - 2x 32-bit source registers gang together for 64 bit input
  38   - 2x 32-bit output registers likewise for output
  39   - 1x CR (from the LO32 FU DM side) for a predicate bit
  40 * 32-bit operations.  2 FUs collaborate 2x32 SIMD style
  41   - 2x 32-bit source registers go into separate input halves of the
  42     SIMD ALU
  43   - 2x 32-bit outputs likewise for output
  44   - 2x CRs (one for HI32, one for LO32) for a predicate bit for each of
  45     the 2x32bit SIMD pair
  46 * 16-bit operations. 2 FUs collaborate 4x16 SIMD style
  47   - 2x 2x16-bit source registers group together to provide 4x16 inputs
  48   - likewise for outputs
  49   - EITHER 2x 2xCRs (2 for HI32, 2 for LO32) provide 4 predicate bits
  50   - OR 1x 8xCR "full" port is utilised (on LO32 FU) followed by masking
  51     at the ALU behind the FU pair, extracting the required 4 predicate bits
  52 * 8-bit operations. 2 FUs collaborate 8x8 SIMD style
  53   - 2x 4x8-bit source registers
  54   - likewise for outputs
  55   - 1x 8xCR "full" port is utilised (on LO32 FU) and all 8 bits are
  56     passed through to the underlying 64-bit ALU to perform 8x 8-bit
  57     predicated operations
  58
  59 a big advantage of this is that unpredicated operations just set the
  60 predicate to an immediate of all 1s and the actual ALUs require very
  61 little modification.
  62
  63 ## Scalar (single) integer as predicate, with one DM row
  64
  65 This idea has several disadvantages.
  66
  67 * the single DM entry for the entire 64 bits creates a read hazard
  68   that has to be resolved through the addition of a special Shadowing
  69   Function Unit.  Only when the entire predicate is available can the
  70   die-cancel/ok be pulled on the FU elements each bit covers
  71 * this situation is exacerbated if one vector creates a predicate
  72   mask that is then used to mask immediately following instructions.
  73   Ordinarily (i.e. without the predicate involved), Cray-style "chaining"
  74   would be possible.  The single DM entry for the entire predicate mask
  75   prohibits this because the subsequent operations can only proceed when
  76   the *entire* mask has been computed.
  77 * Allocation of bits to FUs gets particularly complex for SIMD (elwidth
  78   overrides) requiring shift and mask logic that is simply not needed
  79   compared to "one-for-one" schemes (above)
  80
  81 Overall there is very little in favour of this concept.
  82
  83 ## Scalar (single) integer as predicate with one DM row per bit
  84
  85 The Dependency Matrix logic from the CR proposal favourably applies
  86 equally to this proposal.  However there are additional caveats that
  87 weigh against it:
  88
  89 * Like the single scalar DM entry proposal, the integer scalar register
  90   had to be covered also by a single FM entry (for when it is used *as*
  91   an integer register).
  92 * Unlike the same, it must also be covered by a 64-wide suite of bitlevel
  93   Dependency Matrix Rows.  These numbers are so massive as to cause some
  94   concern.
  95 * A solution is to introduce a virtual register naming scheme however
  96   this slso introduces huge complexity as the register cache has to be
  97   capable of swapping reservations from 64 bitlevel to full 64bit scalar
  98   level *and* keep the Dependency Matrices synchronised
  99
 100 it is enormously complex and likely to result in debugging, verification
 101 and ongoing maintenance difficulties.
 102
 103 ## Schemes which split (a scalar) integer reg into mask "chunks"
 104
 105 These ideas are based on the principle that each chunk of 8 (or 16)
 106 bits of a scalar integer register may be covered by its own DM row.
 107 8 chunks of a scalar 64-bit integer register for use as a bit-level
 108 predicate mask onto 64 vector elements would for example require 8
 109 DM entries.
 110
 111 This would, for vector sizes of 8, solve the "chaining" problem reasonably
 112 well even when two FUs (or two clock cycles) were required to deal with
 113 4 elements at a time.  The "compare" that generated the predicate would
 114 be ready to go into the first "chunk" of predicate bits whilst the second
 115 compare was still being issued.
 116
 117 It would also require a lot smaller DMs than the single-bit-per-element
 118 ideas.
 119
 120 The problems start when trying to allocate bits of predicate to units.
 121 Just like the single-DM-row per entire scalar reg case, a shadow-capable
 122 Predicate Funxtion Unit is now required (already determined to be costly)
 123 except now if there are 8 chunks requiring 8 Predicate FUs *the problem
 124 is now made 8x worse*.
 125
 126 Not only that but it is even more complex when trying to bring in virtual
 127 register cacheing in order to bring down overall FU-REGs DM row count,
 128 although the numbers are much lower: 8x 8-bit chunks of scalar int
 129 only requires 8 DM Rows and 8 virtual subdivisions however *this is per
 130 in-flight register*.
 131
 132 The additional complexity of the cross-over point between use as a chunked
 133 predicate mask and when the same underlying register is used as an actual
 134 scalar (or even vector) integer register is also carried over from the
 135 bit-level DM subdivision case.
 136
 137 Out-of-order systems, to be effective, require several operations to
 138 be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if
 139 every predicated vector operation needed one 8-chunked scalar register
 140 each it becomes exceedingly complex very quickly.
 141
 142 Even more than that, in a predicated chaining scenario, when computing
 143 the mask from a vector "compare", the groupings are troublesome to
 144 think through how to implement, which is itself a bad sign.  It is
 145 suspected that chaining will be complex or adversely affected by certain
 146 combinations of element width.
 147
 148 (see [[masked_vector_chaining]])
 149
 150 Overall this idea which initially seems to save resources brings together
 151 all the least favourable implementation aspects of other proposals and
 152 requires and combines all of them.