openpower/sv/propagation.mdwn

   1 [[!tag standards]]
   2
   3 # SV Context Propagation
   4
   5 [[!toc]]
   6
   7 Context Propagation is for a future version of SV.  It requires one
   8 Major opcode in some cases.
   9
  10 The purpose of Context Propagation is a hardware compression algorithm
  11 for 64-bit prefix-suffix ISAs.  The prefix is *separated* from the suffix
  12 and, on the reasonable assumption  that the exact same prefix will need to
  13 be applied to multiple suffixes, a bit-level FIFO is given to indicate
  14 when a particular prefix shall be applied to future instructions.
  15
  16 In this way, with the suffixes being only 32 bit and multiple 32-bit
  17 instructions having the exact same prefix applied to them, the ISA is
  18 much more compact.
  19
  20 Put another way:
  21 [[sv/svp64]] context is 24 bits long, and Swizzle is 12.  These
  22 are enormous and not sustainable as far as power consumption is
  23 concerned.  Also, there is repetition of the same contexts to different
  24 instructions. An idea therefore is to add a level of indirection that
  25 allows these contexts to be applied to multiple instructions.
  26
  27 The basic principle is to have a suite of 40 indices in a shift register
  28 that indicate one of seven Contexts shall be applied to upcoming 32 bit
  29 v3.0B instructions.  The Least Significant Index in the shift register is
  30 the one that is applied.  One of those indices is 0b000 which indicates
  31 "no prefix applied".  Effectively this is a bit-level FIFO.
  32
  33 A special instruction in an svp64 context takes a copy of the `RM[0..23]`
  34 bits, alongside a 21 bit suite that indicates up to 20 32 bit instructions
  35 will have that `RM` applied to them, as well as an index to associate
  36 with the `RM`.  If there are already indices set within the shift register
  37 then the new entries are placed after the end of the highest-indexed one.
  38
  39 | 0.5|6.8  | 9.10|11.31|  name   |
  40 | -- | --- | --- | --- | ------- |
  41 | OP |     | MMM |     | ?-Form  |
  42 | OP | idx | 000 | imm |         |
  43
  44 Four different types of contexts are available so far: svp64 RM, setvl, Remap and
  45 swizzle. Their format is as follows when stored in SPRs:
  46
  47 | 0..3 | 4..7   | 8........31 |  name     |
  48 | ---- | ----   | ----------- | --------- |
  49 | 0000 | 0000   | `RM[0:23]`  |  [[sv/svp64]] RM |
  50 | 0000 | 0001   |`setvl[0:23]`|  [[sv/setvl]] VL |
  51 | 0001 | 0 mask | swiz1 swiz2 |  swizzle  |
  52 | 0010 | brev   | sh0-4 ms0-5 |  [Remap](sv/remap)    |
  53 | 0011 | brev   | sh0-4 ms0-4 |  [SubVL Remap](sv/remap)    |
  54
  55 There are 4 64 bit SPRs used for storing Context, and the data is stored
  56 as follows:
  57
  58 * 7 32 bit contexts are stored, each indexed from 0b001 to 0b111,
  59   2 per 64 bit SPR and 1 in the 4th.
  60 * Starting from bit 32 of the 4th SPR, in batches of 40 bits the Shift
  61   Registers (bit-level FIFOs) are stored.
  62
  63 ```
  64              0            31 32         63
  65     SVCTX0   context 0       context 1
  66     SVCTX1   context 2       context 3
  67     SVCTX2   context 4       context 5
  68     SVCTX3   context 6       FIFO0[0..31]
  69     SVCTX4   FIFO0[32:39]   FIFO1[0:39] FIFO2[0:15]
  70     SVCTX5   FIFO2[16:39]   FIFO3[0:39] FIFO4[0:7]
  71     SVCTX5   FIFO4[8:39]    FIFO5[0:39] FIFO5[0:15]
  72     SVCTX6   FIFO5[16:39]   FIFO6[0:39] FIFO7[0:7
  73     SVCTX7   FIFO7[16:39]
  74 ```
  75
  76 When each LSB is nonzero in any one of the seven Shift Registers
  77 the corresponding Contexts are looked up and merged (ORed) together.
  78 Contexts for different purposes however may not be mixed: an illegal
  79 instruction is raised if this occurs.
  80
  81 The reason for merging the contexts is so that different aspects may be
  82 applied.  For example some `RM` contexts may indicate that predication
  83 is to be applied to an instruction whilst another context may contain
  84 the svp64 Mode.  Combining the two allows the predication aspect to be
  85 merged and shared, making for better packing.
  86
  87 These changes occur on a precise schedule: compilers should not have
  88 difficulties statically allocating the Context Propagation, as long
  89 as certain conventions are followed, such as avoidance of allowing the
  90 context to propagate through branches used by more than one incoming path,
  91 and variable-length loops.
  92
  93 Loops, clearly, because if the setup of the shift registers does
  94 not precisely match the number of instructions, the meaning of those
  95 instructions will change as the bits in the shift registers run out!
  96 However if the loops are of fixed static size, with no conditional early exit,  and small enough (40 instructions
  97 maximum) then it is perfectly reasonable to insert repeated patterns into
  98 the shift registers, enough to cover all the loops.  Ordinarily however
  99 the use of the Context Propagation instructions should be inside the
 100 loop and it is the responsibility of the compiler and assembler writer
 101 to ensure that the shift registers reach zero before any loop jump-back
 102 point.
 103
 104 ## Pseudocode:
 105
 106 The internal data structures need not precisely match the SPRs.  Here are
 107 some internal datastructures:
 108
 109     bit sreg[7][40] # seven 40 bit shift registers
 110     bit context[7][24]   # seven contexts
 111     int sregoffs[7] # indicator where last bits were placed
 112
 113 The Context Propagation instruction then inserts bits into the selected
 114 stream:
 115
 116     count = 20-count_trailing_zeros(imm)
 117     context[idx] = new_context
 118     start = sregoffs[idx]
 119     sreg[idx][start:start+count] = imm[0:count]
 120     sregoffs[idx] += count
 121
 122 With each shift register being maintained independently the new bits are
 123 dropped in where the last ones end.  To get which one is to be applied
 124 is as follows:
 125
 126     apply_context
 127     for i in range(7):
 128         if sreg[i][0]:
 129             apply_context |= context[i]
 130         sreg[i] = sreg[i] >> 1
 131         sregoffs[i] -= 1
 132
 133 Note that it is the LSB that says which context is to be applied.
 134
 135 # Swizzle Propagation
 136
 137 Swizzle Contexts follow the same schedule except that there is a mask
 138 for specifying to which registers the swizzle is to be applied, and
 139 there is only 17 bit suite to indicate the instructions to which the
 140 swizzle applies.
 141
 142 The bits in the svp64 `RM` field are interpreted as a pair of 12 bit
 143 swizzles
 144
 145 | 0.5| 6.8 | 9.11| 12.14 | 15.31 |  name   |
 146 | -- | --- | --- | ----- | ----- | ------- |
 147 | OP |     | MMM | mask  |       | ?-Form  |
 148 | OP | idx | 001 | mask  |  imm  |         |
 149
 150 Note however that it is only svp64 encoded instructions to which swizzle
 151 applies, so Swizzle Shift Registers only activate (and shift down)
 152 on svp64 instructions. *This includes Context-propagated ones!*
 153
 154 The mask is encoded as follows:
 155
 156 * bit 0 indicates that src1 is swizzled
 157 * bit 1 indicates that src2 is swizzled
 158 * bit 2 indicates that src3 is swizzled
 159
 160 When the compiler creates Swizzle Contexts it is important to recall
 161 that the Contexts will be ORed together. Thus one Context may specify
 162 a mask whilst the other Context specifies the swizzles: ORing different
 163 mask contexts with different swizzle Contexts allows more combinations
 164 than would normally fit into seven Contexts.
 165
 166 More than one bit is permitted to be set in the mask: swiz1 is applied
 167 to the first src operand specified by the mask, and swiz2 is applied to
 168 the second.
 169
 170 # 2D/3D Matrix Remap
 171
 172 [[sv/remap]] allows up to four Vectors (all four arguments of `fma` for example)
 173 to be algorithmically arbitrarily remapped via 1D, 2D or 3D reshaping.
 174 The amount of information needed to do so is however quite large: consequently it is only practical to apply indirectly, via Context propagation.
 175
 176 Vectors may be remapped such that Matrix multiply of any arbitrary size
 177 is performed in one Vectorised `fma` instruction as long as the total
 178 number of elements is less than 64 (maximum for VL).
 179
 180 Additionally, in a fashion known as "Structure Packing" in NEON and RVV, it may be used to perform "zipping" and "unzipping" of
 181 elements in a regular fashion of any arbitrary size and depth: RGB
 182 or Audio channel data may be split into separate contiguous lanes of
 183 registers, for example.
 184
 185 There are four possible Shapes.  Unlike swizzle contexts this one requires
 186 he external remap Shape SPRs because the state information is too large
 187 to fit into the Context itself.  Thus the Remap Context says which Shapes
 188 apply to which registers.
 189
 190 The instruction format is the same as `RM` and thus uses 21 bits of
 191 immediate, 29 of which are dropped into the indexed Shift Register
 192
 193 | 0.5| 6.8 | 9.10| 11.14 | 15.31|  name   |
 194 | -- | --- | --- | ----  | ---- | ------- |
 195 | OP |     | MM  |       |      | ?-Form  |
 196 | OP | idx | 10  | brev  | imm  | Remap        |
 197 | OP | idx | 11  | brev  | imm  | SUBVL Remap    |
 198
 199 SUBVL Remap applies the remapping even into the SUBVL Elements, for a total of `VL\*SUBVL` Elements.  **swizzle may be applied on top as a second phase** after SUBVL Remap.
 200
 201 brev field, which also applied down to SUBVL elements (not to the whole
 202 vec2/3/4, that would be handled by swizzle reordering):
 203
 204 * bit 0 indicates that dest elements are byte-reversed
 205 * bit 1 indicates that src1 elements are byte-reversed
 206 * bit 2 indicates that src2 elements are byte-reversed
 207 * bit 3 indicates that src3 elements are byte-reversed
 208
 209 Again it is the 24 bit `RM` that is interpreted differently:
 210
 211 | 0  | 2  | 4  | 6  | 8  | 10.14 | 15..23 |
 212 | -- | -- | -- | -- | -- | ----- | ------ |
 213 |mi0 |mi1 |mi2 |mo0 |mo1 | en0-4 | rsv    |
 214
 215 si0-2 and so0-1 each select SVSHAPE0-3 to apply to a given register.
 216 si0-2 apply to RA, RB, RC respectively, as input registers, and
 217 likewise so0-1 apply to output registers. en0-4 indicate whether the
 218 SVSHAPE is actively applied or not.
 219
 220 # setvl
 221
 222 Fitting into 22 bits with 2 reserved and 2 for future
 223 expansion of SV Vector Length is a total of 24 bits
 224 which is exactly the same size as SVP64 RM
 225
 226 | 0.5|6.10| 11..18 | 19..20 |21| 22.23 |
 227 | -- | -- | ------ | ------ |--| ----- |
 228 | RT | RA | SVi // | vs ms  |Rc| rsvd  |