openpower/sv/svp64/discussion.mdwn

   1
   2 # Note about naming
   3
   4 the original assessment for SVP from 18 months ago concluded that it should be easy for scalar (non SV) instructions to get at the exact same scalar registers when in SVP mode.  otherwise scalar v3.0B code needs to restrict itself to a massively truncated subset of the scalar registers numbered 0-31 (only r0, r4, r8...) which hugely interferes with ABIs to such an extent that it would compromise SV.
   5
   6 question: has anything changed about the assessment that was done, which concluded that for scalar SVP regs they should overlap completely with scalar ISA regs?
   7
   8
   9 # Notes on requirements for bit allocations
  10
  11 do not try to jam VL or MAXVL in.  go with the flow of 24 bits spare.
  12
  13 * 2: SUBVL
  14 * 2: elwidth
  15 * 2: twin-predication (src, dest) elwidth
  16 * 1: select INT or CR predication
  17 * 3: predicate selection and inversion (QTY 2 for tpred)
  18 * 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
  19 * 5: mode
  20
  21 totals: 24 bits (dest elwidth shared)
  22
  23 http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html
  24
  25 ## All zeros indicates "disable SVP"
  26
  27 The defaults for all capabilities of SVP should be zero to indicate "no action".  SUBVL=1 encoded as 0b00.  register name prefixes, scalar=0b0, elwidth overrides DEFAULT=0b00, predication off=0b000 etc.
  28
  29 this way SV may be entirely disabled, leaving an "all zeros" to indicate to v3.1B 64bit prefixing that the standard OpenPOWER v3.1B encodings are in full effect (and that SV is not). As all zeros meshes with current "reserved" encodings this should work well.
  30
  31
  32 ## twin predication
  33
  34 twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact.  examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended.  more complex operations involve SUBVL and Audio/Video DSP operations, see [[av_opcodes]]
  35
  36 something like:
  37
  38 | 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 12 | 13 18 | 19 23 |
  39 |-------|-----|-----|------|------|-------|-------|-------|
  40 | subvl | sew | dew | ptyp | psrc | pdst  | vspec | mode  |
  41
  42 table:
  43
  44 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  45 * sew / dew - DEFAULT / 8 / 16 /32 element width
  46 * ptyp - predication INT / CR
  47 * psrc / pdst - predicate mask selector and inversion
  48 * vspec - 3 bit src / dest scalar-vector extension
  49 * mode: 5 bits
  50
  51 ## twin predication, CR based.
  52
  53 separate src and dest predicates are a critical part of SV for provision of VEXPAND, VCOMPRESS, VSPLAT, VINSERT and many more operations.
  54
  55 Twin CR predication could be done in two ways:
  56
  57 * start from different CRs for the src and dest
  58 * start from the same CR.
  59
  60 With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
  61
  62 # standard arith ops (single predication)
  63
  64 these are of the form res = op(src1, src2, ...)
  65
  66 | 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 18 | 19  23 |
  67 |-------|-----|-----|------|------|-------|--------|
  68 | subvl | sew | dew | ptyp | pred | vspec | mode   |
  69
  70 table:
  71
  72 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  73 * sew / dew - DEFAULT / 8 / 16 /32 element width
  74 * ptyp - predication INT / CR
  75 * pred - predicate mask selector and inversion
  76 * vspec - 2/3 bit src / dest scalar-vector extension
  77 * mode - 5 bit
  78
  79 For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits.  for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
  80
  81 Note:
  82
  83 * the operation should always be done at max(srcwidth, dstwidth), unless it can
  84   be proven using the lower will lead to the same result
  85 * saturation is done on the result at the **dest** elwidth
  86
  87 Some examples on different operation widths:
  88
  89     u16 / u16 = u8
  90     256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong
  91
  92     u8 * u8 = u16
  93     255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
  94
  95     u16 + u16 = u8
  96     256 + 2 = 2 # this is correct whether we use the larger or smaller width
  97                 # aka hw can optimize narrowing addition
  98
  99
 100 # Notes about Swizzle
 101
 102 Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.
 103
 104 therefore the strategy proposed is:
 105
 106 * design 16bit scalar ops
 107 * use the 11 bit old SV prefix to create 32bit insns
 108 * when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.
 109
 110 with 2x12 this would mean no need to have complex encoding of swizzle.
 111
 112 if we really do need 2 bits spare then the complex encoder of swizzle could be deployed.  (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode* (that's miscalculated, see Single Swizzle section below.)  it isn't because you missed out predicate mask skip as thr 7th option.)
 113
 114 ## Single Swizzle
 115
 116 I expect swizzle to not be common enough to warrant 2 swizzles in a single instruction, therefor the above swizzle strategy is probably unnecessary.
 117
 118 Also, if a swizzle supports up to subvl=4, then 11 bits is sufficient since each swizzle element needs to be able to select 1 of 6 different values: 0, 1, x, y, z, w. 6^4 = 1296 which easily fits in 11 bits (only by dropping "predicate mask" from the list of options, which makes 7 options not 6.  see [[mv.swizzle]])
 119
 120 What about subvl=4 that skips one element?  src vec is 4 but one of the elements is to be left alone?  This is not 6 options, it is 7 options (including "skip" i.e combining with a predicate mask in effect).  note that this is not the same as a vec3-with-a-skip
 121
 122 What could hypothetically be done is: when SUBVL=3 a different encoding is used, one that allows the "skip" to be specified.  X Y skip W for example.  this would then be interpreted, "actually the vector is vec4 but one rlement is skipped"
 123
 124 the problem with that is that now SUBVL has become critically dependent on the swizzle, worse than that the swizzle is embedded in the instruction, even worse than that it's encoded in a complex multi-gate fashion.
 125
 126 all of which screams, "this is going in completely the wrong direction".  keep it simple.  7 options, 3 bits, 4x3, 12 bits for swizzle, ignore some if SUBVL is 1 2 or 3.
 127
 128 # note about INT predicate
 129
 130 001     ALWAYS (implicit)       Operation is not masked
 131
 132 this means by default that 001 will always be in nonpredicated ops, which seems anomalous.  would 000 be better to indicate "no predication"?
 133
 134 000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"
 135
 136 programmerjake:
 137 I picked 0001 to indicate ALWAYS since that matches with the other semantics: the LSB bit is invert-the-mask, and you can think about the table as-if it is really:
 138
 139 this is the opposite of what feels natural.  inversion should switch *off* something.  also 000 is the canonical "this feature is off by default" number.
 140
 141 the constant should be an immediate of all 1s (not r0), which is the natural way to think of "predication is off".
 142
 143 i get the idea "r0 to be used therefore it is all zeros" but that makes 001 the "default", not 000.
 144
 145 | Value | Mnemonic    |
 146 |-------|-------------|
 147 | 000  | R0 (zero)   set to all 1s, naturally means "no predication" |
 148 | 001  | ~R0 (~zero) |
 149 | 010  | R3          |
 150 | 011  | ~R3         |
 151 | 100  | R10         |
 152 | 101  | ~R10        |
 153 | 110  | R30         |
 154 | 111  | ~R30        |
 155
 156
 157 # CR Vectorisation
 158
 159 Some thoughts on this: the sensible (sane) number of CRs to have is 64.  A case could be made for having 128 but it is an awful lot.  64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much).
 160
 161 A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing.  By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. (programmerjake: simple solution -- rename them internally such that CR6 is the first one)
 162
 163 How to number them as vectors gets particularly interesting.  A case could be made for treating the 64 CRs as a square, and using CR numbering (CR0-7) to begin VL for-loop incrementing first by row and when rolling over to increment the column.  CR6 CR14 ... CR62 then CR7 CR15 ...
 164
 165 When the SV prefix marks them with 2 bits, one of those could be used to indicate scalar, and the other to indicate whether the 3 bit CR number is to be treated as a horizontal vector (CR incrementing straight by 1) or a vertical vector (incrementing by 8)
 166
 167 When there are 3 bits it would be possible to indicate whether to begin from a position offset by 4 (middle of matrix, edge of matrix).
 168
 169 Note: considerable care needs to be taken when putting these horiz/vertical CRs through the Dependency Matrices
 170
 171 Indexing algorithm illustrating how the H/V modes would work.  Note that BA is the 3 bit CR register field that normsll, in scalar ISA, would reference only CR0-7 as CR[BA].
 172
 173     for i in range(VL)
 174         y = i % 8
 175         x = i // 8
 176         if verticalmode:
 177             CRINDEX = BA + y*8 + x
 178         else:
 179             CRINDEX = BA*8 + i
 180         CR[CRINDEX] = ...
 181
 182 # Should twin-predication (src=1, dest=1) have DEST SUBVL?
 183
 184 this is tricky: there isn't really enough space unless the reg scalar-vector extension (currently 3 bits per reg) is compacted to only 2 bits each, which would provide 2 extra bits.
 185
 186 so before adding this, an evaluation is needed: *is it necessary*?
 187
 188 what actual operations out of this list need - and work - with a separate SRC and DEST SUBVL?
 189
 190 * mv (the usual way that V* operations are created)
 191 * exts* sign-extension
 192 * rwlinm and other RS-RA shift operations
 193 * LD and ST (treating AGEN as one source)
 194 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 195 * Condition Register ops mfcr, mtcr and other similar
 196
 197 Evaluation:
 198
 199 * mv: yes.  these may need merge/split
 200 * exts: no.  no transformation.
 201 * rwlinm shift operations: no
 202 * LD and ST: no
 203 * FP ops: no
 204 * CR ops: maybe on mvs, not on arithmetic.
 205
 206 therefore it makes no sense to have DEST SUBVL, and instead to have special mv operations.  see [[mv.vec]]