openpower/sv/svp_rewrite/svp64/discussion.mdwn

   1 # Links
   2
   3 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
   4
   5 # Notes on requirements for bit allocations
   6
   7 do not try to jam VL or MAXVL in.  go with the flow of 24 bits spare.
   8
   9 * 2: SUBVL
  10 * 2: elwidth
  11 * 2: twin-predication (src, dest) elwidth
  12 * 1: select INT or CR predication
  13 * 3: predicate selection and inversion (QTY 2 for tpred)
  14 * 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
  15 * 3: saturate mode
  16
  17 totals: 22 bits (dest elwidth shared)
  18
  19 http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html
  20
  21 ## twin predication
  22
  23 twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact.  examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended.  more complex operations involve SUBVL and Audio/Video DSP operations, see [[av_opcodes]]
  24
  25 something like:
  26
  27 | 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 12 | 13 18 | 19 23 |
  28 | ----- | --- | --- | ---- | ---- | ----- | ----- | ----- |
  29 | subvl | sew | dew | ptyp | psrc | pdst  | vspec | mode  |
  30
  31 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  32 * sew / dew - DEFAULT / 8 / 16 /32 element width
  33 * ptyp - predication INT / CR
  34 * psrc / pdst - predicate mask selector and inversion
  35 * vspec - 3 bit src / dest scalar-vector extension
  36 * mode: 5 bits
  37
  38 ## twin predication, CR based.
  39
  40 separate src and dest predicates are a critical part of SV for provision of VEXPAND, VREDUCE, VSPLAT, VINSERT and many more operations.
  41
  42 Twin CR predication could be done in two ways:
  43
  44 * start from different CRs for the src and dest
  45 * start from the same CR.
  46
  47 With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
  48
  49 # standard arith ops (single predication)
  50
  51 these are of the form res = op(src1, src2, ...)
  52
  53 | 0   1 | 2 3 | 4 5 | 6    | 7  9 | 10 18 | 19  23 |
  54 | ----- | --- | --- | ---- | ---- | ----- | ------ |
  55 | subvl | sew | dew | ptyp | pred | vspec | mode   |
  56
  57 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
  58 * sew / dew - DEFAULT / 8 / 16 /32 element width
  59 * ptyp - predication INT / CR
  60 * pred - predicate mask selector and inversion
  61 * vspec - 2/3 bit src / dest scalar-vector extension
  62 * mode - 5 bit
  63
  64 For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits.  for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
  65
  66 Note:
  67
  68 * the operation should always be done at max(srcwidth, dstwidth), unless it can
  69   be proven using the lower will lead to the same result
  70 * saturation is done on the result at the **dest** elwidth
  71
  72 Some examples on different operation widths:
  73
  74     u16 / u16 = u8
  75     256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong
  76
  77     u8 * u8 = u16
  78     255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
  79
  80     u16 + u16 = u8
  81     256 + 2 = 2 # this is correct whether we use the larger or smaller width
  82                 # aka hw can optimize narrowing addition
  83
  84 # Mode
  85
  86      0   1   2   3    4    description
  87      ------------------
  88      0   0   0   0    0    nothing
  89      0   1   inv CR bit    Rc=1: ffirst CR sel
  90      1   0   N   sz   dz   sat mode: N=0/1 u/s
  91      1   1   0   sz   dz   Rc=0: pred zero mode
  92      1   1   1   rsvd      reserved
  93
  94
  95 # Notes about rounding, clamp and saturate
  96
  97 One of the issues with vector ops is that in integer DSP ops for example in Audio the operation must clamp or saturate rather than overflow or ignore the upper bits and become a modulo operation.  This for Audio is extremely important, also to provide an indicator as to whether saturation occurred.  see  [[av_opcodes]].
  98
  99 If there are spare bits it would be very good to look at using some of them to specify the mode, because otherwise a SPR has to be used which will need to be set and unset.  This can get costly.
 100
 101 # Fail-on-first
 102
 103 Data-dependent fail-on-first has two distinct variants: one for LD/ST, the other for arithmetic operations (actually, CR-driven)
 104
 105 * LD/ST ffirst treats the first LD/ST in a vector as an ordinary one.  Exceptions occur "as normal".  However for elements 1 and above, if an exception would occur, then VL is **truncated** to the previous element.
 106 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp).  Similar to branch, an analysis of the CR is performed and if the test succeeds, the vector operation terminates all element operations at and above the current one, and VL is truncated to the *previous* element.
 107
 108 The CR-based data-driven fail-on-first is new and not found in ARM SVE or RVV. It is extremely useful for reducing instruction count, however requires speculative execution involving modifications of VL to get high performance implementations.
 109
 110
 111 # Notes about Swizzle
 112
 113 Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.
 114
 115 therefore the strategy proposed is:
 116
 117 * design 16bit scalar ops
 118 * use the 11 bit old SV prefix to create 32bit insns
 119 * when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.
 120
 121 with 2x12 this would mean no need to have complex encoding of swizzle.
 122
 123 if we really do need 2 bits spare then the complex encoder of swizzle could be deployed.
 124
 125 # note about INT predicate
 126
 127 001     ALWAYS (implicit)       Operation is not masked
 128
 129 this means by default that 001 will always be in nonpredicated ops, which seems anomalous.  would 000 be better to indicate "no predication"?
 130
 131 000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"