openpower/sv/svp64_quirks.mdwn

   1 # The Rules
   2
   3 [[!toc]]
   4
   5 SVP64 is designed around these fundamental and inviolate principles:
   6
   7 1. There are no actual Vector instructions: Scalar instructions
   8    are the sole exclusive bedrock.
   9 2. No scalar instruction ever deviates in its encoding or meaning
  10    just because it is prefixed (caveats below)
  11 3. A hardware-level for-loop makes vector elements 100% synonymous
  12    with scalar instructions (the suffix)
  13
  14 That said, there are a few exceptional places where these rules get
  15 bent, and others where the rules take some explaining,
  16 and this page tracks them.
  17
  18 The modification caveat obviously exempts element width overrides,
  19 which still do not actually modify the meaning of the instruction:
  20 an add remains an add, even if it is only an 8-bit add rather than
  21 a 64-bit add. elwidth overrides *definitely* do not alter the 3.0 encoding.
  22 Other "modifications" such as saturation or Data-dependent Fail-First
  23 likewise are post-augmentation or post-analysis, and do not actually
  24 fundamentally change an add operation into a subtract for example.
  25
  26 *(An experiment was attempted to modify LD-immediate instructions
  27 to include a
  28 third RC register i.e. reinterpret the normal
  29 v3.0 32-bit instruction as a
  30 different encoding if SVP64-prefixed: it did not go well.
  31 The complexity that resulted
  32 in the decode phase was too great)*
  33
  34 # Instruction Groups
  35
  36 The basic principle of SVP64 is the prefix, which contains mode
  37 as well as register augmentation and predicates.  When thinking of
  38 instructions and Vectorising them, it is natural for arithmetic
  39 operations (ADD, OR) to be the first to spring to mind.
  40 Arithmetic instructions have registers, therefore augmentation
  41 applies, end of story, right?
  42
  43 Except, Load and Store deals also with Memory, not just registers.
  44 Power ISA has Condition Register Fields: how can element widths
  45 apply there? And branches: how can you have Saturation on something
  46 that does not return an arithmetic result? In short: there are actually
  47 four different categories (five including those for which Vectorisation
  48 makes no sense at all, such as `sc` or `mtmsr`).
  49
  50 # CR weird instructions
  51
  52 [[sv/int_cr_predication]] is by far the biggest violator of the SVP64
  53 rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
  54 for use as predicates is very awkward without them.
  55
  56 Normally, element width overrides allow the element width to be specified
  57 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
  58 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
  59 required.  When this perspective is taken (that results or sources are
  60 1 or 4 bits) the weirdness starts to make sense, because the "elements",
  61 such as they are, are still packed sequentially.
  62
  63 From a hardware implementation perspective however they will need special
  64 handling as far as Hazard Dependencies are concerned, due to nonconformance
  65 (bit-level management)
  66
  67 # mv.x
  68
  69 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
  70 terms of Register Hazard Management that its addition to any Scalar
  71 ISA is anathematic. In a Traditional Vector ISA however, where the
  72 indices are isolated behind a single Vector Hazard, there is no
  73 problem at all.  `sv.mv.x` is also fraught, precisely because it
  74 sits on top of a Standard Scalar register paradigm, not a Vector
  75 ISA, with separate and distinct Vector registers.
  76
  77 To help partly solve this, `sv.mv.x` has to be made relative:
  78
  79 ```
  80 for i in range(VL):
  81     GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
  82 ```
  83
  84 The reason for doing so is that MAXVL or VL may be used to limit
  85 the number of Register Hazards that need to be raised to a fixed
  86 quantity, at Issue time.
  87
  88 `mv.x` itself will still have to be added as a Scalar instruction,
  89 but the behaviour of `sv.mv.x` will have to be different from that
  90 Scalar version.
  91
  92 Normally, Scalar Instructions have a good justification for being
  93 added as Scalar instructions on their own merit. `mv.x` is the
  94 polar opposite, and as such qualifies for a special mention in
  95 this section.
  96
  97 # Branch-Conditional
  98
  99 [[sv/branches]] are a very special exception to the rule that there
 100 shall be no deviation from the corresponding
 101 Scalar instruction.  This because of the tight
 102 integration with looping and the application of Boolean Logic
 103 manipulation needed for Parallel operations (predicate mask usage).
 104 This results in an extremely important observation that `scalar identity
 105 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
 106 operation as the unprefixed 32-bit scalar version.
 107
 108 One key difference is that LR is only updated if certain additional
 109 conditions are met, whereas Scalar `bclrl` for example unconditionally
 110 overwrites LR.
 111
 112 Well over 500 Vectorised branch instructions exist in SVP64 due to the
 113 number of options available: close integration and interaction with
 114 the base Scalar Branch was unavoidable in order to create Conditional
 115 Branching suitable for parallel 3D / CUDA GPU workloads.
 116
 117 # Saturation
 118
 119 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
 120 It does help that within the SFFS Compliancy subset there are no Saturated
 121 operations at all: they are only added in VSX.
 122
 123 Saturation does not inherently change the instruction itself: it does however
 124 come with some fundamental implications, when applied. For example:
 125 a Floating-Point operation that would normally raise an exception will
 126 no longer do so, instead setting the CR1.SO Flag.  Another quirky
 127 example: signed operations which produce a negative result will be
 128 truncated to zero if Unsigned Saturation is requested.
 129
 130 One very important aspect for implementors is that the operation in
 131 effect has to be considered to be performed at infinite precision,
 132 followed by saturation detection. In practice this does not actually
 133 require infinite precision hardware! Two 8-bit integers being
 134 added can only ever overflow into a 9-bit result.
 135
 136 Overall some care and consideration needs to be applied.
 137
 138 # Fail-First
 139
 140 Fail-First (both the Load/Store and Data-Dependent variants)
 141 is worthy of a special mention in its own right. Where VL is
 142 normally forward-looking and may be part of a pre-decode phase
 143 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
 144 Fail-First changes that because at any point during the execution
 145 of the element-level instructions, one of those elements may not only
 146 terminate further continuation of the hardware-for-looping but also
 147 effect a change of VL:
 148
 149 ```
 150 for i in range(VL):
 151     result = element_operation(GPR(RA+i), GPR(RB+i))
 152     if test(result):
 153         VL = i
 154         break
 155 ```
 156
 157 This is not exactly a violation of SVP64 Rules, more of a breakage
 158 of user expectations, particularly for LD/ST where exceptions
 159 would normally be expected to be raised, Fail-First provides for
 160 avoidance of those exceptions.