openpower/sv/svp64_quirks.mdwn

   1 # The Rules
   2
   3 [[!toc]]
   4
   5 SVP64 is designed around these fundamental and inviolate principles:
   6
   7 1. There are no actual Vector instructions: Scalar instructions
   8    are the sole exclusive bedrock.
   9 2. No scalar instruction ever deviates in its encoding or meaning
  10    just because it is prefixed (semantic caveats below)
  11 3. A hardware-level for-loop makes vector elements 100% synonymous
  12    with scalar instructions (the suffix)
  13
  14 How can a Vector ISA even exist when no actual Vector instructions
  15 are permitted to be added? It comes down to the strict abstraction.
  16 First you add a **scalar** instruction (32-bit). Second, the
  17 Prefixing is applied *in the abstract* to give the *appearance*
  18 and ultimately the same effect as if an explicit Vector instruction
  19 had also been added.
  20
  21 There are a few exceptional places where these rules get
  22 bent, and others where the rules take some explaining,
  23 and this page tracks them.
  24
  25 The modification caveat in (2) above semantically
  26 exempts element width overrides,
  27 which still do not actually modify the meaning of the instruction:
  28 an add remains an add, even if its override makes it an 8-bit add rather than
  29 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
  30 that it's an *8-bit* add-with-carry.
  31 In other words, elwidth overrides *definitely* do not alter the actual
  32 Scalar v3.0 ISA encoding itself, and consequently we can still, in
  33 the strictest sense, not be breaking rule (2).
  34 Likewise, other "modifications" such as saturation or Data-dependent
  35 Fail-First likewise are actually post-augmentation or post-analysis, and do
  36 not fundamentally change an add operation into a subtract
  37 for example, and under absolutely no circumstances do the actual
  38 operand field bits change or the number of operands change.
  39
  40 *(In an early Draft of SVP64,
  41 an experiment was attempted, to modify LD-immediate instructions
  42 to include a
  43 third RC register i.e. reinterpret the normal
  44 v3.0 32-bit instruction as a completely
  45 different encoding if SVP64-prefixed. It did not go well.
  46 The complexity that resulted
  47 in the decode phase was too great. The lesson was learned, the
  48 hard way: it would be infinitely preferable
  49 to add a 32-bit Scalar Load-with-Shift
  50 instruction *first*, which then inherently becomes Vectorised.
  51 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
  52 both ARM and x86 have it, because it saves greatly on instruction count in
  53 hot-loops.)*
  54
  55 # Instruction Groups
  56
  57 The basic principle of SVP64 is the prefix, which contains mode
  58 as well as register augmentation and predicates.  When thinking of
  59 instructions and Vectorising them, it is natural for arithmetic
  60 operations (ADD, OR) to be the first to spring to mind.
  61 Arithmetic instructions have registers, therefore augmentation
  62 applies, end of story, right?
  63
  64 Except, Load and Store deals also with Memory, not just registers.
  65 Power ISA has Condition Register Fields: how can element widths
  66 apply there? And branches: how can you have Saturation on something
  67 that does not return an arithmetic result? In short: there are actually
  68 four different categories (five including those for which Vectorisation
  69 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
  70
  71 * arithmetic/logical including floating-point
  72 * Load/Store
  73 * Condition Register Field operations
  74 * branch
  75
  76 **Arithmetic**
  77
  78 Arithmetic (known as "normal" mode) is where Scalar and Parallel
  79 Reduction can be done: Saturation as well, and two new innovative
  80 modes for Vector ISAs: data-dependent fail-first and predicate result.
  81 Reduction and Saturation are common to see in Vector ISAs: it is just
  82 that they are usually added as explicit instructions,
  83 and NEC SX Aurora has even more iterative instructions. In SVP64 these
  84 concepts are applied in the abstract general form, which takes some
  85 getting used to, as it may, when applied to non-commutative
  86 instructions incorrectly, result in invalid results, but ultimately
  87 it is critical to think in terms of the "rules", that everything is
  88 Scalar instructions in strict Program Order.
  89
  90 **Branches**
  91
  92 Branch is the one and only place where the Scalar
  93 (non-prefixed) operations differ from the Vector (element)
  94 instructions, as explained in a separate section.
  95 The
  96 RM bits can be used for other purposes because the Arithmetic modes
  97 make no sense at all for a Branch.
  98 Almost the entire
  99 SVP64 RM Field is interpreted differently from other Modes, in
 100 order to support a wide range of parallel boolean condition options
 101 which are expected of a Vector / GPU ISA. These save a considerable
 102 number of instructions in tight inner loop situations.
 103
 104 **CR Field Ops**
 105
 106 Condition Register Fields are 4-bit wide and consequently element-width
 107 overrides make absolutely no sense whatsoever. Therefore the elwidth
 108 override field bits can be used for other purposes when Vectorising
 109 CR Field instructions.  Moreover, Rc=1 is completely invalid for
 110 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
 111 a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
 112 such as predicate-result make no sense, and neither does Saturation.
 113 All of these differences, which require quite a lot of logical
 114 reasoning and deduction, help explain why there is an entirely different
 115 CR ops Vectorisation Category.
 116
 117 A particularly strange quirk of CR-based Vector Operations is that the
 118 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
 119 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
 120 this makes up 32 bits, and therefore a CR operand referring to one bit
 121 of the CR will be 5 bits in length (BA, BT).
 122 *However*, some instructions refer
 123 to a *CR Field* (CR0-CR7) and consequently these operands
 124 (BF, BFA etc) are only 3-bits.
 125
 126 With SVP64 extending the number of CR *Fields* to 128, the number of
 127 CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
 128 (8 per CR Register). Then, it gets even more strange, when it comes
 129 to Vectorisation, which applies to the CR *Field* numbers.  The
 130 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
 131 and moves to CR1 for element 1, and so on.  The reason here is quite
 132 simple: each element result has to have its own CR Field co-result.
 133
 134 In other words, the
 135 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
 136 CR Register, and not the CR *Register* (of which there are now 16).
 137 All quite logical, but a little mind-bending.
 138
 139 **Load/Store**
 140
 141 LOAD/STORE is another area that has different needs: this time it is
 142 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
 143 which simply make no sense in a RISC Scalar ISA: element-stride and
 144 unit-stride and the entire concept of a stride itself (a spacing
 145 between elements) has no place at all in a Scalar ISA. The problems
 146 come when trying to *retrofit* the concept of "Vector Elements" onto
 147 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
 148 RM Prefix to convey the stride mode, changing the Effective Address
 149 computation as a result. Interestingly, worth noting for Hardware
 150 designers: it did turn out to be possible to perform pre-multiplication
 151 of the D/DS Immediate by the stride amount, making it possible to avoid
 152 actually modifying the LD/ST Pipeline itself.
 153
 154 Other areas where LD/ST went quirky: element-width overrides especially
 155 when combined with Saturation, given that LD/ST operations have byte,
 156 halfword, word, dword and quad variants. The interaction between these
 157 widths as part of the actual operation, and the source and destination
 158 elwidth overrides, was particularly obtuse and hard to derive: some care
 159 and attention is advised, here, when reading the specification.
 160
 161 **Non-vectorised**
 162
 163 The concept of a Vectorised halt (`attn`) makes no sense. There are never
 164 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
 165 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
 166 Even `td` and `tdi` makes a strange type of sense to permit it to be
 167 Vectorised, because a sequence of comparisons could be Vectorised.
 168 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
 169 Nemory Management
 170 instructions, these make no sense to Vectorise.
 171
 172 However, it is really quite important to not be tempted to conclude that
 173 just because these instructions are un-vectoriseable, the opcode space
 174 must be free for reiterpretation and use for other purposes. This would
 175 be a serious mistake because a future revision of the specification
 176 might *retire* the Scalar instruction, replace it with another.
 177 Again this comes down to being quite strict about the rules: only Scalar
 178 instructions get Vectorised: there are *no* actual explicit Vector
 179 instructions.
 180
 181 **Summary**
 182
 183 Where a traditional Vector ISA effectively duplicates the entirety
 184 of a Scalar ISA and then adds additional instructions which only
 185 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
 186 considerable lengths to keep strictly to augmentation and embedding
 187 of an entire Scalar ISA's instructions into an abstract Vectorisation
 188 Context. That abstraction subdivides down into Categories appropriate
 189 for the type of operation (Branch, CRs, Memory, Arithmetic),
 190 and each Category has its own relevant but
 191 ultimately rational quirks.
 192
 193 # Twin Predication
 194
 195 Twin Predication is an entirely new concept not present in any commercial
 196 Vector ISA of the past forty years.  To explain how normal Single-predication
 197 is applied in a standard Vector ISA:
 198
 199 * Predication on the **destination** of a LOAD instruction creates something
 200   called "Vector Compressed Load" (VCOMPRESS).
 201 * Predication on the **source** of a STORE instruction creates something
 202   called "Vector Expanded Store" (VEXPAND).
 203 * SVP64 allows the two to be put back-to-back: one on source, one on
 204   destination.
 205
 206 The above allows a reader familiar with VCOMPRESS and VEXPAND to
 207 conceptualise what the effect of Twin Predication is, but it actually
 208 goes much further: in *any* twin-predicated instruction (extsw, fmv)
 209 it is possible to apply one predicate to the source register (compressing
 210 the source element array) and another *completely separate* predicate
 211 to the destination register, not just on Load/Stores but on *arithmetic*
 212 operations.
 213
 214 No other Vector ISA in the world has this capability.  All true Vector
 215 ISAs have Predicate Masks: it is an absolutely essential characteristic.
 216 However none of them have abstracted dual predicates out to the extent
 217 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
 218 wide range of arithmetic
 219 instructions, as well as Load/Store.
 220
 221 It is however important to note that not all instructions can be Twin
 222 Predicated (2P): some remain only Single Predicated (1P), as is normally found
 223 in other Vector ISAs. Arithmetic operations with
 224 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
 225 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
 226 Consequently, when using a given instruction, it is necessary to look
 227 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
 228
 229 Also worth a special mention: all Load/Store operations are Twin-Predicated.
 230 The underlying key to understanding:
 231
 232 * one Predicate applies to the Array of Memory *Addresses*,
 233 * the other Predicate applies to the Array of Memory *Data*.
 234
 235 # CR weird instructions
 236
 237 [[sv/int_cr_predication]] is by far the biggest violator of the SVP64
 238 rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
 239 for use as predicates is very awkward without them.
 240
 241 Normally, element width overrides allow the element width to be specified
 242 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
 243 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
 244 required.  When this perspective is taken (that results or sources are
 245 1 or 4 bits) the weirdness starts to make sense, because the "elements",
 246 such as they are, are still packed sequentially.
 247
 248 From a hardware implementation perspective however they will need special
 249 handling as far as Hazard Dependencies are concerned, due to nonconformance
 250 (bit-level management)
 251
 252 # mv.x
 253
 254 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
 255 terms of Register Hazard Management that its addition to any Scalar
 256 ISA is anathematic. In a Traditional Vector ISA however, where the
 257 indices are isolated behind a single Vector Hazard, there is no
 258 problem at all.  `sv.mv.x` is also fraught, precisely because it
 259 sits on top of a Standard Scalar register paradigm, not a Vector
 260 ISA, with separate and distinct Vector registers.
 261
 262 To help partly solve this, `sv.mv.x` has to be made relative:
 263
 264 ```
 265 for i in range(VL):
 266     GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
 267 ```
 268
 269 The reason for doing so is that MAXVL or VL may be used to limit
 270 the number of Register Hazards that need to be raised to a fixed
 271 quantity, at Issue time.
 272
 273 `mv.x` itself will still have to be added as a Scalar instruction,
 274 but the behaviour of `sv.mv.x` will have to be different from that
 275 Scalar version.
 276
 277 Normally, Scalar Instructions have a good justification for being
 278 added as Scalar instructions on their own merit. `mv.x` is the
 279 polar opposite, and as such qualifies for a special mention in
 280 this section.
 281
 282 # Branch-Conditional
 283
 284 [[sv/branches]] are a very special exception to the rule that there
 285 shall be no deviation from the corresponding
 286 Scalar instruction.  This because of the tight
 287 integration with looping and the application of Boolean Logic
 288 manipulation needed for Parallel operations (predicate mask usage).
 289 This results in an extremely important observation that `scalar identity
 290 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
 291 operation as the unprefixed 32-bit scalar version.
 292
 293 One key difference is that LR is only updated if certain additional
 294 conditions are met, whereas Scalar `bclrl` for example unconditionally
 295 overwrites LR.
 296
 297 Well over 500 Vectorised branch instructions exist in SVP64 due to the
 298 number of options available: close integration and interaction with
 299 the base Scalar Branch was unavoidable in order to create Conditional
 300 Branching suitable for parallel 3D / CUDA GPU workloads.
 301
 302 # Saturation
 303
 304 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
 305 It does help that within the SFFS Compliancy subset there are no Saturated
 306 operations at all: they are only added in VSX.
 307
 308 Saturation does not inherently change the instruction itself: it does however
 309 come with some fundamental implications, when applied. For example:
 310 a Floating-Point operation that would normally raise an exception will
 311 no longer do so, instead setting the CR1.SO Flag.  Another quirky
 312 example: signed operations which produce a negative result will be
 313 truncated to zero if Unsigned Saturation is requested.
 314
 315 One very important aspect for implementors is that the operation in
 316 effect has to be considered to be performed at infinite precision,
 317 followed by saturation detection. In practice this does not actually
 318 require infinite precision hardware! Two 8-bit integers being
 319 added can only ever overflow into a 9-bit result.
 320
 321 Overall some care and consideration needs to be applied.
 322
 323 # Fail-First
 324
 325 Fail-First (both the Load/Store and Data-Dependent variants)
 326 is worthy of a special mention in its own right. Where VL is
 327 normally forward-looking and may be part of a pre-decode phase
 328 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
 329 Fail-First changes that because at any point during the execution
 330 of the element-level instructions, one of those elements may not only
 331 terminate further continuation of the hardware-for-looping but also
 332 effect a change of VL:
 333
 334 ```
 335 for i in range(VL):
 336     result = element_operation(GPR(RA+i), GPR(RB+i))
 337     if test(result):
 338         VL = i
 339         break
 340 ```
 341
 342 This is not exactly a violation of SVP64 Rules, more of a breakage
 343 of user expectations, particularly for LD/ST where exceptions
 344 would normally be expected to be raised, Fail-First provides for
 345 avoidance of those exceptions.
 346
 347 # OE=1
 348
 349 The hardware cost of Sticky Overflow in a parallel environment is immense.
 350 The SFFS Compliancy Level is permitted optionally to support XER.SO.
 351 Therefore the decision is made to make it mandatory **not** to
 352 support XER.SO. However, CR.SO *is* supported such that when Rc=1
 353 is set the CR.SO flag will contain only the overflow of
 354 the current instruction, rather than being actually "sticky".
 355 Hardware Out-of-Order designers will recognise and appreciate
 356 that the Hazards are
 357 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
 358
 359 This is sort-of a quirk and sort-of not, because the option to support
 360 XER.SO is already optional from the SFFS Compliancy Level.