openpower/sv/svp64_quirks.mdwn

   1 # The Rules
   2
   3 [[!toc]]
   4
   5 SVP64 is designed around these fundamental and inviolate principles:
   6
   7 1. There are no actual Vector instructions: Scalar instructions
   8    are the sole exclusive bedrock.
   9 2. No scalar instruction ever deviates in its encoding or meaning
  10    just because it is prefixed (semantic caveats below)
  11 3. A hardware-level for-loop makes vector elements 100% synonymous
  12    with scalar instructions (the suffix)
  13
  14 How can a Vector ISA even exist when no actual Vector instructions
  15 are permitted to be added? It comes down to the strict abstraction.
  16 First you add a **scalar** instruction (32-bit). Second, the
  17 Prefixing is applied *in the abstract* to give the *appearance*
  18 and ultimately the same effect as if an explicit Vector instruction
  19 had also been added.
  20
  21 There are a few exceptional places where these rules get
  22 bent, and others where the rules take some explaining,
  23 and this page tracks them.
  24
  25 The modification caveat in (2) above semantically
  26 exempts element width overrides,
  27 which still do not actually modify the meaning of the instruction:
  28 an add remains an add, even if its override makes it an 8-bit add rather than
  29 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
  30 that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry,
  31 where the 9th bit becomes Carry-out, not the 65th bit.
  32 In other words, elwidth overrides **definitely** do not fundamentally
  33 alter the actual
  34 Scalar v3.0 ISA encoding itself.  Consequently we can still, in
  35 the strictest sense, not be breaking rule (2).
  36
  37 Likewise, other "modifications" such as saturation or Data-dependent
  38 Fail-First likewise are actually post-augmentation or post-analysis, and do
  39 not fundamentally change an add operation into a subtract
  40 for example, and under absolutely no circumstances do the actual 32-bit
  41 Scalar v3.0 operand field bits change or the number of operands change.
  42
  43 In an early Draft of SVP64,
  44 an experiment was attempted, to modify LD-immediate instructions
  45 to include a
  46 third RC register i.e. reinterpret the normal
  47 v3.0 32-bit instruction as a completely
  48 different encoding if SVP64-prefixed. It did not go well.
  49 The complexity that resulted
  50 in the decode phase was too great. The lesson was learned, the
  51 hard way: it would be infinitely preferable
  52 to add a 32-bit Scalar Load-with-Shift
  53 instruction *first*, which then inherently becomes Vectorised.
  54 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
  55 both ARM and x86 have it, because it saves greatly on instruction count in
  56 hot-loops.
  57
  58 The other reason for not adding an SVP64-Prefixed instruction without
  59 also having it as a Scalar un-prefixed instruction is that if the
  60 32-bit encoding is ever allocated to a completely unrelated operation
  61 then how can a Vectorised version of that new instruction ever be added?
  62 Bottom line here is that the fundamental RISC Principle is strictly adhered
  63 to, even though these are Advanced 64-bit Vector instructions.
  64
  65 # Instruction Groups
  66
  67 The basic principle of SVP64 is the prefix, which contains mode
  68 as well as register augmentation and predicates.  When thinking of
  69 instructions and Vectorising them, it is natural for arithmetic
  70 operations (ADD, OR) to be the first to spring to mind.
  71 Arithmetic instructions have registers, therefore augmentation
  72 applies, end of story, right?
  73
  74 Except, Load and Store deals also with Memory, not just registers.
  75 Power ISA has Condition Register Fields: how can element widths
  76 apply there? And branches: how can you have Saturation on something
  77 that does not return an arithmetic result? In short: there are actually
  78 four different categories (five including those for which Vectorisation
  79 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
  80
  81 * arithmetic/logical including floating-point
  82 * Load/Store
  83 * Condition Register Field operations
  84 * branch
  85
  86 **Arithmetic**
  87
  88 Arithmetic (known as "normal" mode) is where Scalar and Parallel
  89 Reduction can be done: Saturation as well, and two new innovative
  90 modes for Vector ISAs: data-dependent fail-first and predicate result.
  91 Reduction and Saturation are common to see in Vector ISAs: it is just
  92 that they are usually added as explicit instructions,
  93 and NEC SX Aurora has even more iterative instructions. In SVP64 these
  94 concepts are applied in the abstract general form, which takes some
  95 getting used to.
  96
  97 Reduction may, when applied to non-commutative
  98 instructions incorrectly, result in invalid results, but ultimately
  99 it is critical to think in terms of the "rules", that everything is
 100 Scalar instructions in strict Program Order.  Reduction on non-commutative
 101 Scalar Operations is not *prohibited*: the strict Program Order allows
 102 the programmer to think through what would happen and thus potentially
 103 actually come up with legitimate use.
 104
 105 **Branches**
 106
 107 Branch is the one and only place where the Scalar
 108 (non-prefixed) operations differ from the Vector (element)
 109 instructions, as explained in a separate section.
 110 The
 111 RM bits can be used for other purposes because the Arithmetic modes
 112 make no sense at all for a Branch.
 113 Almost the entire
 114 SVP64 RM Field is interpreted differently from other Modes, in
 115 order to support a wide range of parallel boolean condition options
 116 which are expected of a Vector / GPU ISA. These save a considerable
 117 number of instructions in tight inner loop situations.
 118
 119 **CR Field Ops**
 120
 121 Condition Register Fields are 4-bit wide and consequently element-width
 122 overrides make absolutely no sense whatsoever. Therefore the elwidth
 123 override field bits can be used for other purposes when Vectorising
 124 CR Field instructions.  Moreover, Rc=1 is completely invalid for
 125 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
 126 a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
 127 such as predicate-result make no sense, and neither does Saturation.
 128 All of these differences, which require quite a lot of logical
 129 reasoning and deduction, help explain why there is an entirely different
 130 CR ops Vectorisation Category.
 131
 132 A particularly strange quirk of CR-based Vector Operations is that the
 133 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
 134 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
 135 this makes up 32 bits, and therefore a CR operand referring to one bit
 136 of the CR will be 5 bits in length (BA, BT).
 137 *However*, some instructions refer
 138 to a *CR Field* (CR0-CR7) and consequently these operands
 139 (BF, BFA etc) are only 3-bits.
 140
 141 (*It helps here to think of the top 3 bits of BA as referring
 142 to a CR Field, like BFA does, and the bottom 2 bits of BA
 143 referring to
 144 EQ/LT/GT/SO within that Field*)
 145
 146 With SVP64 extending the number of CR *Fields* to 128, the number of
 147 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
 148 (8 per CR Register). Then, it gets even more strange, when it comes
 149 to Vectorisation, which applies to the CR Field *numbers*.  The
 150 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
 151 and moves to CR1 for element 1, and so on.  The reason here is quite
 152 simple: each element result has to have its own CR Field co-result.
 153
 154 In other words, the
 155 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
 156 CR Register, and not the CR *Register* (of which there are now 16).
 157 All quite logical, but a little mind-bending.
 158
 159 **Load/Store**
 160
 161 LOAD/STORE is another area that has different needs: this time it is
 162 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
 163 which simply make no sense in a RISC Scalar ISA: element-stride and
 164 unit-stride and the entire concept of a stride itself (a spacing
 165 between elements) has no place at all in a Scalar ISA. The problems
 166 come when trying to *retrofit* the concept of "Vector Elements" onto
 167 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
 168 RM Prefix to convey the stride mode, changing the Effective Address
 169 computation as a result. Interestingly, worth noting for Hardware
 170 designers: it did turn out to be possible to perform pre-multiplication
 171 of the D/DS Immediate by the stride amount, making it possible to avoid
 172 actually modifying the LD/ST Pipeline itself.
 173
 174 Other areas where LD/ST went quirky: element-width overrides especially
 175 when combined with Saturation, given that LD/ST operations have byte,
 176 halfword, word, dword and quad variants. The interaction between these
 177 widths as part of the actual operation, and the source and destination
 178 elwidth overrides, was particularly obtuse and hard to derive: some care
 179 and attention is advised, here, when reading the specification.
 180
 181 **Non-vectorised**
 182
 183 The concept of a Vectorised halt (`attn`) makes no sense. There are never
 184 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
 185 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
 186 Even `td` and `tdi` makes a strange type of sense to permit it to be
 187 Vectorised, because a sequence of comparisons could be Vectorised.
 188 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
 189 Nemory Management
 190 instructions, these make no sense to Vectorise.
 191
 192 However, it is really quite important to not be tempted to conclude that
 193 just because these instructions are un-vectoriseable, the opcode space
 194 must be free for reiterpretation and use for other purposes. This would
 195 be a serious mistake because a future revision of the specification
 196 might *retire* the Scalar instruction, replace it with another.
 197 Again this comes down to being quite strict about the rules: only Scalar
 198 instructions get Vectorised: there are *no* actual explicit Vector
 199 instructions.
 200
 201 **Summary**
 202
 203 Where a traditional Vector ISA effectively duplicates the entirety
 204 of a Scalar ISA and then adds additional instructions which only
 205 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
 206 considerable lengths to keep strictly to augmentation and embedding
 207 of an entire Scalar ISA's instructions into an abstract Vectorisation
 208 Context. That abstraction subdivides down into Categories appropriate
 209 for the type of operation (Branch, CRs, Memory, Arithmetic),
 210 and each Category has its own relevant but
 211 ultimately rational quirks.
 212
 213 # Twin Predication
 214
 215 Twin Predication is an entirely new concept not present in any commercial
 216 Vector ISA of the past forty years.  To explain how normal Single-predication
 217 is applied in a standard Vector ISA:
 218
 219 * Predication on the **destination** of a LOAD instruction creates something
 220   called "Vector Compressed Load" (VCOMPRESS).
 221 * Predication on the **source** of a STORE instruction creates something
 222   called "Vector Expanded Store" (VEXPAND).
 223 * SVP64 allows the two to be put back-to-back: one on source, one on
 224   destination.
 225
 226 The above allows a reader familiar with VCOMPRESS and VEXPAND to
 227 conceptualise what the effect of Twin Predication is, but it actually
 228 goes much further: in *any* twin-predicated instruction (extsw, fmv)
 229 it is possible to apply one predicate to the source register (compressing
 230 the source element array) and another *completely separate* predicate
 231 to the destination register, not just on Load/Stores but on *arithmetic*
 232 operations.
 233
 234 No other Vector ISA in the world has this capability.  All true Vector
 235 ISAs have Predicate Masks: it is an absolutely essential characteristic.
 236 However none of them have abstracted dual predicates out to the extent
 237 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
 238 wide range of arithmetic
 239 instructions, as well as Load/Store.
 240
 241 It is however important to note that not all instructions can be Twin
 242 Predicated (2P): some remain only Single Predicated (1P), as is normally found
 243 in other Vector ISAs. Arithmetic operations with
 244 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
 245 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
 246 Consequently, when using a given instruction, it is necessary to look
 247 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
 248
 249 Also worth a special mention: all Load/Store operations are Twin-Predicated.
 250 The underlying key to understanding:
 251
 252 * one Predicate applies to the Array of Memory *Addresses*,
 253 * the other Predicate applies to the Array of Memory *Data*.
 254
 255 # CR weird instructions
 256
 257 [[sv/int_cr_predication]] is by far the biggest violator of the SVP64
 258 rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
 259 for use as predicates is very awkward without them.
 260
 261 Normally, element width overrides allow the element width to be specified
 262 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
 263 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
 264 required.  When this perspective is taken (that results or sources are
 265 1 or 4 bits) the weirdness starts to make sense, because the "elements",
 266 such as they are, are still packed sequentially.
 267
 268 From a hardware implementation perspective however they will need special
 269 handling as far as Hazard Dependencies are concerned, due to nonconformance
 270 (bit-level management)
 271
 272 # mv.x
 273
 274 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
 275 terms of Register Hazard Management that its addition to any Scalar
 276 ISA is anathematic. In a Traditional Vector ISA however, where the
 277 indices are isolated behind a single Vector Hazard, there is no
 278 problem at all.  `sv.mv.x` is also fraught, precisely because it
 279 sits on top of a Standard Scalar register paradigm, not a Vector
 280 ISA, with separate and distinct Vector registers.
 281
 282 To help partly solve this, `sv.mv.x` has to be made relative:
 283
 284 ```
 285 for i in range(VL):
 286     GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
 287 ```
 288
 289 The reason for doing so is that MAXVL or VL may be used to limit
 290 the number of Register Hazards that need to be raised to a fixed
 291 quantity, at Issue time.
 292
 293 `mv.x` itself will still have to be added as a Scalar instruction,
 294 but the behaviour of `sv.mv.x` will have to be different from that
 295 Scalar version.
 296
 297 Normally, Scalar Instructions have a good justification for being
 298 added as Scalar instructions on their own merit. `mv.x` is the
 299 polar opposite, and as such qualifies for a special mention in
 300 this section.
 301
 302 # Branch-Conditional
 303
 304 [[sv/branches]] are a very special exception to the rule that there
 305 shall be no deviation from the corresponding
 306 Scalar instruction.  This because of the tight
 307 integration with looping and the application of Boolean Logic
 308 manipulation needed for Parallel operations (predicate mask usage).
 309 This results in an extremely important observation that `scalar identity
 310 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
 311 operation as the unprefixed 32-bit scalar version.
 312
 313 One key difference is that LR is only updated if certain additional
 314 conditions are met, whereas Scalar `bclrl` for example unconditionally
 315 overwrites LR.
 316
 317 Well over 500 Vectorised branch instructions exist in SVP64 due to the
 318 number of options available: close integration and interaction with
 319 the base Scalar Branch was unavoidable in order to create Conditional
 320 Branching suitable for parallel 3D / CUDA GPU workloads.
 321
 322 # Saturation
 323
 324 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
 325 It does help that within the SFFS Compliancy subset there are no Saturated
 326 operations at all: they are only added in VSX.
 327
 328 Saturation does not inherently change the instruction itself: it does however
 329 come with some fundamental implications, when applied. For example:
 330 a Floating-Point operation that would normally raise an exception will
 331 no longer do so, instead setting the CR1.SO Flag.  Another quirky
 332 example: signed operations which produce a negative result will be
 333 truncated to zero if Unsigned Saturation is requested.
 334
 335 One very important aspect for implementors is that the operation in
 336 effect has to be considered to be performed at infinite precision,
 337 followed by saturation detection. In practice this does not actually
 338 require infinite precision hardware! Two 8-bit integers being
 339 added can only ever overflow into a 9-bit result.
 340
 341 Overall some care and consideration needs to be applied.
 342
 343 # Fail-First
 344
 345 Fail-First (both the Load/Store and Data-Dependent variants)
 346 is worthy of a special mention in its own right. Where VL is
 347 normally forward-looking and may be part of a pre-decode phase
 348 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
 349 Fail-First changes that because at any point during the execution
 350 of the element-level instructions, one of those elements may not only
 351 terminate further continuation of the hardware-for-looping but also
 352 effect a change of VL:
 353
 354 ```
 355 for i in range(VL):
 356     result = element_operation(GPR(RA+i), GPR(RB+i))
 357     if test(result):
 358         VL = i
 359         break
 360 ```
 361
 362 This is not exactly a violation of SVP64 Rules, more of a breakage
 363 of user expectations, particularly for LD/ST where exceptions
 364 would normally be expected to be raised, Fail-First provides for
 365 avoidance of those exceptions.
 366
 367 # OE=1
 368
 369 The hardware cost of Sticky Overflow in a parallel environment is immense.
 370 The SFFS Compliancy Level is permitted optionally to support XER.SO.
 371 Therefore the decision is made to make it mandatory **not** to
 372 support XER.SO. However, CR.SO *is* supported such that when Rc=1
 373 is set the CR.SO flag will contain only the overflow of
 374 the current instruction, rather than being actually "sticky".
 375 Hardware Out-of-Order designers will recognise and appreciate
 376 that the Hazards are
 377 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
 378
 379 This is sort-of a quirk and sort-of not, because the option to support
 380 XER.SO is already optional from the SFFS Compliancy Level.
 381
 382 # Indexed REMAP and CR Field Predication Hazards
 383
 384 Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
 385 Vector "Permute" or "Shuffle" instructions. These provide a Vector of
 386 indices whereby another Vector is reordered (permuted, shuffled) according
 387 to the indices.  Register Hazard Managent here is trivial because there
 388 are three registers: indices source vector, elements source vector to
 389 be shuffled, result vector.
 390
 391 For SVP64 which is based on top of a Scalar Register File paradigm,
 392 combined with the hard requirement to respect full Register Hazard
 393 Management as if element instructions were actual Scalar instructions,
 394 the addition of a Vector permute instruction under these strict
 395 conditions would result in a catastrophic
 396 reduction in performance, due to having to consider Read-after-Write
 397 and Write-after-Read Hazards *at the element level*.
 398
 399 A little leniency and rule-bending is therefore required.
 400
 401 Rather than add explicit Vector permute instructions, the "Indexing"
 402 has been separated out into a REMAP Schedule.  When an Indexed
 403 REMAP is requested, it is assumed (required, of software) that
 404 subsequent instructions intending to use those indices *will not*
 405 attempt to modify the indices. It is *Software* that must consider them
 406 to be read-only.
 407
 408 This simple relaxation of the rules releases Hardware from having the
 409 horrendous job of dynamically detecting Write-after-Read Hazards on a
 410 huge range of registers.
 411
 412 A similar Hazard problem exists for CR Field Predicates, in Vertical-First
 413 Mode.  Instructions could modify CR Fields currently being used as Predicate
 414 Masks: detecting this is so horrendous for hardware resource utilisation
 415 and hardware complexity that, again, the decision is made to relax these
 416 constraints and for Software to take that into account.