openpower/sv/svp64_quirks.mdwn

   1 # The Rules
   2
   3 [[!toc]]
   4
   5 SVP64 is designed around fundamental and inviolate RISC principles.
   6 This gives a uniformity and regularity to the ISA, making implementation
   7 straightforward, which was why RISC
   8 as a concept became popular.
   9
  10 1. There are no actual Vector instructions: Scalar instructions
  11    are the sole exclusive bedrock.
  12 2. No scalar instruction ever deviates in its encoding or meaning
  13    just because it is prefixed (semantic caveats below)
  14 3. A hardware-level for-loop (the prefix) makes vector elements
  15    100% synonymous with scalar instructions (the suffix)
  16
  17 How can a Vector ISA even exist when no actual Vector instructions
  18 are permitted to be added? It comes down to the strict RISC abstraction.
  19 First you start from a **scalar** instruction (32-bit). Second, the
  20 Prefixing is applied *in the abstract* to give the *appearance*
  21 and ultimately the same effect as if an explicit Vector instruction
  22 had also been added.  Looking at the pseudocode of any Vector ISA
  23 (RVV, NEC SX Aurora, Cray)
  24 they always comprise (a) a for-loop around (b) element-based operations.
  25 It is perfectly reasonable and rational to separate (a) from (b)
  26 then find a powerful pre-existing
  27 Supercomputing-class ISA that qualifies for (b).
  28
  29 There are a few exceptional places where these rules get
  30 bent, and others where the rules take some explaining,
  31 and this page tracks them all.
  32
  33 The modification caveat in (2) above semantically
  34 exempts element width overrides,
  35 which still do not actually modify the meaning of the instruction:
  36 an add remains an add, even if its override makes it an 8-bit add rather than
  37 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
  38 that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry
  39 where the 9th bit becomes Carry-out (not the 65th bit).
  40 In other words, elwidth overrides **definitely** do not fundamentally
  41 alter the actual
  42 Scalar v3.0 ISA encoding itself.  Consequently we can still, in
  43 the strictest semantic sense, not be breaking rule (2).
  44
  45 Likewise, other "modifications" such as saturation or Data-dependent
  46 Fail-First likewise are actually post-augmentation or post-analysis, and do
  47 not fundamentally change an add operation into a subtract
  48 for example, and under absolutely no circumstances do the actual 32-bit
  49 Scalar v3.0 operand field bits change or the number of operands change.
  50
  51 In an early Draft of SVP64,
  52 an experiment was attempted, to modify LD-immediate instructions
  53 to include a
  54 third RC register i.e. reinterpret the normal
  55 v3.0 32-bit instruction as a completely
  56 different encoding if SVP64-prefixed. It did not go well.
  57 The complexity that resulted
  58 in the decode phase was too great. The lesson was learned, the
  59 hard way: it would be infinitely preferable
  60 to add a 32-bit Scalar Load-with-Shift
  61 instruction *first*, which then inherently becomes Vectorised.
  62 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
  63 both ARM and x86 have it, because it saves greatly on instruction count in
  64 hot-loops.
  65
  66 The other reason for not adding an SVP64-Prefixed instruction without
  67 also having it as a Scalar un-prefixed instruction is that if the
  68 32-bit encoding is ever allocated in a future revision
  69 of the Power ISA
  70 to a completely unrelated operation
  71 then how can a Vectorised version of that new instruction ever be added?
  72 The uniformity and RISC Abstraction is irreparably damaged.
  73 Bottom line here is that the fundamental RISC Principle is strictly adhered
  74 to, even though these are Advanced 64-bit Vector instructions.
  75 Advocates of the RISC Principle will appreciate the uniformity of
  76 SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
  77
  78 # Instruction Groups
  79
  80 The basic principle of SVP64 is the prefix, which contains mode
  81 as well as register augmentation and predicates.  When thinking of
  82 instructions and Vectorising them, it is natural for arithmetic
  83 operations (ADD, OR) to be the first to spring to mind.
  84 Arithmetic instructions have registers, therefore augmentation
  85 applies, end of story, right?
  86
  87 Except, Load and Store deals also with Memory, not just registers.
  88 Power ISA has Condition Register Fields: how can element widths
  89 apply there? And branches: how can you have Saturation on something
  90 that does not return an arithmetic result? In short: there are actually
  91 four different categories (five including those for which Vectorisation
  92 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
  93
  94 * arithmetic/logical including floating-point
  95 * Load/Store
  96 * Condition Register Field operations
  97 * branch
  98
  99 **Arithmetic**
 100
 101 Arithmetic (known as "normal" mode) is where Scalar and Parallel
 102 Reduction can be done: Saturation as well, and two new innovative
 103 modes for Vector ISAs: data-dependent fail-first and predicate result.
 104 Reduction and Saturation are common to see in Vector ISAs: it is just
 105 that they are usually added as explicit instructions,
 106 and NEC SX Aurora has even more iterative instructions. In SVP64 these
 107 concepts are applied in the abstract general form, which takes some
 108 getting used to.
 109
 110 Reduction may, when applied to non-commutative
 111 instructions incorrectly, result in invalid results, but ultimately
 112 it is critical to think in terms of the "rules", that everything is
 113 Scalar instructions in strict Program Order.  Reduction on non-commutative
 114 Scalar Operations is not *prohibited*: the strict Program Order allows
 115 the programmer to think through what would happen and thus potentially
 116 actually come up with legitimate use.
 117
 118 **Branches**
 119
 120 Branch is the one and only place where the Scalar
 121 (non-prefixed) operations differ from the Vector (element)
 122 instructions (as explained in a separate section) although
 123 a case could be made for the perspective that they are identical,
 124 but the defaults for new parameters in the Scalar case makes branch
 125 identical to Power ISA v3.1 Scalar branches.
 126
 127 The
 128 RM bits can be used for other purposes because the Arithmetic modes
 129 make no sense at all for a Branch.
 130 Almost the entire
 131 SVP64 RM Field is interpreted differently from other Modes, in
 132 order to support a wide range of parallel boolean condition options
 133 which are expected of a Vector / GPU ISA. These save a considerable
 134 number of instructions in tight inner loop situations.
 135
 136 **CR Field Ops**
 137
 138 Condition Register Fields are 4-bit wide and consequently element-width
 139 overrides make absolutely no sense whatsoever. Therefore the elwidth
 140 override field bits can be used for other purposes when Vectorising
 141 CR Field instructions.  Moreover, Rc=1 is completely invalid for
 142 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
 143 a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
 144 such as predicate-result make no sense, and neither does Saturation.
 145 All of these differences, which require quite a lot of logical
 146 reasoning and deduction, help explain why there is an entirely different
 147 CR ops Vectorisation Category.
 148
 149 A particularly strange quirk of CR-based Vector Operations is that the
 150 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
 151 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
 152 this makes up 32 bits, and therefore a CR operand referring to one bit
 153 of the CR will be 5 bits in length (BA, BT).
 154 *However*, some instructions refer
 155 to a *CR Field* (CR0-CR7) and consequently these operands
 156 (BF, BFA etc) are only 3-bits.
 157
 158 (*It helps here to think of the top 3 bits of BA as referring
 159 to a CR Field, like BFA does, and the bottom 2 bits of BA
 160 referring to
 161 EQ/LT/GT/SO within that Field*)
 162
 163 With SVP64 extending the number of CR *Fields* to 128, the number of
 164 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
 165 (8 per CR Register). Then, it gets even more strange, when it comes
 166 to Vectorisation, which applies to the CR Field *numbers*.  The
 167 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
 168 and moves to CR1 for element 1, and so on.  The reason here is quite
 169 simple: each element result has to have its own CR Field co-result.
 170
 171 In other words, the
 172 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
 173 CR Register, and not the CR *Register* (of which there are now 16).
 174 All quite logical, but a little mind-bending.
 175
 176 **Load/Store**
 177
 178 LOAD/STORE is another area that has different needs: this time it is
 179 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
 180 which simply make no sense in a RISC Scalar ISA: element-stride and
 181 unit-stride and the entire concept of a stride itself (a spacing
 182 between elements) has no place at all in a Scalar ISA. The problems
 183 come when trying to *retrofit* the concept of "Vector Elements" onto
 184 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
 185 RM Prefix to convey the stride mode, changing the Effective Address
 186 computation as a result. Interestingly, worth noting for Hardware
 187 designers: it did turn out to be possible to perform pre-multiplication
 188 of the D/DS Immediate by the stride amount, making it possible to avoid
 189 actually modifying the LD/ST Pipeline itself.
 190
 191 Other areas where LD/ST went quirky: element-width overrides especially
 192 when combined with Saturation, given that LD/ST operations have byte,
 193 halfword, word, dword and quad variants. The interaction between these
 194 widths as part of the actual operation, and the source and destination
 195 elwidth overrides, was particularly obtuse and hard to derive: some care
 196 and attention is advised, here, when reading the specification,
 197 especially on arithmetic loads (lbarx, lharx etc.)
 198
 199 **Non-vectorised**
 200
 201 The concept of a Vectorised halt (`attn`) makes no sense. There are never
 202 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
 203 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
 204 Even `td` and `tdi` makes a strange type of sense to permit it to be
 205 Vectorised, because a sequence of comparisons could be Vectorised.
 206 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
 207 Nemory Management
 208 instructions, these make no sense to Vectorise.
 209
 210 However, it is really quite important to not be tempted to conclude that
 211 just because these instructions are un-vectoriseable, the Prefix opcode space
 212 must be free for reiterpretation and use for other purposes. This would
 213 be a serious mistake because a future revision of the specification
 214 might *retire* the Scalar instruction, and, worse, replace it with another.
 215 Again this comes down to being quite strict about the rules: only Scalar
 216 instructions get Vectorised: there are *no* actual explicit Vector
 217 instructions.
 218
 219 **Summary**
 220
 221 Where a traditional Vector ISA effectively duplicates the entirety
 222 of a Scalar ISA and then adds additional instructions which only
 223 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
 224 considerable lengths to keep strictly to augmentation and embedding
 225 of an entire Scalar ISA's instructions into an abstract Vectorisation
 226 Context. That abstraction subdivides down into Categories appropriate
 227 for the type of operation (Branch, CRs, Memory, Arithmetic),
 228 and each Category has its own relevant but
 229 ultimately rational quirks.
 230
 231 # Abstraction between Prefix and Suffix
 232
 233 In the introduction paragraph, a great fuss was made emphasising that
 234 the Prefix is kept separate from the Suffix.  The whole idea there is
 235 that a Multi-issue Decoder and subsequent pipelines would in no way have
 236 "back-propagation" of state that can only be determined far too late.
 237 This *has* been preserved, however there is a hiccup.
 238
 239 Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001.
 240 The encoding of the prefix has 6 bits that are dedicated to letting
 241 the hardware know what the remainder of the Prefix bits mean: how they
 242 are formatted, even without having to examine the Suffix to which
 243 they are applied.
 244
 245 SVP64 has such pressure on its 24-bit encoding that it was simply
 246 not possible to perform the same trick used by Power ISA 3.1 Prefixing.
 247 Therefore, rather unfortunately, it becomes necessary to perform
 248 a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM
 249 Fields may be identified.  Fortunately this is straightforward, and
 250 does not rely on any outside state, and even more fortunately
 251 for a Multi-Issue Execution decoder, the length 32/64 is also
 252 easy to identify by looking for the EXT001 pattern. Once identified
 253 the 32/64 bits may be passed independently to multiple Decoders in
 254 parallel.
 255
 256 # Predication
 257
 258 Predication is entirely missing from the Power ISA.
 259 Adding it would be a costly mistake because it cannot be retrofitted
 260 to an ISA without literally duplicating all instructions. Prefixing
 261 is about the only sane way to go.
 262
 263 CR Fields as predicate masks could be spread across multiple register
 264 file entries, making them costly to read in one hit. Therefore the
 265 possibility exists that an instruction element writing to a CR Field
 266 could *overwrite* the Predicate mask CR Vector during the middle of
 267 a for-loop.
 268
 269 Clearly this is bad, so don't do it.  If there are potential issues
 270 they can be avoided by using the crweird instructions to get CR Field
 271 bits into an Integer GPR (r3, r10 or r30) and use that GPR as a
 272 Predicate mask instead.
 273
 274 Even in Vertical-First Mode, which is a single Scalar instruction executed
 275 with "offset" registers (in effect), the rule still applies: don't write
 276 to the same register being used as the predicate, it's `UNDEFINED`
 277 behaviour.
 278
 279 ## Single Predication
 280
 281 So named because there is a Twin Predication concept as well, Single
 282 Predication is also unlike other Vector ISAs because it allows zeroing
 283 on both the source and destination.  This takes some explaining.
 284
 285 In Vector ISAs, there is a Predicate Mask, it applies to the
 286 destination only, and there
 287 is a choice of actions when a Predicate Mask bit
 288 is zero:
 289
 290 * set the destination element to zero
 291 * skip that element operation entirely, leaving the destination unmodified
 292
 293 The problem comes if the underlying register file SRAM is say 64-bit wide
 294 write granularity but the Vector elements are say 8-bit wide.
 295 Some Vector ISAs strongly advocate Zeroing because to leave one single
 296 element at a small bitwidth in amongst other elements where the register
 297 file does not have the prerequisite access granularity is very expensive,
 298 requiring a Read-Modify-Write cycle to preserve the untouched elements.
 299 Putting zero into the destination avoids that Read.
 300
 301 This is technically very easy to solve: use a Register File that does
 302 in fact have the smallest element-level write-enable granularity.
 303 If the elements are 8 bit then allow 8-bit writes!
 304
 305 With that technical issue solved there is nothing in the way of choosing
 306 to support both zeroing and non-zeroing (skipping) at the ISA level:
 307 SV chooses to further support both *on both the source and destination*.
 308 This can result in the source and destination
 309 element indices getting "out-of-sync" even though the Predicate Mask
 310 is the same because the behaviour is different when zeros in the
 311 Predicate are encountered.
 312
 313 ## Twin Predication
 314
 315 Twin Predication is an entirely new concept not present in any commercial
 316 Vector ISA of the past forty years.  To explain how normal Single-predication
 317 is applied in a standard Vector ISA:
 318
 319 * Predication on the **source** of a LOAD instruction creates something
 320   called "Vector Compressed Load" (VCOMPRESS).
 321 * Predication on the **destination** of a STORE instruction creates something
 322   called "Vector Expanded Store" (VEXPAND).
 323 * SVP64 allows the two to be put back-to-back: one on source, one on
 324   destination.
 325
 326 The above allows a reader familiar with VCOMPRESS and VEXPAND to
 327 conceptualise what the effect of Twin Predication is, but it actually
 328 goes much further: in *any* twin-predicated instruction (extsw, fmv)
 329 it is possible to apply one predicate to the source register (compressing
 330 the source element array) and another *completely separate* predicate
 331 to the destination register, not just on Load/Stores but on *arithmetic*
 332 operations.
 333
 334 No other Vector ISA in the world has this back-to-back
 335 capability.  All true Vector
 336 ISAs have Predicate Masks: it is an absolutely essential characteristic.
 337 However none of them have abstracted dual predicates out to the extent
 338 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
 339 wide range of arithmetic
 340 instructions, as well as Load/Store.
 341
 342 It is however important to note that not all instructions can be Twin
 343 Predicated (2P): some remain only Single Predicated (1P), as is normally found
 344 in other Vector ISAs. Arithmetic operations with
 345 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
 346 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
 347 Consequently, when using a given instruction, it is necessary to look
 348 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
 349
 350 Also worth a special mention: all Load/Store operations are Twin-Predicated.
 351 The underlying key to understanding:
 352
 353 * one Predicate effectively applies to the Array of Memory *Addresses*,
 354 * the other Predicate effectively applies to the Array of Memory *Data*.
 355
 356 # CR weird instructions
 357
 358 [[sv/cr_int_predication]] is by far the biggest violator of the SVP64
 359 rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
 360 for use as predicates is very awkward without them.
 361
 362 Normally, element width overrides allow the element width to be specified
 363 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
 364 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
 365 required.  When this perspective is taken (that results or sources are
 366 1 or 4 bits) the weirdness starts to make sense, because the "elements",
 367 such as they are, are still packed sequentially.
 368
 369 From a hardware implementation perspective however they will need special
 370 handling as far as Hazard Dependencies are concerned, due to nonconformance
 371 (bit-level management)
 372
 373 # mv.x (vector permute)
 374
 375 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
 376 terms of Register Hazard Management that its addition to any Scalar
 377 ISA is anathematic. In a Traditional Vector ISA however, where the
 378 indices are isolated behind a single Vector Hazard, there is no
 379 problem at all.  `sv.mv.x` is also fraught, precisely because it
 380 sits on top of a Standard Scalar register paradigm, not a Vector
 381 ISA with separate and distinct Vector registers.
 382
 383 To help partly solve this, `sv.mv.x` would have had to have
 384 been made relative:
 385
 386 ```
 387 for i in range(VL):
 388     GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
 389 ```
 390
 391 The reason for doing so is that MAXVL or VL may be used to limit
 392 the number of Register Hazards that need to be raised to a fixed
 393 quantity, at Issue time.
 394
 395 `mv.x` itself would still have to be added as a Scalar instruction,
 396 but the behaviour of `sv.mv.x` would have to be different from that
 397 Scalar version.
 398
 399 Normally, Scalar Instructions have a good justification for being
 400 added as Scalar instructions on their own merit. `mv.x` is the
 401 polar opposite, and in the end, the idea was thrown out, and Indexed
 402 REMAP added in its place.  Indexed REMAP comes with its own quirks,
 403 solving the Hazard problem, described in a later section.
 404
 405 # Branch-Conditional
 406
 407 [[sv/branches]] are a very special exception to the rule that there
 408 shall be no deviation from the corresponding
 409 Scalar instruction.  This because of the tight
 410 integration with looping and the application of Boolean Logic
 411 manipulation needed for Parallel operations (predicate mask usage).
 412 This results in an extremely important observation that `scalar identity
 413 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
 414 operation as the unprefixed 32-bit scalar version.
 415
 416 One key difference is that LR is only updated if certain additional
 417 conditions are met, whereas Scalar `bclrl` for example unconditionally
 418 overwrites LR.
 419
 420 Another is that the Vectorised Branch-Conditional instructions are the
 421 only ones where there are side-effects on predication when skipping
 422 is enabled. This is so as to be able to use CTR to count down
 423 *masked-out* elements.
 424
 425 Well over 500 Vectorised branch instructions exist in SVP64 due to the
 426 number of options available: close integration and interaction with
 427 the base Scalar Branch was unavoidable in order to create Conditional
 428 Branching suitable for parallel 3D / CUDA GPU workloads.
 429
 430 # Saturation
 431
 432 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
 433 It does help that within the SFFS Compliancy subset there are no Saturated
 434 operations at all: they are only added in VSX.
 435
 436 Saturation does not inherently change the instruction itself: it does however
 437 come with some fundamental implications, when applied. For example:
 438 a Floating-Point operation that would normally raise an exception will
 439 no longer do so, instead setting the CR1.SO Flag.  Another quirky
 440 example: signed operations which produce a negative result will be
 441 truncated to zero if Unsigned Saturation is requested.
 442
 443 One very important aspect for implementors is that the operation in
 444 effect has to be considered to be performed at infinite precision,
 445 followed by saturation detection. In practice this does not actually
 446 require infinite precision hardware! Two 8-bit integers being
 447 added can only ever overflow into a 9-bit result.
 448
 449 Overall some care and consideration needs to be applied.
 450
 451 # Fail-First
 452
 453 Fail-First (both the Load/Store and Data-Dependent variants)
 454 is worthy of a special mention in its own right. Where VL is
 455 normally forward-looking and may be part of a pre-decode phase
 456 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
 457 Fail-First changes that because at any point during the execution
 458 of the element-level instructions, one of those elements may not only
 459 terminate further continuation of the hardware-for-looping but also
 460 effect a change of VL:
 461
 462 ```
 463 for i in range(VL):
 464     result = element_operation(GPR(RA+i), GPR(RB+i))
 465     if test(result):
 466         VL = i
 467         break
 468 ```
 469
 470 This is not exactly a violation of SVP64 Rules, more of a breakage
 471 of user expectations, particularly for LD/ST where exceptions
 472 would normally be expected to be raised, Fail-First provides for
 473 avoidance of those exceptions.
 474
 475 For Hardware implementers, a standard Out-of-Order micro-architecture
 476 allows for Cancellation of speculatively-executed elements that extended
 477 beyond the Vector Truncation point.  In-order systems will have a slightly
 478 harder time and may choose to execute one element only at a time, reducing
 479 performance as a result.
 480
 481 # OE=1
 482
 483 The hardware cost of Sticky Overflow in a parallel environment is immense.
 484 The SFFS Compliancy Level is permitted optionally to support XER.SO.
 485 Therefore the decision is made to make it mandatory **not** to
 486 support XER.SO. However, CR.SO *is* supported such that when Rc=1
 487 is set the CR.SO flag will contain only the overflow of
 488 the current instruction, rather than being actually "sticky".
 489 Hardware Out-of-Order designers will recognise and appreciate
 490 that the Hazards are
 491 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
 492
 493 This is sort-of a quirk and sort-of not, because the option to support
 494 XER.SO is already optional from the SFFS Compliancy Level.
 495
 496 # Indexed REMAP and CR Field Predication Hazards
 497
 498 Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
 499 Vector "Permute" or "Shuffle" instructions. These provide a Vector of
 500 indices whereby another Vector is reordered (permuted, shuffled) according
 501 to the indices.  Register Hazard Managent here is trivial because there
 502 are three registers: indices source vector, elements source vector to
 503 be shuffled, result vector.
 504
 505 For SVP64 which is based on top of a Scalar Register File paradigm,
 506 combined with the hard requirement to respect full Register Hazard
 507 Management as if element instructions were actual Scalar instructions,
 508 the addition of a Vector permute instruction under these strict
 509 conditions would result in a catastrophic
 510 reduction in performance, due to having to consider Read-after-Write
 511 and Write-after-Read Hazards *at the element level*.
 512
 513 A little leniency and rule-bending is therefore required.
 514
 515 Rather than add explicit Vector permute instructions, the "Indexing"
 516 has been separated out into a REMAP Schedule.  When an Indexed
 517 REMAP is requested, it is assumed (required, of software) that
 518 subsequent instructions intending to use those indices *will not*
 519 attempt to modify the indices. It is *Software* that must consider them
 520 to be read-only.
 521
 522 This simple relaxation of the rules releases Hardware from having the
 523 horrendous job of dynamically detecting Write-after-Read Hazards on a
 524 huge range of registers.
 525
 526 A similar Hazard problem exists for CR Field Predicates, in Vertical-First
 527 Mode.  Instructions could modify CR Fields currently being used as Predicate
 528 Masks: detecting this is so horrendous for hardware resource utilisation
 529 and hardware complexity that, again, the decision is made to relax these
 530 constraints and for Software to take that into account.
 531
 532 # Floating-Point "Single" becomes "Half"
 533
 534 In several places in the Power ISA there are operations that are on
 535 32-bit quantities in 64-bit registers.  The best example is FP which
 536 has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or
 537 FP Add "single").  Element-width overrides it would seem to
 538 be unnecessary, under these circunstances.
 539
 540 However, it is not possible for `fadds` to fit two elements into
 541 64-bit: that breaks the simplicity of SVP64.
 542 Bear in mind that the FP32 bits are spread out across a 64
 543 bit register in FP64 format.  The solution here was to consider the
 544 "s" at the end of each instruction
 545 to mean "half of the element's width". Thus, `sv.fadds/ew=32`
 546 actually stores an FP16 spread out across the 32 bits of an
 547 element, in FP32 format, where `sv.fadd/ew=32` stores a full
 548 FP32 result into the full 32 bits.
 549
 550 Where this breaks down is when attempting to do half-width on
 551 BF16 or FP16 operations: there does not exist a BF8 or an IEE754 FP8
 552 format, so these should be avoided.
 553
 554 # Vertical-First and Subvectors
 555
 556 Documented in the [[sv/setvl]] page, Vertical-First goes through
 557 elements second instructions first and requires an explicit
 558 [[sv/svstep]] instruction to move to the next element,
 559 (whereas Horizontal-First
 560 loops through elements in full first before moving on to
 561 the next instruction): *Subvectors are considered "elements"*
 562 in Vertical-First Mode.
 563
 564 This is conceptually quite easy to keep in mind that a Vertical-First
 565 instruction does one element at a time, and when SUBVL is set,
 566 that "element" in essence becomes a vec2/3/4.
 567
 568 # Swizzle and Pack/Unpack
 569
 570 These are both so weird it's best to just read the pages in full
 571 and pay attention: [[sv/mv.swizzle]] and [[sv/mv.vec]].
 572 Swizzle Moves only engage with vec2/3/4, *reordering* the copying
 573 of the sub-vector elements (including allowing repeats and skips)
 574 based on an immediate supplied by the instruction.  The fun
 575 comes when Pack/Unpack are enabled, and it is really important
 576 to be aware how the Arrays of vec2/3/4 become re-ordered
 577 *and swizzled at the same time*.
 578
 579 Pack/Unpack applies to
 580 [[sv/mv.vec]] as well however the uniform relationship and
 581 the fact that the source and destination subvector length
 582 must be the same (vec2/3/4) makes things slightly easier to
 583 understand.  The main thing to keep in mind about Pack/Unpack
 584 is that it engages a swap of the ordering of the VL-SUBVL
 585 nested for-loops, in exactly the same way that Matrix REMAP
 586 can do.  When Pack or Unpack is enabled it is the SUBVL for-loop
 587 that becomes outermost.
 588
 589 # No Scalar GPR Move
 590
 591 Perhaps unsurprisingly the Scalar Power ISA does not have
 592 a Scalar GPR Move instruction: instead, there are a series
 593 of pseudo-op opportunities such as `addi RT,RA,0` or `ori RT,RA,0`
 594 and many more.
 595
 596 Strictly speaking these may orthogonally be Vectorised and achieve
 597 the same effect as a Vector Move.  However these instructions
 598 are marked as `RM-2P-1S1D` and have EXTRA3 Augmentation. In other
 599 words it is not possible to use them in Pack/Unpack Mode.
 600 There is however a trick: [[sv/mv.swizzle]] with a straight linear
 601 mapping (X to X, Y to Y...)
 602 By applying a straight linear swizzle map, the `RM-2P-1S1D-PU` mode
 603 of `sv.mv.swizzle`
 604 is available.