openpower/sv/svp64_quirks.mdwn

   1 # The Rules
   2
   3 [[!toc]]
   4
   5 SVP64 is designed around fundamental and inviolate RISC principles.
   6 This gives a uniformity and regularity to the ISA, making implementation
   7 straightforward, which was why RISC
   8 as a concept became popular.
   9
  10 1. There are no actual Vector instructions: Scalar instructions
  11    are the sole exclusive bedrock.
  12 2. No scalar instruction ever deviates in its encoding or meaning
  13    just because it is prefixed (semantic caveats below)
  14 3. A hardware-level for-loop (the prefix) makes vector elements
  15    100% synonymous with scalar instructions (the suffix)
  16 4. Exactly as with Scalar RISC ISAs, the uniformity does produce
  17    "holes" in the encoding or some strange combinations.
  18
  19 How can a Vector ISA even exist when no actual Vector instructions
  20 are permitted to be added? It comes down to the strict RISC abstraction.
  21 First you start from a **scalar** instruction (32-bit). Second, the
  22 Prefixing is applied *in the abstract* to give the *appearance*
  23 and ultimately the same effect as if an explicit Vector instruction
  24 had also been added.  Looking at the pseudocode of any Vector ISA
  25 (RVV, NEC SX Aurora, Cray)
  26 they always comprise (a) a for-loop around (b) element-based operations.
  27 It is perfectly reasonable and rational to separate (a) from (b)
  28 then find a powerful pre-existing
  29 Supercomputing-class ISA that qualifies for (b).
  30
  31 There are a few exceptional places where these rules get
  32 bent, and others where the rules take some explaining,
  33 and this page tracks them all.
  34
  35 The modification caveat in (2) above semantically
  36 exempts element width overrides,
  37 which still do not actually modify the meaning of the instruction:
  38 an add remains an add, even if its override makes it an 8-bit add rather than
  39 a 64-bit add. Even add-with-carry remains an add-with-carry: it's just
  40 that when elwidth=8 in the Prefix it's an *8-bit* add-with-carry
  41 where the 9th bit becomes Carry-out (not the 65th bit).
  42 In other words, elwidth overrides **definitely** do not fundamentally
  43 alter the actual
  44 Scalar v3.0 ISA encoding itself.  Consequently we can still, in
  45 the strictest semantic sense, not be breaking rule (2).
  46
  47 Likewise, other "modifications" such as saturation or Data-dependent
  48 Fail-First likewise are actually post-augmentation or post-analysis, and do
  49 not fundamentally change an add operation into a subtract
  50 for example, and under absolutely no circumstances do the actual 32-bit
  51 Scalar v3.0 operand field bits change or the number of operands change.
  52
  53 In an early Draft of SVP64,
  54 an experiment was attempted, to modify LD-immediate instructions
  55 to include a
  56 third RC register i.e. reinterpret the normal
  57 v3.0 32-bit instruction as a completely
  58 different encoding if SVP64-prefixed. It did not go well.
  59 The complexity that resulted
  60 in the decode phase was too great. The lesson was learned, the
  61 hard way: it would be infinitely preferable
  62 to add a 32-bit Scalar Load-with-Shift
  63 instruction *first*, which then inherently becomes Vectorised.
  64 Perhaps a future Power ISA spec will have this Load-with-Shift instruction:
  65 both ARM and x86 have it, because it saves greatly on instruction count in
  66 hot-loops.
  67
  68 The other reason for not adding an SVP64-Prefixed instruction without
  69 also having it as a Scalar un-prefixed instruction is that if the
  70 32-bit encoding is ever allocated in a future revision
  71 of the Power ISA
  72 to a completely unrelated operation
  73 then how can a Vectorised version of that new instruction ever be added?
  74 The uniformity and RISC Abstraction is irreparably damaged.
  75 Bottom line here is that the fundamental RISC Principle is strictly adhered
  76 to, even though these are Advanced 64-bit Vector instructions.
  77 Advocates of the RISC Principle will appreciate the uniformity of
  78 SVP64 and the level of systematic abstraction kept between Prefix and Suffix.
  79
  80 # Instruction Groups
  81
  82 The basic principle of SVP64 is the prefix, which contains mode
  83 as well as register augmentation and predicates.  When thinking of
  84 instructions and Vectorising them, it is natural for arithmetic
  85 operations (ADD, OR) to be the first to spring to mind.
  86 Arithmetic instructions have registers, therefore augmentation
  87 applies, end of story, right?
  88
  89 Except, Load and Store deals also with Memory, not just registers.
  90 Power ISA has Condition Register Fields: how can element widths
  91 apply there? And branches: how can you have Saturation on something
  92 that does not return an arithmetic result? In short: there are actually
  93 four different categories (five including those for which Vectorisation
  94 makes no sense at all, such as `sc` or `mtmsr`). The categories are:
  95
  96 * arithmetic/logical including floating-point
  97 * Load/Store
  98 * Condition Register Field operations
  99 * branch
 100
 101 **Arithmetic**
 102
 103 Arithmetic (known as "normal" mode) is where Scalar and Parallel
 104 Reduction can be done: Saturation as well, and a new innovative
 105 modes for Vector ISAs: data-dependent fail-first.
 106 Reduction and Saturation are common to see in Vector ISAs: it is just
 107 that they are usually added as explicit instructions,
 108 and NEC SX Aurora has even more iterative instructions. In SVP64 these
 109 concepts are applied in the abstract general form, which takes some
 110 getting used to.
 111
 112 Reduction may, when applied to non-commutative
 113 instructions incorrectly, result in invalid results, but ultimately
 114 it is critical to think in terms of the "rules", that everything is
 115 Scalar instructions in strict Program Order.  Reduction on non-commutative
 116 Scalar Operations is not *prohibited*: the strict Program Order allows
 117 the programmer to think through what would happen and thus potentially
 118 actually come up with legitimate use.
 119
 120 **Branches**
 121
 122 Branch is the one and only place where the Scalar
 123 (non-prefixed) operations differ from the Vector (element)
 124 instructions (as explained in a separate section) although
 125 a case could be made for the perspective that they are identical,
 126 but the defaults for new parameters in the Scalar case makes branch
 127 identical to Power ISA v3.1 Scalar branches.
 128
 129 The
 130 RM bits can be used for other purposes because the Arithmetic modes
 131 make no sense at all for a Branch.
 132 Almost the entire
 133 SVP64 RM Field is interpreted differently from other Modes, in
 134 order to support a wide range of parallel boolean condition options
 135 which are expected of a Vector / GPU ISA. These save a considerable
 136 number of instructions in tight inner loop situations.
 137
 138 **CR Field Ops**
 139
 140 Condition Register Fields are 4-bit wide and consequently element-width
 141 overrides make absolutely no sense whatsoever. Therefore the elwidth
 142 override field bits can be used for other purposes when Vectorising
 143 CR Field instructions.  Moreover, Rc=1 is completely invalid for
 144 CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
 145 a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense.
 146 All of these differences, which require quite a lot of logical
 147 reasoning and deduction, help explain why there is an entirely different
 148 CR ops Vectorisation Category.
 149
 150 A particularly strange quirk of CR-based Vector Operations is that the
 151 Scalar Power ISA CR Register is 32-bits, but actually comprises eight
 152 CR Fields, CR0-CR7. With each CR Field being four bits (EQ, LT, GT, SO)
 153 this makes up 32 bits, and therefore a CR operand referring to one bit
 154 of the CR will be 5 bits in length (BA, BT).
 155 *However*, some instructions refer
 156 to a *CR Field* (CR0-CR7) and consequently these operands
 157 (BF, BFA etc) are only 3-bits.
 158
 159 (*It helps here to think of the top 3 bits of BA as referring
 160 to a CR Field, like BFA does, and the bottom 2 bits of BA
 161 referring to
 162 EQ/LT/GT/SO within that Field*)
 163
 164 With SVP64 extending the number of CR *Fields* to 128, the number of
 165 32-bit CR *Registers* extends to 16, in order to hold all 128 CR *Fields*
 166 (8 per CR Register). Then, it gets even more strange, when it comes
 167 to Vectorisation, which applies to the CR Field *numbers*.  The
 168 hardware-for-loop for Rc=1 for example starts at CR0 for element 0,
 169 and moves to CR1 for element 1, and so on.  The reason here is quite
 170 simple: each element result has to have its own CR Field co-result.
 171
 172 In other words, the
 173 element is the 4-bit CR *Field*, not the bits *of* the 32-bit
 174 CR Register, and not the CR *Register* (of which there are now 16).
 175 All quite logical, but a little mind-bending.
 176
 177 **Load/Store**
 178
 179 LOAD/STORE is another area that has different needs: this time it is
 180 down to limitations in Scalar LD/ST. Vector ISAs have Load/Store modes
 181 which simply make no sense in a RISC Scalar ISA: element-stride and
 182 unit-stride and the entire concept of a stride itself (a spacing
 183 between elements) has no place at all in a Scalar ISA. The problems
 184 come when trying to *retrofit* the concept of "Vector Elements" onto
 185 a Scalar ISA. Consequently it required a couple of bits (Modes) in the SVP64
 186 RM Prefix to convey the stride mode, changing the Effective Address
 187 computation as a result. Interestingly, worth noting for Hardware
 188 designers: it did turn out to be possible to perform pre-multiplication
 189 of the D/DS Immediate by the stride amount, making it possible to avoid
 190 actually modifying the LD/ST Pipeline itself.
 191
 192 Other areas where LD/ST went quirky: element-width overrides especially
 193 when combined with Saturation, given that LD/ST operations have byte,
 194 halfword, word, dword and quad variants. The interaction between these
 195 widths as part of the actual operation, and the source and destination
 196 elwidth overrides, was particularly obtuse and hard to derive: some care
 197 and attention is advised, here, when reading the specification,
 198 especially on arithmetic loads (lbarx, lharx etc.)
 199
 200 **Non-vectorised**
 201
 202 The concept of a Vectorised halt (`attn`) makes no sense. There are never
 203 going to be a Vector of global MSRs (Machine Status Register). `mtcr`
 204 on the other hand is a grey area: `mtspr` is clearly Vectoriseable.
 205 Even `td` and `tdi` makes a strange type of sense to permit it to be
 206 Vectorised, because a sequence of comparisons could be Vectorised.
 207 Vectorised System Calls (`sc`) or `tlbie` and other Cache or Virtual
 208 Nemory Management
 209 instructions, these make no sense to Vectorise.
 210
 211 However, it is really quite important to not be tempted to conclude that
 212 just because these instructions are un-vectoriseable, the Prefix opcode space
 213 must be free for reiterpretation and use for other purposes. This would
 214 be a serious mistake because a future revision of the specification
 215 might *retire* the Scalar instruction, and, worse, replace it with another.
 216 Again this comes down to being quite strict about the rules: only Scalar
 217 instructions get Vectorised: there are *no* actual explicit Vector
 218 instructions.
 219
 220 **Summary**
 221
 222 Where a traditional Vector ISA effectively duplicates the entirety
 223 of a Scalar ISA and then adds additional instructions which only
 224 make sense in a Vector Context, such as Vector Shuffle, SVP64 goes to
 225 considerable lengths to keep strictly to augmentation and embedding
 226 of an entire Scalar ISA's instructions into an abstract Vectorisation
 227 Context. That abstraction subdivides down into Categories appropriate
 228 for the type of operation (Branch, CRs, Memory, Arithmetic),
 229 and each Category has its own relevant but
 230 ultimately rational quirks.
 231
 232 # Abstraction between Prefix and Suffix
 233
 234 In the introduction paragraph, a great fuss was made emphasising that
 235 the Prefix is kept separate from the Suffix.  The whole idea there is
 236 that a Multi-issue Decoder and subsequent pipelines would in no way have
 237 "back-propagation" of state that can only be determined far too late.
 238 This *has* been preserved, however there is a hiccup.
 239
 240 Examining the Power ISA 3.1 a 64-bit Prefix was introduced, EXT001.
 241 The encoding of the prefix has 6 bits that are dedicated to letting
 242 the hardware know what the remainder of the Prefix bits mean: how they
 243 are formatted, even without having to examine the Suffix to which
 244 they are applied.
 245
 246 SVP64 has such pressure on its 24-bit encoding that it was simply
 247 not possible to perform the same trick used by Power ISA 3.1 Prefixing.
 248 Therefore, rather unfortunately, it becomes necessary to perform
 249 a *partial decoding* of the v3.0 Suffix before the 24-bit SVP64 RM
 250 Fields may be identified.  Fortunately this is straightforward, and
 251 does not rely on any outside state, and even more fortunately
 252 for a Multi-Issue Execution decoder, the length 32/64 is also
 253 easy to identify by looking for the EXT001 pattern. Once identified
 254 the 32/64 bits may be passed independently to multiple Decoders in
 255 parallel.
 256
 257 # Predication
 258
 259 Predication is entirely missing from the Power ISA.
 260 Adding it would be a costly mistake because it cannot be retrofitted
 261 to an ISA without literally duplicating all instructions. Prefixing
 262 is about the only sane way to go.
 263
 264 CR Fields as predicate masks could be spread across multiple register
 265 file entries, making them costly to read in one hit. Therefore the
 266 possibility exists that an instruction element writing to a CR Field
 267 could *overwrite* the Predicate mask CR Vector during the middle of
 268 a for-loop.
 269
 270 Clearly this is bad, so don't do it.  If there are potential issues
 271 they can be avoided by using the crweird instructions to get CR Field
 272 bits into an Integer GPR (r3, r10 or r30) and use that GPR as a
 273 Predicate mask instead.
 274
 275 Even in Vertical-First Mode, which is a single Scalar instruction executed
 276 with "offset" registers (in effect), the rule still applies: don't write
 277 to the same register being used as the predicate, it's `UNDEFINED`
 278 behaviour.
 279
 280 ## Single Predication
 281
 282 So named because there is a Twin Predication concept as well, Single
 283 Predication is also unlike other Vector ISAs because it allows zeroing
 284 on both the source and destination.  This takes some explaining.
 285
 286 In Vector ISAs, there is a Predicate Mask, it applies to the
 287 destination only, and there
 288 is a choice of actions when a Predicate Mask bit
 289 is zero:
 290
 291 * set the destination element to zero
 292 * skip that element operation entirely, leaving the destination unmodified
 293
 294 The problem comes if the underlying register file SRAM is say 64-bit wide
 295 write granularity but the Vector elements are say 8-bit wide.
 296 Some Vector ISAs strongly advocate Zeroing because to leave one single
 297 element at a small bitwidth in amongst other elements where the register
 298 file does not have the prerequisite access granularity is very expensive,
 299 requiring a Read-Modify-Write cycle to preserve the untouched elements.
 300 Putting zero into the destination avoids that Read.
 301
 302 This is technically very easy to solve: use a Register File that does
 303 in fact have the smallest element-level write-enable granularity.
 304 If the elements are 8 bit then allow 8-bit writes!
 305
 306 With that technical issue solved there is nothing in the way of choosing
 307 to support both zeroing and non-zeroing (skipping) at the ISA level:
 308 SV chooses to further support both *on both the source and destination*.
 309 This can result in the source and destination
 310 element indices getting "out-of-sync" even though the Predicate Mask
 311 is the same because the behaviour is different when zeros in the
 312 Predicate are encountered.
 313
 314 ## Twin Predication
 315
 316 Twin Predication is an entirely new concept not present in any commercial
 317 Vector ISA of the past forty years.  To explain how normal Single-predication
 318 is applied in a standard Vector ISA:
 319
 320 * Predication on the **source** of a LOAD instruction creates something
 321   called "Vector Compressed Load" (VCOMPRESS).
 322 * Predication on the **destination** of a STORE instruction creates something
 323   called "Vector Expanded Store" (VEXPAND).
 324 * SVP64 allows the two to be put back-to-back: one on source, one on
 325   destination.
 326
 327 The above allows a reader familiar with VCOMPRESS and VEXPAND to
 328 conceptualise what the effect of Twin Predication is, but it actually
 329 goes much further: in *any* twin-predicated instruction (extsw, fmv)
 330 it is possible to apply one predicate to the source register (compressing
 331 the source element array) and another *completely separate* predicate
 332 to the destination register, not just on Load/Stores but on *arithmetic*
 333 operations.
 334
 335 No other Vector ISA in the world has this back-to-back
 336 capability.  All true Vector
 337 ISAs have Predicate Masks: it is an absolutely essential characteristic.
 338 However none of them have abstracted dual predicates out to the extent
 339 where this VCOMPRESS-VEXPAND effect is applicable *in general* to a
 340 wide range of arithmetic
 341 instructions, as well as Load/Store.
 342
 343 It is however important to note that not all instructions can be Twin
 344 Predicated (2P): some remain only Single Predicated (1P), as is normally found
 345 in other Vector ISAs. Arithmetic operations with
 346 four registers (3-in, 1-out, VA-Form for example) are Single. The reason
 347 is that there just wasn't enough space in the 24-bits of the SVP64 Prefix.
 348 Consequently, when using a given instruction, it is necessary to look
 349 up in the ISA Tables whether it is 1P or 2P. caveat emptor!
 350
 351 Also worth a special mention: all Load/Store operations are Twin-Predicated.
 352 The underlying key to understanding:
 353
 354 * one Predicate effectively applies to the Array of Memory *Addresses*,
 355 * the other Predicate effectively applies to the Array of Memory *Data*.
 356
 357 # CR weird instructions
 358
 359 [[sv/cr_int_predication]] is by far the biggest violator of the SVP64
 360 rules, for good reasons.  Transfers between Vectors of CR Fields and Integers
 361 for use as predicates is very awkward without them.
 362
 363 Normally, element width overrides allow the element width to be specified
 364 as 8, 16, 32 or default (64) bit. With CR weird instructions producing or
 365 consuming either 1 bit or 4 bit elements (in effect) some adaptation was
 366 required.  When this perspective is taken (that results or sources are
 367 1 or 4 bits) the weirdness starts to make sense, because the "elements",
 368 such as they are, are still packed sequentially.
 369
 370 From a hardware implementation perspective however they will need special
 371 handling as far as Hazard Dependencies are concerned, due to nonconformance
 372 (bit-level management)
 373
 374 # mv.x (vector permute)
 375
 376 [[sv/mv.x]] aka `GPR(RT) = GPR(GPR(RA))` is so horrendous in
 377 terms of Register Hazard Management that its addition to any Scalar
 378 ISA is anathematic. In a Traditional Vector ISA however, where the
 379 indices are isolated behind a single Vector Hazard, there is no
 380 problem at all.  `sv.mv.x` is also fraught, precisely because it
 381 sits on top of a Standard Scalar register paradigm, not a Vector
 382 ISA with separate and distinct Vector registers.
 383
 384 To help partly solve this, `sv.mv.x` would have had to have
 385 been made relative:
 386
 387 ```
 388 for i in range(VL):
 389     GPR(RT+i) = GPR(RT+MIN(GPR(RA+i), VL))
 390 ```
 391
 392 The reason for doing so is that MAXVL or VL may be used to limit
 393 the number of Register Hazards that need to be raised to a fixed
 394 quantity, at Issue time.
 395
 396 `mv.x` itself would still have to be added as a Scalar instruction,
 397 but the behaviour of `sv.mv.x` would have to be different from that
 398 Scalar version.
 399
 400 Normally, Scalar Instructions have a good justification for being
 401 added as Scalar instructions on their own merit. `mv.x` is the
 402 polar opposite, and in the end, the idea was thrown out, and Indexed
 403 REMAP added in its place.  Indexed REMAP comes with its own quirks,
 404 solving the Hazard problem, described in a later section.
 405
 406 # REMAP and other reordering
 407
 408 There are several places in Simple-V which apply some sort of reordering
 409 schedule to elements.  srcstep and dststep do not themselves reorder:
 410 they continue to march in sequence (VL-1 downto 0 in the case of reverse-gear)
 411
 412 It is perfectly legal to apply Parallel-Reduction on top of any type
 413 of REMAP, for example, and it is possible to apply Pack/Unpack on a
 414 REMAP as well.
 415
 416 The order of application of REMAP combined with Parallel-Reduction
 417 should be logically obvious: REMAP has to come first because otherwise
 418 how can the Parallel-Reduction perform a tree-walk?
 419
 420 Pack/Unpack on the other hand is best implemented as applying first,
 421 because it is applied
 422 as the inversion of the for-loops which generate the steps and substeps.
 423 REMAP then applies to the src/dst-step indices (never to the subvl
 424 step indices: that is SWIZZLE's job).
 425
 426 It's all perfectly logical, just a lot going on.
 427
 428 # Branch-Conditional
 429
 430 [[sv/branches]] are a very special exception to the rule that there
 431 shall be no deviation from the corresponding
 432 Scalar instruction.  This because of the tight
 433 integration with looping and the application of Boolean Logic
 434 manipulation needed for Parallel operations (predicate mask usage).
 435 This results in an extremely important observation that `scalar identity
 436 behaviour` is violated: the SV Prefixed variant of branch is **not** the same
 437 operation as the unprefixed 32-bit scalar version.
 438
 439 One key difference is that LR is only updated if certain additional
 440 conditions are met, whereas Scalar `bclrl` for example unconditionally
 441 overwrites LR.
 442
 443 Another is that the Vectorised Branch-Conditional instructions are the
 444 only ones where there are side-effects on predication when skipping
 445 is enabled. This is so as to be able to use CTR to count down
 446 *masked-out* elements.
 447
 448 Well over 500 Vectorised branch instructions exist in SVP64 due to the
 449 number of options available: close integration and interaction with
 450 the base Scalar Branch was unavoidable in order to create Conditional
 451 Branching suitable for parallel 3D / CUDA GPU workloads.
 452
 453 # Saturation
 454
 455 The application of Saturation as a retro-fit to a Scalar ISA is challenging.
 456 It does help that within the SFFS Compliancy subset there are no Saturated
 457 operations at all: they are only added in VSX.
 458
 459 Saturation does not inherently change the instruction itself: it does however
 460 come with some fundamental implications, when applied. For example:
 461 a Floating-Point operation that would normally raise an exception will
 462 no longer do so, instead setting the CR1.SO Flag.  Another quirky
 463 example: signed operations which produce a negative result will be
 464 truncated to zero if Unsigned Saturation is requested.
 465
 466 One very important aspect for implementors is that the operation in
 467 effect has to be considered to be performed at infinite precision,
 468 followed by saturation detection. In practice this does not actually
 469 require infinite precision hardware! Two 8-bit integers being
 470 added can only ever overflow into a 9-bit result.
 471
 472 Overall some care and consideration needs to be applied.
 473
 474 # Fail-First
 475
 476 Fail-First (both the Load/Store and Data-Dependent variants)
 477 is worthy of a special mention in its own right. Where VL is
 478 normally forward-looking and may be part of a pre-decode phase
 479 in a (simplified) pipelined architecture with no Read-after-Write Hazards,
 480 Fail-First changes that because at any point during the execution
 481 of the element-level instructions, one of those elements may not only
 482 terminate further continuation of the hardware-for-looping but also
 483 effect a change of VL:
 484
 485 ```
 486 for i in range(VL):
 487     result = element_operation(GPR(RA+i), GPR(RB+i))
 488     if test(result):
 489         VL = i
 490         break
 491 ```
 492
 493 This is not exactly a violation of SVP64 Rules, more of a breakage
 494 of user expectations, particularly for LD/ST where exceptions
 495 would normally be expected to be raised, Fail-First provides for
 496 avoidance of those exceptions.
 497
 498 For Hardware implementers, a standard Out-of-Order micro-architecture
 499 allows for Cancellation of speculatively-executed elements that extended
 500 beyond the Vector Truncation point.  In-order systems will have a slightly
 501 harder time and may choose to execute one element only at a time, reducing
 502 performance as a result.
 503
 504 # OE=1
 505
 506 The hardware cost of Sticky Overflow in a parallel environment is immense.
 507 The SFFS Compliancy Level is permitted optionally to support XER.SO.
 508 Therefore the decision is made to make it mandatory **not** to
 509 support XER.SO. However, CR.SO *is* supported such that when Rc=1
 510 is set the CR.SO flag will contain only the overflow of
 511 the current instruction, rather than being actually "sticky".
 512 Hardware Out-of-Order designers will recognise and appreciate
 513 that the Hazards are
 514 reduced to Read-After-Write (RAW) and that the WAR Hazard is removed.
 515
 516 This is sort-of a quirk and sort-of not, because the option to support
 517 XER.SO is already optional from the SFFS Compliancy Level.
 518
 519 # Indexed REMAP and CR Field Predication Hazards
 520
 521 Normal Vector ISAs and those Packed SIMD ISAs inspired by them have
 522 Vector "Permute" or "Shuffle" instructions. These provide a Vector of
 523 indices whereby another Vector is reordered (permuted, shuffled) according
 524 to the indices.  Register Hazard Managent here is trivial because there
 525 are three registers: indices source vector, elements source vector to
 526 be shuffled, result vector.
 527
 528 For SVP64 which is based on top of a Scalar Register File paradigm,
 529 combined with the hard requirement to respect full Register Hazard
 530 Management as if element instructions were actual Scalar instructions,
 531 the addition of a Vector permute instruction under these strict
 532 conditions would result in a catastrophic
 533 reduction in performance, due to having to consider Read-after-Write
 534 and Write-after-Read Hazards *at the element level*.
 535
 536 A little leniency and rule-bending is therefore required.
 537
 538 Rather than add explicit Vector permute instructions, the "Indexing"
 539 has been separated out into a REMAP Schedule.  When an Indexed
 540 REMAP is requested, it is assumed (required, of software) that
 541 subsequent instructions intending to use those indices *will not*
 542 attempt to modify the indices. It is *Software* that must consider them
 543 to be read-only.
 544
 545 This simple relaxation of the rules releases Hardware from having the
 546 horrendous job of dynamically detecting Write-after-Read Hazards on a
 547 huge range of registers.
 548
 549 A similar Hazard problem exists for CR Field Predicates, in Vertical-First
 550 Mode.  Instructions could modify CR Fields currently being used as Predicate
 551 Masks: detecting this is so horrendous for hardware resource utilisation
 552 and hardware complexity that, again, the decision is made to relax these
 553 constraints and for Software to take that into account.
 554
 555 # Floating-Point "Single" becomes "Half"
 556
 557 In several places in the Power ISA there are operations that are on
 558 32-bit quantities in 64-bit registers.  The best example is FP which
 559 has 64-bit operations (`fadd`) and 32-bit operations (`fadds` or
 560 FP Add "single").  Element-width overrides it would seem to
 561 be unnecessary, under these circumstances.
 562
 563 However, it is not possible for `fadds` to fit two elements into
 564 64-bit: that breaks the simplicity of SVP64.
 565 Bear in mind that the FP32 bits are spread out across a 64
 566 bit register in FP64 format.  The solution here was to consider the
 567 "s" at the end of each instruction
 568 to mean "half of the element's width". Thus, `sv.fadds/ew=32`
 569 actually stores an FP16 spread out across the 32 bits of an
 570 element, in FP32 format, where `sv.fadd/ew=32` stores a full
 571 FP32 result into the full 32 bits.
 572
 573 Where this breaks down is when attempting to do half-width on
 574 BF16 or FP16 operations: there does not exist a BF8 or an IEEE754 FP8
 575 format, so these (`sv.fadds/ew=8`) should be avoided.
 576
 577 # Word frequently becomes "half"
 578
 579 Again, related to "Single" becoming "half of element width", unless there
 580 are compelling reasons the same trick applies to Scalar GPR operations.
 581 With the pseudocode being "XLEN//2" then of course if XLEN=8 the operation
 582 becomes a 4-bit one.
 583
 584 Similarly byte operations which use "XLEN//8" when XLEN=8 actually become
 585 single-bit operations, which is very useful with `sv.extsb/w=8`
 586 for example.  This instruction copies the LSB of each byte in a sequence of bytes,
 587 and expands it to all 8 bits in each result byte.
 588
 589 # Vertical-First and Subvectors
 590
 591 Documented in the [[sv/setvl]] page, Vertical-First goes through
 592 elements second instructions first and requires an explicit
 593 [[sv/svstep]] instruction to move to the next element,
 594 (whereas Horizontal-First
 595 loops through elements in full first before moving on to
 596 the next instruction): *Subvectors are considered "elements"*
 597 in Vertical-First Mode.
 598
 599 This is conceptually quite easy to keep in mind that a Vertical-First
 600 instruction does one element at a time, and when SUBVL is set,
 601 that "element" in essence becomes a vec2/3/4.
 602
 603 # Swizzle and Pack/Unpack
 604
 605 These are both so weird it's best to just read the pages in full
 606 and pay attention: [[sv/mv.swizzle]] and [[sv/mv.vec]].
 607 Swizzle Moves only engage with vec2/3/4, *reordering* the copying
 608 of the sub-vector elements (including allowing repeats and skips)
 609 based on an immediate supplied by the instruction.  The fun
 610 comes when Pack/Unpack are enabled, and it is really important
 611 to be aware how the Arrays of vec2/3/4 become re-ordered
 612 *and swizzled at the same time*.
 613
 614 Pack/Unpack started out as
 615 [[sv/mv.vec]] but became its own distinct Mode over time.
 616 The main thing to keep in mind about Pack/Unpack
 617 is that it engages a swap of the ordering of the VL-SUBVL
 618 nested for-loops, in exactly the same way that Matrix REMAP
 619 can do.
 620 When Pack or Unpack is enabled it is the SUBVL for-loop
 621 that becomes outermost.  A bit of thought shows that this is
 622 a 2D "Transpose" where Dimension X is VL and Dimension Y is SUBVL.
 623 However *both* source *and* destination may be independently
 624 "Transposed", which makes no sense at all until the fact that
 625 Swizzle can have a *different SUBVL* is taken into account.
 626
 627 Basically Pack/Unpack covers everything that VSX `vpkpx` and
 628 other ops can do, and then some: Saturation included, for arithmetic ops.
 629
 630 # LD/ST with zero-immediate vs mapreduce mode
 631
 632 LD/ST operations with a zero immediate effectively means that on a
 633 Vector operation the element index to offset the memory location is
 634 multiplied by zero.  Thus, a sequence of LD operations will load from
 635 the exact same address, and likewise STs to the exact same address.
 636
 637 Ordinarily this would make absolutely no sense whatsoever, except
 638 that Power ISA has cache-inhibited LD/STs (Power ISA v.1, Book III,
 639 1.6.1, p1033), for accessing memory-mapped
 640 peripherals and other crucial uses.  Thus, *despite not being a mapreduce mode*,
 641 zero-immediates cause multiple hits on the same element.
 642
 643 Mapreduce mode is not actually mapreduce at all: it is
 644 a relaxation of the normal rule where if the destination is a Scalar the
 645 Vector for-looping is not terminated on first write to the destination.
 646 Instead, the developer is expected to exploit the strict Program Order,
 647 make one of the sources the same as that Scalar destination, effectively
 648 making that Scalar register an "Accumulator", thus creating the *appearance*
 649 (effect) of Simple-V having a mapreduce capability, when in fact it is
 650 more of an artefact.
 651
 652 LD/ST zero-immediate has similar quirky overwriting as the "mapreduce"
 653 mode, but actually requires the registers to be Vectors.  It is simply
 654 a mathematical artefact of multiplying by zero, which happens to be
 655 useful for cache-inhibited operations.
 656
 657 # Limited space in LD/ST Mode
 658
 659 As pointed out in the [[sv/ldst]] page there is limited space in only
 660 5 mode bits to fully express all potential modes of operation.
 661
 662 * LD/ST Immediate has no individual control over src/dest zeroing,
 663   whereas LD/ST Indexed does.
 664 * Post-Increment is not possible with Saturation or Data-Dependent Fail-First
 665
 666 These are not insurmountable problems: there do exist workarounds.
 667 For example it is possible to set up Matrix REMAP to perform the same
 668 job as Pack/Unpack, at which point the LD/ST "Saturation" mode may
 669 be used, saving on costly intermediary registers *at double the LD
 670 width* if a Saturated MV had to be involved. Store on the other hand
 671 it is extremely likely that an arithmetic operation already computed
 672 a Saturated Vector of results, so is less of a problem than Load.
 673
 674 Also, the LD/ST Indexed Mode can be element-strided (RB as
 675 a Scalar, times
 676 the element index), or, if that is not enough,
 677 although potentially costly it is possible to
 678 use `svstep` to compute a Vector RB sequence of
 679 Indices, then activate either `sz` or `dz` as required, as a workaround
 680 for LDST Immediate only having `zz`.
 681
 682 Simple-V is powerful but it cannot do everything! There is just not
 683 enough space and so some compromises had to be made.
 684
 685 # sv.mtcr on entire 64-bit Condition Register
 686
 687 Normally, CR operations are either bit-based (where the element numbering actually
 688 applies to the CR Field) or field-based in which case the elements are still
 689 fields.  The `sv.mtcr` and other instructions are actually full 64-bit Condition
 690 *Register* operations and are therefore qualified as Normal/Arithmetic not
 691 CRops.
 692
 693 This is to save on both Vector Length (VL of 16 is sufficient) as well as
 694 complexity in the Hazard Management when context-switching CR fields, as the
 695 entire batch of 128 CR Fields may be transferred to 8 GPRs with a VL of 16
 696 and elwidth overriding of 32. Truncation is sufficent, dropping the top 32 bits
 697 of the Condition Register(s) which are always zero anyway.
 698
 699 # Separate Scalar and Vector Condition Register files
 700
 701 As explained in the introduction [[sv/svp64]] and [[sv/cr_ops]]
 702 Scalar Power ISA lacks "Conditional Execution" present in ARM
 703 Scalar ISA of several decades.  When Vectorised the fact that
 704 Rc=1 Vector results can immediately be used as a Predicate Mask
 705 back into the following instruction can result in large latency
 706 unless "Vector Chaining" is used in the Micro-Architecture.
 707
 708 But that aside is not the main problem faced by the introduction
 709 of Simple-V to the Power ISA: it's that the existing implementations
 710 (IBM) don't have "Conditional Execution" and to add it to their
 711 existing designs would be too disruptive a first step.
 712
 713 A compromise is to wipe blank certain entries in the Register Dependency
 714 Matrices by prohibiting some operations involving the two groups
 715 of CR Fields: those that fall into the existing Scalar 32-bit CR
 716 (fields CR0-CR7) and those that fall into the newly-introduced
 717 CR Fields, CR8-CR127.
 718
 719 This will drive compiler writers nuts, and give assembler writers headaches,
 720 but it gives IBM the opportunity to implement SVP64 without massive
 721 disruption. They can add an entirely new Vector CR register file,
 722 new pipelines etc safe in the knowledge that existing Scalar HDL
 723 needs no modification.