openpower/sv/svp64/appendix.mdwn

   1 [[!tag standards]]
   2
   3 # Appendix
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturation
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47> Parallel Prefix
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=697> Reduce Modes
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=809> OV sv.addex discussion
   9
  10 This is the appendix to [[sv/svp64]], providing explanations of modes
  11 etc. leaving the main svp64 page's primary purpose as outlining the
  12 instruction format.
  13
  14 Table of contents:
  15
  16 [[!toc]]
  17
  18 # Partial Implementations
  19
  20 It is perfectly legal to implement subsets of SVP64 as long as illegal
  21 instruction traps are always raised on unimplemented features,
  22 so that soft-emulation is possible,
  23 even for future revisions of SVP64. With SVP64 being partly controlled
  24 through contextual SPRs, a little care has to be taken.
  25
  26 **All** SPRs
  27 not implemented including reserved ones for future use must raise an illegal
  28 instruction trap if read or written. This allows software the
  29 opportunity to emulate the context created by the given SPR.
  30
  31 **Embedded Scalar Scenario**
  32
  33 In this scenario an implementation does not wish to implement the Vectorisation
  34 but simply wishes to take advantage of predication or other feature
  35 of SVP64, such as instructions that might only be available if prefixed.
  36 Such an implementation would be entirely free to do so with the proviso
  37 that:
  38
  39 * any attempts to call `setvl` shall either raise an illegal instruction
  40   or be partially implemented to set SVSTATE correctly.
  41 * if SVSTATE contains any value in any bit that is not supported
  42   in hardware, an illegal instruction shall be raised when an SVP64
  43   prefixed instruction is executed.
  44 * if SVSTATE contains values requesting supported features at the time
  45   that the prefixed instruction is executed then it is executed in
  46   hardware as per specification, with no illegal exception trap raised.
  47
  48 Example, assuming that hardware implements predication but not
  49 elwidth overrides:
  50
  51     setvli r0, 4            # sets VL equal to 4
  52     sv.addi r5, r0, 1       # raises an 0x700 trap
  53     setvli r0, 1            # sets VL equal to 1
  54     sv.addi r5, r0, 1       # gets executed by hardware
  55     sv.addi/ew=8 r5, r0, 1  # raises an 0x700 trap
  56     sv.ori/sm=EQ r5, r0, 1  # executed by hardware
  57
  58 The first
  59
  60 # XER, SO and other global flags
  61
  62 Vector systems are expected to be high performance.  This is achieved
  63 through parallelism, which requires that elements in the vector be
  64 independent.  XER SO/OV and other global "accumulation" flags (CR.SO) cause
  65 Read-Write Hazards on single-bit global resources, having a significant
  66 detrimental effect.
  67
  68 Consequently in SV, XER.SO behaviour is disregarded (including
  69 in `cmp` instructions).  XER.SO is not read, but XER.OV may be written,
  70 breaking the Read-Modify-Write Hazard Chain that complicates
  71 microarchitectural implementations.
  72 This includes when `scalar identity behaviour` occurs.  If precise
  73 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  74 instructions should be used without an SV Prefix.
  75
  76 TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
  77
  78 Of note here is that XER.SO and OV may already be disregarded in the
  79 Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
  80 SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets,
  81 but only for SVP64 Prefixed Operations.
  82
  83 XER.CA/CA32 on the other hand is expected and required to be implemented
  84 according to standard Power ISA Scalar behaviour.  Interestingly, due
  85 to SVP64 being in effect a hardware for-loop around Scalar instructions
  86 executing in precise Program Order, a little thought shows that a Vectorised
  87 Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
  88 and producing, at the end, a single bit Carry out.  High performance
  89 implementations may exploit this observation to deploy efficient
  90 Parallel Carry Lookahead.
  91
  92     # assume VL=4, this results in 4 sequential ops (below)
  93     sv.adde r0.v, r4.v, r8.v
  94
  95     # instructions that get executed in backend hardware:
  96     adde r0, r4, r8 # takes carry-in, produces carry-out
  97     adde r1, r5, r9 # takes carry from previous
  98     ...
  99     adde r3, r7, r11 # likewise
 100
 101 It can clearly be seen that the carry chains from one
 102 64 bit add to the next, the end result being that a
 103 256-bit "Big Integer Add" has been performed, and that
 104 CA contains the 257th bit.  A one-instruction 512-bit Add
 105 may be performed by setting VL=8, and a one-instruction
 106 1024-bit add by setting VL=16, and so on.
 107
 108 # v3.0B/v3.1 relevant instructions
 109
 110 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
 111 CPU ISA.
 112
 113 Vectorisation of the VSX Packed SIMD system makes no sense whatsoever,
 114 the sole exceptions potentially being any operations with 128-bit
 115 operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar
 116 Quad-precision Add).
 117 SV effectively *replaces* VSX requiring far less instructions, and provides,
 118 at the very minimum, predication (which VSX was designed without).
 119 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
 120 illegal instruction exceptions in SV Prefix Mode.
 121
 122 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
 123 have because they are not only provided by SV, the SV alternatives may
 124 be predicated as well, making them far better suited to use in function
 125 calls and context-switching.
 126
 127 Additionally, some v3.0/1 instructions simply make no sense at all in a
 128 Vector context: `rfid` falls into this category,
 129 as well as `sc` and `scv`.  Here there is simply no point
 130 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
 131 should be called instead.
 132
 133 Fortuitously this leaves several Major Opcodes free for use by SV
 134 to fit alternative future instructions.  In a 3D context this means
 135 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
 136 operations, and others critical to an efficient, effective 3D GPU and
 137 VPU ISA. With such instructions being included as standard in other
 138 commercially-successful GPU ISAs it is likewise critical that a 3D
 139 GPU/VPU based on svp64 also have such instructions.
 140
 141 Note however that svp64 is stand-alone and is in no way
 142 critically dependent on the existence or provision of 3D GPU or VPU
 143 instructions. These should be considered extensions, and their discussion
 144 and specification is out of scope for this document.
 145
 146 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
 147 v3.1B is *not* altered by svp64 in any way.
 148
 149 ## Major opcode map (v3.0B)
 150
 151 This table is taken from v3.0B.
 152 Table 9: Primary Opcode Map (opcode bits 0:5)
 153
 154 ```
 155     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 156 000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
 157 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 158 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 159 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 160 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 161 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
 162 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 163 111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
 164     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 165 ```
 166
 167 ## Suitable for svp64-only
 168
 169 This is the same table containing v3.0B Primary Opcodes except those that
 170 make no sense in a Vectorisation Context have been removed.  These removed
 171 POs can, *in the SV Vector Context only*, be assigned to alternative
 172 (Vectorised-only) instructions, including future extensions.
 173 EXT04 retains the scalar `madd*` operations but would have all PackedSIMD
 174 (aka VSX) operations removed.
 175
 176 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 177 change.  When not prefixed with svp64 these opcodes **specifically**
 178 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 179
 180 ```
 181     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 182 000 |        |       |       |       | EXT04 |        |       | mulli | 000
 183 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 184 010 | bc/l/a |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 185 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 186 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 187 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 188 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 189 111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 190     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 191 ```
 192
 193 It is important to note that having a different v3.0B Scalar opcode
 194 that is different from an SVP64 one is highly undesirable: the complexity
 195 in the decoder is greatly increased.
 196
 197 # EXTRA Field Mapping
 198
 199 The purpose of the 9-bit EXTRA field mapping is to mark individual
 200 registers (RT, RA, BFA) as either scalar or vector, and to extend
 201 their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
 202 Three of the 9 bits may also be used up for a 2nd Predicate (Twin
 203 Predication) leaving a mere 6 bits for qualifying registers. As can
 204 be seen there is significant pressure on these (and in fact all) SVP64 bits.
 205
 206 In Power ISA v3.1 prefixing there are bits which describe and classify
 207 the prefix in a fashion that is independent of the suffix. MLSS for
 208 example.  For SVP64 there is insufficient space to make the SVP64 Prefix
 209 "self-describing", and consequently every single Scalar instruction
 210 had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
 211 This process was semi-automated and is described in this section.
 212 The final results, which are part of the SVP64 Specification, are here:
 213
 214 * [[openpower/opcode_regs_deduped]]
 215
 216 Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
 217 from reading the markdown formatted version of the Scalar pseudocode
 218 which is machine-readable and found in [[openpower/isatables]].  The
 219 analysis gives, by instruction, a "Register Profile".  `add RT, RA, RB`
 220 for example is given a designation `RM-2R-1W` because it requires
 221 two GPR reads and one GPR write.
 222
 223 Secondly, the total number of registers was added up (2R-1W is 3 registers)
 224 and if less than or equal to three then that instruction could be given an
 225 EXTRA3 designation.  Four or more is given an EXTRA2 designation because
 226 there are only 9 bits available.
 227
 228 Thirdly, the instruction was analysed to see if Twin or Single
 229 Predication was suitable.  As a general rule this was if there
 230 was only a single operand and a single result (`extw` and LD/ST)
 231 however it was found that some 2 or 3 operand instructions also
 232 qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
 233 in Twin Predication, some compromises were made, here.  LDST is
 234 Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
 235
 236 Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
 237 could have been decided
 238 that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
 239 and RT indexed 2 (EXTRA bits 6-8).  In some cases (LD/ST with update)
 240 RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
 241 (because it is possible to do, and perceived to be useful). Rc=1
 242 co-results (CR0, CR1) are always given the same EXTRA index as their
 243 main result (RT, FRT).
 244
 245 Fifthly, in an automated process the results of the analysis
 246 were outputted in CSV Format for use in machine-readable form
 247 by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
 248
 249 This process was laborious but logical, and, crucially, once a
 250 decision is made (and ratified) cannot be reversed.
 251 Qualifying future Power ISA Scalar instructions for SVP64
 252 is **strongly** advised to utilise this same process and the same
 253 sv_analysis.py program as a canonical method of maintaining the
 254 relationships.  Alterations to that same program which
 255 change the Designation is **prohibited** once finalised (ratified
 256 through the Power ISA WG Process). It would
 257 be similar to deciding that `add` should be changed from X-Form
 258 to D-Form.
 259
 260 # Single Predication
 261
 262 This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
 263
 264 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
 265
 266 # Twin Predication
 267
 268 This is a novel concept that allows predication to be applied to a single
 269 source and a single dest register.  The following types of traditional
 270 Vector operations may be encoded with it, *without requiring explicit
 271 opcodes to do so*
 272
 273 * VSPLAT (a single scalar distributed across a vector)
 274 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 275 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 276 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 277 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 278
 279 Those patterns (and more) may be applied to:
 280
 281 * mv (the usual way that V\* ISA operations are created)
 282 * exts\* sign-extension
 283 * rwlinm and other RS-RA shift operations (**note**: excluding
 284   those that take RA as both a src and dest. These are not
 285   1-src 1-dest, they are 2-src, 1-dest)
 286 * LD and ST (treating AGEN as one source)
 287 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 288 * Condition Register ops mfcr, mtcr and other similar
 289
 290 This is a huge list that creates extremely powerful combinations,
 291 particularly given that one of the predicate options is `(1<<r3)`
 292
 293 Additional unusual capabilities of Twin Predication include a back-to-back
 294 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 295 sequentially ordered multiple VINSERTs.  The source predicate selects a
 296 sequentially ordered subset of elements to be inserted; the destination
 297 predicate specifies the sequentially ordered recipient locations.
 298 This is equivalent to
 299 `llvm.masked.compressstore.*`
 300 followed by
 301 `llvm.masked.expandload.*`
 302 with a single instruction.
 303
 304 This extreme power and flexibility comes down to the fact that SVP64
 305 is not actually a Vector ISA: it is a loop-abstraction-concept that
 306 is applied *in general* to Scalar operations, just like the x86
 307 `REP` instruction (if put on steroids).
 308
 309 # Reduce modes
 310
 311 Reduction in SVP64 is deterministic and somewhat of a misnomer.  A normal
 312 Vector ISA would have explicit Reduce opcodes with defined characteristics
 313 per operation: in SX Aurora there is even an additional scalar argument
 314 containing the initial reduction value, and the default is either 0
 315 or 1 depending on the specifics of the explicit opcode.
 316 SVP64 fundamentally has to
 317 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
 318 unique challenges.
 319
 320 The solution turns out to be to simply define reduction as permitting
 321 deterministic element-based schedules to be issued using the base Scalar
 322 operations, and to rely on the underlying microarchitecture to resolve
 323 Register Hazards at the element level.  This goes back to
 324 the fundamental principle that SV is nothing more than a Sub-Program-Counter
 325 sitting between Decode and Issue phases.
 326
 327 Microarchitectures *may* take opportunities to parallelise the reduction
 328 but only if in doing so they preserve Program Order at the Element Level.
 329 Opportunities where this is possible include an `OR` operation
 330 or a MIN/MAX operation: it may be possible to parallelise the reduction,
 331 but for Floating Point it is not permitted due to different results
 332 being obtained if the reduction is not executed in strict Program-Sequential
 333 Order.
 334
 335 In essence it becomes the programmer's responsibility to leverage the
 336 pre-determined schedules to desired effect.
 337
 338 ## Scalar result reduction and iteration
 339
 340 Scalar Reduction per se does not exist, instead is implemented in SVP64
 341 as a simple and natural relaxation of the usual restriction on the Vector
 342 Looping which would terminate if the destination was marked as a Scalar.
 343 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
 344 even though the destination register is marked as scalar.
 345 Thus it is up to the programmer to be aware of this, observe some
 346 conventions, and thus end up achieving the desired outcome of scalar
 347 reduction.
 348
 349 It is also important to appreciate that there is no
 350 actual imposition or restriction on how this mode is utilised: there
 351 will therefore be several valuable uses (including Vector Iteration
 352 and "Reverse-Gear")
 353 and it is up to the programmer to make best use of the
 354 (strictly deterministic) capability
 355 provided.
 356
 357 In this mode, which is suited to operations involving carry or overflow,
 358 one register must be assigned, by convention by the programmer to be the
 359 "accumulator".  Scalar reduction is thus categorised by:
 360
 361 * One of the sources is a Vector
 362 * the destination is a scalar
 363 * optionally but most usefully when one source scalar register is
 364   also the scalar destination (which may be informally termed
 365   the "accumulator")
 366 * That the source register type is the same as the destination register
 367   type identified as the "accumulator".  Scalar reduction on `cmp`,
 368   `setb` or `isel` makes no sense for example because of the mixture
 369   between CRs and GPRs.
 370
 371 *Note that issuing instructions in Scalar reduce mode such as `setb`
 372 are neither `UNDEFINED` nor prohibited, despite them not making much
 373 sense at first glance.
 374 Scalar reduce is strictly defined behaviour, and the cost in
 375 hardware terms of prohibition of seemingly non-sensical operations is too great.
 376 Therefore it is permitted and required to be executed successfully.
 377 Implementors **MAY** choose to optimise such instructions in instances
 378 where their use results in "extraneous execution", i.e. where it is clear
 379 that the sequence of operations, comprising multiple overwrites to
 380 a scalar destination **without** cumulative, iterative, or reductive
 381 behaviour (no "accumulator"), may discard all but the last element
 382 operation.  Identification
 383 of such is trivial to do for `setb` and `cmp`: the source register type is
 384 a completely different register file from the destination.
 385 Likewise Scalar reduction when the destination is a Vector
 386 is as if the Reduction Mode was not requested.*
 387
 388 Typical applications include simple operations such as `ADD r3, r10.v,
 389 r3` where, clearly, r3 is being used to accumulate the addition of all
 390 elements of the vector starting at r10.
 391
 392      # add RT, RA,RB but when RT==RA
 393      for i in range(VL):
 394           iregs[RA] += iregs[RB+i] # RT==RA
 395
 396 However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
 397 SV ordinarily
 398 **terminates** at the first scalar operation.  Only by marking the
 399 operation as "mapreduce" will it continue to issue multiple sub-looped
 400 (element) instructions in `Program Order`.
 401
 402 To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This may be useful in situations where the results may be different
 403 (floating-point) if executed in a different order.  Given that there is
 404 no actual prohibition on Reduce Mode being applied when the destination
 405 is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
 406 or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
 407 for example will start at the opposite end of the Vector and push
 408 a cumulative series of overlapping add operations into the Execution units of
 409 the underlying hardware.
 410
 411 Other examples include shift-mask operations where a Vector of inserts
 412 into a single destination register is required (see [[sv/bitmanip]], bmset),
 413 as a way to construct
 414 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 415 Using the same register as both the source and destination, with Vectors
 416 of different offsets masks and values to be inserted has multiple
 417 applications including Video, cryptography and JIT compilation.
 418
 419     # assume VL=4:
 420     # * Vector of shift-offsets contained in RC (r12.v)
 421     # * Vector of masks contained in RB (r8.v)
 422     # * Vector of values to be masked-in in RA (r4.v)
 423     # * Scalar destination RT (r0) to receive all mask-offset values
 424     sv.bmset/mr r0, r4.v, r8.v, r12.v
 425
 426 Due to the Deterministic Scheduling,
 427 Subtract and Divide are still permitted to be executed in this mode,
 428 although from an algorithmic perspective it is strongly discouraged.
 429 It would be better to use addition followed by one final subtract,
 430 or in the case of divide, to get better accuracy, to perform a multiply
 431 cascade followed by a final divide.
 432
 433 Note that single-operand or three-operand scalar-dest reduce is perfectly
 434 well permitted: the programmer may still declare one register, used as
 435 both a Vector source and Scalar destination, to be utilised as
 436 the "accumulator".  In the case of `sv.fmadds` and `sv.maddhw` etc
 437 this naturally fits well with the normal expected usage of these
 438 operations.
 439
 440 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 441 the scalar destination register **MUST** be updated with the current
 442 (intermediate) result, because this is how ```Program Order``` is
 443 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 444 in Program Order).  In this way, after return from interrupt,
 445 the scalar mapreduce may continue where it left off.  This provides
 446 "precise" exception behaviour.
 447
 448 Note that hardware is perfectly permitted to perform multi-issue
 449 parallel optimisation of the scalar reduce operation: it's just that
 450 as far as the user is concerned, all exceptions and interrupts **MUST**
 451 be precise.
 452
 453 ## Vector result reduce mode
 454
 455 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 456 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 457 *appearance* and *effect* of Reduction.
 458
 459 Given that the tree-reduction schedule is deterministic,
 460 Interrupts and exceptions
 461 can therefore also be precise.  The final result will be in the first
 462 non-predicate-masked-out destination element, but due again to
 463 the deterministic schedule programmers may find uses for the intermediate
 464 results.
 465
 466 When Rc=1 a corresponding Vector of co-resultant CRs is also
 467 created.  No special action is taken: the result and its CR Field
 468 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 469
 470 ## Sub-Vector Horizontal Reduction
 471
 472 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 473 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 474 illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
 475
 476     for i in range(0, VL):
 477         # RA==RT in the instruction. does not have to be
 478         iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
 479         iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
 480
 481 Thus logically there is nothing special or unanticipated about
 482 `SVM=0`: it is expected behaviour according to standard SVP64
 483 Sub-Vector rules.
 484
 485 By contrast, when SVM is set and SUBVL!=1, a Horizontal
 486 Subvector mode is enabled, which behaves very much more
 487 like a traditional Vector Processor Reduction instruction.
 488 Example for a vec3:
 489
 490     for i in range(VL):
 491         result = iregs[RA+i].x
 492         result = op(result, iregs[RA+i].y)
 493         result = op(result, iregs[RA+i].z)
 494         iregs[RT+i] = result
 495
 496 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 497 element creates a corresponding CR element (for the final, reduced, result).
 498
 499 # Fail-on-first
 500
 501 Data-dependent fail-on-first has two distinct variants: one for LD/ST
 502 (see [[sv/ldst]],
 503 the other for arithmetic operations (actually, CR-driven)
 504 ([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
 505 Note in each
 506 case the assumption is that vector elements are required appear to be
 507 executed in sequential Program Order, element 0 being the first.
 508
 509 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 510   ordinary one.  Exceptions occur "as normal".  However for elements 1
 511   and above, if an exception would occur, then VL is **truncated** to the
 512   previous element.
 513 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 514   CR-creating operation produces a result (including cmp).  Similar to
 515   branch, an analysis of the CR is performed and if the test fails, the
 516   vector operation terminates and discards all element operations
 517   above the current one (and the current one if VLi is not set),
 518   and VL is truncated to either
 519   the *previous* element or the current one, depending on whether
 520   VLi (VL "inclusive") is set.
 521
 522 Thus the new VL comprises a contiguous vector of results,
 523 all of which pass the testing criteria (equal to zero, less than zero).
 524
 525 The CR-based data-driven fail-on-first is new and not found in ARM
 526 SVE or RVV. It is extremely useful for reducing instruction count,
 527 however requires speculative execution involving modifications of VL
 528 to get high performance implementations.  An additional mode (RC1=1)
 529 effectively turns what would otherwise be an arithmetic operation
 530 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 531 against the `inv` field).
 532 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 533 the loop ends.
 534 Note that when RC1=1 the result elements are never stored, only the CRs.
 535
 536 VLi is only available as an option when `Rc=0` (or for instructions
 537 which do not have Rc). When set, the current element is always
 538 also included in the count (the new length that VL will be set to).
 539 This may be useful in combination with "inv" to truncate the Vector
 540 to `exclude` elements that fail a test, or, in the case of implementations
 541 of strncpy, to include the terminating zero.
 542
 543 In CR-based data-driven fail-on-first there is only the option to select
 544 and test one bit of each CR (just as with branch BO).  For more complex
 545 tests this may be insufficient.  If that is the case, a vectorised crops
 546 (crand, cror) may be used, and ffirst applied to the crop instead of to
 547 the arithmetic vector.
 548
 549 One extremely important aspect of ffirst is:
 550
 551 * LDST ffirst may never set VL equal to zero.  This because on the first
 552   element an exception must be raised "as normal".
 553 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 554   to zero. This is the only means in the entirety of SV that VL may be set
 555   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 556   zero due to the first element failing the CR bit-test, all subsequent
 557   vectorised operations are effectively `nops` which is
 558   *precisely the desired and intended behaviour*.
 559
 560 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 561 to a nonzero value for any implementation-specific reason.  For example:
 562 it is perfectly reasonable for implementations to alter VL when ffirst
 563 LD or ST operations are initiated on a nonaligned boundary, such that
 564 within a loop the subsequent iteration of that loop begins subsequent
 565 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 566 workloads or balance resources.
 567
 568 CR-based data-dependent first on the other hand MUST not truncate VL
 569 arbitrarily to a length decided by the hardware: VL MUST only be
 570 truncated based explicitly on whether a test fails.
 571 This because it is a precise test on which algorithms
 572 will rely.
 573
 574 ## Data-dependent fail-first on CR operations (crand etc)
 575
 576 Operations that actually produce or alter CR Field as a result
 577 do not also in turn have an Rc=1 mode.  However it makes no
 578 sense to try to test the 4 bits of a CR Field for being equal
 579 or not equal to zero. Moreover, the result is already in the
 580 form that is desired: it is a CR field.  Therefore,
 581 CR-based operations have their own SVP64 Mode, described
 582 in [[sv/cr_ops]]
 583
 584 There are two primary different types of CR operations:
 585
 586 * Those which have a 3-bit operand field (referring to a CR Field)
 587 * Those which have a 5-bit operand (referring to a bit within the
 588    whole 32-bit CR)
 589
 590 More details can be found in [[sv/cr_ops]].
 591
 592 # pred-result mode
 593
 594 Predicate-result merges common CR testing with predication, saving on
 595 instruction count.  In essence, a Condition Register Field test
 596 is performed, and if it fails it is considered to have been
 597 *as if* the destination predicate bit was zero.
 598 Arithmetic and Logical Pred-result is covered in [[sv/normal]]
 599
 600 Ped-result mode may not be applied on CR ops.
 601
 602 Although CR operations (mtcr, crand, cror) may be Vectorised,
 603 predicated, pred-result mode applies to operations that have
 604 an Rc=1 mode, or make sense to add an RC1 option.
 605
 606 # CR Operations
 607
 608 CRs are slightly more involved than INT or FP registers due to the
 609 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 610 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 611 numbering, with a clear linear relationship and mapping existing when
 612 SV is applied.
 613
 614 ## CR EXTRA mapping table and algorithm
 615
 616 Numbering relationships for CR fields are already complex due to being
 617 in BE format (*the relationship is not clearly explained in the v3.0B
 618 or v3.1 specification*).  However with some care and consideration
 619 the exact same mapping used for INT and FP regfiles may be applied,
 620 just to the upper bits, as explained below.  The notation
 621 `CR{field number}` is used to indicate access to a particular
 622 Condition Register Field (as opposed to the notation `CR[bit]`
 623 which accesses one bit of the 32 bit Power ISA v3.0B
 624 Condition Register)
 625
 626 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
 627
 628      CR{7-n} = CR[32+n*4:35+n*4]
 629
 630 For SVP64 the relationship for the sequential
 631 numbering of elements is to the CR **fields** within
 632 the CR Register, not to individual bits within the CR register.
 633
 634 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 635 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 636 *in* that CR.  The numbering was determined (after 4 months of
 637 analysis and research) to be as follows:
 638
 639     CR_index = 7-(BA>>2)      # top 3 bits but BE
 640     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 641     CR_reg = CR{CR_index}     # get the CR
 642     # finally get the bit from the CR.
 643     CR_bit = (CR_reg & (1<<bit_index)) != 0
 644
 645 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 646 applies, **not** the CR\_bit portion (bits 3:4):
 647
 648     if extra3_mode:
 649         spec = EXTRA3
 650     else:
 651         spec = EXTRA2<<1 | 0b0
 652     if spec[0]:
 653        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 654        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 655               (spec[1:2]<<4) | # to make room for these
 656               (BA & 0b11)      # CR_bit on the end
 657     else:
 658        # scalar constructs "00 spec[1:2] BA[0:4]"
 659        return (spec[1:2] << 5) | BA
 660
 661 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 662 algorithm to determin CR\_reg is modified to as follows:
 663
 664     CR_index = 7-(BA>>2)      # top 3 bits but BE
 665     if spec[0]:
 666         # vector mode, 0-124 increments of 4
 667         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 668     else:
 669         # scalar mode, 0-32 increments of 1
 670         CR_index = (spec[1:2]<<3) | CR_index
 671     # same as for v3.0/v3.1 from this point onwards
 672     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 673     CR_reg = CR{CR_index}     # get the CR
 674     # finally get the bit from the CR.
 675     CR_bit = (CR_reg & (1<<bit_index)) != 0
 676
 677 Note here that the decoding pattern to determine CR\_bit does not change.
 678
 679 Note: high-performance implementations may read/write Vectors of CRs in
 680 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 681 simplify internal design.  If instructions are issued where CR Vectors
 682 do not start on a 32-bit aligned boundary, performance may be affected.
 683
 684 ## CR fields as inputs/outputs of vector operations
 685
 686 CRs (or, the arithmetic operations associated with them)
 687 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 688
 689 When vectorized, the CR inputs/outputs are sequentially read/written
 690 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 691 writing to CR8 (TBD evaluate) and increase sequentially from there.
 692 This is so that:
 693
 694 * implementations may rely on the Vector CRs being aligned to 8. This
 695   means that CRs may be read or written in aligned batches of 32 bits
 696   (8 CRs per batch), for high performance implementations.
 697 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 698   overwritten by vector Rc=1 operations except for very large VL
 699 * CR-based predication, from CR32, is also not interfered with
 700   (except by large VL).
 701
 702 However when the SV result (destination) is marked as a scalar by the
 703 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 704 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 705 for FP operations.
 706
 707 Note that yes, the CR Fields are genuinely Vectorised.  Unlike in SIMD VSX which
 708 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 709 v3.0B scalar operations produce a **tuple** of element results: the
 710 result of the operation as one part of that element *and a corresponding
 711 CR element*.  Greatly simplified pseudocode:
 712
 713     for i in range(VL):
 714          # calculate the vector result of an add
 715          iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
 716          # now calculate CR bits
 717          CRs{8+i}.eq = iregs[RT+i] == 0
 718          CRs{8+i}.gt = iregs[RT+i] > 0
 719          ... etc
 720
 721 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 722 then a followup instruction must be performed, setting "reduce" mode on
 723 the Vector of CRs, using cr ops (crand, crnor) to do so.  This provides far
 724 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 725 Vector ISAs are typically restricted to "were all results nonzero" and
 726 "were some results nonzero". The application of mapreduce to Vectorised
 727 cr operations allows far more sophisticated analysis, particularly in
 728 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 729
 730 Note in particular that the use of a separate instruction in this way
 731 ensures that high performance multi-issue OoO inplementations do not
 732 have the computation of the cumulative analysis CR as a bottleneck and
 733 hindrance, regardless of the length of VL.
 734
 735 Additionally,
 736 SVP64 [[sv/branches]] may be used, even when the branch itself is to
 737 the following instruction.  The combined side-effects of CTR reduction
 738 and VL truncation provide several benefits.
 739
 740 (see [[discussion]].  some alternative schemes are described there)
 741
 742 ## Rc=1 when SUBVL!=1
 743
 744 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
 745 predicate is allocated per subvector; likewise only one CR is allocated
 746 per subvector.
 747
 748 This leaves a conundrum as to how to apply CR computation per subvector,
 749 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 750 is to perform a bitwise OR or AND of the subvector tests.  Given that
 751 OE is ignored in SVP64, this field may (when available) be used to select OR or
 752 AND behavior.
 753
 754 ### Table of CR fields
 755
 756 CRn is the notation used by the OpenPower spec to refer to CR field #i,
 757 so FP instructions with Rc=1 write to CR1 (n=1).
 758
 759 CRs are not stored in SPRs: they are registers in their own right.
 760 Therefore context-switching the full set of CRs involves a Vectorised
 761 mfcr or mtcr, using VL=8 to do so.  This is exactly as how
 762 scalar OpenPOWER context-switches CRs: it is just that there are now
 763 more of them.
 764
 765 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 766 are arranged.  TODO a python program that auto-generates a CSV file
 767 which can be included in a table, which is in a new page (so as not to
 768 overwhelm this one). [[svp64/cr_names]]
 769
 770 # Register Profiles
 771
 772 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 773 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 774
 775 Instructions are broken down by Register Profiles as listed in the
 776 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 777 indicates that the operations with this Register Profile cannot be
 778 Vectorised (mtspr, bc, dcbz, twi)
 779
 780 TODO generate table which will be here [[svp64/reg_profiles]]
 781
 782 # SV pseudocode illilustration
 783
 784 ## Single-predicated Instruction
 785
 786 illustration of normal mode add operation: zeroing not included, elwidth
 787 overrides not included.  if there is no predicate, it is set to all 1s
 788
 789     function op_add(rd, rs1, rs2) # add not VADD!
 790       int i, id=0, irs1=0, irs2=0;
 791       predval = get_pred_val(FALSE, rd);
 792       for (i = 0; i < VL; i++)
 793         STATE.srcoffs = i # save context
 794         if (predval & 1<<i) # predication uses intregs
 795            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 796         if (!int_vec[rd].isvec) break;
 797         if (rd.isvec)  { id += 1; }
 798         if (rs1.isvec) { irs1 += 1; }
 799         if (rs2.isvec) { irs2 += 1; }
 800         if (id == VL or irs1 == VL or irs2 == VL)
 801         {
 802           # end VL hardware loop
 803           STATE.srcoffs = 0; # reset
 804           return;
 805         }
 806
 807 This has several modes:
 808
 809 * RT.v = RA.v RB.v
 810 * RT.v = RA.v RB.s (and RA.s RB.v)
 811 * RT.v = RA.s RB.s
 812 * RT.s = RA.v RB.v
 813 * RT.s = RA.v RB.s (and RA.s RB.v)
 814 * RT.s = RA.s RB.s
 815
 816 All of these may be predicated.  Vector-Vector is straightfoward.
 817 When one of source is a Vector and the other a Scalar, it is clear that
 818 each element of the Vector source should be added to the Scalar source,
 819 each result placed into the Vector (or, if the destination is a scalar,
 820 only the first nonpredicated result).
 821
 822 The one that is not obvious is RT=vector but both RA/RB=scalar.
 823 Here this acts as a "splat scalar result", copying the same result into
 824 all nonpredicated result elements.  If a fixed destination scalar was
 825 intended, then an all-Scalar operation should be used.
 826
 827 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 828
 829 # Assembly Annotation
 830
 831 Assembly code annotation is required for SV to be able to successfully
 832 mark instructions as "prefixed".
 833
 834 A reasonable (prototype) starting point:
 835
 836     svp64 [field=value]*
 837
 838 Fields:
 839
 840 * ew=8/16/32 - element width
 841 * sew=8/16/32 - source element width
 842 * vec=2/3/4 - SUBVL
 843 * mode=mr/satu/sats/crpred
 844 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 845
 846 similar to x86 "rex" prefix.
 847
 848 For actual assembler:
 849
 850     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 851
 852 Qualifiers:
 853
 854 * m={pred}: predicate mask mode
 855 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 856 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 857 * ew={N}: ew=8/16/32 - sets elwidth override
 858 * sw={N}: sw=8/16/32 - sets source elwidth override
 859 * ff={xx}: see fail-first mode
 860 * pr={xx}: see predicate-result mode
 861 * sat{x}: satu / sats - see saturation mode
 862 * mr: see map-reduce mode
 863 * mr.svm see map-reduce with sub-vector mode
 864 * crm: see map-reduce CR mode
 865 * crm.svm see map-reduce CR with sub-vector mode
 866 * sz: predication with source-zeroing
 867 * dz: predication with dest-zeroing
 868
 869 For modes:
 870
 871 * pred-result:
 872   - pm=lt/gt/le/ge/eq/ne/so/ns
 873   - RC1 mode
 874 * fail-first
 875   - ff=lt/gt/le/ge/eq/ne/so/ns
 876   - RC1 mode
 877 * saturation:
 878   - sats
 879   - satu
 880 * map-reduce:
 881   - mr OR crm: "normal" map-reduce mode or CR-mode.
 882   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 883
 884 # Proposed Parallel-reduction algorithm
 885
 886 **This algorithm contains a MV operation and may NOT be used.  Removal
 887 of the MV operation may be achieved by using index-redirection as was
 888 achieved in DCT and FFT REMAP**
 889
 890 ```
 891 /// reference implementation of proposed SimpleV reduction semantics.
 892 ///
 893                 // reduction operation -- we still use this algorithm even
 894                 // if the reduction operation isn't associative or
 895                 // commutative.
 896     XXX VIOLATION OF SVP64 DESIGN PRINCIPLES               XXXX
 897 /// XXX `pred` is a user-visible Vector Condition register XXXX
 898     XXX VIOLATION OF SVP64 DESIGN PRINCIPLES               XXXX
 899 ///
 900 /// all input arrays have length `vl`
 901 def reduce(vl, vec, pred):
 902     pred = copy(pred) # must not damage predicate
 903     step = 1;
 904     while step < vl
 905         step *= 2;
 906         for i in (0..vl).step_by(step)
 907             other = i + step / 2;
 908             other_pred = other < vl && pred[other];
 909             if pred[i] && other_pred
 910                 vec[i] += vec[other];
 911             else if other_pred
 912                 XXX VIOLATION OF SVP64 DESIGN XXX
 913                 XXX vec[i] = vec[other];      XXX
 914                 XXX VIOLATION OF SVP64 DESIGN XXX
 915             pred[i] |= other_pred;
 916 ```
 917
 918 The first principle in SVP64 being violated is that SVP64 is a fully-independent
 919 Abstraction of hardware-looping in between issue and execute phases
 920 that has no relation to the operation it issues.  The above pseudocode
 921 conditionally changes not only the type of element operation issued
 922 (a MV in some cases) but also the number of arguments (2 for a MV).
 923 At the very least, for Vertical-First Mode this will result in unanticipated and unexpected behaviour (maximise "surprises" for programmers) in
 924 the middle of loops, that will be far too hard to explain.
 925
 926 The second principle being violated by the above algorithm is the expectation
 927 that temporary storage is available for a modified predicate: there is no
 928 such space, and predicates are read-only to reduce complexity at the
 929 micro-architectural level.
 930 SVP64 is founded on the principle that all operations are
 931 "re-entrant" with respect to interrupts and exceptions: SVSTATE must
 932 be saved and restored alongside PC and MSR, but nothing more. It is perfectly
 933 fine to have context-switching back to the operation be somewhat slower,
 934 through "reconstruction" of temporary internal state based on what SVSTATE
 935 contains, but nothing more.
 936
 937 An alternative algorithm is therefore required that does not perform MVs,
 938 and does not require additional state to be saved on context-switching.
 939
 940 ```
 941 def reduce(  vl,  vec, pred ):
 942     pred = copy(pred) # must not damage predicate
 943     j = 0
 944     vi = [] # array of lookup indices to skip nonpredicated
 945     for i, pbit in enumerate(pred):
 946        if pbit:
 947            vi[j] = i
 948            j += 1
 949     step = 2
 950     while step <= vl
 951         halfstep = step // 2
 952         for i in (0..vl).step_by(step)
 953             other = vi[i + halfstep]
 954             ir = vi[i]
 955             other_pred = other < vl && pred[other]
 956             if pred[i] && other_pred
 957                 vec[ir] += vec[other]
 958             else if other_pred:
 959                vi[ir] = vi[other] # index redirection, no MV
 960             pred[ir] |= other_pred # reconstructed on context-switch
 961          step *= 2
 962 ```
 963
 964 In this version the need for an explicit MV is made unnecessary by instead
 965 leaving elements *in situ*.  The internal modifications to the predicate may,
 966 due to the reduction being entirely deterministic, be "reconstructed"
 967 on a context-switch. This may make some implementations slower.
 968
 969 *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
 970 implemented in hardware with MVs that ensure lane-crossing is minimised.
 971 The mistake which would be catastrophic to SVP64 to make is to then
 972 limit the Reduction Sequence for all implementors
 973 based solely and exclusively on what one
 974 specific internal microarchitecture does.
 975 In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
 976 compact and efficient encodings of abstract concepts.
 977 It is the Implementor's responsibility to produce a design
 978 that complies with the above algorithm,
 979 utilising internal Micro-coding and other techniques to transparently
 980 insert MV operations
 981 if necessary or desired, to give the level of efficiency or performance
 982 required.*
 983
 984 # Element-width overrides
 985
 986 Element-width overrides are best illustrated with a packed structure
 987 union in the c programming language.  The following should be taken
 988 literally, and assume always a little-endian layout:
 989
 990     typedef union {
 991         uint8_t  b[];
 992         uint16_t s[];
 993         uint32_t i[];
 994         uint64_t l[];
 995         uint8_t actual_bytes[8];
 996     } el_reg_t;
 997
 998     elreg_t int_regfile[128];
 999
1000     get_polymorphed_reg(reg, bitwidth, offset):
1001         el_reg_t res;
1002         res.l = 0; // TODO: going to need sign-extending / zero-extending
1003         if bitwidth == 8:
1004             reg.b = int_regfile[reg].b[offset]
1005         elif bitwidth == 16:
1006             reg.s = int_regfile[reg].s[offset]
1007         elif bitwidth == 32:
1008             reg.i = int_regfile[reg].i[offset]
1009         elif bitwidth == 64:
1010             reg.l = int_regfile[reg].l[offset]
1011         return res
1012
1013     set_polymorphed_reg(reg, bitwidth, offset, val):
1014         if (!reg.isvec):
1015             # not a vector: first element only, overwrites high bits
1016             int_regfile[reg].l[0] = val
1017         elif bitwidth == 8:
1018             int_regfile[reg].b[offset] = val
1019         elif bitwidth == 16:
1020             int_regfile[reg].s[offset] = val
1021         elif bitwidth == 32:
1022             int_regfile[reg].i[offset] = val
1023         elif bitwidth == 64:
1024             int_regfile[reg].l[offset] = val
1025
1026 In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
1027 to fp127) are reinterpreted to be "starting points" in a byte-addressable
1028 memory.  Vectors - which become just a virtual naming construct - effectively
1029 overlap.
1030
1031 It is extremely important for implementors to note that the only circumstance
1032 where upper portions of an underlying 64-bit register are zero'd out is
1033 when the destination is a scalar.  The ideal register file has byte-level
1034 write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE.
1035
1036 An example ADD operation with predication and element width overrides:
1037
1038       for (i = 0; i < VL; i++)
1039         if (predval & 1<<i) # predication
1040            src1 = get_polymorphed_reg(RA, srcwid, irs1)
1041            src2 = get_polymorphed_reg(RB, srcwid, irs2)
1042            result = src1 + src2 # actual add here
1043            set_polymorphed_reg(RT, destwid, ird, result)
1044            if (!RT.isvec) break
1045         if (RT.isvec)  { id += 1; }
1046         if (RA.isvec)  { irs1 += 1; }
1047         if (RB.isvec)  { irs2 += 1; }
1048
1049 Thus it can be clearly seen that elements are packed by their
1050 element width, and the packing starts from the source (or destination)
1051 specified by the instruction.
1052
1053 # Twin (implicit) result operations
1054
1055 Some operations in the Power ISA already target two 64-bit scalar
1056 registers: `lq` for example, and LD with update.
1057 Some mathematical algorithms are more
1058 efficient when there are two outputs rather than one, providing
1059 feedback loops between elements (the most well-known being add with
1060 carry).  64-bit multiply
1061 for example actually internally produces a 128 bit result, which clearly
1062 cannot be stored in a single 64 bit register.  Some ISAs recommend
1063 "macro op fusion": the practice of setting a convention whereby if
1064 two commonly used instructions (mullo, mulhi) use the same ALU but
1065 one selects the low part of an identical operation and the other
1066 selects the high part, then optimised micro-architectures may
1067 "fuse" those two instructions together, using Micro-coding techniques,
1068 internally.
1069
1070 The practice and convention of macro-op fusion however is not compatible
1071 with SVP64 Horizontal-First, because Horizontal Mode may only
1072 be applied to a single instruction at a time, and SVP64 is based on
1073 the principle of strict Program Order even at the element
1074 level.  Thus it becomes
1075 necessary to add explicit more complex single instructions with
1076 more operands than would normally be seen in the average RISC ISA
1077 (3-in, 2-out, in some cases). If it
1078 was not for Power ISA already having LD/ST with update as well as
1079 Condition Codes and `lq` this would be hard to justify.
1080
1081 With limited space in the `EXTRA` Field, and Power ISA opcodes
1082 being only 32 bit, 5 operands is quite an ask.  `lq` however sets
1083 a precedent: `RTp` stands for "RT pair".  In other words the result
1084 is stored in RT and RT+1.  For Scalar operations, following this
1085 precedent is perfectly reasonable.  In Scalar mode,
1086 `madded` therefore stores the two halves of the 128-bit multiply
1087 into RT and RT+1.
1088
1089 What, then, of `sv.madded`? If the destination is hard-coded to
1090 RT and RT+1 the instruction is not useful when Vectorised because
1091 the output will be overwritten on the next element.  To solve this
1092 is easy: define the destination registers as RT and RT+MAXVL
1093 respectively.  This makes it easy for compilers to statically allocate
1094 registers even when VL changes dynamically.
1095
1096 Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
1097 and bear in mind that element-width overrides still have to be taken
1098 into consideration, the starting point for the implicit destination
1099 is best illustrated in pseudocode:
1100
1101      # demo of madded
1102      for (i = 0; i < VL; i++)
1103         if (predval & 1<<i) # predication
1104            src1 = get_polymorphed_reg(RA, srcwid, irs1)
1105            src2 = get_polymorphed_reg(RB, srcwid, irs2)
1106            src2 = get_polymorphed_reg(RC, srcwid, irs3)
1107            result = src1*src2 + src2
1108            destmask = (2<<destwid)-1
1109            # store two halves of result, both start from RT.
1110            set_polymorphed_reg(RT, destwid, ird      , result&destmask)
1111            set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
1112            if (!RT.isvec) break
1113         if (RT.isvec)  { id += 1; }
1114         if (RA.isvec)  { irs1 += 1; }
1115         if (RB.isvec)  { irs2 += 1; }
1116         if (RC.isvec)  { irs3 += 1; }
1117
1118 The significant part here is that the second half is stored
1119 starting not from RT+MAXVL at all: it is the *element* index
1120 that is offset by MAXVL, both halves actually starting from RT.
1121 If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements
1122 RT0 to RT2 are stored:
1123
1124           0..31     32..63
1125      r0  unchanged unchanged
1126      r1  RT0.lo    RT1.lo
1127      r2  RT2.lo    unchanged
1128      r3  unchanged RT0.hi
1129      r4  RT1.hi    RT2.hi
1130      r5  unchanged unchanged
1131
1132 Note that all of the LO halves start from r1, but that the HI halves
1133 start from half-way into r3. The reason is that with MAXVL bring
1134 5 and elwidth being 32, this is the 5th element
1135 offset (in 32 bit quantities) counting from r1.
1136
1137 Additional DRAFT Scalar instructions in 3-in 2-out form
1138 with an implicit 2nd destination:
1139
1140 * [[isa/svfixedarith]]
1141 * [[isa/svfparith]]
1142