openpower/sv/svp64/appendix.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturation
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47> Parallel Prefix
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=697> Reduce Modes
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=809> OV sv.addex discussion
   7
   8 This is the appendix to [[sv/svp64]], providing explanations of modes
   9 etc. leaving the main svp64 page's primary purpose as outlining the
  10 instruction format.
  11
  12 Table of contents:
  13
  14 [[!toc]]
  15
  16 # XER, SO and other global flags
  17
  18 Vector systems are expected to be high performance.  This is achieved
  19 through parallelism, which requires that elements in the vector be
  20 independent.  XER SO/OV and other global "accumulation" flags (CR.SO) cause
  21 Read-Write Hazards on single-bit global resources, having a significant
  22 detrimental effect.
  23
  24 Consequently in SV, XER.SO behaviour is disregarded (including
  25 in `cmp` instructions).  XER.SO is not read, but XER.OV may be written,
  26 breaking the Read-Modify-Write Hazard Chain that complicates
  27 microarchitectural implementations.
  28 This includes when `scalar identity behaviour` occurs.  If precise
  29 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  30 instructions should be used without an SV Prefix.
  31
  32 TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
  33
  34 Of note here is that XER.SO and OV may already be disregarded in the
  35 Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
  36 SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets,
  37 but only for SVP64 Prefixed Operations.
  38
  39 XER.CA/CA32 on the other hand is expected and required to be implemented
  40 according to standard Power ISA Scalar behaviour.  Interestingly, due
  41 to SVP64 being in effect a hardware for-loop around Scalar instructions
  42 executing in precise Program Order, a little thought shows that a Vectorised
  43 Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
  44 and producing, at the end, a single bit Carry out.  High performance
  45 implementations may exploit this observation to deploy efficient
  46 Parallel Carry Lookahead.
  47
  48     # assume VL=4, this results in 4 sequential ops (below)
  49     sv.adde r0.v, r4.v, r8.v
  50
  51     # instructions that get executed in backend hardware:
  52     adde r0, r4, r8 # takes carry-in, produces carry-out
  53     adde r1, r5, r9 # takes carry from previous
  54     ...
  55     adde r3, r7, r11 # likewise
  56
  57 It can clearly be seen that the carry chains from one
  58 64 bit add to the next, the end result being that a
  59 256-bit "Big Integer Add" has been performed, and that
  60 CA contains the 257th bit.  A one-instruction 512-bit Add
  61 may be performed by setting VL=8, and a one-instruction
  62 1024-bit add by setting VL=16, and so on.
  63
  64 # v3.0B/v3.1 relevant instructions
  65
  66 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  67 CPU ISA.
  68
  69 Vectorisation of the VSX Packed SIMD system
  70 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
  71 at the very minimum, predication (which VSX was designed without).
  72 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
  73 illegal instruction exceptions in SV Prefix Mode.
  74
  75 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
  76 have because they are not only provided by SV, the SV alternatives may
  77 be predicated as well, making them far better suited to use in function
  78 calls and context-switching.
  79
  80 Additionally, some v3.0/1 instructions simply make no sense at all in a
  81 Vector context: `rfid` falls into this category,
  82 as well as `sc` and `scv`.  Here there is simply no point
  83 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
  84 should be called instead.
  85
  86 Fortuitously this leaves several Major Opcodes free for use by SV
  87 to fit alternative future instructions.  In a 3D context this means
  88 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
  89 operations, and others critical to an efficient, effective 3D GPU and
  90 VPU ISA. With such instructions being included as standard in other
  91 commercially-successful GPU ISAs it is likewise critical that a 3D
  92 GPU/VPU based on svp64 also have such instructions.
  93
  94 Note however that svp64 is stand-alone and is in no way
  95 critically dependent on the existence or provision of 3D GPU or VPU
  96 instructions. These should be considered extensions, and their discussion
  97 and specification is out of scope for this document.
  98
  99 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
 100 v3.1B is *not* altered by svp64 in any way.
 101
 102 ## Major opcode map (v3.0B)
 103
 104 This table is taken from v3.0B.
 105 Table 9: Primary Opcode Map (opcode bits 0:5)
 106
 107 ```
 108     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 109 000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
 110 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 111 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 112 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 113 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 114 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
 115 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 116 111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
 117     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 118 ```
 119
 120 ## Suitable for svp64-only
 121
 122 This is the same table containing v3.0B Primary Opcodes except those that
 123 make no sense in a Vectorisation Context have been removed.  These removed
 124 POs can, *in the SV Vector Context only*, be assigned to alternative
 125 (Vectorised-only) instructions, including future extensions.
 126 EXT04 retains the scalar `madd*` operations but would have all PackedSIMD
 127 (aka VSX) operations removed.
 128
 129 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 130 change.  When not prefixed with svp64 these opcodes **specifically**
 131 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 132
 133 ```
 134     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 135 000 |        |       |       |       | EXT04 |        |       | mulli | 000
 136 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 137 010 | bc/l/a |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 138 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 139 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 140 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 141 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 142 111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 143     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 144 ```
 145
 146 It is important to note that having a different v3.0B Scalar opcode
 147 that is different from an SVP64 one is highly undesirable: the complexity
 148 in the decoder is greatly increased.
 149
 150 # EXTRA Field Mapping
 151
 152 The purpose of the 9-bit EXTRA field mapping is to mark individual
 153 registers (RT, RA, BFA) as either scalar or vector, and to extend
 154 their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
 155 Three of the 9 bits may also be used up for a 2nd Predicate (Twin
 156 Predication) leaving a mere 6 bits for qualifying registers. As can
 157 be seen there is significant pressure on these (and in fact all) SVP64 bits.
 158
 159 In Power ISA v3.1 prefixing there are bits which describe and classify
 160 the prefix in a fashion that is independent of the suffix. MLSS for
 161 example.  For SVP64 there is insufficient space to make the SVP64 Prefix
 162 "self-describing", and consequently every single Scalar instruction
 163 had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
 164 This process was semi-automated and is described in this section.
 165 The final results, which are part of the SVP64 Specification, are here:
 166
 167 * [[openpower/opcode_regs_deduped]]
 168
 169 Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
 170 from reading the markdown formatted version of the Scalar pseudocode
 171 which is machine-readable and found in [[openpower/isatables]].  The
 172 analysis gives, by instruction, a "Register Profile".  `add RT, RA, RB`
 173 for example is given a designation `RM-2R-1W` because it requires
 174 two GPR reads and one GPR write.
 175
 176 Secondly, the total number of registers was added up (2R-1W is 3 registers)
 177 and if less than or equal to three then that instruction could be given an
 178 EXTRA3 designation.  Four or more is given an EXTRA2 designation because
 179 there are only 9 bits available.
 180
 181 Thirdly, the instruction was analysed to see if Twin or Single
 182 Predication was suitable.  As a general rule this was if there
 183 was only a single operand and a single result (`extw` and LD/ST)
 184 however it was found that some 2 or 3 operand instructions also
 185 qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
 186 in Twin Predication, some compromises were made, here.  LDST is
 187 Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
 188
 189 Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
 190 could have been decided
 191 that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
 192 and RT indexed 2 (EXTRA bits 6-8).  In some cases (LD/ST with update)
 193 RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
 194 (because it is possible to do, and perceived to be useful). Rc=1
 195 co-results (CR0, CR1) are always given the same EXTRA index as their
 196 main result (RT, FRT).
 197
 198 Fifthly, in an automated process the results of the analysis
 199 were outputted in CSV Format for use in machine-readable form
 200 by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
 201
 202 This process was laborious but logical, and, crucially, once a
 203 decision is made (and ratified) cannot be reversed.
 204 Qualifying future Power ISA Scalar instructions for SVP64
 205 is **strongly** advised to utilise this same process and the same
 206 sv_analysis.py program as a canonical method of maintaining the
 207 relationships.  Alterations to that same program which
 208 change the Designation is **prohibited** once finalised (ratified
 209 through the Power ISA WG Process). It would
 210 be similar to deciding that `add` should be changed from X-Form
 211 to D-Form.
 212
 213 # Single Predication
 214
 215 This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
 216
 217 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
 218
 219 # Twin Predication
 220
 221 This is a novel concept that allows predication to be applied to a single
 222 source and a single dest register.  The following types of traditional
 223 Vector operations may be encoded with it, *without requiring explicit
 224 opcodes to do so*
 225
 226 * VSPLAT (a single scalar distributed across a vector)
 227 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 228 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 229 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 230 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 231
 232 Those patterns (and more) may be applied to:
 233
 234 * mv (the usual way that V\* ISA operations are created)
 235 * exts\* sign-extension
 236 * rwlinm and other RS-RA shift operations (**note**: excluding
 237   those that take RA as both a src and dest. These are not
 238   1-src 1-dest, they are 2-src, 1-dest)
 239 * LD and ST (treating AGEN as one source)
 240 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 241 * Condition Register ops mfcr, mtcr and other similar
 242
 243 This is a huge list that creates extremely powerful combinations,
 244 particularly given that one of the predicate options is `(1<<r3)`
 245
 246 Additional unusual capabilities of Twin Predication include a back-to-back
 247 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 248 sequentially ordered multiple VINSERTs.  The source predicate selects a
 249 sequentially ordered subset of elements to be inserted; the destination
 250 predicate specifies the sequentially ordered recipient locations.
 251 This is equivalent to
 252 `llvm.masked.compressstore.*`
 253 followed by
 254 `llvm.masked.expandload.*`
 255 with a single instruction.
 256
 257 This extreme power and flexibility comes down to the fact that SVP64
 258 is not actually a Vector ISA: it is a loop-abstraction-concept that
 259 is applied *in general* to Scalar operations, just like the x86
 260 `REP` instruction (if put on steroids).
 261
 262 # Reduce modes
 263
 264 Reduction in SVP64 is deterministic and somewhat of a misnomer.  A normal
 265 Vector ISA would have explicit Reduce opcodes with defined characteristics
 266 per operation: in SX Aurora there is even an additional scalar argument
 267 containing the initial reduction value, and the default is either 0
 268 or 1 depending on the specifics of the explicit opcode.
 269 SVP64 fundamentally has to
 270 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
 271 unique challenges.
 272
 273 The solution turns out to be to simply define reduction as permitting
 274 deterministic element-based schedules to be issued using the base Scalar
 275 operations, and to rely on the underlying microarchitecture to resolve
 276 Register Hazards at the element level.  This goes back to
 277 the fundamental principle that SV is nothing more than a Sub-Program-Counter
 278 sitting between Decode and Issue phases.
 279
 280 Microarchitectures *may* take opportunities to parallelise the reduction
 281 but only if in doing so they preserve Program Order at the Element Level.
 282 Opportunities where this is possible include an `OR` operation
 283 or a MIN/MAX operation: it may be possible to parallelise the reduction,
 284 but for Floating Point it is not permitted due to different results
 285 being obtained if the reduction is not executed in strict Program-Sequential
 286 Order.
 287
 288 In essence it becomes the programmer's responsibility to leverage the
 289 pre-determined schedules to desired effect.
 290
 291 ## Scalar result reduction and iteration
 292
 293 Scalar Reduction per se does not exist, instead is implemented in SVP64
 294 as a simple and natural relaxation of the usual restriction on the Vector
 295 Looping which would terminate if the destination was marked as a Scalar.
 296 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
 297 even though the destination register is marked as scalar.
 298 Thus it is up to the programmer to be aware of this, observe some
 299 conventions, and thus end up achieving the desired outcome of scalar
 300 reduction.
 301
 302 It is also important to appreciate that there is no
 303 actual imposition or restriction on how this mode is utilised: there
 304 will therefore be several valuable uses (including Vector Iteration
 305 and "Reverse-Gear")
 306 and it is up to the programmer to make best use of the
 307 (strictly deterministic) capability
 308 provided.
 309
 310 In this mode, which is suited to operations involving carry or overflow,
 311 one register must be assigned, by convention by the programmer to be the
 312 "accumulator".  Scalar reduction is thus categorised by:
 313
 314 * One of the sources is a Vector
 315 * the destination is a scalar
 316 * optionally but most usefully when one source scalar register is
 317   also the scalar destination (which may be informally termed
 318   the "accumulator")
 319 * That the source register type is the same as the destination register
 320   type identified as the "accumulator".  Scalar reduction on `cmp`,
 321   `setb` or `isel` makes no sense for example because of the mixture
 322   between CRs and GPRs.
 323
 324 *Note that issuing instructions in Scalar reduce mode such as `setb`
 325 are neither `UNDEFINED` nor prohibited, despite them not making much
 326 sense at first glance.
 327 Scalar reduce is strictly defined behaviour, and the cost in
 328 hardware terms of prohibition of seemingly non-sensical operations is too great.
 329 Therefore it is permitted and required to be executed successfully.
 330 Implementors **MAY** choose to optimise such instructions in instances
 331 where their use results in "extraneous execution", i.e. where it is clear
 332 that the sequence of operations, comprising multiple overwrites to
 333 a scalar destination **without** cumulative, iterative, or reductive
 334 behaviour (no "accumulator"), may discard all but the last element
 335 operation.  Identification
 336 of such is trivial to do for `setb` and `cmp`: the source register type is
 337 a completely different register file from the destination.
 338 Likewise Scalar reduction when the destination is a Vector
 339 is as if the Reduction Mode was not requested.*
 340
 341 Typical applications include simple operations such as `ADD r3, r10.v,
 342 r3` where, clearly, r3 is being used to accumulate the addition of all
 343 elements of the vector starting at r10.
 344
 345      # add RT, RA,RB but when RT==RA
 346      for i in range(VL):
 347           iregs[RA] += iregs[RB+i] # RT==RA
 348
 349 However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
 350 SV ordinarily
 351 **terminates** at the first scalar operation.  Only by marking the
 352 operation as "mapreduce" will it continue to issue multiple sub-looped
 353 (element) instructions in `Program Order`.
 354
 355 To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This may be useful in situations where the results may be different
 356 (floating-point) if executed in a different order.  Given that there is
 357 no actual prohibition on Reduce Mode being applied when the destination
 358 is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
 359 or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
 360 for example will start at the opposite end of the Vector and push
 361 a cumulative series of overlapping add operations into the Execution units of
 362 the underlying hardware.
 363
 364 Other examples include shift-mask operations where a Vector of inserts
 365 into a single destination register is required (see [[sv/bitmanip]], bmset),
 366 as a way to construct
 367 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 368 Using the same register as both the source and destination, with Vectors
 369 of different offsets masks and values to be inserted has multiple
 370 applications including Video, cryptography and JIT compilation.
 371
 372     # assume VL=4:
 373     # * Vector of shift-offsets contained in RC (r12.v)
 374     # * Vector of masks contained in RB (r8.v)
 375     # * Vector of values to be masked-in in RA (r4.v)
 376     # * Scalar destination RT (r0) to receive all mask-offset values
 377     sv.bmset/mr r0, r4.v, r8.v, r12.v
 378
 379 Due to the Deterministic Scheduling,
 380 Subtract and Divide are still permitted to be executed in this mode,
 381 although from an algorithmic perspective it is strongly discouraged.
 382 It would be better to use addition followed by one final subtract,
 383 or in the case of divide, to get better accuracy, to perform a multiply
 384 cascade followed by a final divide.
 385
 386 Note that single-operand or three-operand scalar-dest reduce is perfectly
 387 well permitted: the programmer may still declare one register, used as
 388 both a Vector source and Scalar destination, to be utilised as
 389 the "accumulator".  In the case of `sv.fmadds` and `sv.maddhw` etc
 390 this naturally fits well with the normal expected usage of these
 391 operations.
 392
 393 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 394 the scalar destination register **MUST** be updated with the current
 395 (intermediate) result, because this is how ```Program Order``` is
 396 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 397 in Program Order).  In this way, after return from interrupt,
 398 the scalar mapreduce may continue where it left off.  This provides
 399 "precise" exception behaviour.
 400
 401 Note that hardware is perfectly permitted to perform multi-issue
 402 parallel optimisation of the scalar reduce operation: it's just that
 403 as far as the user is concerned, all exceptions and interrupts **MUST**
 404 be precise.
 405
 406 ## Vector result reduce mode
 407
 408 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 409 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 410 *appearance* and *effect* of Reduction.
 411
 412 Given that the tree-reduction schedule is deterministic,
 413 Interrupts and exceptions
 414 can therefore also be precise.  The final result will be in the first
 415 non-predicate-masked-out destination element, but due again to
 416 the deterministic schedule programmers may find uses for the intermediate
 417 results.
 418
 419 When Rc=1 a corresponding Vector of co-resultant CRs is also
 420 created.  No special action is taken: the result and its CR Field
 421 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 422
 423 ## Sub-Vector Horizontal Reduction
 424
 425 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 426 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 427 illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
 428
 429     for i in range(0, VL):
 430         # RA==RT in the instruction. does not have to be
 431         iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
 432         iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
 433
 434 Thus logically there is nothing special or unanticipated about
 435 `SVM=0`: it is expected behaviour according to standard SVP64
 436 Sub-Vector rules.
 437
 438 By contrast, when SVM is set and SUBVL!=1, a Horizontal
 439 Subvector mode is enabled, which behaves very much more
 440 like a traditional Vector Processor Reduction instruction.
 441 Example for a vec3:
 442
 443     for i in range(VL):
 444         result = iregs[RA+i].x
 445         result = op(result, iregs[RA+i].y)
 446         result = op(result, iregs[RA+i].z)
 447         iregs[RT+i] = result
 448
 449 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 450 element creates a corresponding CR element (for the final, reduced, result).
 451
 452 # Fail-on-first
 453
 454 Data-dependent fail-on-first has two distinct variants: one for LD/ST
 455 (see [[sv/ldst]],
 456 the other for arithmetic operations (actually, CR-driven)
 457 ([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
 458 Note in each
 459 case the assumption is that vector elements are required appear to be
 460 executed in sequential Program Order, element 0 being the first.
 461
 462 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 463   ordinary one.  Exceptions occur "as normal".  However for elements 1
 464   and above, if an exception would occur, then VL is **truncated** to the
 465   previous element.
 466 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 467   CR-creating operation produces a result (including cmp).  Similar to
 468   branch, an analysis of the CR is performed and if the test fails, the
 469   vector operation terminates and discards all element operations
 470   above the current one (and the current one if VLi is not set),
 471   and VL is truncated to either
 472   the *previous* element or the current one, depending on whether
 473   VLi (VL "inclusive") is set.
 474
 475 Thus the new VL comprises a contiguous vector of results,
 476 all of which pass the testing criteria (equal to zero, less than zero).
 477
 478 The CR-based data-driven fail-on-first is new and not found in ARM
 479 SVE or RVV. It is extremely useful for reducing instruction count,
 480 however requires speculative execution involving modifications of VL
 481 to get high performance implementations.  An additional mode (RC1=1)
 482 effectively turns what would otherwise be an arithmetic operation
 483 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 484 against the `inv` field).
 485 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 486 the loop ends.
 487 Note that when RC1=1 the result elements are never stored, only the CRs.
 488
 489 VLi is only available as an option when `Rc=0` (or for instructions
 490 which do not have Rc). When set, the current element is always
 491 also included in the count (the new length that VL will be set to).
 492 This may be useful in combination with "inv" to truncate the Vector
 493 to `exclude` elements that fail a test, or, in the case of implementations
 494 of strncpy, to include the terminating zero.
 495
 496 In CR-based data-driven fail-on-first there is only the option to select
 497 and test one bit of each CR (just as with branch BO).  For more complex
 498 tests this may be insufficient.  If that is the case, a vectorised crops
 499 (crand, cror) may be used, and ffirst applied to the crop instead of to
 500 the arithmetic vector.
 501
 502 One extremely important aspect of ffirst is:
 503
 504 * LDST ffirst may never set VL equal to zero.  This because on the first
 505   element an exception must be raised "as normal".
 506 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 507   to zero. This is the only means in the entirety of SV that VL may be set
 508   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 509   zero due to the first element failing the CR bit-test, all subsequent
 510   vectorised operations are effectively `nops` which is
 511   *precisely the desired and intended behaviour*.
 512
 513 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 514 to a nonzero value for any implementation-specific reason.  For example:
 515 it is perfectly reasonable for implementations to alter VL when ffirst
 516 LD or ST operations are initiated on a nonaligned boundary, such that
 517 within a loop the subsequent iteration of that loop begins subsequent
 518 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 519 workloads or balance resources.
 520
 521 CR-based data-dependent first on the other hand MUST not truncate VL
 522 arbitrarily to a length decided by the hardware: VL MUST only be
 523 truncated based explicitly on whether a test fails.
 524 This because it is a precise test on which algorithms
 525 will rely.
 526
 527 ## Data-dependent fail-first on CR operations (crand etc)
 528
 529 Operations that actually produce or alter CR Field as a result
 530 do not also in turn have an Rc=1 mode.  However it makes no
 531 sense to try to test the 4 bits of a CR Field for being equal
 532 or not equal to zero. Moreover, the result is already in the
 533 form that is desired: it is a CR field.  Therefore,
 534 CR-based operations have their own SVP64 Mode, described
 535 in [[sv/cr_ops]]
 536
 537 There are two primary different types of CR operations:
 538
 539 * Those which have a 3-bit operand field (referring to a CR Field)
 540 * Those which have a 5-bit operand (referring to a bit within the
 541    whole 32-bit CR)
 542
 543 More details can be found in [[sv/cr_ops]].
 544
 545 # pred-result mode
 546
 547 Predicate-result merges common CR testing with predication, saving on
 548 instruction count.  In essence, a Condition Register Field test
 549 is performed, and if it fails it is considered to have been
 550 *as if* the destination predicate bit was zero.
 551 Arithmetic and Logical Pred-result is covered in [[sv/normal]]
 552
 553 Ped-result mode may not be applied on CR ops.
 554
 555 Although CR operations (mtcr, crand, cror) may be Vectorised,
 556 predicated, pred-result mode applies to operations that have
 557 an Rc=1 mode, or make sense to add an RC1 option.
 558
 559 # CR Operations
 560
 561 CRs are slightly more involved than INT or FP registers due to the
 562 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 563 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 564 numbering, with a clear linear relationship and mapping existing when
 565 SV is applied.
 566
 567 ## CR EXTRA mapping table and algorithm
 568
 569 Numbering relationships for CR fields are already complex due to being
 570 in BE format (*the relationship is not clearly explained in the v3.0B
 571 or v3.1 specification*).  However with some care and consideration
 572 the exact same mapping used for INT and FP regfiles may be applied,
 573 just to the upper bits, as explained below.  The notation
 574 `CR{field number}` is used to indicate access to a particular
 575 Condition Register Field (as opposed to the notation `CR[bit]`
 576 which accesses one bit of the 32 bit Power ISA v3.0B
 577 Condition Register)
 578
 579 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
 580
 581      CR{7-n} = CR[32+n*4:35+n*4]
 582
 583 For SVP64 the relationship for the sequential
 584 numbering of elements is to the CR **fields** within
 585 the CR Register, not to individual bits within the CR register.
 586
 587 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 588 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 589 *in* that CR.  The numbering was determined (after 4 months of
 590 analysis and research) to be as follows:
 591
 592     CR_index = 7-(BA>>2)      # top 3 bits but BE
 593     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 594     CR_reg = CR{CR_index}     # get the CR
 595     # finally get the bit from the CR.
 596     CR_bit = (CR_reg & (1<<bit_index)) != 0
 597
 598 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 599 applies, **not** the CR\_bit portion (bits 3:4):
 600
 601     if extra3_mode:
 602         spec = EXTRA3
 603     else:
 604         spec = EXTRA2<<1 | 0b0
 605     if spec[0]:
 606        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 607        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 608               (spec[1:2]<<4) | # to make room for these
 609               (BA & 0b11)      # CR_bit on the end
 610     else:
 611        # scalar constructs "00 spec[1:2] BA[0:4]"
 612        return (spec[1:2] << 5) | BA
 613
 614 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 615 algorithm to determin CR\_reg is modified to as follows:
 616
 617     CR_index = 7-(BA>>2)      # top 3 bits but BE
 618     if spec[0]:
 619         # vector mode, 0-124 increments of 4
 620         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 621     else:
 622         # scalar mode, 0-32 increments of 1
 623         CR_index = (spec[1:2]<<3) | CR_index
 624     # same as for v3.0/v3.1 from this point onwards
 625     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 626     CR_reg = CR{CR_index}     # get the CR
 627     # finally get the bit from the CR.
 628     CR_bit = (CR_reg & (1<<bit_index)) != 0
 629
 630 Note here that the decoding pattern to determine CR\_bit does not change.
 631
 632 Note: high-performance implementations may read/write Vectors of CRs in
 633 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 634 simplify internal design.  If instructions are issued where CR Vectors
 635 do not start on a 32-bit aligned boundary, performance may be affected.
 636
 637 ## CR fields as inputs/outputs of vector operations
 638
 639 CRs (or, the arithmetic operations associated with them)
 640 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 641
 642 When vectorized, the CR inputs/outputs are sequentially read/written
 643 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 644 writing to CR8 (TBD evaluate) and increase sequentially from there.
 645 This is so that:
 646
 647 * implementations may rely on the Vector CRs being aligned to 8. This
 648   means that CRs may be read or written in aligned batches of 32 bits
 649   (8 CRs per batch), for high performance implementations.
 650 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 651   overwritten by vector Rc=1 operations except for very large VL
 652 * CR-based predication, from CR32, is also not interfered with
 653   (except by large VL).
 654
 655 However when the SV result (destination) is marked as a scalar by the
 656 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 657 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 658 for FP operations.
 659
 660 Note that yes, the CR Fields are genuinely Vectorised.  Unlike in SIMD VSX which
 661 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 662 v3.0B scalar operations produce a **tuple** of element results: the
 663 result of the operation as one part of that element *and a corresponding
 664 CR element*.  Greatly simplified pseudocode:
 665
 666     for i in range(VL):
 667          # calculate the vector result of an add
 668          iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
 669          # now calculate CR bits
 670          CRs{8+i}.eq = iregs[RT+i] == 0
 671          CRs{8+i}.gt = iregs[RT+i] > 0
 672          ... etc
 673
 674 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 675 then a followup instruction must be performed, setting "reduce" mode on
 676 the Vector of CRs, using cr ops (crand, crnor) to do so.  This provides far
 677 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 678 Vector ISAs are typically restricted to "were all results nonzero" and
 679 "were some results nonzero". The application of mapreduce to Vectorised
 680 cr operations allows far more sophisticated analysis, particularly in
 681 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 682
 683 Note in particular that the use of a separate instruction in this way
 684 ensures that high performance multi-issue OoO inplementations do not
 685 have the computation of the cumulative analysis CR as a bottleneck and
 686 hindrance, regardless of the length of VL.
 687
 688 Additionally,
 689 SVP64 [[sv/branches]] may be used, even when the branch itself is to
 690 the following instruction.  The combined side-effects of CTR reduction
 691 and VL truncation provide several benefits.
 692
 693 (see [[discussion]].  some alternative schemes are described there)
 694
 695 ## Rc=1 when SUBVL!=1
 696
 697 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
 698 predicate is allocated per subvector; likewise only one CR is allocated
 699 per subvector.
 700
 701 This leaves a conundrum as to how to apply CR computation per subvector,
 702 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 703 is to perform a bitwise OR or AND of the subvector tests.  Given that
 704 OE is ignored in SVP64, this field may (when available) be used to select OR or
 705 AND behavior.
 706
 707 ### Table of CR fields
 708
 709 CRn is the notation used by the OpenPower spec to refer to CR field #i,
 710 so FP instructions with Rc=1 write to CR1 (n=1).
 711
 712 CRs are not stored in SPRs: they are registers in their own right.
 713 Therefore context-switching the full set of CRs involves a Vectorised
 714 mfcr or mtcr, using VL=8 to do so.  This is exactly as how
 715 scalar OpenPOWER context-switches CRs: it is just that there are now
 716 more of them.
 717
 718 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 719 are arranged.  TODO a python program that auto-generates a CSV file
 720 which can be included in a table, which is in a new page (so as not to
 721 overwhelm this one). [[svp64/cr_names]]
 722
 723 # Register Profiles
 724
 725 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 726 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 727
 728 Instructions are broken down by Register Profiles as listed in the
 729 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 730 indicates that the operations with this Register Profile cannot be
 731 Vectorised (mtspr, bc, dcbz, twi)
 732
 733 TODO generate table which will be here [[svp64/reg_profiles]]
 734
 735 # SV pseudocode illilustration
 736
 737 ## Single-predicated Instruction
 738
 739 illustration of normal mode add operation: zeroing not included, elwidth
 740 overrides not included.  if there is no predicate, it is set to all 1s
 741
 742     function op_add(rd, rs1, rs2) # add not VADD!
 743       int i, id=0, irs1=0, irs2=0;
 744       predval = get_pred_val(FALSE, rd);
 745       for (i = 0; i < VL; i++)
 746         STATE.srcoffs = i # save context
 747         if (predval & 1<<i) # predication uses intregs
 748            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 749         if (!int_vec[rd].isvec) break;
 750         if (rd.isvec)  { id += 1; }
 751         if (rs1.isvec) { irs1 += 1; }
 752         if (rs2.isvec) { irs2 += 1; }
 753         if (id == VL or irs1 == VL or irs2 == VL)
 754         {
 755           # end VL hardware loop
 756           STATE.srcoffs = 0; # reset
 757           return;
 758         }
 759
 760 This has several modes:
 761
 762 * RT.v = RA.v RB.v
 763 * RT.v = RA.v RB.s (and RA.s RB.v)
 764 * RT.v = RA.s RB.s
 765 * RT.s = RA.v RB.v
 766 * RT.s = RA.v RB.s (and RA.s RB.v)
 767 * RT.s = RA.s RB.s
 768
 769 All of these may be predicated.  Vector-Vector is straightfoward.
 770 When one of source is a Vector and the other a Scalar, it is clear that
 771 each element of the Vector source should be added to the Scalar source,
 772 each result placed into the Vector (or, if the destination is a scalar,
 773 only the first nonpredicated result).
 774
 775 The one that is not obvious is RT=vector but both RA/RB=scalar.
 776 Here this acts as a "splat scalar result", copying the same result into
 777 all nonpredicated result elements.  If a fixed destination scalar was
 778 intended, then an all-Scalar operation should be used.
 779
 780 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 781
 782 # Assembly Annotation
 783
 784 Assembly code annotation is required for SV to be able to successfully
 785 mark instructions as "prefixed".
 786
 787 A reasonable (prototype) starting point:
 788
 789     svp64 [field=value]*
 790
 791 Fields:
 792
 793 * ew=8/16/32 - element width
 794 * sew=8/16/32 - source element width
 795 * vec=2/3/4 - SUBVL
 796 * mode=reduce/satu/sats/crpred
 797 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 798 * spred={reg spec}
 799
 800 similar to x86 "rex" prefix.
 801
 802 For actual assembler:
 803
 804     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 805
 806 Qualifiers:
 807
 808 * m={pred}: predicate mask mode
 809 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 810 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 811 * ew={N}: ew=8/16/32 - sets elwidth override
 812 * sw={N}: sw=8/16/32 - sets source elwidth override
 813 * ff={xx}: see fail-first mode
 814 * pr={xx}: see predicate-result mode
 815 * sat{x}: satu / sats - see saturation mode
 816 * mr: see map-reduce mode
 817 * mr.svm see map-reduce with sub-vector mode
 818 * crm: see map-reduce CR mode
 819 * crm.svm see map-reduce CR with sub-vector mode
 820 * sz: predication with source-zeroing
 821 * dz: predication with dest-zeroing
 822
 823 For modes:
 824
 825 * pred-result:
 826   - pm=lt/gt/le/ge/eq/ne/so/ns OR
 827   - pm=RC1 OR pm=~RC1
 828 * fail-first
 829   - ff=lt/gt/le/ge/eq/ne/so/ns OR
 830   - ff=RC1 OR ff=~RC1
 831 * saturation:
 832   - sats
 833   - satu
 834 * map-reduce:
 835   - mr OR crm: "normal" map-reduce mode or CR-mode.
 836   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 837
 838 # Proposed Parallel-reduction algorithm
 839
 840 **This algorithm contains a MV operation and may NOT be used.  Removal
 841 of the MV operation may be achieved by using index-redirection as was
 842 achieved in DCT and FFT REMAP**
 843
 844 ```
 845 /// reference implementation of proposed SimpleV reduction semantics.
 846 ///
 847                 // reduction operation -- we still use this algorithm even
 848                 // if the reduction operation isn't associative or
 849                 // commutative.
 850     XXX VIOLATION OF SVP64 DESIGN PRINCIPLES               XXXX
 851 /// XXX `pred` is a user-visible Vector Condition register XXXX
 852     XXX VIOLATION OF SVP64 DESIGN PRINCIPLES               XXXX
 853 ///
 854 /// all input arrays have length `vl`
 855 def reduce(vl, vec, pred):
 856     pred = copy(pred) # must not damage predicate
 857     step = 1;
 858     while step < vl
 859         step *= 2;
 860         for i in (0..vl).step_by(step)
 861             other = i + step / 2;
 862             other_pred = other < vl && pred[other];
 863             if pred[i] && other_pred
 864                 vec[i] += vec[other];
 865             else if other_pred
 866                 XXX VIOLATION OF SVP64 DESIGN XXX
 867                 XXX vec[i] = vec[other];      XXX
 868                 XXX VIOLATION OF SVP64 DESIGN XXX
 869             pred[i] |= other_pred;
 870 ```
 871
 872 The first principle in SVP64 being violated is that SVP64 is a fully-independent
 873 Abstraction of hardware-looping in between issue and execute phases
 874 that has no relation to the operation it issues.  The above pseudocode
 875 conditionally changes not only the type of element operation issued
 876 (a MV in some cases) but also the number of arguments (2 for a MV).
 877 At the very least, for Vertical-First Mode this will result in unanticipated and unexpected behaviour (maximise "surprises" for programmers) in
 878 the middle of loops, that will be far too hard to explain.
 879
 880 The second principle being violated by the above algorithm is the expectation
 881 that temporary storage is available for a modified predicate: there is no
 882 such space, and predicates are read-only to reduce complexity at the
 883 micro-architectural level.
 884 SVP64 is founded on the principle that all operations are
 885 "re-entrant" with respect to interrupts and exceptions: SVSTATE must
 886 be saved and restored alongside PC and MSR, but nothing more. It is perfectly
 887 fine to have context-switching back to the operation be somewhat slower,
 888 through "reconstruction" of temporary internal state based on what SVSTATE
 889 contains, but nothing more.
 890
 891 An alternative algorithm is therefore required that does not perform MVs,
 892 and does not require additional state to be saved on context-switching.
 893
 894 ```
 895 def reduce(  vl,  vec, pred ):
 896     pred = copy(pred) # must not damage predicate
 897     j = 0
 898     vi = [] # array of lookup indices to skip nonpredicated
 899     for i, pbit in enumerate(pred):
 900        if pbit:
 901            vi[j] = i
 902            j += 1
 903     step = 2
 904     while step <= vl
 905         halfstep = step // 2
 906         for i in (0..vl).step_by(step)
 907             other = vi[i + halfstep]
 908             ir = vi[i]
 909             other_pred = other < vl && pred[other]
 910             if pred[i] && other_pred
 911                 vec[ir] += vec[other]
 912             else if other_pred:
 913                vi[ir] = vi[other] # index redirection, no MV
 914             pred[ir] |= other_pred # reconstructed on context-switch
 915          step *= 2
 916 ```
 917
 918 In this version the need for an explicit MV is made unnecessary by instead
 919 leaving elements *in situ*.  The internal modifications to the predicate may,
 920 due to the reduction being entirely deterministic, be "reconstructed"
 921 on a context-switch. This may make some implementations slower.
 922
 923 *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
 924 implemented in hardware with MVs that ensure lane-crossing is minimised.
 925 The mistake which would be catastrophic to SVP64 to make is to then
 926 limit the Reduction Sequence for all implementors
 927 based solely and exclusively on what one
 928 specific internal microarchitecture does.
 929 In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
 930 compact and efficient encodings of abstract concepts.
 931 It is the Implementor's responsibility to produce a design
 932 that complies with the above algorithm,
 933 utilising internal Micro-coding and other techniques to transparently
 934 insert MV operations
 935 if necessary or desired, to give the level of efficiency or performance
 936 required.*
 937
 938 # Element-width overrides
 939
 940 Element-width overrides are best illustrated with a packed structure
 941 union in the c programming language.  The following should be taken
 942 literally, and assume always a little-endian layout:
 943
 944     typedef union {
 945         uint8_t  b[];
 946         uint16_t s[];
 947         uint32_t i[];
 948         uint64_t l[];
 949         uint8_t actual_bytes[8];
 950     } el_reg_t;
 951
 952     elreg_t int_regfile[128];
 953
 954     get_polymorphed_reg(reg, bitwidth, offset):
 955         el_reg_t res;
 956         res.l = 0; // TODO: going to need sign-extending / zero-extending
 957         if bitwidth == 8:
 958             reg.b = int_regfile[reg].b[offset]
 959         elif bitwidth == 16:
 960             reg.s = int_regfile[reg].s[offset]
 961         elif bitwidth == 32:
 962             reg.i = int_regfile[reg].i[offset]
 963         elif bitwidth == 64:
 964             reg.l = int_regfile[reg].l[offset]
 965         return res
 966
 967     set_polymorphed_reg(reg, bitwidth, offset, val):
 968         if (!reg.isvec):
 969             # not a vector: first element only, overwrites high bits
 970             int_regfile[reg].l[0] = val
 971         elif bitwidth == 8:
 972             int_regfile[reg].b[offset] = val
 973         elif bitwidth == 16:
 974             int_regfile[reg].s[offset] = val
 975         elif bitwidth == 32:
 976             int_regfile[reg].i[offset] = val
 977         elif bitwidth == 64:
 978             int_regfile[reg].l[offset] = val
 979
 980 In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
 981 to fp127) are reinterpreted to be "starting points" in a byte-addressable
 982 memory.  Vectors - which become just a virtual naming construct - effectively
 983 overlap.
 984
 985 It is extremely important for implementors to note that the only circumstance
 986 where upper portions of an underlying 64-bit register are zero'd out is
 987 when the destination is a scalar.  The ideal register file has byte-level
 988 write-enable lines, just like most SRAMs.
 989
 990 An example ADD operation with predication and element width overrides:
 991
 992       for (i = 0; i < VL; i++)
 993         if (predval & 1<<i) # predication
 994            src1 = get_polymorphed_reg(RA, srcwid, irs1)
 995            src2 = get_polymorphed_reg(RB, srcwid, irs2)
 996            result = src1 + src2 # actual add here
 997            set_polymorphed_reg(RT, destwid, ird, result)
 998            if (!RT.isvec) break
 999         if (RT.isvec)  { id += 1; }
1000         if (RA.isvec)  { irs1 += 1; }
1001         if (RB.isvec)  { irs2 += 1; }
1002
1003 # Twin (implicit) result operations
1004
1005 Some operations in the Power ISA already target two 64-bit scalar
1006 registers: `lq` for example. Some mathematical algorithms are more
1007 efficient when there are two outputs rather than one, providing
1008 feedback loops between elements.  64-bit multiply
1009 for example actually internally produces a 128 bit result, which clearly
1010 cannot be stored in a single 64 bit register.  Some ISAs recommend
1011 "macro op fusion": the practice of setting a convention whereby if
1012 two commonly used instructions (mullo, mulhi) use the same ALU but
1013 one selects the low part of an identical operation and the other
1014 selects the high part, then optimised micro-architectures may
1015 "fuse" those two instructions together, using Micro-coding techniques,
1016 internally.
1017
1018 The practice and convention of macro-op fusion however is not compatible
1019 with SVP64 Horizontal-First, because Horizontal Mode may only
1020 be applied to a single instruction at a time.  Thus it becomes
1021 necessary to add explicit more complex single instructions with
1022 more operands than would normally be seen in another ISA. If it
1023 was not for Power ISA already having LD/ST with update as well as
1024 Condition Codes and `lq` this would be hard to justify.
1025
1026 With limited space in the `EXTRA` Field, and Power ISA opcodes
1027 being only 32 bit, 5 operands is quite an ask.  `lq` however sets
1028 a precedent: `RTp` stands for "RT pair".  In other words the result
1029 is stored in RT and RT+1.  For Scalar operations, following this
1030 precedent is perfectly reasonable.  In Scalar mode,
1031 `madded` therefore stores the two halves of the 128-bit multiply
1032 into RT and RT+1.
1033
1034 What, then, of `sv.madded`? If the destination is hard-coded to
1035 RT and RT+1 the instruction is not useful when Vectorised because
1036 the output will be overwritten on the next element.  To solve this
1037 is easy: define the destination registers as RT and RT+MAXVL
1038 respectively.  This makes it easy for compilers to statically allocate
1039 registers even when VL changes dynamically.
1040
1041 Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
1042 and bear in mind that element-width overrides still have to be taken
1043 into consideration, the starting point for the implicit destination
1044 is best illustrated in pseudocode:
1045
1046       # demo of madded
1047       for (i = 0; i < VL; i++)
1048         if (predval & 1<<i) # predication
1049            src1 = get_polymorphed_reg(RA, srcwid, irs1)
1050            src2 = get_polymorphed_reg(RB, srcwid, irs2)
1051            src2 = get_polymorphed_reg(RC, srcwid, irs3)
1052            result = src1*src2 + src2
1053            destmask = (2<<destwid)-1
1054            # store two halves of result
1055            set_polymorphed_reg(RT, destwid, ird      , result&destmask)
1056            set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
1057            if (!RT.isvec) break
1058         if (RT.isvec)  { id += 1; }
1059         if (RA.isvec)  { irs1 += 1; }
1060         if (RB.isvec)  { irs2 += 1; }
1061         if (RC.isvec)  { irs3 += 1; }
1062
1063 The significant part here is that the second half is stored
1064 starting not from RT+MAXVL at all: it is the *element* index
1065 that is offset by MAXVL, both starting from RT.
1066
1067 * [[isa/svfixedarith]]
1068 * [[isa/svfparith]]
1069