openpower/sv/svp64/appendix.mdwn

   1 [[!tag standards]]
   2
   3 # Appendix
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturation
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47> Parallel Prefix
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=697> Reduce Modes
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel prefix simulator
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=809> OV sv.addex discussion
  10
  11 This is the appendix to [[sv/svp64]], providing explanations of modes
  12 etc. leaving the main svp64 page's primary purpose as outlining the
  13 instruction format.
  14
  15 Table of contents:
  16
  17 [[!toc]]
  18
  19 # Partial Implementations
  20
  21 It is perfectly legal to implement subsets of SVP64 as long as illegal
  22 instruction traps are always raised on unimplemented features,
  23 so that soft-emulation is possible,
  24 even for future revisions of SVP64. With SVP64 being partly controlled
  25 through contextual SPRs, a little care has to be taken.
  26
  27 **All** SPRs
  28 not implemented including reserved ones for future use must raise an illegal
  29 instruction trap if read or written. This allows software the
  30 opportunity to emulate the context created by the given SPR.
  31
  32 See [[sv/compliancy_levels]] for full details.
  33
  34 # XER, SO and other global flags
  35
  36 Vector systems are expected to be high performance.  This is achieved
  37 through parallelism, which requires that elements in the vector be
  38 independent.  XER SO/OV and other global "accumulation" flags (CR.SO) cause
  39 Read-Write Hazards on single-bit global resources, having a significant
  40 detrimental effect.
  41
  42 Consequently in SV, XER.SO behaviour is disregarded (including
  43 in `cmp` instructions).  XER.SO is not read, but XER.OV may be written,
  44 breaking the Read-Modify-Write Hazard Chain that complicates
  45 microarchitectural implementations.
  46 This includes when `scalar identity behaviour` occurs.  If precise
  47 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  48 instructions should be used without an SV Prefix.
  49
  50 TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
  51
  52 Of note here is that XER.SO and OV may already be disregarded in the
  53 Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
  54 SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets,
  55 but only for SVP64 Prefixed Operations.
  56
  57 XER.CA/CA32 on the other hand is expected and required to be implemented
  58 according to standard Power ISA Scalar behaviour.  Interestingly, due
  59 to SVP64 being in effect a hardware for-loop around Scalar instructions
  60 executing in precise Program Order, a little thought shows that a Vectorised
  61 Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
  62 and producing, at the end, a single bit Carry out.  High performance
  63 implementations may exploit this observation to deploy efficient
  64 Parallel Carry Lookahead.
  65
  66     # assume VL=4, this results in 4 sequential ops (below)
  67     sv.adde r0.v, r4.v, r8.v
  68
  69     # instructions that get executed in backend hardware:
  70     adde r0, r4, r8 # takes carry-in, produces carry-out
  71     adde r1, r5, r9 # takes carry from previous
  72     ...
  73     adde r3, r7, r11 # likewise
  74
  75 It can clearly be seen that the carry chains from one
  76 64 bit add to the next, the end result being that a
  77 256-bit "Big Integer Add" has been performed, and that
  78 CA contains the 257th bit.  A one-instruction 512-bit Add
  79 may be performed by setting VL=8, and a one-instruction
  80 1024-bit add by setting VL=16, and so on.  More on
  81 this in [[openpower/sv/biginteger]]
  82
  83 # v3.0B/v3.1 relevant instructions
  84
  85 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  86 CPU ISA.
  87
  88 Vectorisation of the VSX Packed SIMD system makes no sense whatsoever,
  89 the sole exceptions potentially being any operations with 128-bit
  90 operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar
  91 Quad-precision Add).
  92 SV effectively *replaces* the majority of VSX, requiring far less
  93 instructions, and provides, at the very minimum, predication
  94 (which VSX was designed without).
  95
  96 Likewise, Load/Store Multiple make no sense to
  97 have because they are not only provided by SV, the SV alternatives may
  98 be predicated as well, making them far better suited to use in function
  99 calls and context-switching.
 100
 101 Additionally, some v3.0/1 instructions simply make no sense at all in a
 102 Vector context: `rfid` falls into this category,
 103 as well as `sc` and `scv`.  Here there is simply no point
 104 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
 105 should be called instead.
 106
 107 Fortuitously this leaves several Major Opcodes free for use by SV
 108 to fit alternative future instructions.  In a 3D context this means
 109 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
 110 operations, and others critical to an efficient, effective 3D GPU and
 111 VPU ISA. With such instructions being included as standard in other
 112 commercially-successful GPU ISAs it is likewise critical that a 3D
 113 GPU/VPU based on svp64 also have such instructions.
 114
 115 Note however that svp64 is stand-alone and is in no way
 116 critically dependent on the existence or provision of 3D GPU or VPU
 117 instructions. These should be considered extensions, and their discussion
 118 and specification is out of scope for this document.
 119
 120 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
 121 v3.1B is *not* altered by svp64 in any way.
 122
 123 ## Major opcode map (v3.0B)
 124
 125 This table is taken from v3.0B.
 126 Table 9: Primary Opcode Map (opcode bits 0:5)
 127
 128 ```
 129     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 130 000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
 131 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 132 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 133 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 134 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 135 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
 136 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 137 111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
 138     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 139 ```
 140
 141 ## Suitable for svp64-only
 142
 143 This is the same table containing v3.0B Primary Opcodes except those that
 144 make no sense in a Vectorisation Context have been removed.  These removed
 145 POs can, *in the SV Vector Context only*, be assigned to alternative
 146 (Vectorised-only) instructions, including future extensions.
 147 EXT04 retains the scalar `madd*` operations but would have all PackedSIMD
 148 (aka VSX) operations removed.
 149
 150 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 151 change.  When not prefixed with svp64 these opcodes **specifically**
 152 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 153
 154 ```
 155     |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 156 000 |        |       |       |       | EXT04 |        |       | mulli | 000
 157 001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 158 010 | bc/l/a |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 159 011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 160 100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 161 101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 162 110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 163 111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 164     |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 165 ```
 166
 167 It is important to note that having a different v3.0B Scalar opcode
 168 that is different from an SVP64 one is highly undesirable: the complexity
 169 in the decoder is greatly increased.
 170
 171 # EXTRA Field Mapping
 172
 173 The purpose of the 9-bit EXTRA field mapping is to mark individual
 174 registers (RT, RA, BFA) as either scalar or vector, and to extend
 175 their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
 176 Three of the 9 bits may also be used up for a 2nd Predicate (Twin
 177 Predication) leaving a mere 6 bits for qualifying registers. As can
 178 be seen there is significant pressure on these (and in fact all) SVP64 bits.
 179
 180 In Power ISA v3.1 prefixing there are bits which describe and classify
 181 the prefix in a fashion that is independent of the suffix. MLSS for
 182 example.  For SVP64 there is insufficient space to make the SVP64 Prefix
 183 "self-describing", and consequently every single Scalar instruction
 184 had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
 185 This process was semi-automated and is described in this section.
 186 The final results, which are part of the SVP64 Specification, are here:
 187
 188 * [[openpower/opcode_regs_deduped]]
 189
 190 Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
 191 from reading the markdown formatted version of the Scalar pseudocode
 192 which is machine-readable and found in [[openpower/isatables]].  The
 193 analysis gives, by instruction, a "Register Profile".  `add RT, RA, RB`
 194 for example is given a designation `RM-2R-1W` because it requires
 195 two GPR reads and one GPR write.
 196
 197 Secondly, the total number of registers was added up (2R-1W is 3 registers)
 198 and if less than or equal to three then that instruction could be given an
 199 EXTRA3 designation.  Four or more is given an EXTRA2 designation because
 200 there are only 9 bits available.
 201
 202 Thirdly, the instruction was analysed to see if Twin or Single
 203 Predication was suitable.  As a general rule this was if there
 204 was only a single operand and a single result (`extw` and LD/ST)
 205 however it was found that some 2 or 3 operand instructions also
 206 qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
 207 in Twin Predication, some compromises were made, here.  LDST is
 208 Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
 209
 210 Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
 211 could have been decided
 212 that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
 213 and RT indexed 2 (EXTRA bits 6-8).  In some cases (LD/ST with update)
 214 RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
 215 (because it is possible to do, and perceived to be useful). Rc=1
 216 co-results (CR0, CR1) are always given the same EXTRA index as their
 217 main result (RT, FRT).
 218
 219 Fifthly, in an automated process the results of the analysis
 220 were outputted in CSV Format for use in machine-readable form
 221 by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
 222
 223 This process was laborious but logical, and, crucially, once a
 224 decision is made (and ratified) cannot be reversed.
 225 Qualifying future Power ISA Scalar instructions for SVP64
 226 is **strongly** advised to utilise this same process and the same
 227 sv_analysis.py program as a canonical method of maintaining the
 228 relationships.  Alterations to that same program which
 229 change the Designation is **prohibited** once finalised (ratified
 230 through the Power ISA WG Process). It would
 231 be similar to deciding that `add` should be changed from X-Form
 232 to D-Form.
 233
 234 # Single Predication <a name="1p"> </a>
 235
 236 This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
 237
 238 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep, but depending on whether sz and/or dz are set, srcstep and
 239 dststep can still potentially become different indices.  Only when sz=dz
 240 is srcstep guaranteed to equal dststep at all times.
 241
 242 Note that in some Mode Formats there is only one flag (zz). This indicates
 243 that *both* sz *and* dz are set to the same.
 244
 245 Example 1:
 246
 247 * VL=4
 248 * mask=0b1101
 249 * sz=0, dz=1
 250
 251 The following schedule for srcstep and dststep will occur:
 252
 253 | srcstep | dststep | comment      |
 254 | ----    | -----   | --------      |
 255 | 0       | 0       | both mask[src=0] and mask[dst=0] are 1 |
 256 | 1       | 2       | sz=1 but dz=0: dst skips mask[1], src soes not |
 257 | 2       | 3       | mask[src=2] and mask[dst=3] are 1 |
 258 | end     | end     | loop has ended because dst reached VL-1 |
 259
 260 Example 2:
 261
 262 * VL=4
 263 * mask=0b1101
 264 * sz=1, dz=0
 265
 266 The following schedule for srcstep and dststep will occur:
 267
 268 | srcstep | dststep | comment      |
 269 | ----    | -----   | --------      |
 270 | 0       | 0       | both mask[src=0] and mask[dst=0] are 1 |
 271 | 2       | 1       | sz=0 but dz=1: src skips mask[1], dst does not |
 272 | 3       | 2       | mask[src=3] and mask[dst=2] are 1 |
 273 | end     | end     | loop has ended because src reached VL-1 |
 274
 275 In both these examples it is crucial to note that despite there being
 276 a single predicate mask, with sz and dz being different, srcstep and
 277 dststep are being requested to react differently.
 278
 279 Example 3:
 280
 281 * VL=4
 282 * mask=0b1101
 283 * sz=0, dz=0
 284
 285 The following schedule for srcstep and dststep will occur:
 286
 287 | srcstep | dststep | comment      |
 288 | ----    | -----   | --------      |
 289 | 0       | 0       | both mask[src=0] and mask[dst=0] are 1 |
 290 | 2       | 2       | sz=0 and dz=0: both src and dst skip mask[1] |
 291 | 3       | 3       | mask[src=3] and mask[dst=3] are 1 |
 292 | end     | end     | loop has ended because src and dst reached VL-1 |
 293
 294 Here, both srcstep and dststep remain in lockstep because sz=dz=1
 295
 296 # EXTRA Pack/Unpack Modes
 297
 298 The pack/unpack concept of VSX `vpack` is abstracted out as a Sub-Vector
 299 reordering Schedule, named `RM-2P-1S1D-PU`.
 300 The usual RM-2P-1S1D is reduced from EXTRA3 to EXTRA2, making
 301 room for 2 extra bits that enable either "packing" or "unpacking"
 302 on the subvectors vec2/3/4.
 303
 304 Illustrating a
 305 "normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
 306
 307     def index():
 308         for i in range(VL):
 309             for j in range(SUBVL):
 310                 yield i*SUBVL+j
 311
 312     for idx in index():
 313         operation_on(RA+idx)
 314
 315 For pack/unpack (again, no elwidth overrides):
 316
 317     # yield an outer-SUBVL or inner VL loop with SUBVL
 318     def index_p(outer):
 319         if outer:
 320             for j in range(SUBVL):
 321                 for i in range(VL):
 322                     yield i+VL*j
 323         else:
 324             for i in range(VL):
 325                 for j in range(SUBVL):
 326                     yield i*SUBVL+j
 327
 328      # walk through both source and dest indices simultaneously
 329      for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)):
 330          move_operation(RT+dst_idx, RA+src_idx)
 331
 332 "yield" from python is used here for simplicity and clarity.
 333 The two Finite State Machines for the generation of the source
 334 and destination element offsets progress incrementally in
 335 lock-step.
 336
 337 Setting of both `PACK_en` and `UNPACK_en` is neither prohibited nor
 338 `UNDEFINED` because the reordering is fully deterministic, and
 339 additional REMAP reordering may be applied. For Matrix this would
 340 give potentially up to 4 Dimensions of reordering.
 341
 342 Pack/Unpack applies to mv operations and some other single-source
 343 single-destination operations such as Indexed LD/ST and extsw.
 344 [[sv/mv.swizzle] has a slightly different pseudocode algorithm
 345 for Vertical-First Mode.
 346
 347 # Twin Predication <a name="2p"> </a>
 348
 349 This is a novel concept that allows predication to be applied to a single
 350 source and a single dest register.  The following types of traditional
 351 Vector operations may be encoded with it, *without requiring explicit
 352 opcodes to do so*
 353
 354 * VSPLAT (a single scalar distributed across a vector)
 355 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 356 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 357 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 358 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 359
 360 Those patterns (and more) may be applied to:
 361
 362 * mv (the usual way that V\* ISA operations are created)
 363 * exts\* sign-extension
 364 * rwlinm and other RS-RA shift operations (**note**: excluding
 365   those that take RA as both a src and dest. These are not
 366   1-src 1-dest, they are 2-src, 1-dest)
 367 * LD and ST (treating AGEN as one source)
 368 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 369 * Condition Register ops mfcr, mtcr and other similar
 370
 371 This is a huge list that creates extremely powerful combinations,
 372 particularly given that one of the predicate options is `(1<<r3)`
 373
 374 Additional unusual capabilities of Twin Predication include a back-to-back
 375 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 376 sequentially ordered multiple VINSERTs.  The source predicate selects a
 377 sequentially ordered subset of elements to be inserted; the destination
 378 predicate specifies the sequentially ordered recipient locations.
 379 This is equivalent to
 380 `llvm.masked.compressstore.*`
 381 followed by
 382 `llvm.masked.expandload.*`
 383 with a single instruction.
 384
 385 This extreme power and flexibility comes down to the fact that SVP64
 386 is not actually a Vector ISA: it is a loop-abstraction-concept that
 387 is applied *in general* to Scalar operations, just like the x86
 388 `REP` instruction (if put on steroids).
 389
 390 # Reduce modes
 391
 392 Reduction in SVP64 is deterministic and somewhat of a misnomer.  A normal
 393 Vector ISA would have explicit Reduce opcodes with defined characteristics
 394 per operation: in SX Aurora there is even an additional scalar argument
 395 containing the initial reduction value, and the default is either 0
 396 or 1 depending on the specifics of the explicit opcode.
 397 SVP64 fundamentally has to
 398 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
 399 unique challenges.
 400
 401 The solution turns out to be to simply define reduction as permitting
 402 deterministic element-based schedules to be issued using the base Scalar
 403 operations, and to rely on the underlying microarchitecture to resolve
 404 Register Hazards at the element level.  This goes back to
 405 the fundamental principle that SV is nothing more than a Sub-Program-Counter
 406 sitting between Decode and Issue phases.
 407
 408 Microarchitectures *may* take opportunities to parallelise the reduction
 409 but only if in doing so they preserve Program Order at the Element Level.
 410 Opportunities where this is possible include an `OR` operation
 411 or a MIN/MAX operation: it may be possible to parallelise the reduction,
 412 but for Floating Point it is not permitted due to different results
 413 being obtained if the reduction is not executed in strict Program-Sequential
 414 Order.
 415
 416 In essence it becomes the programmer's responsibility to leverage the
 417 pre-determined schedules to desired effect.
 418
 419 ## Scalar result reduction and iteration
 420
 421 Scalar Reduction per se does not exist, instead is implemented in SVP64
 422 as a simple and natural relaxation of the usual restriction on the Vector
 423 Looping which would terminate if the destination was marked as a Scalar.
 424 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
 425 even though the destination register is marked as scalar.
 426 Thus it is up to the programmer to be aware of this, observe some
 427 conventions, and thus end up achieving the desired outcome of scalar
 428 reduction.
 429
 430 It is also important to appreciate that there is no
 431 actual imposition or restriction on how this mode is utilised: there
 432 will therefore be several valuable uses (including Vector Iteration
 433 and "Reverse-Gear")
 434 and it is up to the programmer to make best use of the
 435 (strictly deterministic) capability
 436 provided.
 437
 438 In this mode, which is suited to operations involving carry or overflow,
 439 one register must be assigned, by convention by the programmer to be the
 440 "accumulator".  Scalar reduction is thus categorised by:
 441
 442 * One of the sources is a Vector
 443 * the destination is a scalar
 444 * optionally but most usefully when one source scalar register is
 445   also the scalar destination (which may be informally termed
 446   the "accumulator")
 447 * That the source register type is the same as the destination register
 448   type identified as the "accumulator".  Scalar reduction on `cmp`,
 449   `setb` or `isel` makes no sense for example because of the mixture
 450   between CRs and GPRs.
 451
 452 *Note that issuing instructions in Scalar reduce mode such as `setb`
 453 are neither `UNDEFINED` nor prohibited, despite them not making much
 454 sense at first glance.
 455 Scalar reduce is strictly defined behaviour, and the cost in
 456 hardware terms of prohibition of seemingly non-sensical operations is too great.
 457 Therefore it is permitted and required to be executed successfully.
 458 Implementors **MAY** choose to optimise such instructions in instances
 459 where their use results in "extraneous execution", i.e. where it is clear
 460 that the sequence of operations, comprising multiple overwrites to
 461 a scalar destination **without** cumulative, iterative, or reductive
 462 behaviour (no "accumulator"), may discard all but the last element
 463 operation.  Identification
 464 of such is trivial to do for `setb` and `cmp`: the source register type is
 465 a completely different register file from the destination.
 466 Likewise Scalar reduction when the destination is a Vector
 467 is as if the Reduction Mode was not requested.*
 468
 469 Typical applications include simple operations such as `ADD r3, r10.v,
 470 r3` where, clearly, r3 is being used to accumulate the addition of all
 471 elements of the vector starting at r10.
 472
 473      # add RT, RA,RB but when RT==RA
 474      for i in range(VL):
 475           iregs[RA] += iregs[RB+i] # RT==RA
 476
 477 However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
 478 SV ordinarily
 479 **terminates** at the first scalar operation.  Only by marking the
 480 operation as "mapreduce" will it continue to issue multiple sub-looped
 481 (element) instructions in `Program Order`.
 482
 483 To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This may be useful in situations where the results may be different
 484 (floating-point) if executed in a different order.  Given that there is
 485 no actual prohibition on Reduce Mode being applied when the destination
 486 is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
 487 or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
 488 for example will start at the opposite end of the Vector and push
 489 a cumulative series of overlapping add operations into the Execution units of
 490 the underlying hardware.
 491
 492 Other examples include shift-mask operations where a Vector of inserts
 493 into a single destination register is required (see [[sv/bitmanip]], bmset),
 494 as a way to construct
 495 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 496 Using the same register as both the source and destination, with Vectors
 497 of different offsets masks and values to be inserted has multiple
 498 applications including Video, cryptography and JIT compilation.
 499
 500     # assume VL=4:
 501     # * Vector of shift-offsets contained in RC (r12.v)
 502     # * Vector of masks contained in RB (r8.v)
 503     # * Vector of values to be masked-in in RA (r4.v)
 504     # * Scalar destination RT (r0) to receive all mask-offset values
 505     sv.bmset/mr r0, r4.v, r8.v, r12.v
 506
 507 Due to the Deterministic Scheduling,
 508 Subtract and Divide are still permitted to be executed in this mode,
 509 although from an algorithmic perspective it is strongly discouraged.
 510 It would be better to use addition followed by one final subtract,
 511 or in the case of divide, to get better accuracy, to perform a multiply
 512 cascade followed by a final divide.
 513
 514 Note that single-operand or three-operand scalar-dest reduce is perfectly
 515 well permitted: the programmer may still declare one register, used as
 516 both a Vector source and Scalar destination, to be utilised as
 517 the "accumulator".  In the case of `sv.fmadds` and `sv.maddhw` etc
 518 this naturally fits well with the normal expected usage of these
 519 operations.
 520
 521 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 522 the scalar destination register **MUST** be updated with the current
 523 (intermediate) result, because this is how ```Program Order``` is
 524 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 525 in Program Order).  In this way, after return from interrupt,
 526 the scalar mapreduce may continue where it left off.  This provides
 527 "precise" exception behaviour.
 528
 529 Note that hardware is perfectly permitted to perform multi-issue
 530 parallel optimisation of the scalar reduce operation: it's just that
 531 as far as the user is concerned, all exceptions and interrupts **MUST**
 532 be precise.
 533
 534 ## Vector result reduce mode
 535
 536 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 537 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 538 *appearance* and *effect* of Reduction.
 539
 540 Vector-result reduction **requires**
 541 the destination to be a Vector, which will be used to store
 542 intermediary results.
 543
 544 Given that the tree-reduction schedule is deterministic,
 545 Interrupts and exceptions
 546 can therefore also be precise.  The final result will be in the first
 547 non-predicate-masked-out destination element, but due again to
 548 the deterministic schedule programmers may find uses for the intermediate
 549 results.
 550
 551 When Rc=1 a corresponding Vector of co-resultant CRs is also
 552 created.  No special action is taken: the result and its CR Field
 553 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 554
 555 Note that the Schedule only makes sense on top of certain instructions:
 556 X-Form with a Register Profile of `RT,RA,RB` is fine.  Like Scalar
 557 Reduction, nothing is prohibited:
 558 the results of execution on an unsuitable instruction may simply
 559 not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
 560 Reduction in particular do not make sense, but `ternlogi`, if used
 561 with care, would.
 562
 563 **Parallel-Reduction with Predication**
 564
 565 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
 566 completely separate from the actual element-level (scalar) operations,
 567 Move operations are **not** included in the Schedule.  This means that
 568 the Schedule leaves the final (scalar) result in the first-non-masked
 569 element of the Vector used.  With the predicate mask being dynamic
 570 (but deterministic) this result could be anywhere.
 571
 572 If that result is needed to be moved to a (single) scalar register
 573 then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
 574 needed to get it, where the predicate is the exact same predicate used
 575 in the prior Parallel-Reduction instruction. For *some* implementations
 576 this may be a slow operation.  It may be better to perform a pre-copy
 577 of the values, compressing them (VREDUCE-style) into a contiguous block,
 578 which will guarantee that the result goes into the very first element
 579 of the destination vector.
 580
 581 **Usage conditions**
 582
 583 The simplest usage is to perform an overwrite, specifying all three
 584 register operands the same.
 585
 586     setvl VL=6
 587     sv.add/vr 8.v, 8.v, 8.v
 588
 589 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 590 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
 591 necessary (see "Parallel Reduction algorithm" in a later section).
 592
 593 A non-overwrite is possible as well but just as with the overwrite
 594 version, only those destination elements necessary for storing
 595 intermediary computations will be written to: the remaining elements
 596 will **not** be overwritten and will **not** be zero'd.
 597
 598     setvl VL=4
 599     sv.add/vr 0.v, 8.v, 8.v
 600
 601 ## Sub-Vector Horizontal Reduction
 602
 603 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 604 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 605 illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
 606
 607     for i in range(0, VL):
 608         # RA==RT in the instruction. does not have to be
 609         iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
 610         iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
 611
 612 Thus logically there is nothing special or unanticipated about
 613 `SVM=0`: it is expected behaviour according to standard SVP64
 614 Sub-Vector rules.
 615
 616 By contrast, when SVM is set and SUBVL!=1, a Horizontal
 617 Subvector mode is enabled, which behaves very much more
 618 like a traditional Vector Processor Reduction instruction.
 619
 620 Example for a vec2:
 621
 622     for i in range(VL):
 623         iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
 624
 625 Example for a vec3:
 626
 627     for i in range(VL):
 628         iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
 629         iregs[RT+i] = op(iregs[RT+i]  , iregs[RA+i].z)
 630
 631 Example for a vec4:
 632
 633     for i in range(VL):
 634         iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
 635         iregs[RT+i] = op(iregs[RT+i]  , iregs[RA+i].z)
 636         iregs[RT+i] = op(iregs[RT+i]  , iregs[RA+i].w)
 637
 638 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 639 element creates a corresponding CR element (for the final, reduced, result).
 640
 641 Note that the destination (RT) is automatically used as an "Accumulator"
 642 register, and consequently the Sub-Vector Loop is interruptible.
 643 If RT is a Scalar then as usual the main VL Loop terminates at the
 644 first predicated element (or the first element if unpredicated).
 645
 646 # Fail-on-first
 647
 648 Data-dependent fail-on-first has two distinct variants: one for LD/ST
 649 (see [[sv/ldst]],
 650 the other for arithmetic operations (actually, CR-driven)
 651 ([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
 652 Note in each
 653 case the assumption is that vector elements are required appear to be
 654 executed in sequential Program Order, element 0 being the first.
 655
 656 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 657   ordinary one.  Exceptions occur "as normal".  However for elements 1
 658   and above, if an exception would occur, then VL is **truncated** to the
 659   previous element.
 660 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 661   CR-creating operation produces a result (including cmp).  Similar to
 662   branch, an analysis of the CR is performed and if the test fails, the
 663   vector operation terminates and discards all element operations
 664   above the current one (and the current one if VLi is not set),
 665   and VL is truncated to either
 666   the *previous* element or the current one, depending on whether
 667   VLi (VL "inclusive") is set.
 668
 669 Thus the new VL comprises a contiguous vector of results,
 670 all of which pass the testing criteria (equal to zero, less than zero).
 671
 672 The CR-based data-driven fail-on-first is new and not found in ARM
 673 SVE or RVV. It is extremely useful for reducing instruction count,
 674 however requires speculative execution involving modifications of VL
 675 to get high performance implementations.  An additional mode (RC1=1)
 676 effectively turns what would otherwise be an arithmetic operation
 677 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 678 against the `inv` field).
 679 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 680 the loop ends.
 681 Note that when RC1=1 the result elements are never stored, only the CRs.
 682
 683 VLi is only available as an option when `Rc=0` (or for instructions
 684 which do not have Rc). When set, the current element is always
 685 also included in the count (the new length that VL will be set to).
 686 This may be useful in combination with "inv" to truncate the Vector
 687 to `exclude` elements that fail a test, or, in the case of implementations
 688 of strncpy, to include the terminating zero.
 689
 690 In CR-based data-driven fail-on-first there is only the option to select
 691 and test one bit of each CR (just as with branch BO).  For more complex
 692 tests this may be insufficient.  If that is the case, a vectorised crops
 693 (crand, cror) may be used, and ffirst applied to the crop instead of to
 694 the arithmetic vector.
 695
 696 One extremely important aspect of ffirst is:
 697
 698 * LDST ffirst may never set VL equal to zero.  This because on the first
 699   element an exception must be raised "as normal".
 700 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 701   to zero. This is the only means in the entirety of SV that VL may be set
 702   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 703   zero due to the first element failing the CR bit-test, all subsequent
 704   vectorised operations are effectively `nops` which is
 705   *precisely the desired and intended behaviour*.
 706
 707 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 708 to a nonzero value for any implementation-specific reason.  For example:
 709 it is perfectly reasonable for implementations to alter VL when ffirst
 710 LD or ST operations are initiated on a nonaligned boundary, such that
 711 within a loop the subsequent iteration of that loop begins subsequent
 712 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 713 workloads or balance resources.
 714
 715 CR-based data-dependent first on the other hand MUST not truncate VL
 716 arbitrarily to a length decided by the hardware: VL MUST only be
 717 truncated based explicitly on whether a test fails.
 718 This because it is a precise test on which algorithms
 719 will rely.
 720
 721 ## Data-dependent fail-first on CR operations (crand etc)
 722
 723 Operations that actually produce or alter CR Field as a result
 724 do not also in turn have an Rc=1 mode.  However it makes no
 725 sense to try to test the 4 bits of a CR Field for being equal
 726 or not equal to zero. Moreover, the result is already in the
 727 form that is desired: it is a CR field.  Therefore,
 728 CR-based operations have their own SVP64 Mode, described
 729 in [[sv/cr_ops]]
 730
 731 There are two primary different types of CR operations:
 732
 733 * Those which have a 3-bit operand field (referring to a CR Field)
 734 * Those which have a 5-bit operand (referring to a bit within the
 735    whole 32-bit CR)
 736
 737 More details can be found in [[sv/cr_ops]].
 738
 739 # pred-result mode
 740
 741 Pred-result mode may not be applied on CR-based operations.
 742
 743 Although CR operations (mtcr, crand, cror) may be Vectorised,
 744 predicated, pred-result mode applies to operations that have
 745 an Rc=1 mode, or make sense to add an RC1 option.
 746
 747 Predicate-result merges common CR testing with predication, saving on
 748 instruction count.  In essence, a Condition Register Field test
 749 is performed, and if it fails it is considered to have been
 750 *as if* the destination predicate bit was zero.  Given that
 751 there are no CR-based operations that produce Rc=1 co-results,
 752 there can be no pred-result mode  for mtcr and other CR-based instructions
 753
 754 Arithmetic and Logical Pred-result, which does have Rc=1 or for which
 755 RC1 Mode makes sense, is covered in [[sv/normal]]
 756
 757 # CR Operations
 758
 759 CRs are slightly more involved than INT or FP registers due to the
 760 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 761 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 762 numbering, with a clear linear relationship and mapping existing when
 763 SV is applied.
 764
 765 ## CR EXTRA mapping table and algorithm <a name="cr_extra"></a>
 766
 767 Numbering relationships for CR fields are already complex due to being
 768 in BE format (*the relationship is not clearly explained in the v3.0B
 769 or v3.1 specification*).  However with some care and consideration
 770 the exact same mapping used for INT and FP regfiles may be applied,
 771 just to the upper bits, as explained below.  The notation
 772 `CR{field number}` is used to indicate access to a particular
 773 Condition Register Field (as opposed to the notation `CR[bit]`
 774 which accesses one bit of the 32 bit Power ISA v3.0B
 775 Condition Register)
 776
 777 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
 778
 779      CR{7-n} = CR[32+n*4:35+n*4]
 780
 781 For SVP64 the relationship for the sequential
 782 numbering of elements is to the CR **fields** within
 783 the CR Register, not to individual bits within the CR register.
 784
 785 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 786 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 787 *in* that CR (EQ/LT/GT/SO).  The numbering was determined (after 4 months of
 788 analysis and research) to be as follows:
 789
 790     CR_index = 7-(BA>>2)      # top 3 bits but BE
 791     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 792     CR_reg = CR{CR_index}     # get the CR
 793     # finally get the bit from the CR.
 794     CR_bit = (CR_reg & (1<<bit_index)) != 0
 795
 796 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 797 applies, **not** the CR\_bit portion (bits 3-4):
 798
 799     if extra3_mode:
 800         spec = EXTRA3
 801     else:
 802         spec = EXTRA2<<1 | 0b0
 803     if spec[0]:
 804        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 805        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 806               (spec[1:2]<<4) | # to make room for these
 807               (BA & 0b11)      # CR_bit on the end
 808     else:
 809        # scalar constructs "00 spec[1:2] BA[0:4]"
 810        return (spec[1:2] << 5) | BA
 811
 812 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 813 algorithm to determine CR\_reg is modified to as follows:
 814
 815     CR_index = 7-(BA>>2)      # top 3 bits but BE
 816     if spec[0]:
 817         # vector mode, 0-124 increments of 4
 818         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 819     else:
 820         # scalar mode, 0-32 increments of 1
 821         CR_index = (spec[1:2]<<3) | CR_index
 822     # same as for v3.0/v3.1 from this point onwards
 823     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 824     CR_reg = CR{CR_index}     # get the CR
 825     # finally get the bit from the CR.
 826     CR_bit = (CR_reg & (1<<bit_index)) != 0
 827
 828 Note here that the decoding pattern to determine CR\_bit does not change.
 829
 830 Note: high-performance implementations may read/write Vectors of CRs in
 831 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 832 simplify internal design.  If instructions are issued where CR Vectors
 833 do not start on a 32-bit aligned boundary, performance may be affected.
 834
 835 ## CR fields as inputs/outputs of vector operations
 836
 837 CRs (or, the arithmetic operations associated with them)
 838 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 839
 840 When vectorized, the CR inputs/outputs are sequentially read/written
 841 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 842 writing to CR8 (TBD evaluate) and increase sequentially from there.
 843 This is so that:
 844
 845 * implementations may rely on the Vector CRs being aligned to 8. This
 846   means that CRs may be read or written in aligned batches of 32 bits
 847   (8 CRs per batch), for high performance implementations.
 848 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 849   overwritten by vector Rc=1 operations except for very large VL
 850 * CR-based predication, from CR32, is also not interfered with
 851   (except by large VL).
 852
 853 However when the SV result (destination) is marked as a scalar by the
 854 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 855 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 856 for FP operations.
 857
 858 Note that yes, the CR Fields are genuinely Vectorised.  Unlike in SIMD VSX which
 859 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 860 v3.0B scalar operations produce a **tuple** of element results: the
 861 result of the operation as one part of that element *and a corresponding
 862 CR element*.  Greatly simplified pseudocode:
 863
 864     for i in range(VL):
 865          # calculate the vector result of an add
 866          iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
 867          # now calculate CR bits
 868          CRs{8+i}.eq = iregs[RT+i] == 0
 869          CRs{8+i}.gt = iregs[RT+i] > 0
 870          ... etc
 871
 872 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 873 then a followup instruction must be performed, setting "reduce" mode on
 874 the Vector of CRs, using cr ops (crand, crnor) to do so.  This provides far
 875 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 876 Vector ISAs are typically restricted to "were all results nonzero" and
 877 "were some results nonzero". The application of mapreduce to Vectorised
 878 cr operations allows far more sophisticated analysis, particularly in
 879 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 880
 881 Note in particular that the use of a separate instruction in this way
 882 ensures that high performance multi-issue OoO inplementations do not
 883 have the computation of the cumulative analysis CR as a bottleneck and
 884 hindrance, regardless of the length of VL.
 885
 886 Additionally,
 887 SVP64 [[sv/branches]] may be used, even when the branch itself is to
 888 the following instruction.  The combined side-effects of CTR reduction
 889 and VL truncation provide several benefits.
 890
 891 (see [[discussion]].  some alternative schemes are described there)
 892
 893 ## Rc=1 when SUBVL!=1
 894
 895 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
 896 predicate is allocated per subvector; likewise only one CR is allocated
 897 per subvector.
 898
 899 This leaves a conundrum as to how to apply CR computation per subvector,
 900 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 901 is to perform a bitwise OR or AND of the subvector tests.  Given that
 902 OE is ignored in SVP64, this field may (when available) be used to select OR or
 903 AND behavior.
 904
 905 ### Table of CR fields
 906
 907 CRn is the notation used by the OpenPower spec to refer to CR field #i,
 908 so FP instructions with Rc=1 write to CR1 (n=1).
 909
 910 CRs are not stored in SPRs: they are registers in their own right.
 911 Therefore context-switching the full set of CRs involves a Vectorised
 912 mfcr or mtcr, using VL=8 to do so.  This is exactly as how
 913 scalar OpenPOWER context-switches CRs: it is just that there are now
 914 more of them.
 915
 916 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 917 are arranged.  TODO a python program that auto-generates a CSV file
 918 which can be included in a table, which is in a new page (so as not to
 919 overwhelm this one). [[svp64/cr_names]]
 920
 921 # Register Profiles
 922
 923 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 924 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 925
 926 Instructions are broken down by Register Profiles as listed in the
 927 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 928 indicates that the operations with this Register Profile cannot be
 929 Vectorised (mtspr, bc, dcbz, twi)
 930
 931 TODO generate table which will be here [[svp64/reg_profiles]]
 932
 933 # SV pseudocode illilustration
 934
 935 ## Single-predicated Instruction
 936
 937 illustration of normal mode add operation: zeroing not included, elwidth
 938 overrides not included.  if there is no predicate, it is set to all 1s
 939
 940     function op_add(rd, rs1, rs2) # add not VADD!
 941       int i, id=0, irs1=0, irs2=0;
 942       predval = get_pred_val(FALSE, rd);
 943       for (i = 0; i < VL; i++)
 944         STATE.srcoffs = i # save context
 945         if (predval & 1<<i) # predication uses intregs
 946            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 947         if (!int_vec[rd].isvec) break;
 948         if (rd.isvec)  { id += 1; }
 949         if (rs1.isvec) { irs1 += 1; }
 950         if (rs2.isvec) { irs2 += 1; }
 951         if (id == VL or irs1 == VL or irs2 == VL)
 952         {
 953           # end VL hardware loop
 954           STATE.srcoffs = 0; # reset
 955           return;
 956         }
 957
 958 This has several modes:
 959
 960 * RT.v = RA.v RB.v
 961 * RT.v = RA.v RB.s (and RA.s RB.v)
 962 * RT.v = RA.s RB.s
 963 * RT.s = RA.v RB.v
 964 * RT.s = RA.v RB.s (and RA.s RB.v)
 965 * RT.s = RA.s RB.s
 966
 967 All of these may be predicated.  Vector-Vector is straightfoward.
 968 When one of source is a Vector and the other a Scalar, it is clear that
 969 each element of the Vector source should be added to the Scalar source,
 970 each result placed into the Vector (or, if the destination is a scalar,
 971 only the first nonpredicated result).
 972
 973 The one that is not obvious is RT=vector but both RA/RB=scalar.
 974 Here this acts as a "splat scalar result", copying the same result into
 975 all nonpredicated result elements.  If a fixed destination scalar was
 976 intended, then an all-Scalar operation should be used.
 977
 978 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 979
 980 # Assembly Annotation
 981
 982 Assembly code annotation is required for SV to be able to successfully
 983 mark instructions as "prefixed".
 984
 985 A reasonable (prototype) starting point:
 986
 987     svp64 [field=value]*
 988
 989 Fields:
 990
 991 * ew=8/16/32 - element width
 992 * sew=8/16/32 - source element width
 993 * vec=2/3/4 - SUBVL
 994 * mode=mr/satu/sats/crpred
 995 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 996
 997 similar to x86 "rex" prefix.
 998
 999 For actual assembler:
1000
1001     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
1002
1003 Qualifiers:
1004
1005 * m={pred}: predicate mask mode
1006 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
1007 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
1008 * ew={N}: ew=8/16/32 - sets elwidth override
1009 * sw={N}: sw=8/16/32 - sets source elwidth override
1010 * ff={xx}: see fail-first mode
1011 * pr={xx}: see predicate-result mode
1012 * sat{x}: satu / sats - see saturation mode
1013 * mr: see map-reduce mode
1014 * mr.svm see map-reduce with sub-vector mode
1015 * crm: see map-reduce CR mode
1016 * crm.svm see map-reduce CR with sub-vector mode
1017 * sz: predication with source-zeroing
1018 * dz: predication with dest-zeroing
1019
1020 For modes:
1021
1022 * pred-result:
1023   - pm=lt/gt/le/ge/eq/ne/so/ns
1024   - RC1 mode
1025 * fail-first
1026   - ff=lt/gt/le/ge/eq/ne/so/ns
1027   - RC1 mode
1028 * saturation:
1029   - sats
1030   - satu
1031 * map-reduce:
1032   - mr OR crm: "normal" map-reduce mode or CR-mode.
1033   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
1034
1035 # Parallel-reduction algorithm
1036
1037 The principle of SVP64 is that SVP64 is a fully-independent
1038 Abstraction of hardware-looping in between issue and execute phases
1039 that has no relation to the operation it issues.
1040 Additional state cannot be saved on context-switching beyond that
1041 of SVSTATE, making things slightly tricky.
1042
1043 Executable demo pseudocode, full version
1044 [here](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/test_preduce.py;hb=HEAD)
1045
1046 ```
1047 [[!inline raw="yes" pages="openpower/sv/preduce.py" ]]
1048 ```
1049
1050 This algorithm works by noting when data remains in-place rather than
1051 being reduced, and referring to that alternative position on subsequent
1052 layers of reduction.  It is re-entrant. If however interrupted and
1053 restored, some implementations may take longer to re-establish the
1054 context.
1055
1056 Its application by default is that:
1057
1058 * RA, FRA or BFA is the first register as the first operand
1059   (ci index offset in the above pseudocode)
1060 * RB, FRB or BFB is the second (co index offset)
1061 * RT (result) also uses ci **if RA==RT**
1062
1063 For more complex applications a REMAP Schedule must be used
1064
1065 *Programmers's note:
1066 if passed a predicate mask with only one bit set, this algorithm
1067 takes no action, similar to when a predicate mask is all zero.*
1068
1069 *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
1070 implemented in hardware with MVs that ensure lane-crossing is minimised.
1071 The mistake which would be catastrophic to SVP64 to make is to then
1072 limit the Reduction Sequence for all implementors
1073 based solely and exclusively on what one
1074 specific internal microarchitecture does.
1075 In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
1076 compact and efficient encodings of abstract concepts.*
1077 **It is the Implementor's responsibility to produce a design
1078 that complies with the above algorithm,
1079 utilising internal Micro-coding and other techniques to transparently
1080 insert micro-architectural lane-crossing Move operations
1081 if necessary or desired, to give the level of efficiency or performance
1082 required.**
1083
1084 # Element-width overrides <a name="elwidth"> </>
1085
1086 Element-width overrides are best illustrated with a packed structure
1087 union in the c programming language.  The following should be taken
1088 literally, and assume always a little-endian layout:
1089
1090     typedef union {
1091         uint8_t  b[];
1092         uint16_t s[];
1093         uint32_t i[];
1094         uint64_t l[];
1095         uint8_t actual_bytes[8];
1096     } el_reg_t;
1097
1098     elreg_t int_regfile[128];
1099
1100     get_polymorphed_reg(reg, bitwidth, offset):
1101         el_reg_t res;
1102         res.l = 0; // TODO: going to need sign-extending / zero-extending
1103         if bitwidth == 8:
1104             reg.b = int_regfile[reg].b[offset]
1105         elif bitwidth == 16:
1106             reg.s = int_regfile[reg].s[offset]
1107         elif bitwidth == 32:
1108             reg.i = int_regfile[reg].i[offset]
1109         elif bitwidth == 64:
1110             reg.l = int_regfile[reg].l[offset]
1111         return res
1112
1113     set_polymorphed_reg(reg, bitwidth, offset, val):
1114         if (!reg.isvec):
1115             # not a vector: first element only, overwrites high bits
1116             int_regfile[reg].l[0] = val
1117         elif bitwidth == 8:
1118             int_regfile[reg].b[offset] = val
1119         elif bitwidth == 16:
1120             int_regfile[reg].s[offset] = val
1121         elif bitwidth == 32:
1122             int_regfile[reg].i[offset] = val
1123         elif bitwidth == 64:
1124             int_regfile[reg].l[offset] = val
1125
1126 In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
1127 to fp127) are reinterpreted to be "starting points" in a byte-addressable
1128 memory.  Vectors - which become just a virtual naming construct - effectively
1129 overlap.
1130
1131 It is extremely important for implementors to note that the only circumstance
1132 where upper portions of an underlying 64-bit register are zero'd out is
1133 when the destination is a scalar.  The ideal register file has byte-level
1134 write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE.
1135
1136 An example ADD operation with predication and element width overrides:
1137
1138       for (i = 0; i < VL; i++)
1139         if (predval & 1<<i) # predication
1140            src1 = get_polymorphed_reg(RA, srcwid, irs1)
1141            src2 = get_polymorphed_reg(RB, srcwid, irs2)
1142            result = src1 + src2 # actual add here
1143            set_polymorphed_reg(RT, destwid, ird, result)
1144            if (!RT.isvec) break
1145         if (RT.isvec)  { id += 1; }
1146         if (RA.isvec)  { irs1 += 1; }
1147         if (RB.isvec)  { irs2 += 1; }
1148
1149 Thus it can be clearly seen that elements are packed by their
1150 element width, and the packing starts from the source (or destination)
1151 specified by the instruction.
1152
1153 # Twin (implicit) result operations
1154
1155 Some operations in the Power ISA already target two 64-bit scalar
1156 registers: `lq` for example, and LD with update.
1157 Some mathematical algorithms are more
1158 efficient when there are two outputs rather than one, providing
1159 feedback loops between elements (the most well-known being add with
1160 carry).  64-bit multiply
1161 for example actually internally produces a 128 bit result, which clearly
1162 cannot be stored in a single 64 bit register.  Some ISAs recommend
1163 "macro op fusion": the practice of setting a convention whereby if
1164 two commonly used instructions (mullo, mulhi) use the same ALU but
1165 one selects the low part of an identical operation and the other
1166 selects the high part, then optimised micro-architectures may
1167 "fuse" those two instructions together, using Micro-coding techniques,
1168 internally.
1169
1170 The practice and convention of macro-op fusion however is not compatible
1171 with SVP64 Horizontal-First, because Horizontal Mode may only
1172 be applied to a single instruction at a time, and SVP64 is based on
1173 the principle of strict Program Order even at the element
1174 level.  Thus it becomes
1175 necessary to add explicit more complex single instructions with
1176 more operands than would normally be seen in the average RISC ISA
1177 (3-in, 2-out, in some cases). If it
1178 was not for Power ISA already having LD/ST with update as well as
1179 Condition Codes and `lq` this would be hard to justify.
1180
1181 With limited space in the `EXTRA` Field, and Power ISA opcodes
1182 being only 32 bit, 5 operands is quite an ask.  `lq` however sets
1183 a precedent: `RTp` stands for "RT pair".  In other words the result
1184 is stored in RT and RT+1.  For Scalar operations, following this
1185 precedent is perfectly reasonable.  In Scalar mode,
1186 `madded` therefore stores the two halves of the 128-bit multiply
1187 into RT and RT+1.
1188
1189 What, then, of `sv.madded`? If the destination is hard-coded to
1190 RT and RT+1 the instruction is not useful when Vectorised because
1191 the output will be overwritten on the next element.  To solve this
1192 is easy: define the destination registers as RT and RT+MAXVL
1193 respectively.  This makes it easy for compilers to statically allocate
1194 registers even when VL changes dynamically.
1195
1196 Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
1197 and bear in mind that element-width overrides still have to be taken
1198 into consideration, the starting point for the implicit destination
1199 is best illustrated in pseudocode:
1200
1201      # demo of madded
1202      for (i = 0; i < VL; i++)
1203         if (predval & 1<<i) # predication
1204            src1 = get_polymorphed_reg(RA, srcwid, irs1)
1205            src2 = get_polymorphed_reg(RB, srcwid, irs2)
1206            src2 = get_polymorphed_reg(RC, srcwid, irs3)
1207            result = src1*src2 + src2
1208            destmask = (2<<destwid)-1
1209            # store two halves of result, both start from RT.
1210            set_polymorphed_reg(RT, destwid, ird      , result&destmask)
1211            set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
1212            if (!RT.isvec) break
1213         if (RT.isvec)  { id += 1; }
1214         if (RA.isvec)  { irs1 += 1; }
1215         if (RB.isvec)  { irs2 += 1; }
1216         if (RC.isvec)  { irs3 += 1; }
1217
1218 The significant part here is that the second half is stored
1219 starting not from RT+MAXVL at all: it is the *element* index
1220 that is offset by MAXVL, both halves actually starting from RT.
1221 If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements
1222 RT0 to RT2 are stored:
1223
1224           0..31     32..63
1225      r0  unchanged unchanged
1226      r1  RT0.lo    RT1.lo
1227      r2  RT2.lo    unchanged
1228      r3  unchanged RT0.hi
1229      r4  RT1.hi    RT2.hi
1230      r5  unchanged unchanged
1231
1232 Note that all of the LO halves start from r1, but that the HI halves
1233 start from half-way into r3. The reason is that with MAXVL bring
1234 5 and elwidth being 32, this is the 5th element
1235 offset (in 32 bit quantities) counting from r1.
1236
1237 *Programmer's note: accessing registers that have been placed
1238 starting on a non-contiguous boundary (half-way along a scalar
1239 register) can be inconvenient: REMAP can provide an offset but
1240 it requires extra instructions to set up.  A simple solution
1241 is to ensure that MAXVL is rounded up such that the Vector
1242 ends cleanly on a contiguous register boundary.  MAXVL=6 in
1243 the above example would achieve that*
1244
1245 Additional DRAFT Scalar instructions in 3-in 2-out form
1246 with an implicit 2nd destination:
1247
1248 * [[isa/svfixedarith]]
1249 * [[isa/svfparith]]
1250