openpower/sv/svp64/appendix.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
   5
   6 This is the appendix to [[sv/svp64]], providing explanations of modes
   7 etc. leaving the main svp64 page's primary purpose as outlining the
   8 instruction format.
   9
  10 Table of contents:
  11
  12 [[!toc]]
  13
  14 # XER, SO and other global flags
  15
  16 Vector systems are expected to be high performance.  This is achieved
  17 through parallelism, which requires that elements in the vector be
  18 independent.  XER SO and other global "accumulation" flags (CR.OV) cause
  19 Read-Write Hazards on single-bit global resources, having a significant
  20 detrimental effect.
  21
  22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
  23 in `cmp` instructions).  XER is simply neither read nor written.
  24 This includes when `scalar identity behaviour` occurs.  If precise
  25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  26 instructions should be used without an SV Prefix.
  27
  28 An interesting side-effect of this decision is that the OE flag is now
  29 free for other uses when SV Prefixing is used.
  30
  31 Regarding XER.CA: this does not fit either: it was designed for a scalar
  32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
  33 Vector element.  This provides a means to perform large parallel batches
  34 of Vectorised carry-capable additions.  crweird instructions can be used
  35 to transfer the CRs in and out of an integer, where bitmanipulation
  36 may be performed to analyse the carry bits (including carry lookahead
  37 propagation) before continuing with further parallel additions.
  38
  39 # v3.0B/v3.1B relevant instructions
  40
  41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  42 CPU ISA.
  43
  44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
  45 alternative uses.  Additionally, Vectorisation of the VSX SIMD system
  46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
  47 at the very minimum, predication (which VSX was designed without).
  48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
  49 illegal instruction exceptions in SV Prefix Mode.
  50
  51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
  52 have because they are not only provided by SV, the SV alternatives may
  53 be predicated as well, making them far better suited to use in function
  54 calls and context-switching.
  55
  56 Additionally, some v3.0/1 instructions simply make no sense at all in a
  57 Vector context: `rfid` falls into this category,
  58 as well as `sc` and `scv`.  Here there is simply no point
  59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
  60 should be called instead.
  61
  62 Fortuitously this leaves several Major Opcodes free for use by SV
  63 to fit alternative future instructions.  In a 3D context this means
  64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
  65 operations, and others critical to an efficient, effective 3D GPU and
  66 VPU ISA. With such instructions being included as standard in other
  67 commercially-successful GPU ISAs it is likewise critical that a 3D
  68 GPU/VPU based on svp64 also have such instructions.
  69
  70 Note however that svp64 is stand-alone and is in no way
  71 critically dependent on the existence or provision of 3D GPU or VPU
  72 instructions. These should be considered extensions, and their discussion
  73 and specification is out of scope for this document.
  74
  75 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
  76 v3.1B is *not* altered by svp64 in any way.
  77
  78 ## Major opcode map (v3.0B)
  79
  80 This table is taken from v3.0B.
  81 Table 9: Primary Opcode Map (opcode bits 0:5)
  82
  83         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
  84     000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
  85     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
  86     010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
  87     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
  88     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
  89     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
  90     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
  91     111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
  92         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
  93
  94 ## Suitable for svp64-only
  95
  96 This is the same table containing v3.0B Primary Opcodes except those that
  97 make no sense in a Vectorisation Context have been removed.  These removed
  98 POs can, *in the SV Vector Context only*, be assigned to alternative
  99 (Vectorised-only) instructions, including future extensions.
 100
 101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 102 change.  When not prefixed with svp64 these opcodes **specifically**
 103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 104
 105         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 106     000 |        |       |       |       |       |        |       | mulli | 000
 107     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 108     010 |        |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 109     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 110     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 111     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 112     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 113     111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 114         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 115
 116 It is important to note that having a different v3.0B Scalar opcode
 117 that is different from an SVP64 one is highly undesirable: the complexity
 118 in the decoder is greatly increased.
 119
 120 # Single Predication
 121
 122 This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
 123
 124 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
 125
 126 # Twin Predication
 127
 128 This is a novel concept that allows predication to be applied to a single
 129 source and a single dest register.  The following types of traditional
 130 Vector operations may be encoded with it, *without requiring explicit
 131 opcodes to do so*
 132
 133 * VSPLAT (a single scalar distributed across a vector)
 134 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 135 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 136 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 137 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 138
 139 Those patterns (and more) may be applied to:
 140
 141 * mv (the usual way that V\* ISA operations are created)
 142 * exts\* sign-extension
 143 * rwlinm and other RS-RA shift operations (**note**: excluding
 144   those that take RA as both a src and dest. These are not
 145   1-src 1-dest, they are 2-src, 1-dest)
 146 * LD and ST (treating AGEN as one source)
 147 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 148 * Condition Register ops mfcr, mtcr and other similar
 149
 150 This is a huge list that creates extremely powerful combinations,
 151 particularly given that one of the predicate options is `(1<<r3)`
 152
 153 Additional unusual capabilities of Twin Predication include a back-to-back
 154 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 155 sequentially ordered multiple VINSERTs.  The source predicate selects a
 156 sequentially ordered subset of elements to be inserted; the destination
 157 predicate specifies the sequentially ordered recipient locations.
 158 This is equivalent to
 159 `llvm.masked.compressstore.*`
 160 followed by
 161 `llvm.masked.expandload.*`
 162
 163 # Rounding, clamp and saturate
 164
 165 see  [[av_opcodes]].
 166
 167 To help ensure that audio quality is not compromised by overflow,
 168 "saturation" is provided, as well as a way to detect when saturation
 169 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
 170 one CR per element in the result (Note: this is different from VSX which
 171 has a single CR per block).
 172
 173 When N=0 the result is saturated to within the maximum range of an
 174 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
 175 logic applies to FP operations, with the result being saturated to
 176 maximum rather than returning INF, and the minimum to +0.0
 177
 178 When N=1 the same occurs except that the result is saturated to the min
 179 or max of a signed result, and for FP to the min and max value rather
 180 than returning +/- INF.
 181
 182 When Rc=1, the CR "overflow" bit is set on the CR associated with the
 183 element, to indicate whether saturation occurred.  Note that due to
 184 the hugely detrimental effect it has on parallel processing, XER.SO is
 185 **ignored** completely and is **not** brought into play here.  The CR
 186 overflow bit is therefore simply set to zero if saturation did not occur,
 187 and to one if it did.
 188
 189 Note also that saturate on operations that produce a carry output are
 190 prohibited due to the conflicting use of the CR.so bit for storing if
 191 saturation occurred.
 192
 193 Post-analysis of the Vector of CRs to find out if any given element hit
 194 saturation may be done using a mapreduced CR op (cror), or by using the
 195 new crweird instruction, transferring the relevant CR bits to a scalar
 196 integer and testing it for nonzero.  see [[sv/cr_int_predication]]
 197
 198 Note that the operation takes place at the maximum bitwidth (max of
 199 src and dest elwidth) and that truncation occurs to the range of the
 200 dest elwidth.
 201
 202 # Reduce mode
 203
 204 There are two variants here.  The first is when the destination is scalar
 205 and at least one of the sources is Vector.  The second is more complex
 206 and involves map-reduction on vectors.
 207
 208 The first defining characteristic distinguishing Scalar-dest reduce mode
 209 from Vector reduce mode is that Scalar-dest reduce issues VL element
 210 operations, whereas Vector reduce mode performs an actual map-reduce
 211 (tree reduction): typically `O(VL log VL)` actual computations.
 212
 213 The second defining characteristic of scalar-dest reduce mode is that it
 214 is, in simplistic and shallow terms *serial and sequential in nature*,
 215 whereas the Vector reduce mode is definitely inherently paralleliseable.
 216
 217 The reason why scalar-dest reduce mode is "simplistically" serial and
 218 sequential is that in certain circumstances (such as an `OR` operation
 219 or a MIN/MAX operation) it may be possible to parallelise the reduction.
 220
 221 ## Scalar result reduce mode
 222
 223 In this mode, which is suited to operations involving carry or overflow,
 224 one register must be identified by the programmer as being the "accumulator".
 225 Scalar reduction is thus categorised by:
 226
 227 * One of the sources is a Vector
 228 * the destination is a scalar
 229 * optionally but most usefully when one source register is also the destination
 230 * That the source register type is the same as the destination register
 231   type identified as the "accumulator".  scalar reduction on `cmp`,
 232   `setb` or `isel` makes no sense for example because of the mixture
 233   between CRs and GPRs.
 234
 235 Typical applications include simple operations such as `ADD r3, r10.v,
 236 r3` where, clearly, r3 is being used to accumulate the addition of all
 237 elements is the vector starting at r10.
 238
 239      # add RT, RA,RB but when RT==RA
 240      for i in range(VL):
 241           iregs[RA] += iregs[RB+i] # RT==RA
 242
 243 However, *unless* the operation is marked as "mapreduce", SV ordinarily
 244 **terminates** at the first scalar operation.  Only by marking the
 245 operation as "mapreduce" will it continue to issue multiple sub-looped
 246 (element) instructions in `Program Order`.
 247
 248 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This is useful for leaving a cumulative suffix sum in reverse order:
 249
 250     for i in (VL-1 downto 0):
 251         # RT-1 = RA gives a suffix sum
 252         iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
 253
 254 Other examples include shift-mask operations where a Vector of inserts
 255 into a single destination register is required, as a way to construct
 256 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 257 Using the same register as both the source and destination, with Vectors
 258 of different offsets masks and values to be inserted has multiple
 259 applications including Video, cryptography and JIT compilation.
 260
 261 Subtract and Divide are still permitted to be executed in this mode,
 262 although from an algorithmic perspective it is strongly discouraged.
 263 It would be better to use addition followed by one final subtract,
 264 or in the case of divide, to get better accuracy, to perform a multiply
 265 cascade followed by a final divide.
 266
 267 Note that single-operand or three-operand scalar-dest reduce is perfectly
 268 well permitted: both still meet the qualifying characteristics that one
 269 source operand can also be the destination, which allows the "accumulator"
 270 to be identified.
 271
 272 If the "accumulator" cannot be identified (one of the sources is also
 273 a destination) the results are **UNDEFINED**.  This permits implementations
 274 to not have to have complex decoding analysis of register fields: it
 275 is thus up to the programmer to ensure that one of the source registers
 276 is also a destination register in order to take advantage of Scalar
 277 Reduce Mode.
 278
 279 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 280 the scalar destination register **MUST** be updated with the current
 281 (intermediate) result, because this is how ```Program Order``` is
 282 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 283 in Program Order).  In this way, after return from interrupt,
 284 the scalar mapreduce may continue where it left off.  This provides
 285 "precise" exception behaviour.
 286
 287 Note that hardware is perfectly permitted to perform multi-issue
 288 parallel optimisation of the scalar reduce operation: it's just that
 289 as far as the user is concerned, all exceptions and interrupts **MUST**
 290 be precise.
 291
 292 ## Vector result reduce mode
 293
 294 Vector result reduce mode may utilise the destination vector for
 295 the purposes of storing intermediary results.  Interrupts and exceptions
 296 can therefore also be precise.  The result will be in the first
 297 non-predicate-masked-out destination element.  Note that unlike
 298 Scalar reduce mode, Vector reduce
 299 mode is *not* suited to operations which involve carry or overflow.
 300
 301 Programs **MUST NOT** rely on the contents of the intermediate results:
 302 they may change from hardware implementation to hardware implementation.
 303 Some implementations may perform an incremental update, whilst others
 304 may choose to use the available Vector space for a binary tree reduction.
 305 If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
 306 a *straight* SVP64 Vector instruction can be issued, where the source and
 307 destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
 308 respecting ```Program Order``` being mandatory in SVP64, hardware should
 309 and must detect this case and issue an incremental sequence of scalar
 310 element instructions.
 311
 312 1. limited to single predicated dual src operations (add RT, RA, RB).
 313    triple source operations are prohibited (such as fma).
 314 2. limited to operations that make sense.  divide is excluded, as is
 315    subtract (X - Y - Z produces different answers depending on the order)
 316    and asymmetric CRops (crandc, crorc). sane  operations:
 317    multiply, min/max, add, logical bitwise OR, most other CR ops.
 318    operations that do have the same source and dest register type are
 319    also excluded (isel, cmp). operations involving carry or overflow
 320    (XER.CA / OV) are also prohibited.
 321 3. the destination is a vector but the result is stored, ultimately,
 322    in the first nonzero predicated element.  all other nonzero predicated
 323    elements are undefined. *this includes the CR vector* when Rc=1
 324 4. implementations may use any ordering and any algorithm to reduce
 325    down to a single result.  However it must be equivalent to a straight
 326    application of mapreduce.  The destination vector (except masked out
 327    elements) may be used for storing any intermediate results. these may
 328    be left in the vector (undefined).
 329 5. CRM applies when Rc=1.  When CRM is zero, the CR associated with
 330    the result is regarded as a "some results met standard CR result
 331    criteria". When CRM is one, this changes to "all results met standard
 332    CR criteria".
 333 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
 334    in order to store sufficient state to resume operation should an
 335    interrupt occur. this is also why implementations are permitted to use
 336    the destination vector to store intermediary computations
 337 7. *Predication may be applied*.  zeroing mode is not an option.  masked-out
 338    inputs are ignored; masked-out elements in the destination vector are
 339    unaltered (not used for the purposes of intermediary storage); the
 340    scalar result is placed in the first available unmasked element.
 341
 342 Pseudocode for the case where RA==RB:
 343
 344     result = op(iregs[RA], iregs[RA+1])
 345     CR = analyse(result)
 346     for i in range(2, VL):
 347         result = op(result, iregs[RA+i])
 348         CRnew = analyse(result)
 349         if Rc=1
 350             if CRM:
 351                  CR = CR bitwise or CRnew
 352             else:
 353                  CR = CR bitwise AND CRnew
 354
 355 TODO: case where RA!=RB which involves first a vector of 2-operand
 356 results followed by a mapreduce on the intermediates.
 357
 358 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 359 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 360 illustration with a vec2:
 361
 362     result.x = op(iregs[RA].x, iregs[RA+1].x)
 363     result.y = op(iregs[RA].y, iregs[RA+1].y)
 364     for i in range(2, VL):
 365         result.x = op(result.x, iregs[RA+i].x)
 366         result.y = op(result.y, iregs[RA+i].y)
 367
 368 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
 369
 370 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
 371 subvector mode.  Example for a vec3:
 372
 373     for i in range(VL):
 374         result = op(iregs[RA+i].x, iregs[RA+i].x)
 375         result = op(result, iregs[RA+i].y)
 376         result = op(result, iregs[RA+i].z)
 377         iregs[RT+i] = result
 378
 379 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 380 element creates a corresponding CR element.
 381
 382 # Fail-on-first
 383
 384 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 385 the other for arithmetic operations (actually, CR-driven).  Note in each
 386 case the assumption is that vector elements are required appear to be
 387 executed in sequential Program Order, element 0 being the first.
 388
 389 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 390   ordinary one.  Exceptions occur "as normal".  However for elements 1
 391   and above, if an exception would occur, then VL is **truncated** to the
 392   previous element.
 393 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 394   CR-creating operation produces a result (including cmp).  Similar to
 395   branch, an analysis of the CR is performed and if the test fails, the
 396   vector operation terminates and discards all element operations at and
 397   above the current one, and VL is truncated to either
 398   the *previous* element or the current one, depending on whether
 399   VLi (VL "inclusive") is set.
 400
 401 Thus the new VL comprises a contiguous vector of results,
 402 all of which pass the testing criteria (equal to zero, less than zero).
 403
 404 The CR-based data-driven fail-on-first is new and not found in ARM
 405 SVE or RVV. It is extremely useful for reducing instruction count,
 406 however requires speculative execution involving modifications of VL
 407 to get high performance implementations.  An additional mode (RC1=1)
 408 effectively turns what would otherwise be an arithmetic operation
 409 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 410 against the `inv` field).
 411 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 412 the loop ends.
 413 Note that when RC1=1 the result elements are never stored, only the CRs.
 414
 415 VLi is only available as an option when `Rc=0` (or for instructions
 416 which do not have Rc). When set, the current element is always
 417 also included in the count (the new length that VL will be set to).
 418 This may be useful in combination with "inv" to truncate the Vector
 419 to `exclude` elements that fail a test, or, in the case of implementations
 420 of strncpy, to include the terminating zero.
 421
 422 In CR-based data-driven fail-on-first there is only the option to select
 423 and test one bit of each CR (just as with branch BO).  For more complex
 424 tests this may be insufficient.  If that is the case, a vectorised crops
 425 (crand, cror) may be used, and ffirst applied to the crop instead of to
 426 the arithmetic vector.
 427
 428 One extremely important aspect of ffirst is:
 429
 430 * LDST ffirst may never set VL equal to zero.  This because on the first
 431   element an exception must be raised "as normal".
 432 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 433   to zero. This is the only means in the entirety of SV that VL may be set
 434   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 435   zero due to the first element failing the CR bit-test, all subsequent
 436   vectorised operations are effectively `nops` which is
 437   *precisely the desired and intended behaviour*.
 438
 439 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 440 to a nonzero value for any implementation-specific reason.  For example:
 441 it is perfectly reasonable for implementations to alter VL when ffirst
 442 LD or ST operations are initiated on a nonaligned boundary, such that
 443 within a loop the subsequent iteration of that loop begins subsequent
 444 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 445 workloads or balance resources.
 446
 447 CR-based data-dependent first on the other hand MUST not truncate VL
 448 arbitrarily to a length decided by the hardware: VL MUST only be
 449 truncated based explicitly on whether a test fails.
 450 This because it is a precise test on which algorithms
 451 will rely.
 452
 453 ## Data-dependent fail-first on CR operations (crand etc)
 454
 455 Operations that actually produce or alter CR Field as a result
 456 do not also in turn have an Rc=1 mode.  However it makes no
 457 sense to try to test the 4 bits of a CR Field for being equal
 458 or not equal to zero. Moreover, the result is already in the
 459 form that is desired: it is a CR field.
 460
 461 There are two primary different types of CR operations:
 462
 463 * Those which have a 3-bit operand field (referring to a CR Field)
 464 * Those which have a 5-bit operand (referring to a bit within the
 465    whole 32-bit CR)
 466
 467 Examining these two as has already been done it is observed that
 468 the difference may be considered to be that the 5-bit variant
 469 provides additional information about which CR Field bit
 470 (EQ, GE, LT, SO) is to be operated on by the instruction.
 471
 472 Thus, logically, we may set the following rule:
 473
 474 * When a 5-bit CR Result field is used in an instruction, the
 475   `inv, VLi and RC1` variant of Data-Dependent Fail-First
 476   must be used. i.e. the bit of the CR field to be tested is
 477   the one that has just been modified by the operation.
 478 * When a 3-bit CR Result field is used the `inv CRbit` variant
 479   must be used in order to select which CR Field bit shall
 480   be tested (EQ, LE, GE, SO).
 481
 482 Examples of the former type:
 483
 484 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
 485   to be tested against `inv` is the one selected by `BT`
 486 * mcrf. This has only 3-bit (BF, BFA). In order to select the
 487   bit to be tested, the alternative FFirst encoding must be used.
 488
 489 This limits sv.mcrf in that it may not use the `VLi` (VL inclusive)
 490 Mode. This is unfortunste but unavoidable due to encoding pressure
 491 on SVP64.
 492
 493 # pred-result mode
 494
 495 This mode merges common CR testing with predication, saving on instruction
 496 count. Below is the pseudocode excluding predicate zeroing and elwidth
 497 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
 498
 499     for i in range(VL):
 500         # predication test, skip all masked out elements.
 501         if predicate_masked_out(i):
 502              continue
 503         result = op(iregs[RA+i], iregs[RB+i])
 504         CRnew = analyse(result) # calculates eq/lt/gt
 505         # Rc=1 always stores the CR
 506         if Rc=1 or RC1:
 507             crregs[offs+i] = CRnew
 508         # now test CR, similar to branch
 509         if RC1 or CRnew[BO[0:1]] != BO[2]:
 510             continue # test failed: cancel store
 511         # result optionally stored but CR always is
 512         iregs[RT+i] = result
 513
 514 The reason for allowing the CR element to be stored is so that
 515 post-analysis of the CR Vector may be carried out.  For example:
 516 Saturation may have occurred (and been prevented from updating, by the
 517 test) but it is desirable to know *which* elements fail saturation.
 518
 519 Note that RC1 Mode basically turns all operations into `cmp`.  The
 520 calculation is performed but it is only the CR that is written. The
 521 element result is *always* discarded, never written (just like `cmp`).
 522
 523 Note that predication is still respected: predicate zeroing is slightly
 524 different: elements that fail the CR test *or* are masked out are zero'd.
 525
 526 ## pred-result mode on CR ops
 527
 528 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
 529 predicated, and also pred-result mode applied to it.  In this case,
 530 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
 531 individual bits that are treated as the Vector, but the CRs themselves
 532 (CR0, CR8, CR9...).
 533
 534 Put another way: Vectorised crand uses the higher bits of BA BB BC
 535 to select the CR Field: these will increment sequentially as the Vector
 536 loop progresses, whereas the lower 2 bits (selecting one of eq, ge, le, ov)
 537 remain the same.
 538
 539 Thus after each Vectorised operation (crand) a test of the CR result
 540 can in fact be performed. However the only meaningful comparision will
 541 be "eq" or "ne", given that the result is only one bit.
 542
 543 # CR Operations
 544
 545 CRs are slightly more involved than INT or FP registers due to the
 546 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 547 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 548 numbering, with a clear linear relationship and mapping existing when
 549 SV is applied.
 550
 551 ## CR EXTRA mapping table and algorithm
 552
 553 Numbering relationships for CR fields are already complex due to being
 554 in BE format (*the relationship is not clearly explained in the v3.0B
 555 or v3.1B specification*).  However with some care and consideration
 556 the exact same mapping used for INT and FP regfiles may be applied,
 557 just to the upper bits, as explained below.  The notation
 558 `CR{field number}` is used to indicate access to a particular
 559 Condition Register Field (as opposed to the notation `CR[bit]`
 560 which accesses one bit of the 32 bit Power ISA v3.0B
 561 Condition Register)
 562
 563 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 564 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 565 *in* that CR.  The numbering was determined (after 4 months of
 566 analysis and research) to be as follows:
 567
 568     CR_index = 7-(BA>>2)      # top 3 bits but BE
 569     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 570     CR_reg = CR{CR_index}     # get the CR
 571     # finally get the bit from the CR.
 572     CR_bit = (CR_reg & (1<<bit_index)) != 0
 573
 574 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 575 applies, **not** the CR\_bit portion (bits 3:4):
 576
 577     if extra3_mode:
 578         spec = EXTRA3
 579     else:
 580         spec = EXTRA2<<1 | 0b0
 581     if spec[0]:
 582        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 583        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 584               (spec[1:2]<<4) | # to make room for these
 585               (BA & 0b11)      # CR_bit on the end
 586     else:
 587        # scalar constructs "00 spec[1:2] BA[0:4]"
 588        return (spec[1:2] << 5) | BA
 589
 590 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 591 algorithm to determin CR\_reg is modified to as follows:
 592
 593     CR_index = 7-(BA>>2)      # top 3 bits but BE
 594     if spec[0]:
 595         # vector mode, 0-124 increments of 4
 596         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 597     else:
 598         # scalar mode, 0-32 increments of 1
 599         CR_index = (spec[1:2]<<3) | CR_index
 600     # same as for v3.0/v3.1 from this point onwards
 601     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 602     CR_reg = CR{CR_index}     # get the CR
 603     # finally get the bit from the CR.
 604     CR_bit = (CR_reg & (1<<bit_index)) != 0
 605
 606 Note here that the decoding pattern to determine CR\_bit does not change.
 607
 608 Note: high-performance implementations may read/write Vectors of CRs in
 609 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 610 simplify internal design.  If instructions are issued where CR Vectors
 611 do not start on a 32-bit aligned boundary, performance may be affected.
 612
 613 ## CR fields as inputs/outputs of vector operations
 614
 615 CRs (or, the arithmetic operations associated with them)
 616 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 617
 618 When vectorized, the CR inputs/outputs are sequentially read/written
 619 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 620 writing to CR8 (TBD evaluate) and increase sequentially from there.
 621 This is so that:
 622
 623 * implementations may rely on the Vector CRs being aligned to 8. This
 624   means that CRs may be read or written in aligned batches of 32 bits
 625   (8 CRs per batch), for high performance implementations.
 626 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 627   overwritten by vector Rc=1 operations except for very large VL
 628 * CR-based predication, from CR32, is also not interfered with
 629   (except by large VL).
 630
 631 However when the SV result (destination) is marked as a scalar by the
 632 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 633 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 634 for FP operations.
 635
 636 Note that yes, the CRs are genuinely Vectorised.  Unlike in SIMD VSX which
 637 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 638 v3.0B scalar operations produce a **tuple** of element results: the
 639 result of the operation as one part of that element *and a corresponding
 640 CR element*.  Greatly simplified pseudocode:
 641
 642     for i in range(VL):
 643          # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
 644          + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
 645          == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
 646
 647 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 648 then a followup instruction must be performed, setting "reduce" mode on
 649 the Vector of CRs, using cr ops (crand, crnor)to do so.  This provides far
 650 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 651 Vector ISAs are typically restricted to "were all results nonzero" and
 652 "were some results nonzero". The application of mapreduce to Vectorised
 653 cr operations allows far more sophisticated analysis, particularly in
 654 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 655
 656 Note in particular that the use of a separate instruction in this way
 657 ensures that high performance multi-issue OoO inplementations do not
 658 have the computation of the cumulative analysis CR as a bottleneck and
 659 hindrance, regardless of the length of VL.
 660
 661 (see [[discussion]].  some alternative schemes are described there)
 662
 663 ## Rc=1 when SUBVL!=1
 664
 665 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
 666 predicate is allocated per subvector; likewise only one CR is allocated
 667 per subvector.
 668
 669 This leaves a conundrum as to how to apply CR computation per subvector,
 670 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 671 is to perform a bitwise OR or AND of the subvector tests.  Given that
 672 OE is ignored, rhis field may (when available) be used to select OR or
 673 AND behavior.
 674
 675 ### Table of CR fields
 676
 677 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
 678 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
 679
 680 CRs are not stored in SPRs: they are registers in their own right.
 681 Therefore context-switching the full set of CRs involves a Vectorised
 682 mfcr or mtcr, using VL=64, elwidth=8 to do so.  This is exactly as how
 683 scalar OpenPOWER context-switches CRs: it is just that there are now
 684 more of them.
 685
 686 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 687 are arranged.  TODO a python program that auto-generates a CSV file
 688 which can be included in a table, which is in a new page (so as not to
 689 overwhelm this one). [[svp64/cr_names]]
 690
 691 # Register Profiles
 692
 693 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 694 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 695
 696 Instructions are broken down by Register Profiles as listed in the
 697 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 698 indicates that the operations with this Register Profile cannot be
 699 Vectorised (mtspr, bc, dcbz, twi)
 700
 701 TODO generate table which will be here [[svp64/reg_profiles]]
 702
 703 # SV pseudocode illilustration
 704
 705 ## Single-predicated Instruction
 706
 707 illustration of normal mode add operation: zeroing not included, elwidth
 708 overrides not included.  if there is no predicate, it is set to all 1s
 709
 710     function op_add(rd, rs1, rs2) # add not VADD!
 711       int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
 712       for (i = 0; i < VL; i++)
 713         STATE.srcoffs = i # save context if (predval & 1<<i) # predication
 714         uses intregs
 715            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
 716            ].isvec) break;
 717         if (rd.isvec)  { id += 1; } if (rs1.isvec)  { irs1 += 1; } if
 718         (rs2.isvec)  { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
 719         VL) {
 720           # end VL hardware loop STATE.srcoffs = 0; # reset return;
 721         }
 722
 723 This has several modes:
 724
 725 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
 726 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
 727
 728 All of these may be predicated.  Vector-Vector is straightfoward.
 729 When one of source is a Vector and the other a Scalar, it is clear that
 730 each element of the Vector source should be added to the Scalar source,
 731 each result placed into the Vector (or, if the destination is a scalar,
 732 only the first nonpredicated result).
 733
 734 The one that is not obvious is RT=vector but both RA/RB=scalar.
 735 Here this acts as a "splat scalar result", copying the same result into
 736 all nonpredicated result elements.  If a fixed destination scalar was
 737 intended, then an all-Scalar operation should be used.
 738
 739 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 740
 741 # Assembly Annotation
 742
 743 Assembly code annotation is required for SV to be able to successfully
 744 mark instructions as "prefixed".
 745
 746 A reasonable (prototype) starting point:
 747
 748     svp64 [field=value]*
 749
 750 Fields:
 751
 752 * ew=8/16/32 - element width
 753 * sew=8/16/32 - source element width
 754 * vec=2/3/4 - SUBVL
 755 * mode=reduce/satu/sats/crpred
 756 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 757 * spred={reg spec}
 758
 759 similar to x86 "rex" prefix.
 760
 761 For actual assembler:
 762
 763     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 764
 765 Qualifiers:
 766
 767 * m={pred}: predicate mask mode
 768 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 769 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 770 * ew={N}: ew=8/16/32 - sets elwidth override
 771 * sw={N}: sw=8/16/32 - sets source elwidth override
 772 * ff={xx}: see fail-first mode
 773 * pr={xx}: see predicate-result mode
 774 * sat{x}: satu / sats - see saturation mode
 775 * mr: see map-reduce mode
 776 * mr.svm see map-reduce with sub-vector mode
 777 * crm: see map-reduce CR mode
 778 * crm.svm see map-reduce CR with sub-vector mode
 779 * sz: predication with source-zeroing
 780 * dz: predication with dest-zeroing
 781
 782 For modes:
 783
 784 * pred-result:
 785   - pm=lt/gt/le/ge/eq/ne/so/ns OR
 786   - pm=RC1 OR pm=~RC1
 787 * fail-first
 788   - ff=lt/gt/le/ge/eq/ne/so/ns OR
 789   - ff=RC1 OR ff=~RC1
 790 * saturation:
 791   - sats
 792   - satu
 793 * map-reduce:
 794   - mr OR crm: "normal" map-reduce mode or CR-mode.
 795   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 796
 797 # Proposed Parallel-reduction algorithm
 798
 799 ```
 800 /// reference implementation of proposed SimpleV reduction semantics.
 801 ///
 802                 // reduction operation -- we still use this algorithm even
 803                 // if the reduction operation isn't associative or
 804                 // commutative.
 805 /// `temp_pred` is a user-visible Vector Condition register
 806 ///
 807 /// all input arrays have length `vl`
 808 def reduce(  vl,  vec, pred, pred,):
 809     step = 1;
 810     while step < vl
 811         step *= 2;
 812         for i in (0..vl).step_by(step)
 813             other = i + step / 2;
 814             other_pred = other < vl && pred[other];
 815             if pred[i] && other_pred
 816                 vec[i] += vec[other];
 817             else if other_pred
 818                 vec[i] = vec[other];
 819             pred[i] |= other_pred;
 820
 821 def reduce(  vl,  vec, pred, pred,):
 822     j = 0
 823     vi = [] # array of lookup indices to skip nonpredicated
 824     for i, pbit in enumerate(pred):
 825        if pbit:
 826            vi[j] = i
 827            j += 1
 828     step = 2
 829     while step <= vl
 830         halfstep = step // 2
 831         for i in (0..vl).step_by(step)
 832             other = vi[i + halfstep]
 833             i = vi[i]
 834             other_pred = other < vl && pred[other]
 835             if pred[i] && other_pred
 836                 vec[i] += vec[other]
 837             pred[i] |= other_pred
 838          step *= 2
 839
 840 ```