openpower/sv/svp64/appendix.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
   5
   6 This is the appendix to [[sv/svp64]], providing explanations of modes
   7 etc. leaving the main svp64 page's primary purpose as outlining the
   8 instruction format.
   9
  10 Table of contents:
  11
  12 [[!toc]]
  13
  14 # XER, SO and other global flags
  15
  16 Vector systems are expected to be high performance.  This is achieved
  17 through parallelism, which requires that elements in the vector be
  18 independent.  XER SO and other global "accumulation" flags (CR.OV) cause
  19 Read-Write Hazards on single-bit global resources, having a significant
  20 detrimental effect.
  21
  22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
  23 in `cmp` instructions).  XER is simply neither read nor written.
  24 This includes when `scalar identity behaviour` occurs.  If precise
  25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
  26 instructions should be used without an SV Prefix.
  27
  28 An interesting side-effect of this decision is that the OE flag is now
  29 free for other uses when SV Prefixing is used.
  30
  31 Regarding XER.CA: this does not fit either: it was designed for a scalar
  32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
  33 Vector element.  This provides a means to perform large parallel batches
  34 of Vectorised carry-capable additions.  crweird instructions can be used
  35 to transfer the CRs in and out of an integer, where bitmanipulation
  36 may be performed to analyse the carry bits (including carry lookahead
  37 propagation) before continuing with further parallel additions.
  38
  39 # v3.0B/v3.1B relevant instructions
  40
  41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
  42 CPU ISA.
  43
  44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
  45 alternative uses.  Additionally, Vectorisation of the VSX SIMD system
  46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
  47 at the very minimum, predication (which VSX was designed without).
  48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
  49 illegal instruction exceptions in SV Prefix Mode.
  50
  51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
  52 have because they are not only provided by SV, the SV alternatives may
  53 be predicated as well, making them far better suited to use in function
  54 calls and context-switching.
  55
  56 Additionally, some v3.0/1 instructions simply make no sense at all in a
  57 Vector context: `twi` and `tdi` fall into this category, as do branch
  58 operations as well as `sc` and `scv`.  Here there is simply no point
  59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
  60 should be called instead.
  61
  62 Fortuitously this leaves several Major Opcodes free for use by SV
  63 to fit alternative future instructions.  In a 3D context this means
  64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
  65 operations, and others critical to an efficient, effective 3D GPU and
  66 VPU ISA. With such instructions being included as standard in other
  67 commercially-successful GPU ISAs it is likewise critical that a 3D
  68 GPU/VPU based on svp64 also have such instructions.
  69
  70 Note however that svp64 is stand-alone and is in no way
  71 critically dependent on the existence or provision of 3D GPU or VPU
  72 instructions. These should be considered extensions, and their discussion
  73 and specification is out of scope for this document.
  74
  75 Note, again: this is *only* under svp64 prefixing.  Standard v3.0B /
  76 v3.1B is *not* altered by svp64 in any way.
  77
  78 ## Major opcode map (v3.0B)
  79
  80 This table is taken from v3.0B.
  81 Table 9: Primary Opcode Map (opcode bits 0:5)
  82
  83         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
  84     000 |        |       |  tdi  | twi   | EXT04 |        |       | mulli | 000
  85     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
  86     010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
  87     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
  88     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
  89     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   | lmw   | stmw  | 101
  90     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
  91     111 |  lq    | EXT57 | EXT58 | EXT59 | EXT60 | EXT61  | EXT62 | EXT63 | 111
  92         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
  93
  94 ## Suitable for svp64
  95
  96 This is the same table containing v3.0B Primary Opcodes except those that
  97 make no sense in a Vectorisation Context have been removed.  These removed
  98 POs can, *in the SV Vector Context only*, be assigned to alternative
  99 (Vectorised-only) instructions, including future extensions.
 100
 101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
 102 change.  When not prefixed with svp64 these opcodes **specifically**
 103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
 104
 105         |  000   |   001 |  010  | 011   |  100  |    101 |  110  |  111
 106     000 |        |       |       |       |       |        |       | mulli | 000
 107     001 | subfic |       | cmpli | cmpi  | addic | addic. | addi  | addis | 001
 108     010 |        |       |       | EXT19 | rlwimi| rlwinm |       | rlwnm | 010
 109     011 |  ori   | oris  | xori  | xoris | andi. | andis. | EXT30 | EXT31 | 011
 110     100 |  lwz   | lwzu  | lbz   | lbzu  | stw   | stwu   | stb   | stbu  | 100
 111     101 |  lhz   | lhzu  | lha   | lhau  | sth   | sthu   |       |       | 101
 112     110 |  lfs   | lfsu  | lfd   | lfdu  | stfs  | stfsu  | stfd  | stfdu | 110
 113     111 |        |       | EXT58 | EXT59 |       | EXT61  |       | EXT63 | 111
 114         |  000   |   001 |   010 |  011  |   100 |   101  | 110   |  111
 115
 116 # Single Predication
 117
 118 This is a standard mode normally found in Vector ISAs.  every element in rvery source Vector and in the destination uses the same bit of one single predicate mask.
 119
 120 Note however that in SVSTATE, implementors MUST increment both srcstep and dststep, and that the two must be equal at all times.
 121
 122 # Twin Predication
 123
 124 This is a novel concept that allows predication to be applied to a single
 125 source and a single dest register.  The following types of traditional
 126 Vector operations may be encoded with it, *without requiring explicit
 127 opcodes to do so*
 128
 129 * VSPLAT (a single scalar distributed across a vector)
 130 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
 131 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
 132 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
 133 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
 134
 135 Those patterns (and more) may be applied to:
 136
 137 * mv (the usual way that V\* ISA operations are created)
 138 * exts\* sign-extension
 139 * rwlinm and other RS-RA shift operations (**note**: excluding
 140   those that take RA as both a src and dest. These are not
 141   1-src 1-dest, they are 2-src, 1-dest)
 142 * LD and ST (treating AGEN as one source)
 143 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
 144 * Condition Register ops mfcr, mtcr and other similar
 145
 146 This is a huge list that creates extremely powerful combinations,
 147 particularly given that one of the predicate options is `(1<<r3)`
 148
 149 Additional unusual capabilities of Twin Predication include a back-to-back
 150 version of VCOMPRESS-VEXPAND which is effectively the ability to do
 151 sequentially ordered multiple VINSERTs.  The source predicate selects a
 152 sequentially ordered subset of elements to be inserted; the destination
 153 predicate specifies the sequentially ordered recipient locations.
 154 This is equivalent to
 155 `llvm.masked.compressstore.*`
 156 followed by
 157 `llvm.masked.expandload.*`
 158
 159
 160 # Rounding, clamp and saturate
 161
 162 see  [[av_opcodes]].
 163
 164 To help ensure that audio quality is not compromised by overflow,
 165 "saturation" is provided, as well as a way to detect when saturation
 166 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
 167 one CR per element in the result (Note: this is different from VSX which
 168 has a single CR per block).
 169
 170 When N=0 the result is saturated to within the maximum range of an
 171 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
 172 logic applies to FP operations, with the result being saturated to
 173 maximum rather than returning INF, and the minimum to +0.0
 174
 175 When N=1 the same occurs except that the result is saturated to the min
 176 or max of a signed result, and for FP to the min and max value rather
 177 than returning +/- INF.
 178
 179 When Rc=1, the CR "overflow" bit is set on the CR associated with the
 180 element, to indicate whether saturation occurred.  Note that due to
 181 the hugely detrimental effect it has on parallel processing, XER.SO is
 182 **ignored** completely and is **not** brought into play here.  The CR
 183 overflow bit is therefore simply set to zero if saturation did not occur,
 184 and to one if it did.
 185
 186 Note also that saturate on operations that produce a carry output are
 187 prohibited due to the conflicting use of the CR.so bit for storing if
 188 saturation occurred.
 189
 190 Post-analysis of the Vector of CRs to find out if any given element hit
 191 saturation may be done using a mapreduced CR op (cror), or by using the
 192 new crweird instruction, transferring the relevant CR bits to a scalar
 193 integer and testing it for nonzero.  see [[sv/cr_int_predication]]
 194
 195 Note that the operation takes place at the maximum bitwidth (max of
 196 src and dest elwidth) and that truncation occurs to the range of the
 197 dest elwidth.
 198
 199 # Reduce mode
 200
 201 There are two variants here.  The first is when the destination is scalar
 202 and at least one of the sources is Vector.  The second is more complex
 203 and involves map-reduction on vectors.
 204
 205 The first defining characteristic distinguishing Scalar-dest reduce mode
 206 from Vector reduce mode is that Scalar-dest reduce issues VL element
 207 operations, whereas Vector reduce mode performs an actual map-reduce
 208 (tree reduction): typically `O(VL log VL)` actual computations.
 209
 210 The second defining characteristic of scalar-dest reduce mode is that it
 211 is, in simplistic and shallow terms *serial and sequential in nature*,
 212 whereas the Vector reduce mode is definitely inherently paralleliseable.
 213
 214 The reason why scalar-dest reduce mode is "simplistically" serial and
 215 sequential is that in certain circumstances (such as an `OR` operation
 216 or a MIN/MAX operation) it may be possible to parallelise the reduction.
 217
 218 ## Scalar result reduce mode
 219
 220 In this mode, one register is identified as being the "accumulator".
 221 Scalar reduction is thus categorised by:
 222
 223 * One of the sources is a Vector
 224 * the destination is a scalar
 225 * optionally but most usefully when one source register is also the destination
 226 * That the source register type is the same as the destination register
 227   type identified as the "accumulator".  scalar reduction on `cmp`,
 228   `setb` or `isel` is not possible for example because of the mixture
 229   between CRs and GPRs.
 230
 231 Typical applications include simple operations such as `ADD r3, r10.v,
 232 r3` where, clearly, r3 is being used to accumulate the addition of all
 233 elements is the vector starting at r10.
 234
 235      # add RT, RA,RB but when RT==RA
 236      for i in range(VL):
 237           iregs[RA] += iregs[RB+i] # RT==RA
 238
 239 However, *unless* the operation is marked as "mapreduce", SV ordinarily
 240 **terminates** at the first scalar operation.  Only by marking the
 241 operation as "mapreduce" will it continue to issue multiple sub-looped
 242 (element) instructions in `Program Order`.
 243
 244 Other examples include shift-mask operations where a Vector of inserts
 245 into a single destination register is required, as a way to construct
 246 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 247 Using the same register as both the source and destination, with Vectors
 248 of different offsets masks and values to be inserted has multiple
 249 applications including Video, cryptography and JIT compilation.
 250
 251 Subtract and Divide are still permitted to be executed in this mode,
 252 although from an algorithmic perspective it is strongly discouraged.
 253 It would be better to use addition followed by one final subtract,
 254 or in the case of divide, to get better accuracy, to perform a multiply
 255 cascade followed by a final divide.
 256
 257 Note that single-operand or three-operand scalar-dest reduce is perfectly
 258 well permitted: both still meet the qualifying characteristics that one
 259 source operand can also be the destination, which allows the "accumulator"
 260 to be identified.
 261
 262 ## Vector result reduce mode
 263
 264 1. limited to single predicated dual src operations (add RT, RA, RB).
 265    triple source operations are prohibited (fma).
 266 2. limited to operations that make sense.  divide is excluded, as is
 267    subtract (X - Y - Z produces different answers depending on the order)
 268    and asymmetric CRops (crandc, crorc). sane  operations:
 269    multiply, min/max, add, logical bitwise OR, most other CR ops.
 270    operations that do have the same source and dest register type are
 271    also excluded (isel, cmp). operations involving carry or overflow
 272    (XER.CA / OV) are also prohibited.
 273 3. the destination is a vector but the result is stored, ultimately,
 274    in the first nonzero predicated element.  all other nonzero predicated
 275    elements are undefined. *this includes the CR vector* when Rc=1
 276 4. implementations may use any ordering and any algorithm to reduce
 277    down to a single result.  However it must be equivalent to a straight
 278    application of mapreduce.  The destination vector (except masked out
 279    elements) may be used for storing any intermediate results. these may
 280    be left in the vector (undefined).
 281 5. CRM applies when Rc=1.  When CRM is zero, the CR associated with
 282    the result is regarded as a "some results met standard CR result
 283    criteria". When CRM is one, this changes to "all results met standard
 284    CR criteria".
 285 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
 286    in order to store sufficient state to resume operation should an
 287    interrupt occur. this is also why implementations are permitted to use
 288    the destination vector to store intermediary computations
 289 7. *Predication may be applied*.  zeroing mode is not an option.  masked-out
 290    inputs are ignored; masked-out elements in the destination vector are
 291    unaltered (not used for the purposes of intermediary storage); the
 292    scalar result is placed in the first available unmasked element.
 293
 294 Pseudocode for the case where RA==RB:
 295
 296     result = op(iregs[RA], iregs[RA+1])
 297     CR = analyse(result)
 298     for i in range(2, VL):
 299         result = op(result, iregs[RA+i])
 300         CRnew = analyse(result)
 301         if Rc=1
 302             if CRM:
 303                  CR = CR bitwise or CRnew
 304             else:
 305                  CR = CR bitwise AND CRnew
 306
 307 TODO: case where RA!=RB which involves first a vector of 2-operand
 308 results followed by a mapreduce on the intermediates.
 309
 310 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 311 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 312 illustration with a vec2:
 313
 314     result.x = op(iregs[RA].x, iregs[RA+1].x)
 315     result.y = op(iregs[RA].y, iregs[RA+1].y)
 316     for i in range(2, VL):
 317         result.x = op(result.x, iregs[RA+i].x)
 318         result.y = op(result.y, iregs[RA+i].y)
 319
 320 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
 321
 322 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
 323 subvector mode.  Example for a vec3:
 324
 325     for i in range(VL):
 326         result = op(iregs[RA+i].x, iregs[RA+i].x)
 327         result = op(result, iregs[RA+i].y)
 328         result = op(result, iregs[RA+i].z)
 329         iregs[RT+i] = result
 330
 331 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 332 element creates a corresponding CR element.
 333
 334 # Fail-on-first
 335
 336 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 337 the other for arithmetic operations (actually, CR-driven).  Note in each
 338 case the assumption is that vector elements are required appear to be
 339 executed in sequential Program Order, element 0 being the first.
 340
 341 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 342   ordinary one.  Exceptions occur "as normal".  However for elements 1
 343   and above, if an exception would occur, then VL is **truncated** to the
 344   previous element.
 345 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 346   CR-creating operation produces a result (including cmp).  Similar to
 347   branch, an analysis of the CR is performed and if the test fails, the
 348   vector operation terminates and discards all element operations at and
 349   above the current one, and VL is truncated to the *previous* element.
 350   Thus the new VL comprises a contiguous vector of results, all of which
 351   pass the testing criteria (equal to zero, less than zero).
 352
 353 The CR-based data-driven fail-on-first is new and not found in ARM
 354 SVE or RVV. It is extremely useful for reducing instruction count,
 355 however requires speculative execution involving modifications of VL
 356 to get high performance implementations.  An additional mode (RC1=1)
 357 effectively turns what would otherwise be an arithmetic operation
 358 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested).
 359 If the CR.eq bit fails then the Vector is truncated and the loop ends.
 360 Note that when RC1=1 the result elements arw never stored, only the CRs.
 361
 362 In CR-based data-driven fail-on-first there is only the option to select
 363 and test one bit of each CR (just as with branch BO).  For more complex
 364 tests this may be insufficient.  If that is the case, a vectorised crops
 365 (crand, cror) may be used, and ffirst applied to the crop instead of to
 366 the arithmetic vector.
 367
 368 One extremely important aspect of ffirst is:
 369
 370 * LDST ffirst may never set VL equal to zero.  This because on the first
 371   element an exception must be raised "as normal".
 372 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 373   to zero. This is the only means in the entirety of SV that VL may be set
 374   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 375   zero due to the first element failing the CR bit-test, all subsequent
 376   vectorised operations are effectively `nops` which is
 377   *precisely the desired and intended behaviour*.
 378
 379 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 380 to a nonzero value for any implementation-specific reason.  For example:
 381 it is perfectly reasonable for implementations to alter VL when ffirst
 382 LD or ST operations are initiated on a nonaligned boundary, such that
 383 within a loop the subsequent iteration of that loop begins subsequent
 384 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 385 workloads or balance resources.
 386
 387 CR-based data-dependent first on the other hand MUST not truncate VL
 388 arbitrarily.  This because it is a precise test on which algorithms
 389 will rely.
 390
 391 # pred-result mode
 392
 393 This mode merges common CR testing with predication, saving on instruction
 394 count. Below is the pseudocode excluding predicate zeroing and elwidth
 395 overrides.
 396
 397     for i in range(VL):
 398         # predication test, skip all masked out elements.
 399         if predicate_masked_out(i):
 400              continue
 401         result = op(iregs[RA+i], iregs[RB+i])
 402         CRnew = analyse(result) # calculates eq/lt/gt
 403         # Rc=1 always stores the CR
 404         if Rc=1 or RC1:
 405             crregs[offs+i] = CRnew
 406         # now test CR, similar to branch
 407         if RC1 or CRnew[BO[0:1]] != BO[2]:
 408             continue # test failed: cancel store
 409         # result optionally stored but CR always is
 410         iregs[RT+i] = result
 411
 412 The reason for allowing the CR element to be stored is so that
 413 post-analysis of the CR Vector may be carried out.  For example:
 414 Saturation may have occurred (and been prevented from updating, by the
 415 test) but it is desirable to know *which* elements fail saturation.
 416
 417 Note that RC1 Mode basically turns all operations into `cmp`.  The
 418 calculation is performed but it is only the CR that is written. The
 419 element result is *always* discarded, never written (just like `cmp`).
 420
 421 Note that predication is still respected: predicate zeroing is slightly
 422 different: elements that fail the CR test *or* are masked out are zero'd.
 423
 424 ## pred-result mode on CR ops
 425
 426 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
 427 predicated, and also pred-result mode applied to it.  In this case,
 428 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
 429 individual bits that are treated as the Vector, but the CRs themselves
 430 (CR0, CR8, CR9...)
 431
 432 Thus after each Vectorised operation (crand) a test of the CR result
 433 can in fact be performed.
 434
 435 # CR Operations
 436
 437 CRs are slightly more involved than INT or FP registers due to the
 438 possibility for indexing individual bits (crops BA/BB/BT).  Again however
 439 the access pattern needs to be understandable in relation to v3.0B / v3.1B
 440 numbering, with a clear linear relationship and mapping existing when
 441 SV is applied.
 442
 443 ## CR EXTRA mapping table and algorithm
 444
 445 Numbering relationships for CR fields are already complex due to being
 446 in BE format (*the relationship is not clearly explained in the v3.0B
 447 or v3.1B specification*).  However with some care and consideration
 448 the exact same mapping used for INT and FP regfiles may be applied,
 449 just to the upper bits, as explained below.
 450
 451 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
 452 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
 453 *in* that CR.  The numbering was determined (after 4 months of
 454 analysis and research) to be as follows:
 455
 456     CR_index = 7-(BA>>2)      # top 3 bits but BE
 457     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 458     CR_reg = CR{CR_index}     # get the CR
 459     # finally get the bit from the CR.
 460     CR_bit = (CR_reg & (1<<bit_index)) != 0
 461
 462 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
 463 applies, **not** the CR\_bit portion (bits 3:4):
 464
 465     if extra3_mode:
 466         spec = EXTRA3
 467     else:
 468         spec = EXTRA2<<1 | 0b0
 469     if spec[0]:
 470        # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
 471        return ((BA >> 2)<<6) | # hi 3 bits shifted up
 472               (spec[1:2]<<4) | # to make room for these
 473               (BA & 0b11)      # CR_bit on the end
 474     else:
 475        # scalar constructs "00 spec[1:2] BA[0:4]"
 476        return (spec[1:2] << 5) | BA
 477
 478 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
 479 algorithm to determin CR\_reg is modified to as follows:
 480
 481     CR_index = 7-(BA>>2)      # top 3 bits but BE
 482     if spec[0]:
 483         # vector mode, 0-124 increments of 4
 484         CR_index = (CR_index<<4) | (spec[1:2] << 2)
 485     else:
 486         # scalar mode, 0-32 increments of 1
 487         CR_index = (spec[1:2]<<3) | CR_index
 488     # same as for v3.0/v3.1 from this point onwards
 489     bit_index = 3-(BA & 0b11) # low 2 bits but BE
 490     CR_reg = CR{CR_index}     # get the CR
 491     # finally get the bit from the CR.
 492     CR_bit = (CR_reg & (1<<bit_index)) != 0
 493
 494 Note here that the decoding pattern to determine CR\_bit does not change.
 495
 496 Note: high-performance implementations may read/write Vectors of CRs in
 497 batches of aligned 32-bit chunks (CR0-7, CR7-15).  This is to greatly
 498 simplify internal design.  If instructions are issued where CR Vectors
 499 do not start on a 32-bit aligned boundary, performance may be affected.
 500
 501 ## CR fields as inputs/outputs of vector operations
 502
 503 CRs (or, the arithmetic operations associated with them)
 504 may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
 505
 506 When vectorized, the CR inputs/outputs are sequentially read/written
 507 to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
 508 writing to CR8 (TBD evaluate) and increase sequentially from there.
 509 This is so that:
 510
 511 * implementations may rely on the Vector CRs being aligned to 8. This
 512   means that CRs may be read or written in aligned batches of 32 bits
 513   (8 CRs per batch), for high performance implementations.
 514 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
 515   overwritten by vector Rc=1 operations except for very large VL
 516 * CR-based predication, from CR32, is also not interfered with
 517   (except by large VL).
 518
 519 However when the SV result (destination) is marked as a scalar by the
 520 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
 521 CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
 522 for FP operations.
 523
 524 Note that yes, the CRs are genuinely Vectorised.  Unlike in SIMD VSX which
 525 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
 526 v3.0B scalar operations produce a **tuple** of element results: the
 527 result of the operation as one part of that element *and a corresponding
 528 CR element*.  Greatly simplified pseudocode:
 529
 530     for i in range(VL):
 531          # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
 532          + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
 533          == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
 534
 535 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
 536 then a followup instruction must be performed, setting "reduce" mode on
 537 the Vector of CRs, using cr ops (crand, crnor)to do so.  This provides far
 538 more flexibility in analysing vectors than standard Vector ISAs.  Normal
 539 Vector ISAs are typically restricted to "were all results nonzero" and
 540 "were some results nonzero". The application of mapreduce to Vectorised
 541 cr operations allows far more sophisticated analysis, particularly in
 542 conjunction with the new crweird operations see [[sv/cr_int_predication]].
 543
 544 Note in particular that the use of a separate instruction in this way
 545 ensures that high performance multi-issue OoO inplementations do not
 546 have the computation of the cumulative analysis CR as a bottleneck and
 547 hindrance, regardless of the length of VL.
 548
 549 (see [[discussion]].  some alternative schemes are described there)
 550
 551 ## Rc=1 when SUBVL!=1
 552
 553 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
 554 predicate is allocated per subvector; likewise only one CR is allocated
 555 per subvector.
 556
 557 This leaves a conundrum as to how to apply CR computation per subvector,
 558 when normally Rc=1 is exclusively applied to scalar elements.  A solution
 559 is to perform a bitwise OR or AND of the subvector tests.  Given that
 560 OE is ignored, rhis field may (when available) be used to select OR or
 561 AND behavior.
 562
 563 ### Table of CR fields
 564
 565 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
 566 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
 567
 568 CRs are not stored in SPRs: they are registers in their own right.
 569 Therefore context-switching the full set of CRs involves a Vectorised
 570 mfcr or mtcr, using VL=64, elwidth=8 to do so.  This is exactly as how
 571 scalar OpenPOWER context-switches CRs: it is just that there are now
 572 more of them.
 573
 574 The 64 SV CRs are arranged similarly to the way the 128 integer registers
 575 are arranged.  TODO a python program that auto-generates a CSV file
 576 which can be included in a table, which is in a new page (so as not to
 577 overwhelm this one). [[svp64/cr_names]]
 578
 579 # Register Profiles
 580
 581 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
 582 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
 583
 584 Instructions are broken down by Register Profiles as listed in the
 585 following auto-generated page: [[opcode_regs_deduped]].  "Non-SV"
 586 indicates that the operations with this Register Profile cannot be
 587 Vectorised (mtspr, bc, dcbz, twi)
 588
 589 TODO generate table which will be here [[svp64/reg_profiles]]
 590
 591 # SV pseudocode illilustration
 592
 593 ## Single-predicated Instruction
 594
 595 illustration of normal mode add operation: zeroing not included, elwidth
 596 overrides not included.  if there is no predicate, it is set to all 1s
 597
 598     function op_add(rd, rs1, rs2) # add not VADD!
 599       int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
 600       for (i = 0; i < VL; i++)
 601         STATE.srcoffs = i # save context if (predval & 1<<i) # predication
 602         uses intregs
 603            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
 604            ].isvec) break;
 605         if (rd.isvec)  { id += 1; } if (rs1.isvec)  { irs1 += 1; } if
 606         (rs2.isvec)  { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
 607         VL) {
 608           # end VL hardware loop STATE.srcoffs = 0; # reset return;
 609         }
 610
 611 This has several modes:
 612
 613 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
 614 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
 615
 616 All of these may be predicated.  Vector-Vector is straightfoward.
 617 When one of source is a Vector and the other a Scalar, it is clear that
 618 each element of the Vector source should be added to the Scalar source,
 619 each result placed into the Vector (or, if the destination is a scalar,
 620 only the first nonpredicated result).
 621
 622 The one that is not obvious is RT=vector but both RA/RB=scalar.
 623 Here this acts as a "splat scalar result", copying the same result into
 624 all nonpredicated result elements.  If a fixed destination scalar was
 625 intended, then an all-Scalar operation should be used.
 626
 627 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
 628
 629 # Assembly Annotation
 630
 631 Assembly code annotation is required for SV to be able to successfully
 632 mark instructions as "prefixed".
 633
 634 A reasonable (prototype) starting point:
 635
 636     svp64 [field=value]*
 637
 638 Fields:
 639
 640 * ew=8/16/32 - element width
 641 * sew=8/16/32 - source element width
 642 * vec=2/3/4 - SUBVL
 643 * mode=reduce/satu/sats/crpred
 644 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
 645 * spred={reg spec}
 646
 647 similar to x86 "rex" prefix.
 648
 649 For actual assembler:
 650
 651     sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
 652
 653 Qualifiers:
 654
 655 * m={pred}: predicate mask mode
 656 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
 657 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
 658 * ew={N}: ew=8/16/32 - sets elwidth override
 659 * sw={N}: sw=8/16/32 - sets source elwidth override
 660 * ff={xx}: see fail-first mode
 661 * pr={xx}: see predicate-result mode
 662 * sat{x}: satu / sats - see saturation mode
 663 * mr: see map-reduce mode
 664 * mr.svm see map-reduce with sub-vector mode
 665 * crm: see map-reduce CR mode
 666 * crm.svm see map-reduce CR with sub-vector mode
 667 * sz: predication with source-zeroing
 668 * dz: predication with dest-zeroing
 669
 670 For modes:
 671
 672 * pred-result:
 673   - pm=lt/gt/le/ge/eq/ne/so/ns OR
 674   - pm=RC1 OR pm=~RC1
 675 * fail-first
 676   - ff=lt/gt/le/ge/eq/ne/so/ns OR
 677   - ff=RC1 OR ff=~RC1
 678 * saturation:
 679   - sats
 680   - satu
 681 * map-reduce:
 682   - mr OR crm: "normal" map-reduce mode or CR-mode.
 683   - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
 684