openpower/sv/remap.mdwn

   1 # REMAP <a name="remap" />
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=143> matrix multiply
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=867> add svindex
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=885> svindex in simulator
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=911> offset svshape option
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel reduction
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=930> DCT/FFT "strides"
  10 * see [[sv/remap/appendix]] for examples and usage
  11 * see [[sv/propagation]] for a future way to apply REMAP
  12 * [[remap/discussion]]
  13 <!-- show -->
  14
  15 REMAP is an advanced form of Vector "Structure Packing" that provides
  16 hardware-level support for commonly-used *nested* loop patterns that would
  17 otherwise require full inline loop unrolling.  For more general reordering
  18 an Indexed REMAP mode is available (a RISC-paradigm
  19 abstracted analog to `xxperm`).
  20
  21 REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped"
  22 (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset"
  23 to permit arbitrary access to elements, independently on each
  24 Vector src or dest register. Up to four separate independent REMAPs may be applied
  25 to the registers of any instruction.
  26
  27 A normal Vector Add:
  28
  29 ```
  30    for i in range(VL):
  31      GPR[RT+i] <= GPR[RA+i] + GPR[RB+i];
  32 ```
  33
  34 A Hardware-assisted REMAP Vector Add:
  35
  36 ```
  37    for i in range(VL):
  38      GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)];
  39 ```
  40
  41 Aside from
  42 Indexed REMAP this is entirely Hardware-accelerated reordering and
  43 consequently not costly in terms of register access for the Indices. It will however
  44 place a burden on Multi-Issue systems but no more than if the equivalent
  45 Scalar instructions were explicitly loop-unrolled without SVP64, and
  46 some advanced implementations may even find the Deterministic nature of
  47 the Scheduling to be easier on resources.
  48
  49 *Hardware note: in its general form, REMAP is quite expensive to set up, and on some
  50 implementations may introduce latency, so should realistically be used
  51 only where it is worthwhile.  Given that even with latency the fact
  52 that up to 127 operations can be Deterministically issued (from a single
  53 instruction) it should be clear that REMAP should not be dismissed
  54 for *possible* latency alone.  Commonly-used patterns such as Matrix
  55 Multiply, DCT and FFT have helper instruction options which make REMAP
  56 easier to use.*
  57
  58 There are five types of REMAP:
  59
  60 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
  61   Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
  62   Matrix Multiply (a future variant may introduce Inner Product).
  63 * **FFT/DCT**, with full triple-loop in-place support: limited to
  64   Power-2 RADIX
  65 * **Indexing**, for any general-purpose reordering, also includes
  66   limited 2D reshaping as well as Element "offsetting".
  67 * **Parallel Reduction**, for scheduling a sequence of operations
  68   in a Deterministic fashion, in a way that may be parallelised,
  69   to reduce a Vector down to a single value.
  70 * **Parallel Prefix Sum**, implemented as a work-efficient Schedule,
  71   has several key Computer Science uses. Again Prefix Sum is 100%
  72   Deterministic.
  73
  74 Best implemented on top of a Multi-Issue Out-of-Order Micro-architecture,
  75 REMAP Schedules are 100% Deterministic **including Indexing** and are
  76 designed to be incorporated in between the Decode and Issue phases,
  77 directly into Register Hazard Management.
  78
  79 As long as the SVSHAPE SPRs
  80 are not written to directly, Hardware may treat REMAP as 100%
  81 Deterministic: all REMAP Management instructions take static
  82 operands (no dynamic register operands)
  83 with the exception of Indexed Mode, and even then
  84 Architectural State is permitted to assume that the Indices
  85 are cacheable from the point at which the `svindex` instruction
  86 is executed.
  87
  88 Further details on the Deterministic Precise-Interruptible algorithms
  89 used in these Schedules is found in the [[sv/remap/appendix]].
  90
  91 *Future specification note: future versions of the REMAP Management instructions
  92 will extend to EXT1xx Prefixed variants. This will overcome some of the limitations
  93 present in the 32-bit variants of the REMAP Management instructions that at
  94 present require direct writing to SVSHAPE0-3 SPRs.  Additional
  95 REMAP Modes may also be introduced at that time.*
  96
  97 ## Determining Register Hazards (hphint)
  98
  99 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
 100 to be able to statically determine the extent of Vectors in order to
 101 allocate pre-emptive Hazard protection.  The next task is to eliminate
 102 masked-out elements using predicate bits, freeing up the associated
 103 Hazards.
 104
 105 For non-REMAP situations `VL` is sufficient to ascertain early
 106 Hazard coverage, and with SVSTATE being a high priority cached
 107 quantity at the same level of MSR and PC this is not a problem.
 108
 109 The problems come when REMAP is enabled.  Indexed REMAP must instead
 110 use `MAXVL` as the earliest (simplest)
 111 batch-level Hazard Reservation indicator (after taking element-width
 112 overriding on the Index source into consideration),
 113 but Matrix, FFT and Parallel Reduction must all use completely different
 114 schemes.  The reason is that VL is used to step through the total
 115 number of *operations*, not the number of registers.
 116 The "Saving Grace" is that all of the REMAP Schedules are 100% Deterministic.
 117
 118 Advance-notice Parallel computation and subsequent cacheing
 119 of all of these complex Deterministic REMAP Schedules is
 120 *strongly recommended*, thus allowing clear and precise multi-issue
 121 batched Hazard coverage to be deployed, *even for Indexed Mode*.
 122 This is only possible for Indexed due to the strict guidelines
 123 given to Programmers.
 124
 125 In short, there exists solutions to the problem of Hazard Management,
 126 with varying degrees of refinement possible at correspondingly
 127 increasing levels of complexity in hardware.
 128
 129 A reminder: when Rc=1 each result register (element) has an associated
 130 co-result CR Field (one per result element).  Thus above when determining
 131 the Write-Hazards for result registers the corresponding Write-Hazards for the
 132 corresponding associated co-result CR Field must not be forgotten, *including* when
 133 Predication is used.
 134
 135 **Horizontal-Parallelism Hint**
 136
 137 To help further in reducing Hazards,
 138 `SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
 139 fully independent.  Hardware is permitted to assume that groups of elements
 140 up to `hphint` in size need not have Register (or Memory) Hazards created
 141 between them, including when `hphint > VL`, which greatly aids simplification of
 142 Multi-Issue implementations.
 143
 144 If care is not taken in setting `hphint` correctly it may wreak havoc.
 145 For example Matrix Outer Product relies on the innermost loop computations
 146 being independent.  If `hphint` is set to greater than the Outer Product
 147 depth then data corruption is guaranteed to occur.
 148
 149 Likewise on FFTs it is assumed that each layer of the RADIX2 triple-loop
 150 is independent, but that there is strict *inter-layer* Register Hazards.
 151 Therefore if `hphint` is set to greater than the RADIX2 width of the FFT,
 152 data corruption is guaranteed.
 153
 154 Thus the key message is that setting `hphint` requires in-depth knowledge
 155 of the REMAP Algorithm Schedules, given in the Appendix.
 156
 157 ## REMAP area of SVSTATE SPR
 158
 159 The following bits of the SVSTATE SPR are used for REMAP:
 160
 161 ```
 162     |32:33|34:35|36:37|38:39|40:41| 42:46 | 62     |
 163     | --  | --  | --  | --  | --  | ----- | ------ |
 164     |mi0  |mi1  |mi2  |mo0  |mo1  | SVme  | RMpst  |
 165 ```
 166
 167 mi0-2 and mo0-1 each select SVSHAPE0-3 to apply to a given register.
 168 mi0-2 apply to RA, RB, RC respectively, as input registers, and
 169 likewise mo0-1 apply to output registers (RT/FRT, RS/FRS) respectively.
 170 SVme is 5 bits (one for each of mi0-2/mo0-1) and indicates whether the
 171 SVSHAPE is actively applied or not, and if so, to which registers.
 172
 173 * bit 4 of SVme indicates if mi0 is applied to source RA / FRA / BA / BFA / RT / FRT
 174 * bit 3 of SVme indicates if mi1 is applied to source RB / FRB / BB
 175 * bit 2 of SVme indicates if mi2 is applied to source RC / FRC / BC
 176 * bit 1 of SVme indicates if mo0 is applied to result RT / FRT / BT / BF
 177 * bit 0 of SVme indicates if mo1 is applied to result Effective Address / FRS / RS
 178   (LD/ST-with-update has an implicit 2nd write register, RA)
 179
 180 The "persistence" bit if set will result in all Active REMAPs being applied
 181 indefinitely.
 182
 183 -----------
 184
 185 \newpage{}
 186
 187 # svremap instruction <a name="svremap"> </a>
 188
 189 SVRM-Form:
 190
 191 |0     |6     |11  |13   |15   |17   |19   |21    | 22:25 |26:31  |
 192 | --   | --   | -- | --  | --  | --  | --  | --   | ----  | ----- |
 193 | PO   | SVme |mi0 | mi1 | mi2 | mo0 | mo1 | pst  | rsvd  | XO    |
 194
 195 * svremap SVme,mi0,mi1,mi2,mo0,mo1,pst
 196
 197 Pseudo-code:
 198
 199 ```
 200     # registers RA RB RC RT EA/FRS SVSHAPE0-3 indices
 201     SVSTATE[32:33] <- mi0
 202     SVSTATE[34:35] <- mi1
 203     SVSTATE[36:37] <- mi2
 204     SVSTATE[38:39] <- mo0
 205     SVSTATE[40:41] <- mo1
 206     # enable bit for RA RB RC RT EA/FRS
 207     SVSTATE[42:46] <- SVme
 208     # persistence bit (applies to more than one instruction)
 209     SVSTATE[62] <- pst
 210 ```
 211
 212 Special Registers Altered:
 213
 214 ```
 215     SVSTATE
 216 ```
 217
 218 `svremap` determines the relationship between registers and SVSHAPE SPRs.
 219 The bitmask `SVme` determines which registers have a REMAP applied, and mi0-mo1
 220 determine which shape is applied to an activated register.  the `pst` bit if
 221 cleared indicated that the REMAP operation shall only apply to the immediately-following
 222 instruction.  If set then REMAP remains permanently enabled until such time as it is
 223 explicitly disabled, either by `setvl` setting a new MAXVL, or with another
 224 `svremap` instruction. `svindex` and `svshape2` are also capable of setting or
 225 clearing persistence, as well as partially covering a subset of the capability of
 226 `svremap` to set register-to-SVSHAPE relationships.
 227
 228 Programmer's Note: applying non-persistent `svremap` to an instruction that has
 229 no REMAP enabled or is a Scalar operation will obviously have no effect but
 230 the bits 32 to 46 will at least have been set in SVSTATE. This may prove useful
 231 when using `svindex` or `svshape2`.
 232
 233 Hardware Architectural Note: when persistence is not set it is critically important
 234 to treat the `svremap` and the following SVP64 instruction as an indivisible fused operation.
 235 *No state* is stored in the SVSTATE SPR in order to allow continuation should an
 236 Interrupt occur between the two instructions. Thus, Interrupts must be prohibited
 237 from occurring or other workaround deployed.  When persistence is set this issue
 238 is moot.
 239
 240 It is critical to note that if persistence is clear then `svremap` is the *only* way
 241 to activate REMAP on any given (following) instruction.  If persistence is set however then
 242 **all** SVP64 instructions go through REMAP as long as `SVme` is non-zero.
 243
 244 -------------
 245
 246 \newpage{}
 247
 248 # SHAPE Remapping SPRs
 249
 250 There are four "shape" SPRs, SHAPE0-3, 32-bits in each,
 251 which have the same format.
 252
 253 Shape is 32-bits.  When SHAPE is set entirely to zeros, remapping is
 254 disabled: the register's elements are a linear (1D) vector.
 255
 256 |0:5   |6:11  | 12:17   | 18:20   | 21:23   |24:27 |28:29  |30:31| Mode  |
 257 |----- |----- | ------- | ------- | ------  |------|------ |---- | ----- |
 258 |xdimsz|ydimsz| zdimsz  | permute | invxyz  |offset|skip   |mode |Matrix |
 259 |xdimsz|ydimsz|SVGPR    | 11/     |sk1/invxy|offset|elwidth|0b00 |Indexed|
 260 |xdimsz|mode  | zdimsz  | submode2| invxyz  |offset|submode|0b01 |DCT/FFT|
 261 | rsvd |rsvd  |xdimsz   | rsvd    | invxyz  |offset|submode|0b10 |Red/Sum|
 262 |      |      |         |         |         |      |       |0b11 |rsvd   |
 263
 264 `mode` sets different behaviours (straight matrix multiply, FFT, DCT).
 265
 266 * **mode=0b00** sets straight Matrix Mode
 267 * **mode=0b00** with permute=0b110 or 0b111 sets Indexed Mode
 268 * **mode=0b01** sets "FFT/DCT" mode and activates submodes
 269 * **mode=0b10** sets "Parallel Reduction or Prefix-Sum" Schedules.
 270
 271 *Architectural Resource Allocation note: the four SVSHAPE SPRs are best
 272 allocated sequentially and contiguously in order that `sv.mtspr` may
 273 be used. This is safe to do as long as `SVSTATE.SVme=0`*
 274
 275 ## Parallel Reduction / Prefix-Sum Mode
 276
 277 Creates the Schedules for Parallel Tree Reduction and Prefix-Sum
 278
 279 * **submode=0b00** selects the left operand index for Reduction
 280 * **submode=0b01** selects the right operand index for Reduction
 281 * **submode=0b10** selects the left operand index for Prefix-Sum
 282 * **submode=0b11** selects the right operand index for Prefix-Sum
 283
 284 * When bit 0 of `invxyz` is set, the order of the indices
 285   in the inner for-loop are reversed. This has the side-effect
 286   of placing the final reduced result in the last-predicated element.
 287   It also has the indirect side-effect of swapping the source
 288   registers: Left-operand index numbers will always exceed
 289   Right-operand indices.
 290   When clear, the reduced result will be in the first-predicated
 291   element, and Left-operand indices will always be *less* than
 292   Right-operand ones.
 293 * When bit 1 of `invxyz` is set, the order of the outer loop
 294   step is inverted: stepping begins at the nearest power-of two
 295   to half of the vector length and reduces by half each time.
 296   When clear the step will begin at 2 and double on each
 297   inner loop.
 298
 299 **Parallel Prefix Sum**
 300
 301 This is a work-efficient Parallel Schedule that for example produces Trangular
 302 or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
 303 to Parallel Reduction.  Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
 304 end-result, implementations may only implement Mapreduce in serial form (or give
 305 the appearance to Programmers of the same). The Parallel Prefix Schedule is
 306 *required* to be implemented in such a way that its Deterministic Schedule may be
 307 parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
 308 may be used with non-commutative operations.
 309 The Schedule Algorithm may be found in the [[sv/remap/appendix]]
 310
 311 **Parallel Reduction**
 312
 313 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
 314 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 315 *appearance* and *effect* of Reduction. Parallel Reduction is not limited
 316 to Power-of-two but is limited as usual by the total number of
 317 element operations (127) as well as available register file size.
 318
 319 In Horizontal-First Mode, Vector-result reduction **requires**
 320 the destination to be a Vector, which will be used to store
 321 intermediary results, in order to achieve a correct final
 322 result.
 323
 324 Given that the tree-reduction schedule is deterministic,
 325 Interrupts and exceptions
 326 can therefore also be precise.  The final result will be in the first
 327 non-predicate-masked-out destination element, but due again to
 328 the deterministic schedule programmers may find uses for the intermediate
 329 results, even for non-commutative Defined Word operations.
 330 Additionally, because the intermediate results are always written out
 331 it is possible to service Precise Interrupts without affecting latency
 332 (a common limitation of Vector ISAs implementing explicit
 333 Parallel Reduction instructions, because their Architectural State cannot
 334 hold the partial results).
 335
 336 When Rc=1 a corresponding Vector of co-resultant CRs is also
 337 created.  No special action is taken: the result *and its CR Field*
 338 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 339
 340 Note that the Schedule only makes sense on top of certain instructions:
 341 X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
 342 and the destination are all the same type.  Like Scalar
 343 Reduction, nothing is prohibited:
 344 the results of execution on an unsuitable instruction may simply
 345 not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
 346 may be used, and whilst it is down to the Programmer to walk through the
 347 process the Programmer can be confident that the Parallel-Reduction is
 348 guaranteed 100% Deterministic.
 349
 350 Critical to note regarding use of Parallel-Reduction REMAP is that,
 351 exactly as with all REMAP Modes, the `svshape` instruction *requests*
 352 a certain Vector Length (number of elements to reduce) and then
 353 sets VL and MAXVL at the number of **operations** needed to be
 354 carried out.  Thus, equally as importantly, like Matrix REMAP
 355 the total number of operations
 356 is restricted to 127.  Any Parallel-Reduction requiring more operations
 357 will need to be done manually in batches (hierarchical
 358 recursive Reduction).
 359
 360 Also important to note is that the Deterministic Schedule is arranged
 361 so that some implementations *may* parallelise it (as long as doing so
 362 respects Program Order and Register Hazards).  Performance (speed)
 363 of any given
 364 implementation is neither strictly defined or guaranteed.  As with
 365 the Vulkan(tm) Specification, strict compliance is paramount whilst
 366 performance is at the discretion of Implementors.
 367
 368 **Parallel-Reduction with Predication**
 369
 370 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
 371 completely separate from the actual element-level (scalar) operations,
 372 Move operations are **not** included in the Schedule.  This means that
 373 the Schedule leaves the final (scalar) result in the first-non-masked
 374 element of the Vector used.  With the predicate mask being dynamic
 375 (but deterministic) at a superficial glance it seems this result
 376 could be anywhere.
 377
 378 If that result is needed to be moved to a (single) scalar register
 379 then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
 380 needed to get it, where the predicate is the exact same predicate used
 381 in the prior Parallel-Reduction instruction.
 382
 383 * If there was only a single
 384   bit in the predicate then the result will not have moved or been altered
 385   from the source vector prior to the Reduction
 386 * If there was more than one bit the result will be in the
 387   first element with a predicate bit set.
 388
 389 In either case the result is in the element with the first bit set in
 390 the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
 391
 392 Programmer's Note: For *some* hardware implementations
 393 the vector-to-scalar copy may be a slow operation, as may the Predicated
 394 Parallel Reduction itself.
 395 It may be better to perform a pre-copy
 396 of the values, compressing them (VREDUCE-style) into a contiguous block,
 397 which will guarantee that the result goes into the very first element
 398 of the destination vector, in which case clearly no follow-up
 399 predicated vector-to-scalar MV operation is needed. A VREDUCE effect
 400 is achieved by setting just a source predicate mask on Twin-Predicated
 401 operations.
 402
 403 **Usage conditions**
 404
 405 The simplest usage is to perform an overwrite, specifying all three
 406 register operands the same.
 407
 408 ```
 409     svshape parallelreduce, 6
 410     sv.add *8, *8, *8
 411 ```
 412
 413 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 414 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
 415 necessary (see "Parallel Reduction algorithm" in a later section).
 416
 417 A non-overwrite is possible as well but just as with the overwrite
 418 version, only those destination elements necessary for storing
 419 intermediary computations will be written to: the remaining elements
 420 will **not** be overwritten and will **not** be zero'd.
 421
 422 ```
 423     svshape parallelreduce, 6
 424     sv.add *0, *8, *8
 425 ```
 426
 427 However it is critical to note that if the source and destination are
 428 not the same then the trick of using a follow-up vector-scalar MV will
 429 not work.
 430
 431 **Sub-Vector Horizontal Reduction**
 432
 433 To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
 434 which will turn the Schedule around such that issuing of the Scalar
 435 Defined Words is done with SUBVL looping as the inner loop not the
 436 outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
 437
 438 *Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
 439 will clearly result in data corruption.  It may be best to perform
 440 a Pack/Unpack Transposing copy of the data first*
 441
 442 ## FFT/DCT mode
 443
 444 submode2=0 is for FFT. For FFT submode the following schedules may be
 445 selected:
 446
 447 * **submode=0b00** selects the ``j`` offset of the innermost for-loop
 448   of Tukey-Cooley
 449 * **submode=0b10** selects the ``j+halfsize`` offset of the innermost for-loop
 450   of Tukey-Cooley
 451 * **submode=0b11** selects the ``k`` of exptable (which coefficient)
 452
 453 When submode2 is 1 or 2, for DCT inner butterfly submode the following
 454 schedules may be selected.  When submode2 is 1, additional bit-reversing
 455 is also performed.
 456
 457 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 458     in-place
 459 * **submode=0b010** selects the ``j+halfsize`` offset of the innermost for-loop,
 460   in reverse-order, in-place
 461 * **submode=0b10** selects the ``ci`` count of the innermost for-loop,
 462   useful for calculating the cosine coefficient
 463 * **submode=0b11** selects the ``size`` offset of the outermost for-loop,
 464   useful for the cosine coefficient ``cos(ci + 0.5) * pi / size``
 465
 466 When submode2 is 3 or 4, for DCT outer butterfly submode the following
 467 schedules may be selected.  When submode is 3, additional bit-reversing
 468 is also performed.
 469
 470 * **submode=0b00** selects the ``j`` offset of the innermost for-loop,
 471 * **submode=0b01** selects the ``j+1`` offset of the innermost for-loop,
 472
 473 `zdimsz` is used as an in-place "Stride", particularly useful for
 474 column-based in-place DCT/FFT.
 475
 476 ## Matrix Mode
 477
 478 In Matrix Mode, skip allows dimensions to be skipped from being included
 479 in the resultant output index.  This allows sequences to be repeated:
 480 ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
 481 modulo ```0 1 2 0 1 2 ...```
 482
 483 * **skip=0b00** indicates no dimensions to be skipped
 484 * **skip=0b01** sets "skip 1st dimension"
 485 * **skip=0b10** sets "skip 2nd dimension"
 486 * **skip=0b11** sets "skip 3rd dimension"
 487
 488 invxyz will invert the start index of each of x, y or z. If invxyz[0] is
 489 zero then x-dimensional counting begins from 0 and increments, otherwise
 490 it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.
 491
 492 offset will have the effect of offsetting the result by ```offset``` elements:
 493
 494 ```
 495     for i in 0..VL-1:
 496         GPR(RT + remap(i) + SVSHAPE.offset) = ....
 497 ```
 498
 499 This appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced.  Also
 500 bear in mind that unlike a static compiler SVSHAPE.offset may
 501 be set dynamically at runtime.
 502
 503 xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates
 504 that the array dimensionality for that dimension is 1. any dimension
 505 not intended to be used must have its value set to 0 (dimensionality
 506 of 1).  A value of xdimsz=2 would indicate that in the first dimension
 507 there are 3 elements in the array.  For example, to create a 2D array
 508 X,Y of dimensionality X=3 and Y=2, set xdimsz=2, ydimsz=1 and zdimsz=0
 509
 510 The format of the array is therefore as follows:
 511
 512 ```
 513     array[xdimsz+1][ydimsz+1][zdimsz+1]
 514 ```
 515
 516 However whilst illustrative of the dimensionality, that does not take the
 517 "permute" setting into account.  "permute" may be any one of six values
 518 (0-5, with values of 6 and 7 indicating "Indexed" Mode).  The table
 519 below shows how the permutation dimensionality order works:
 520
 521 | permute | order | array format             |
 522 | ------- | ----- | ------------------------ |
 523 | 000     | 0,1,2 | (xdim+1)(ydim+1)(zdim+1) |
 524 | 001     | 0,2,1 | (xdim+1)(zdim+1)(ydim+1) |
 525 | 010     | 1,0,2 | (ydim+1)(xdim+1)(zdim+1) |
 526 | 011     | 1,2,0 | (ydim+1)(zdim+1)(xdim+1) |
 527 | 100     | 2,0,1 | (zdim+1)(xdim+1)(ydim+1) |
 528 | 101     | 2,1,0 | (zdim+1)(ydim+1)(xdim+1) |
 529 | 110     | 0,1   | Indexed (xdim+1)(ydim+1) |
 530 | 111     | 1,0   | Indexed (ydim+1)(xdim+1) |
 531
 532 In other words, the "permute" option changes the order in which
 533 nested for-loops over the array would be done.  See executable
 534 python reference code for further details.
 535
 536 *Note: permute=0b110 and permute=0b111 enable Indexed REMAP Mode,
 537 described below*
 538
 539 With all these options it is possible to support in-place transpose,
 540 in-place rotate, Matrix Multiply and Convolutions, without being
 541 limited to Power-of-Two dimension sizes.
 542
 543 **Limitations and caveats**
 544
 545 Limitations of Matrix REMAP are that the Vector Length (VL) is currently
 546 restricted to 127: up to 127 FMAs (or other operation)
 547 may be performed in total.
 548 Also given that it is in-registers only at present some care has to be
 549 taken on regfile resource utilisation. However it is perfectly possible
 550 to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
 551 the usual 6-level "Tiled" large Matrix Multiply, without the usual
 552 difficulties associated with SIMD.
 553
 554 Also the `svshape` instruction only provides access to *part* of the
 555 Matrix REMAP capability. Rotation and mirroring need to be done by
 556 programming the SVSHAPE SPRs directly, which can take a lot more
 557 instructions. Future versions of SVP64 will include EXT1xx prefixed
 558 variants (`psvshape`) which provide more comprehensive capacity and
 559 mitigate the need to write direct to the SVSHAPE SPRs.
 560
 561 Additionally there is not yet a way to set Matrix sizes from registers
 562 with `svshape`: this was an intentional decision to simplify Hardware, that
 563 may be corrected in a future version of SVP64. The limitation may presently
 564 be overcome by direct programming of the SVSHAPE SPRs.
 565
 566 *Hardware Architectural note: with the Scheduling applying as a Phase between
 567 Decode and Issue in a Deterministic fashion the Register Hazards may be
 568 easily computed and a standard Out-of-Order Micro-Architecture exploited to good
 569 effect.  Even an In-Order system may observe that for large Outer Product
 570 Schedules there will be no stalls, but if the Matrices are particularly
 571 small size an In-Order system would have to stall, just as it would if
 572 the operations were loop-unrolled without Simple-V. Thus: regardless
 573 of the Micro-Architecture the Hardware Engineer should first consider
 574 how best to process the exact same equivalent loop-unrolled instruction
 575 stream. Once solved Matrix REMAP will fit naturally.*
 576
 577 ## Indexed Mode
 578
 579 Indexed Mode activates reading of the element indices from the GPR
 580 and includes optional limited 2D reordering.
 581 In its simplest form (without elwidth overrides or other modes):
 582
 583 ```
 584     def index_remap(i):
 585         return GPR((SVSHAPE.SVGPR<<1)+i) + SVSHAPE.offset
 586
 587     for i in 0..VL-1:
 588         element_result = ....
 589         GPR(RT + indexed_remap(i)) = element_result
 590 ```
 591
 592 With element-width overrides included, and using the pseudocode
 593 from the SVP64 [[sv/svp64/appendix#elwidth]] elwidth section
 594 this becomes:
 595
 596 ```
 597     def index_remap(i):
 598         svreg = SVSHAPE.SVGPR << 1
 599         srcwid = elwid_to_bitwidth(SVSHAPE.elwid)
 600         offs = SVSHAPE.offset
 601         return get_polymorphed_reg(svreg, srcwid, i) + offs
 602
 603     for i in 0..VL-1:
 604         element_result = ....
 605         rt_idx = indexed_remap(i)
 606         set_polymorphed_reg(RT, destwid, rt_idx, element_result)
 607 ```
 608
 609 Matrix-style reordering still applies to the indices, except limited
 610 to up to 2 Dimensions (X,Y). Ordering is therefore limited to (X,Y) or
 611 (Y,X) for in-place Transposition.
 612 Only one dimension may optionally be skipped. Inversion of either
 613 X or Y or both is possible (2D mirroring). Pseudocode for Indexed Mode (including elwidth
 614 overrides) may be written in terms of Matrix Mode, specifically
 615 purposed to ensure that the 3rd dimension (Z) has no effect:
 616
 617 ```
 618     def index_remap(ISHAPE, i):
 619         MSHAPE.skip   = 0b0 || ISHAPE.sk1
 620         MSHAPE.invxyz = 0b0 || ISHAPE.invxy
 621         MSHAPE.xdimsz = ISHAPE.xdimsz
 622         MSHAPE.ydimsz = ISHAPE.ydimsz
 623         MSHAPE.zdimsz = 0 # disabled
 624         if ISHAPE.permute = 0b110 # 0,1
 625            MSHAPE.permute = 0b000 # 0,1,2
 626         if ISHAPE.permute = 0b111 # 1,0
 627            MSHAPE.permute = 0b010 # 1,0,2
 628         el_idx = remap_matrix(MSHAPE, i)
 629         svreg = ISHAPE.SVGPR << 1
 630         srcwid = elwid_to_bitwidth(ISHAPE.elwid)
 631         offs = ISHAPE.offset
 632         return get_polymorphed_reg(svreg, srcwid, el_idx) + offs
 633 ```
 634
 635 The most important observation above is that the Matrix-style
 636 remapping occurs first and the Index lookup second.  Thus it
 637 becomes possible to perform in-place Transpose of Indices which
 638 may have been costly to set up or costly to duplicate
 639 (waste register file space). In other words: it is fine for two or more
 640 SVSHAPEs to simultaneously use the same
 641 Indices (use the same GPRs), even if one SVSHAPE has different
 642 2D dimensions and ordering from the others.
 643
 644 **Caveats and Limitations**
 645
 646 The purpose of Indexing is to provide a generalised version of
 647 Vector ISA "Permute" instructions, such as VSX `vperm`.  The
 648 Indexing is abstracted out and may be applied to much more
 649 than an element move/copy, and is not limited for example
 650 to the number of bytes that can fit into a VSX register.
 651 Indexing may be applied to LD/ST (even on Indexed LD/ST
 652 instructions such as `sv.lbzx`), arithmetic operations,
 653 extsw: there is no artificial limit.
 654
 655 The only major caveat is that the registers to be used as
 656 Indices must not be modified by any instruction after Indexed Mode
 657 is established, and neither must MAXVL be altered. Additionally,
 658 no register used as an Index may exceed MAXVL-1.
 659
 660 Failure to observe
 661 these conditions results in `UNDEFINED` behaviour.
 662 These conditions allow a Read-After-Write (RAW) Hazard to be created on
 663 the entire range of Indices to be subsequently used, but a corresponding
 664 Write-After-Read Hazard by any instruction that modifies the Indices
 665 **does not have to be created**. Given the large number of registers
 666 involved in Indexing this is a huge resource saving and reduction
 667 in micro-architectural complexity. MAXVL is likewise
 668 included in the RAW Hazards because it is involved in calculating
 669 how many registers are to be considered Indices.
 670
 671 With these Hazard Mitigations in place, high-performance implementations
 672 may read-cache the Indices at the point where a given `svindex` instruction
 673 is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
 674 background GPR register file reads whilst other instructions are being
 675 issued and executed.
 676
 677 Indexed REMAP **does not prevent conflicts** (overlapping
 678 destinations), which on a superficial analysis may be perceived to be a
 679 problem, until it is recalled that, firstly, Simple-V is designed specifically
 680 to require Program Order to be respected, and that Matrix, DCT and FFT
 681 all *already* critically depend on overlapping Reads/Writes: Matrix
 682 uses overlapping registers as accumulators.  Thus the Register Hazard
 683 Management needed by Indexed REMAP *has* to be in place anyway.
 684
 685 *Programmer's Note: `hphint` may be used to help hardware identify
 686 parallelism opportunities but it is critical to remember that the
 687 groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
 688
 689 The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
 690 clearly that of the additional reading of the GPRs to be used as Indices,
 691 plus the setup cost associated with creating those same Indices.
 692 If any Deterministic REMAP can cover the required task, clearly it
 693 is adviseable to use it instead.
 694
 695 *Programmer's note: some algorithms may require skipping of Indices exceeding
 696 VL-1, not MAXVL-1. This may be achieved programmatically by performing
 697 an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
 698 and RB contains the value of VL returned from `setvl`. The resultant
 699 CR Fields may then be used as Predicate Masks to exclude those operations
 700 with an Index exceeding VL-1.*
 701
 702 -------------
 703
 704 \newpage{}
 705
 706 # svshape instruction  <a name="svshape"> </a>
 707
 708 SVM-Form
 709
 710     svshape SVxd,SVyd,SVzd,SVRM,vf
 711
 712 | 0:5|6:10  |11:15  |16:20  | 21:24  | 25 | 26:31 |  name    |
 713 | -- | --   | ---   | ----- | ------ | -- | ------| -------- |
 714 |PO  | SVxd | SVyd  | SVzd  | SVRM   | vf | XO    | svshape  |
 715
 716 See [[sv/remap/appendix]] for `svshape` pseudocode
 717
 718 Special Registers Altered:
 719
 720 ```
 721     SVSTATE, SVSHAPE0-3
 722 ```
 723
 724 `svshape` is a convenience instruction that reduces instruction
 725 count for common usage patterns, particularly Matrix, DCT and FFT. It sets up
 726 (overwrites) all required SVSHAPE SPRs and also modifies SVSTATE
 727 including VL and MAXVL. Using `svshape` therefore does not also
 728 require `setvl`.
 729
 730 Fields:
 731
 732 * **SVxd** - SV REMAP "xdim" (X-dimension)
 733 * **SVyd** - SV REMAP "ydim" (Y-dimension, sometimes used for sub-mode selection)
 734 * **SVzd** - SV REMAP "zdim" (Z-dimension)
 735 * **SVRM** - SV REMAP Mode (0b00000 for Matrix, 0b00001 for FFT etc.)
 736 * **vf** - sets "Vertical-First" mode
 737 * **XO** - standard 6-bit XO field
 738
 739 *Note: SVxd, SVyz and SVzd are all stored "off-by-one".  In the assembler
 740 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*
 741
 742 There are 12 REMAP Modes (2 Modes are RESERVED for `svshape2`, 2 Modes
 743 are RESERVED)
 744
 745 | SVRM   | Remap Mode description |
 746 | --     | --              |
 747 | 0b0000 | Matrix 1/2/3D    |
 748 | 0b0001 | FFT Butterfly   |
 749 | 0b0010 | reserved |
 750 | 0b0011 | DCT Outer butterfly  |
 751 | 0b0100 | DCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 752 | 0b0101 | DCT COS table index generation |
 753 | 0b0110 | DCT half-swap   |
 754 | 0b0111 | Parallel Reduction and Prefix Sum |
 755 | 0b1000 | reserved for svshape2 |
 756 | 0b1001 | reserved for svshape2 |
 757 | 0b1010 | reserved |
 758 | 0b1011 | iDCT Outer butterfly  |
 759 | 0b1100 | iDCT Inner butterfly, on-the-fly (Vertical-First Mode) |
 760 | 0b1101 | iDCT COS table index generation |
 761 | 0b1110 | iDCT half-swap   |
 762 | 0b1111 | FFT half-swap   |
 763
 764 Examples showing how all of these Modes operate exists in the online
 765 [SVP64 unit tests](https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/decoder/isa;hb=HEAD).  Explaining
 766 these Modes further in detail is beyond the scope of this document.
 767
 768 In Indexed Mode, there are only 5 bits available to specify the GPR
 769 to use, out of 128 GPRs (7 bit numbering).  Therefore, only the top
 770 5 bits are given in the `SVxd` field: the bottom two implicit bits
 771 will be zero (`SVxd || 0b00`).
 772
 773 `svshape` has *limited applicability* due to being a 32-bit instruction.
 774 The full capability of SVSHAPE SPRs may be accessed by directly writing
 775 to SVSHAPE0-3 with `mtspr`. Circumstances include Matrices with dimensions
 776 larger than 32, and in-place Transpose.  Potentially a future v3.1 Prefixed
 777 instruction, `psvshape`, may extend the capability here.
 778
 779 Programmer's Note: Parallel Reduction Mode is selected by setting `SVRM=7,SVyd=1`.
 780 Prefix Sum Mode is selected by setting `SVRM=7,SVyd=3`:
 781
 782 ```
 783     # Vector length of 8.
 784     svshape 8, 3, 1, 0x7, 0
 785     # activate SVSHAPE0 (prefix-sum lhs) for RA
 786     # activate SVSHAPE1 (prefix-sum rhs) for RT and RB
 787     svremap 7, 0, 1, 0, 1, 0, 0
 788     sv.add *10, *10, *10
 789 ```
 790
 791 *Architectural Resource Allocation note: the SVRM field is carefully
 792 crafted to allocate two Modes, corresponding to bits 21-23 within the
 793 instruction being set to the value `0b100`, to `svshape2` (not
 794 `svshape`). These two Modes are
 795 considered "RESERVED" within the context of `svshape` but it is
 796 absolutely critical to allocate the exact same pattern in XO for
 797 both instructions in bits 26-31.*
 798
 799 -------------
 800
 801 \newpage{}
 802
 803
 804 # svindex instruction  <a name="svindex"> </a>
 805
 806 SVI-Form
 807
 808 | 0:5|6:10 |11:15  |16:20 | 21:25       | 26:31 |  Form    |
 809 | -- | --  | ---   | ---- | ----------- | ------| -------- |
 810 | PO | SVG | rmm   | SVd  | ew/yx/mm/sk | XO    | SVI-Form |
 811
 812 * svindex SVG,rmm,SVd,ew,SVyx,mm,sk
 813
 814 See [[sv/remap/appendix]] for `svindex` pseudocode
 815
 816 Special Registers Altered:
 817
 818 ```
 819     SVSTATE, SVSHAPE0-3
 820 ```
 821
 822 `svindex` is a convenience instruction that reduces instruction count
 823 for Indexed REMAP Mode. It sets up (overwrites) all required SVSHAPE
 824 SPRs and **unlike** `svshape` can modify the REMAP area of the SVSTATE
 825 SPR as well, including setting persistence.  The relevant SPRs *may*
 826 be directly programmed with `mtspr` however it is laborious to do so:
 827 svindex saves instructions covering much of Indexed REMAP capability.
 828
 829 Fields:
 830
 831 * **SVd** - SV REMAP x/y dim
 832 * **rmm** - REMAP mask: sets remap mi0-2/mo0-1 and SVSHAPEs,
 833   controlled by mm
 834 * **ew** - sets element width override on the Indices
 835 * **SVG** - GPR SVG<<2 to be used for Indexing
 836 * **yx** - 2D reordering to be used if yx=1
 837 * **mm** - mask mode. determines how `rmm` is interpreted.
 838 * **sk** - Dimension skipping enabled
 839
 840 *Note: SVd, like SVxd, SVyz and SVzd of `svshape`, are all stored
 841 "off-by-one".  In the assembler
 842 mnemonic the values `1-32` are stored in binary as `0b00000..0b11111`*.
 843
 844 *Note: when `yx=1,sk=0` the second dimension is calculated as
 845 `CEIL(MAXVL/SVd)`*.
 846
 847 When `mm=0`:
 848
 849 * `rmm`, like REMAP.SVme, has bit 0
 850   correspond to mi0, bit 1 to mi1, bit 2 to mi2,
 851   bit 3 to mo0 and bit 4 to mi1
 852 * all SVSHAPEs and the REMAP parts of SVSHAPE are first reset (initialised to zero)
 853 * for each bit set in the 5-bit `rmm`, in order, the first
 854   as-yet-unset SVSHAPE will be updated
 855   with the other operands in the instruction, and the REMAP
 856   SPR set.
 857 * If all 5 bits of `rmm` are set then both mi0 and mo1 use SVSHAPE0.
 858 * SVSTATE persistence bit is cleared
 859 * No other alterations to SVSTATE are carried out
 860
 861 Example 1: if rmm=0b00110 then SVSHAPE0 and SVSHAPE1 are set up,
 862 and the REMAP SPR set so that mi1 uses SVSHAPE0 and mi2
 863 uses mi2.  REMAP.SVme is also set to 0b00110, REMAP.mi1=0
 864 (SVSHAPE0) and REMAP.mi2=1 (SVSHAPE1)
 865
 866 Example 2: if rmm=0b10001 then again SVSHAPE0 and SVSHAPE1
 867 are set up, but the REMAP SPR is set so that mi0 uses SVSHAPE0
 868 and mo1 uses SVSHAPE1. REMAP.SVme=0b10001, REMAP.mi0=0, REMAP.mo1=1
 869
 870 Rough algorithmic form:
 871
 872 ```
 873     marray = [mi0, mi1, mi2, mo0, mo1]
 874     idx = 0
 875     for bit = 0 to 4:
 876         if not rmm[bit]: continue
 877         setup(SVSHAPE[idx])
 878         SVSTATE{marray[bit]} = idx
 879         idx = (idx+1) modulo 4
 880 ```
 881
 882 When `mm=1`:
 883
 884 * bits 0-2 (MSB0 numbering) of `rmm` indicate an index selecting mi0-mo1
 885 * bits 3-4 (MSB0 numbering) of `rmm` indicate which SVSHAPE 0-3 shall
 886   be updated
 887 * only the selected SVSHAPE is overwritten
 888 * only the relevant bits in the REMAP area of SVSTATE are updated
 889 * REMAP persistence bit is set.
 890
 891 Example 1: if `rmm`=0b01110 then bits 0-2 (MSB0) are 0b011 and
 892 bits 3-4 are 0b10. thus, mo0 is selected and SVSHAPE2
 893 to be updated. REMAP.SVme[3] will be set high and REMAP.mo0
 894 set to 2 (SVSHAPE2).
 895
 896 Example 2: if `rmm`=0b10011 then bits 0-2 (MSB0) are 0b100 and
 897 bits 3-4 are 0b11.  thus, mo1 is selected and SVSHAPE3
 898 to be updated. REMAP.SVme[4] will be set high and REMAP.mo1
 899 set to 3 (SVSHAPE3).
 900
 901 Rough algorithmic form:
 902
 903 ```
 904     marray = [mi0, mi1, mi2, mo0, mo1]
 905     bit = rmm[0:2]
 906     idx = rmm[3:4]
 907     setup(SVSHAPE[idx])
 908     SVSTATE{marray[bit]} = idx
 909     SVSTATE.pst = 1
 910 ```
 911
 912 In essence, `mm=0` is intended for use to set as much of the
 913 REMAP State SPRs as practical with a single instruction,
 914 whilst `mm=1` is intended to be a little more refined.
 915
 916 **Usage guidelines**
 917
 918 * **Disable 2D mapping**: to only perform Indexing without
 919  reordering use `SVd=1,sk=0,yx=0` (or set SVd to a value larger
 920  or equal to VL)
 921 * **Modulo 1D mapping**: to perform Indexing cycling through the
 922  first N Indices use `SVd=N,sk=0,yx=0` where `VL>N`. There is
 923  no requirement to set VL equal to a multiple of N.
 924 * **Modulo 2D transposed**: `SVd=M,sk=0,yx=1`, sets
 925  `xdim=M,ydim=CEIL(MAXVL/M)`.
 926
 927 Beyond these mappings it becomes necessary to write directly to
 928 the SVSTATE SPRs manually.
 929
 930 -------------
 931
 932 \newpage{}
 933
 934
 935 # svshape2 (offset-priority) <a name="svshape2"> </a>
 936
 937 SVM2-Form
 938
 939 | 0:5|6:9 |10|11:15  |16:20  | 21:24  | 25 | 26:31 |  Form      |
 940 | -- |----|--| ---   | ----- | ------ | -- | ------| --------   |
 941 | PO |offs|yx| rmm   | SVd   | 100/mm | sk | XO    | SVM2-Form  |
 942
 943 * svshape2 offs,yx,rmm,SVd,sk,mm
 944
 945 See [[sv/remap/appendix]] for `svshape2` pseudocode
 946
 947 Special Registers Altered:
 948
 949 ```
 950     SVSTATE, SVSHAPE0-3
 951 ```
 952
 953 `svshape2` is an additional convenience instruction that prioritises
 954 setting `SVSHAPE.offset`. Its primary purpose is for use when
 955 element-width overrides are used. It has identical capabilities to `svindex`
 956 in terms of both options (skip, etc.) and ability to activate REMAP
 957 (rmm, mask mode) but unlike `svindex` it does not set GPR REMAP:
 958 only a 1D or 2D `svshape`, and
 959 unlike `svshape` it can set an arbitrary `SVSHAPE.offset` immediate.
 960
 961 One of the limitations of Simple-V is that Vector elements start on the boundary
 962 of the Scalar regfile, which is fine when element-width overrides are not
 963 needed. If the starting point of a Vector with smaller elwidths must begin
 964 in the middle of a register, normally there would be no way to do so except
 965 through costly LD/ST.  `SVSHAPE.offset` caters for this scenario and `svshape2`
 966 makes it easier to access.
 967
 968 **Operand Fields**:
 969
 970 * **offs** (4 bits) - unsigned offset
 971 * **yx** (1 bit) - swap XY to YX
 972 * **SVd** dimension size
 973 * **rmm** REMAP mask
 974 * **mm** mask mode
 975 * **sk** (1 bit) skips 1st dimension if set
 976
 977 Dimensions are calculated exactly as `svindex`. `rmm` and
 978 `mm` are as per `svindex`.
 979
 980 *Programmer's Note: offsets for `svshape2` may be specified in the range
 981 0-15. Given that the principle of Simple-V is to fit on top of
 982 byte-addressable register files and that GPR and FPR are 64-bit (8 bytes)
 983 it should be clear that the offset may, when `elwidth=8`, begin an
 984 element-level operation starting element zero at any arbitrary byte.
 985 On cursory examination attempting to go beyond the range 0-7 seems
 986 unnecessary given that the **next GPR or FPR** is an
 987 alias for an offset in the range 8-15.  Thus by simply increasing
 988 the starting Vector point of the operation to the next register it
 989 can be seen that the offset of 0-7 would be sufficient.  Unfortunately
 990 however some operations are EXTRA2-encoded it is **not possible**
 991 to increase the GPR/FPR register number by one, because EXTRA2-encoding
 992 of GPR/FPR Vector numbers are restricted to even numbering.
 993 For CR Fields the EXTRA2 encoding is even more sparse.
 994 The additional offset range (8-15) helps overcome these limitations.*
 995
 996 *Hardware Implementor's note: with the offsets only being immediates
 997 and with register numbering being entirely immediate as well it is
 998 possible to correctly compute Register Hazards without requiring
 999 reading the contents of any SPRs.  If however there are
1000 instructions that have directly written to the SVSTATE or SVSHAPE
1001 SPRs and those instructions are still in-flight then this position
1002 is clearly **invalid**. This is why Programmers are strongly
1003 discouraged from directly writing to these SPRs.*
1004
1005 *Architectural Resource Allocation note: this instruction shares
1006 the space of `svshape`. Therefore it is critical that the two
1007 instructions, `svshape` and `svshape2` have the exact same XO
1008 in bits 26 thru 31.  It is also critical that for `svshape2`,
1009 bit 21 of XO is a 1, bit 22 of XO is a 0, and bit 23 of XO is a 0.*
1010
1011 [[!tag standards]]
1012
1013 -------------
1014
1015 \newpage{}
1016