openpower/sv/overview.mdwn

   1 # SV Overview
   2
   3 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
   4
   5 This document provides an overview and introduction as to why SV (a
   6 [[!wikipedia Cray]]-style Vector augmentation to [[!wikipedia OpenPOWER]]) exists, and how it works.
   7
   8 **Sponsored by NLnet under the Privacy and Enhanced Trust Programme**
   9
  10 Links:
  11
  12 * This page: [http://libre-soc.org/openpower/sv/overview](http://libre-soc.org/openpower/sv/overview)
  13 * [FOSDEM2021 SimpleV for OpenPOWER](https://fosdem.org/2021/schedule/event/the_libresoc_project_simple_v_vectorisation/)
  14 * FOSDEM2021 presentation <https://www.youtube.com/watch?v=FS6tbfyb2VA>
  15 * [[discussion]] and
  16   [bugreport](https://bugs.libre-soc.org/show_bug.cgi?id=556)
  17   feel free to add comments, questions.
  18 * [[SV|sv]]
  19 * [[sv/svp64]]
  20 * [x86 REP instruction](https://c9x.me/x86/html/file_module_x86_id_279.html):
  21   a useful way to quickly understand that the core of the SV concept
  22   is not new.
  23 * [Article about register tagging](http://science.lpnu.ua/sites/default/files/journal-paper/2019/jul/17084/volum3number1text-9-16_1.pdf) showing
  24   that tagging is not a new idea either. Register tags
  25   are also used in the Mill Architecture.
  26
  27
  28 [[!toc]]
  29
  30 # Introduction: SIMD and Cray Vectors
  31
  32 SIMD, the primary method for easy parallelism of the
  33 past 30 years in Computer Architectures, is
  34 [known to be harmful](https://www.sigarch.org/simd-instructions-considered-harmful/).
  35 SIMD provides a seductive simplicity that is easy to implement in
  36 hardware.  With each doubling in width it promises increases in raw
  37 performance without the complexity of either multi-issue or out-of-order
  38 execution.
  39
  40 Unfortunately, even with predication added, SIMD only becomes more and
  41 more problematic with each power of two SIMD width increase introduced
  42 through an ISA revision.  The opcode proliferation, at O(N^6), inexorably
  43 spirals out of control in the ISA, detrimentally impacting the hardware,
  44 the software, the compilers and the testing and compliance.  Here are
  45 the typical dimensions that result in such massive proliferation:
  46
  47 * Operation (add, mul)
  48 * bitwidth (8, 16, 32, 64, 128)
  49 * Conversion between bitwidths (FP16-FP32-64)
  50 * Signed/unsigned
  51 * HI/LO swizzle (Audio L/R channels)
  52    - HI/LO selection on src 1
  53    - selection on src 2
  54    - selection on dest
  55    - Example: AndesSTAR Audio DSP
  56 * Saturation (Clamping at max range)
  57
  58 These typically are multiplied up to produce explicit opcodes numbering
  59 in the thousands on, for example the ARC Video/DSP cores.
  60
  61 Cray-style variable-length Vectors on the other hand result in
  62 stunningly elegant and small loops, exceptionally high data throughput
  63 per instruction (by one *or greater* orders of magnitude than SIMD), with
  64 no alarmingly high setup and cleanup code, where at the hardware level
  65 the microarchitecture may execute from one element right the way through
  66 to tens of thousands at a time, yet the executable remains exactly the
  67 same and the ISA remains clear, true to the RISC paradigm, and clean.
  68 Unlike in SIMD, powers of two limitations are not involved in the ISA
  69 or in the assembly code.
  70
  71 SimpleV takes the Cray style Vector principle and applies it in the
  72 abstract to a Scalar ISA in the same way that x86 used to do its "REP" instruction.  In the process, "context" is applied, allowing amongst other things
  73 a register file size
  74 increase using "tagging" (similar to how x86 originally extended
  75 registers from 32 to 64 bit).
  76
  77 ## SV
  78
  79 The fundamentals are (just like x86 "REP"):
  80
  81 * The Program Counter (PC) gains a "Sub Counter" context (Sub-PC)
  82 * Vectorisation pauses the PC and runs a Sub-PC loop from 0 to VL-1
  83   (where VL is Vector Length)
  84 * The [[Program Order]] of "Sub-PC" instructions must be preserved,
  85   just as is expected of instructions ordered by the PC.
  86 * Some registers may be "tagged" as Vectors
  87 * During the loop, "Vector"-tagged register are incremented by
  88   one with each iteration, executing the *same instruction*
  89   but with *different registers*
  90 * Once the loop is completed *only then* is the Program Counter
  91   allowed to move to the next instruction.
  92
  93 Hardware (and simulator) implementors are free and clear to implement this
  94 as literally a for-loop, sitting in between instruction decode and issue.
  95 Higher performance systems may deploy SIMD backends, multi-issue and
  96 out-of-order execution, although it is strongly recommended to add
  97 predication capability directly into SIMD backend units.
  98
  99 In OpenPOWER ISA v3.0B pseudo-code form, an ADD operation, assuming both
 100 source and destination have been "tagged" as Vectors, is simply:
 101
 102     for i = 0 to VL-1:
 103          GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
 104
 105 At its heart, SimpleV really is this simple.  On top of this fundamental
 106 basis further refinements can be added which build up towards an extremely
 107 powerful Vector augmentation system, with very little in the way of
 108 additional opcodes required: simply external "context".
 109
 110 x86 was originally only 80 instructions: prior to AVX512 over 1,300
 111 additional instructions have been added, almost all of them SIMD.
 112
 113 RISC-V RVV as of version 0.9 is over 188 instructions (more than the
 114 rest of RV64G combined: 80 for RV64G and 27 for C). Over 95% of that
 115 functionality is added to OpenPOWER v3 0B, by SimpleV augmentation,
 116 with around 5 to 8 instructions.
 117
 118 Even in OpenPOWER v3.0B, the Scalar Integer ISA is around 150
 119 instructions, with IEEE754 FP adding approximately 80 more. VSX, being
 120 based on SIMD design principles, adds somewhere in the region of 600 more.
 121 SimpleV again provides over 95% of VSX functionality, simply by augmenting
 122 the *Scalar* OpenPOWER ISA, and in the process providing features such
 123 as predication, which VSX is entirely missing.
 124
 125 AVX512, SVE2, VSX, RVV, all of these systems have to provide different
 126 types of register files: Scalar and Vector is the minimum. AVX512
 127 even provides a mini mask regfile, followed by explicit instructions
 128 that handle operations on each of them *and map between all of them*.
 129 SV simply not only uses the existing scalar regfiles (including CRs),
 130 but because operations exist within OpenPOWER to cover interactions
 131 between the scalar regfiles (`mfcr`, `fcvt`) there is very little that
 132 needs to be added.
 133
 134 In fairness to both VSX and RVV, there are things that are not provided
 135 by SimpleV:
 136
 137 * 128 bit or above arithmetic and other operations
 138   (VSX Rijndael and SHA primitives; VSX shuffle and bitpermute operations)
 139 * register files above 128 entries
 140 * Vector lengths over 64
 141 * Unit-strided LD/ST and other comprehensive memory operations
 142   (struct-based LD/ST from RVV for example)
 143 * 32-bit instruction lengths. [[svp64]] had to be added as 64 bit.
 144
 145 These limitations, which stem inherently from the adaptation process of
 146 starting from a Scalar ISA, are not insurmountable. Over time, they may
 147 well be addressed in future revisions of SV.
 148
 149 The rest of this document builds on the above simple loop to add:
 150
 151 * Vector-Scalar, Scalar-Vector and Scalar-Scalar operation
 152  (of all register files: Integer, FP *and CRs*)
 153 * Traditional Vector operations (VSPLAT, VINSERT, VCOMPRESS etc)
 154 * Predication masks (essential for parallel if/else constructs)
 155 * 8, 16 and 32 bit integer operations, and both FP16 and BF16.
 156 * Compacted operations into registers (normally only provided by SIMD)
 157 * Fail-on-first (introduced in ARM SVE2)
 158 * A new concept: Data-dependent fail-first
 159 * Condition-Register based *post-result* predication (also new)
 160 * A completely new concept: "Twin Predication"
 161 * vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
 162
 163 All of this is *without modifying the OpenPOWER v3.0B ISA*, except to add
 164 "wrapping context", similar to how v3.1B 64 Prefixes work.
 165
 166 # Adding Scalar / Vector
 167
 168 The first augmentation to the simple loop is to add the option for all
 169 source and destinations to all be either scalar or vector.  As a FSM
 170 this is where our "simple" loop gets its first complexity.
 171
 172     function op_add(RT, RA, RB) # add not VADD!
 173       int id=0, irs1=0, irs2=0;
 174       for i = 0 to VL-1:
 175         ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 176         if (!RT.isvec) break;
 177         if (RT.isvec)  { id += 1; }
 178         if (RA.isvec)  { irs1 += 1; }
 179         if (RB.isvec)  { irs2 += 1; }
 180
 181 This could have been written out as eight separate cases: one each for
 182 when each of `RA`, `RB` or `RT` is scalar or vector.  Those eight cases,
 183 when optimally combined, result in the pseudocode above.
 184
 185 With some walkthroughs it is clear that the loop exits immediately
 186 after the first scalar destination result is written, and that when the
 187 destination is a Vector the loop proceeds to fill up the register file,
 188 sequentially, starting at `RT` and ending at `RT+VL-1`. The two source
 189 registers will, independently, either remain pointing at `RB` or `RA`
 190 respectively, or, if marked as Vectors, will march incrementally in
 191 lockstep, producing element results along the way, as the destination
 192 also progresses through elements.
 193
 194 In this way all the eight permutations of Scalar and Vector behaviour
 195 are covered, although without predication the scalar-destination ones are
 196 reduced in usefulness.  It does however clearly illustrate the principle.
 197
 198 Note in particular: there is no separate Scalar add instruction and
 199 separate Vector instruction and separate Scalar-Vector instruction, *and
 200 there is no separate Vector register file*: it's all the same instruction,
 201 on the standard register file, just with a loop.  Scalar happens to set
 202 that loop size to one.
 203
 204 The important insight from the above is that, strictly speaking, Simple-V
 205 is not really a Vectorisation scheme at all: it is more of a hardware
 206 ISA "Compression scheme", allowing as it does for what would normally
 207 require multiple sequential instructions to be replaced with just one.
 208 This is where the rule that Program Order must be preserved in Sub-PC
 209 execution derives from.  However in other ways, which will emerge below,
 210 the "tagging" concept presents an opportunity to include features
 211 definitely not common outside of Vector ISAs, and in that regard it's
 212 definitely a class of Vectorisation.
 213
 214 ## Register "tagging"
 215
 216 As an aside: in [[sv/svp64]] the encoding which allows SV to both extend
 217 the range beyond r0-r31 and to determine whether it is a scalar or vector
 218 is encoded in two to three bits, depending on the instruction.
 219
 220 The reason for using so few bits is because there are up to *four*
 221 registers to mark in this way (`fma`, `isel`) which starts to be of
 222 concern when there are only 24 available bits to specify the entire SV
 223 Vectorisation Context.  In fact, for a small subset of instructions it
 224 is just not possible to tag every single register.  Under these rare
 225 circumstances a tag has to be shared between two registers.
 226
 227 Below is the pseudocode which expresses the relationship which is usually
 228 applied to *every* register:
 229
 230     if extra3_mode:
 231         spec = EXTRA3 # bit 2 s/v, 0-1 extends range
 232     else:
 233         spec = EXTRA2 << 1 # same as EXTRA3, shifted
 234     if spec[2]: # vector
 235          RA.isvec = True
 236          return (RA << 2) | spec[0:1]
 237     else:         # scalar
 238          RA.isvec = False
 239          return (spec[0:1] << 5) | RA
 240
 241 Here we can see that the scalar registers are extended in the top bits,
 242 whilst vectors are shifted up by 2 bits, and then extended in the LSBs.
 243 Condition Registers have a slightly different scheme, along the same
 244 principle, which takes into account the fact that each CR may be bit-level
 245 addressed by Condition Register operations.
 246
 247 Readers familiar with OpenPOWER will know of Rc=1 operations that create
 248 an associated post-result "test", placing this test into an implicit
 249 Condition Register.  The original researchers who created the POWER ISA
 250 chose CR0 for Integer, and CR1 for Floating Point.  These *also become
 251 Vectorised* - implicitly - if the associated destination register is
 252 also Vectorised.  This allows for some very interesting savings on
 253 instruction count due to the very same CR Vectors being predication masks.
 254
 255 # Adding single predication
 256
 257 The next step is to add a single predicate mask.  This is where it gets
 258 interesting.  Predicate masks are a bitvector, each bit specifying, in
 259 order, whether the element operation is to be skipped ("masked out")
 260 or allowed. If there is no predicate, it is set to all 1s, which is
 261 effectively the same as "no predicate".
 262
 263     function op_add(RT, RA, RB) # add not VADD!
 264       int id=0, irs1=0, irs2=0;
 265       predval = get_pred_val(FALSE, rd);
 266       for i = 0 to VL-1:
 267         if (predval & 1<<i) # predication bit test
 268            ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 269            if (!RT.isvec) break;
 270         if (RT.isvec)  { id += 1; }
 271         if (RA.isvec)  { irs1 += 1; }
 272         if (RB.isvec)  { irs2 += 1; }
 273
 274 The key modification is to skip the creation and storage of the result
 275 if the relevant predicate mask bit is clear, but *not the progression
 276 through the registers*.
 277
 278 A particularly interesting case is if the destination is scalar, and the
 279 first few bits of the predicate are zero.  The loop proceeds to increment
 280 the Scalar *source* registers until the first nonzero predicate bit is
 281 found, whereupon a single result is computed, and *then* the loop exits.
 282 This therefore uses the predicate to perform Vector source indexing.
 283 This case was not possible without the predicate mask.
 284
 285 If all three registers are marked as Vector then the "traditional"
 286 predicated Vector behaviour is provided.  Yet, just as before, all other
 287 options are still provided, right the way back to the pure-scalar case,
 288 as if this were a straight OpenPOWER v3.0B non-augmented instruction.
 289
 290 Single Predication therefore provides several modes traditionally seen
 291 in Vector ISAs:
 292
 293 * VINSERT: the predicate may be set as a single bit, the sources are
 294   scalar and the destination a vector.
 295 * VSPLAT (result broadcasting) is provided by making the sources scalar
 296   and the destination a vector, and having no predicate set or having
 297   multiple bits set.
 298 * VSELECT is provided by setting up (at least one of) the sources as a
 299   vector, using a single bit in the predicate, and the destination as
 300   a scalar.
 301
 302 All of this capability and coverage without even adding one single actual
 303 Vector opcode, let alone 180, 600 or 1,300!
 304
 305 # Predicate "zeroing" mode
 306
 307 Sometimes with predication it is ok to leave the masked-out element
 308 alone (not modify the result) however sometimes it is better to zero the
 309 masked-out elements.  Zeroing can be combined with bit-wise ORing to build
 310 up vectors from multiple predicate patterns: the same combining with
 311 nonzeroing involves more mv operations and predicate mask operations.
 312 Our pseudocode therefore ends up as follows, to take the enhancement
 313 into account:
 314
 315     function op_add(RT, RA, RB) # add not VADD!
 316       int id=0, irs1=0, irs2=0;
 317       predval = get_pred_val(FALSE, rd);
 318       for i = 0 to VL-1:
 319         if (predval & 1<<i) # predication bit test
 320            ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 321            if (!RT.isvec) break;
 322         else if zeroing:   # predicate failed
 323            ireg[RT+id] = 0 # set element  to zero
 324         if (RT.isvec)  { id += 1; }
 325         if (RA.isvec)  { irs1 += 1; }
 326         if (RB.isvec)  { irs2 += 1; }
 327
 328 Many Vector systems either have zeroing or they have nonzeroing, they
 329 do not have both.  This is because they usually have separate Vector
 330 register files. However SV sits on top of standard register files and
 331 consequently there are advantages to both, so both are provided.
 332
 333 # Element Width overrides <a name="elwidths"></a>
 334
 335 All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64
 336 bit integer operations, and IEEE754 FP32 and 64.  Often also included
 337 is FP16 and more recently BF16.  The *really* good Vector ISAs have
 338 variable-width vectors right down to bitlevel, and as high as 1024 bit
 339 arithmetic per element, as well as IEEE754 FP128.
 340
 341 SV has an "override" system that *changes* the bitwidth of operations
 342 that were intended by the original scalar ISA designers to have (for
 343 example) 64 bit operations (only).  The override widths are 8, 16 and
 344 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in
 345 the future).
 346
 347 This presents a particularly intriguing conundrum given that the OpenPOWER
 348 Scalar ISA was never designed with for example 8 bit operations in mind,
 349 let alone Vectors of 8 bit.
 350
 351 The solution comes in terms of rethinking the definition of a Register
 352 File.  The typical regfile may be considered to be a multi-ported SRAM
 353 block, 64 bits wide and usually 32 entries deep, to give 32 64 bit
 354 registers.  In c this would be:
 355
 356     typedef uint64_t reg_t;
 357     reg_t int_regfile[32]; // standard scalar 32x 64bit
 358
 359 Conceptually, to get our variable element width vectors,
 360 we may think of the regfile as instead being the following c-based data
 361 structure, where all types uint16_t etc. are in little-endian order:
 362
 363     #pragma(packed)
 364     typedef union {
 365         uint8_t  actual_bytes[8];
 366         uint8_t  b[0]; // array of type uint8_t
 367         uint16_t s[0]; // array of LE ordered uint16_t
 368         uint32_t i[0];
 369         uint64_t l[0]; // default OpenPOWER ISA uses this
 370     } reg_t;
 371
 372     reg_t int_regfile[128]; // SV extends to 128 regs
 373
 374 This means that Vector elements start from locations specified by 64 bit
 375 "register" but that from that location onwards the elements *overlap
 376 subsequent registers*.
 377
 378 Here is another way to view the same concept, bearing in mind that it
 379 is assumed a LE memory order:
 380
 381     uint8_t reg_sram[8*128];
 382     uint8_t *actual_bytes = &reg_sram[RA*8];
 383     if elwidth == 8:
 384         uint8_t *b = (uint8_t*)actual_bytes;
 385         b[idx] = result;
 386     if elwidth == 16:
 387         uint16_t *s = (uint16_t*)actual_bytes;
 388         s[idx] = result;
 389     if elwidth == 32:
 390         uint32_t *i = (uint32_t*)actual_bytes;
 391         i[idx] = result;
 392     if elwidth == default:
 393         uint64_t *l = (uint64_t*)actual_bytes;
 394         l[idx] = result;
 395
 396 Starting with all zeros, setting `actual_bytes[3]` in any given `reg_t`
 397 to 0x01 would mean that:
 398
 399 * b[0..2] = 0x00 and b[3] = 0x01
 400 * s[0] = 0x0000 and s[1] = 0x0001
 401 * i[0] = 0x00010000
 402 * l[0] = 0x0000000000010000
 403
 404 In tabular form, starting an elwidth=8 loop from r0 and extending for
 405 16 elements would begin at r0 and extend over the entirety of r1:
 406
 407        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 408        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 409     r0 | b[0]  | b[1]  | b[2]  | b[3]  | b[4]  | b[5]  | b[6]  | b[7]  |
 410     r1 | b[8]  | b[9]  | b[10] | b[11] | b[12] | b[13] | b[14] | b[15] |
 411
 412 Starting an elwidth=16 loop from r0 and extending for
 413 7 elements would begin at r0 and extend partly over r1.  Note that
 414 b0 indicates the low byte (lowest 8 bits) of each 16-bit word, and
 415 b1 represents the top byte:
 416
 417        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 418        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 419     r0 | s[0].b0  b1   | s[1].b0  b1   | s[2].b0  b1   |  s[3].b0  b1  |
 420     r1 | s[4].b0  b1   | s[5].b0  b1   | s[6].b0  b1   |  unmodified   |
 421
 422 Likewise for elwidth=32, and a loop extending for 3 elements.  b0 through
 423 b3 represent the bytes (numbered lowest for LSB and highest for MSB) within
 424 each element word:
 425
 426        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 427        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 428     r0 | w[0].b0  b1      b2      b3   | w[1].b0  b1      b2      b3   |
 429     r1 | w[2].b0  b1      b2      b3   | unmodified    unmodified      |
 430
 431 64-bit (default) elements access the full registers.  In each case the
 432 register number (`RT`, `RA`) indicates the *starting* point for the storage
 433 and retrieval of the elements.
 434
 435 Our simple loop, instead of accessing the array of regfile entries
 436 with a computed index `iregs[RT+i]`, would access the appropriate element
 437 of the appropriate width, such as `iregs[RT].s[i]` in order to access
 438 16 bit elements starting from RT.  Thus we have a series of overlapping
 439 conceptual arrays that each start at what is traditionally thought of as
 440 "a register".  It then helps if we have a couple of routines:
 441
 442     get_polymorphed_reg(reg, bitwidth, offset):
 443         reg_t res = 0;
 444         if (!reg.isvec): # scalar
 445             offset = 0
 446         if bitwidth == 8:
 447             reg.b = int_regfile[reg].b[offset]
 448         elif bitwidth == 16:
 449             reg.s = int_regfile[reg].s[offset]
 450         elif bitwidth == 32:
 451             reg.i = int_regfile[reg].i[offset]
 452         elif bitwidth == default: # 64
 453             reg.l = int_regfile[reg].l[offset]
 454         return res
 455
 456     set_polymorphed_reg(reg, bitwidth, offset, val):
 457         if (!reg.isvec): # scalar
 458             offset = 0
 459         if bitwidth == 8:
 460             int_regfile[reg].b[offset] = val
 461         elif bitwidth == 16:
 462             int_regfile[reg].s[offset] = val
 463         elif bitwidth == 32:
 464             int_regfile[reg].i[offset] = val
 465         elif bitwidth == default: # 64
 466             int_regfile[reg].l[offset] = val
 467
 468 These basically provide a convenient parameterised way to access the
 469 register file, at an arbitrary vector element offset and an arbitrary
 470 element width.  Our first simple loop thus becomes:
 471
 472     for i = 0 to VL-1:
 473        src1 = get_polymorphed_reg(RA, srcwid, i)
 474        src2 = get_polymorphed_reg(RB, srcwid, i)
 475        result = src1 + src2 # actual add here
 476        set_polymorphed_reg(RT, destwid, i, result)
 477
 478 With this loop, if elwidth=16 and VL=3 the first 48 bits of the target
 479 register will contain three 16 bit addition results, and the upper 16
 480 bits will be *unaltered*.
 481
 482 Note that things such as zero/sign-extension (and predication) have
 483 been left out to illustrate the elwidth concept. Also note that it turns
 484 out to be important to perform the operation internally at effectively an *infinite* bitwidth such that any truncation, rounding errors or
 485 other artefacts may all be ironed out.  This turns out to be important
 486 when applying Saturation for Audio DSP workloads, particularly for multiply and IEEE754 FP rounding.  By "infinite" this is conceptual only: in reality, the application of the different truncations and width-extensions set a fixed deterministic practical limit on the internal precision needed, on a per-operation basis.
 487
 488 Other than that, element width overrides, which can be applied to *either*
 489 source or destination or both, are pretty straightforward, conceptually.
 490 The details, for hardware engineers, involve byte-level write-enable
 491 lines, which is exactly what is used on SRAMs anyway.  Compiler writers
 492 have to alter Register Allocation Tables to byte-level granularity.
 493
 494 One critical thing to note: upper parts of the underlying 64 bit
 495 register are *not zero'd out* by a write involving a non-aligned Vector
 496 Length. An 8 bit operation with VL=7 will *not* overwrite the 8th byte
 497 of the destination.  The only situation where a full overwrite occurs
 498 is on "default" behaviour.  This is extremely important to consider the
 499 register file as a byte-level store, not a 64-bit-level store.
 500
 501 ## Why a LE regfile?
 502
 503 The concept of having a regfile where the byte ordering of the underlying
 504 SRAM seems utter nonsense.  Surely, a hardware implementation gets to
 505 choose the order, right? It's memory only where LE/BE matters, right? The
 506 bytes come in, all registers are 64 bit and it's just wiring, right?
 507
 508 Ordinarily this would be 100% correct, in both a scalar ISA and in a Cray
 509 style Vector one.  The assumption in that last question was, however, "all
 510 registers are 64 bit".  SV allows SIMD-style packing of vectors into the
 511 64 bit registers, where one instruction and the next may interpret that
 512 very same register as containing elements of completely different widths.
 513
 514 Consequently it becomes critically important to decide a byte-order.
 515 That decision was - arbitrarily - LE mode.  Actually it wasn't arbitrary
 516 at all: it was such hell to implement BE supported interpretations of CRs
 517 and LD/ST in LibreSOC, based on a terse spec that provides insufficient
 518 clarity and assumes significant working knowledge of OpenPOWER, with
 519 arbitrary insertions of 7-index here and 3-bitindex there, the decision
 520 to pick LE was extremely easy.
 521
 522 Without such a decision, if two words are packed as elements into a 64
 523 bit register, what does this mean? Should they be inverted so that the
 524 lower indexed element goes into the HI or the LO word? should the 8
 525 bytes of each register be inverted? Should the bytes in each element
 526 be inverted? Should the element indexing loop order be broken onto
 527 discontiguous chunks such as 32107654 rather than 01234567, and if so
 528 at what granularity of discontinuity? These are all equally valid and
 529 legitimate interpretations of what constitutes "BE" and they all cause
 530 merry mayhem.
 531
 532 The decision was therefore made: the c typedef union is the canonical
 533 definition, and its members are defined as being in LE order. From there,
 534 implementations may choose whatever internal HDL wire order they like
 535 as long as the results produced conform to the elwidth pseudocode.
 536
 537 *Note: it turns out that both x86 SIMD and NEON SIMD follow this convention, namely that both are implicitly LE, even though their ISA Manuals may not explicitly spell this out*
 538
 539 * <https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Application-Level-Memory-Model/Endian-support/Endianness-in-Advanced-SIMD?lang=en>
 540 * <https://stackoverflow.com/questions/24045102/how-does-endianness-work-with-simd-registers>
 541 * <https://llvm.org/docs/BigEndianNEON.html>
 542
 543
 544 ## Source and Destination overrides
 545
 546 A minor fly in the ointment: what happens if the source and destination
 547 are over-ridden to different widths?  For example, FP16 arithmetic is
 548 not accurate enough and may introduce rounding errors when up-converted
 549 to FP32 output.  The rule is therefore set:
 550
 551     The operation MUST take place effectively at infinite precision:
 552     actual precision determined by the operation and the operand widths
 553
 554 In pseudocode this is:
 555
 556     for i = 0 to VL-1:
 557        src1 = get_polymorphed_reg(RA, srcwid, i)
 558        src2 = get_polymorphed_reg(RB, srcwid, i)
 559        opwidth = max(srcwid, destwid) # usually
 560        result = op_add(src1, src2, opwidth) # at max width
 561        set_polymorphed_reg(rd, destwid, i, result)
 562
 563 In reality the source and destination widths determine the actual required
 564 precision in a given ALU.  The reason for setting "effectively" infinite precision
 565 is illustrated for example by Saturated-multiply, where if the internal precision was insufficient it would not be possible to correctly determine the maximum clip range had been exceeded.
 566
 567 Thus it will turn out that under some conditions the combination of the
 568 extension of the source registers followed by truncation of the result
 569 gets rid of bits that didn't matter, and the operation might as well have
 570 taken place at the narrower width and could save resources that way.
 571 Examples include Logical OR where the source extension would place
 572 zeros in the upper bits, the result will be truncated and throw those
 573 zeros away.
 574
 575 Counterexamples include the previously mentioned FP16 arithmetic,
 576 where for operations such as division of large numbers by very small
 577 ones it should be clear that internal accuracy will play a major role
 578 in influencing the result.  Hence the rule that the calculation takes
 579 place at the maximum bitwidth, and truncation follows afterwards.
 580
 581 ## Signed arithmetic
 582
 583 What happens when the operation involves signed arithmetic?  Here the
 584 implementor has to use common sense, and make sure behaviour is accurately
 585 documented.  If the result of the unmodified operation is sign-extended
 586 because one of the inputs is signed, then the input source operands must
 587 be first read at their overridden bitwidth and *then* sign-extended:
 588
 589       for i = 0 to VL-1:
 590        src1 = get_polymorphed_reg(RA, srcwid, i)
 591        src2 = get_polymorphed_reg(RB, srcwid, i)
 592        opwidth = max(srcwid, destwid)
 593        # srces known to be less than result width
 594        src1 = sign_extend(src1, srcwid, opwidth)
 595        src2 = sign_extend(src2, srcwid, opwidth)
 596        result = op_signed(src1, src2, opwidth) # at max width
 597        set_polymorphed_reg(rd, destwid, i, result)
 598
 599 The key here is that the cues are taken from the underlying operation.
 600
 601 ## Saturation
 602
 603 Audio DSPs need to be able to clip sound when the "volume" is adjusted,
 604 but if it is too loud and the signal wraps, distortion occurs.  The
 605 solution is to clip (saturate) the audio and allow this to be detected.
 606 In practical terms this is a post-result analysis however it needs to
 607 take place at the largest bitwidth i.e. before a result is element width
 608 truncated.  Only then can the arithmetic saturation condition be detected:
 609
 610     for i = 0 to VL-1:
 611        src1 = get_polymorphed_reg(RA, srcwid, i)
 612        src2 = get_polymorphed_reg(RB, srcwid, i)
 613        opwidth = max(srcwid, destwid)
 614        # unsigned add
 615        result = op_add(src1, src2, opwidth) # at max width
 616        # now saturate (unsigned)
 617        sat = min(result, (1<<destwid)-1)
 618        set_polymorphed_reg(rd, destwid, i, sat)
 619        # set sat overflow
 620        if Rc=1:
 621           CR[i].ov = (sat != result)
 622
 623 So the actual computation took place at the larger width, but was
 624 post-analysed as an unsigned operation.  If however "signed" saturation
 625 is requested then the actual arithmetic operation has to be carefully
 626 analysed to see what that actually means.
 627
 628 In terms of FP arithmetic, which by definition has a sign bit (so
 629 always takes place as a signed operation anyway), the request to saturate
 630 to signed min/max is pretty clear.  However for integer arithmetic such
 631 as shift (plain shift, not arithmetic shift), or logical operations
 632 such as XOR, which were never designed to have the assumption that its
 633 inputs be considered as signed numbers, common sense has to kick in,
 634 and follow what CR0 does.
 635
 636 CR0 for Logical operations still applies: the test is still applied to
 637 produce CR.eq, CR.lt and CR.gt analysis.  Following this lead we may
 638 do the same thing: although the input operations for and OR or XOR can
 639 in no way be thought of as "signed" we may at least consider the result
 640 to be signed, and thus apply min/max range detection -128 to +127 when
 641 truncating down to 8 bit for example.
 642
 643     for i = 0 to VL-1:
 644        src1 = get_polymorphed_reg(RA, srcwid, i)
 645        src2 = get_polymorphed_reg(RB, srcwid, i)
 646        opwidth = max(srcwid, destwid)
 647        # logical op, signed has no meaning
 648        result = op_xor(src1, src2, opwidth)
 649        # now saturate (signed)
 650        sat = min(result, (1<<destwid-1)-1)
 651        sat = max(result, -(1<<destwid-1))
 652        set_polymorphed_reg(rd, destwid, i, sat)
 653
 654 Overall here the rule is: apply common sense then document the behaviour
 655 really clearly, for each and every operation.
 656
 657 # Quick recap so far
 658
 659 The above functionality pretty much covers around 85% of Vector ISA needs.
 660 Predication is provided so that parallel if/then/else constructs can
 661 be performed: critical given that sequential if/then statements and
 662 branches simply do not translate successfully to Vector workloads.
 663 VSPLAT capability is provided which is approximately 20% of all GPU
 664 workload operations.  Also covered, with elwidth overriding, is the
 665 smaller arithmetic operations that caused ISAs developed from the
 666 late 80s onwards to get themselves into a tiz when adding "Multimedia"
 667 acceleration aka "SIMD" instructions.
 668
 669 Experienced Vector ISA readers will however have noted that VCOMPRESS
 670 and VEXPAND are missing, as is Vector "reduce" (mapreduce) capability
 671 and VGATHER and VSCATTER.  Compress and Expand are covered by Twin
 672 Predication, and yet to also be covered is fail-on-first, CR-based result
 673 predication, and Subvectors and Swizzle.
 674
 675 ## SUBVL <a name="subvl"></a>
 676
 677 Adding in support for SUBVL is a matter of adding in an extra inner
 678 for-loop, where register src and dest are still incremented inside the
 679 inner part.  Predication is still taken from the VL index, however it
 680 is applied to the whole subvector:
 681
 682     function op_add(RT, RA, RB) # add not VADD!
 683       int id=0, irs1=0, irs2=0;
 684       predval = get_pred_val(FALSE, rd);
 685       for i = 0 to VL-1:
 686         if (predval & 1<<i) # predication uses intregs
 687           for (s = 0; s < SUBVL; s++)
 688             sd = id*SUBVL + s
 689             srs1 = irs1*SUBVL + s
 690             srs2 = irs2*SUBVL + s
 691             ireg[RT+sd] <= ireg[RA+srs1] + ireg[RB+srs2];
 692           if (!RT.isvec) break;
 693         if (RT.isvec)  { id += 1; }
 694         if (RA.isvec)  { irs1 += 1; }
 695         if (RB.isvec)  { irs2 += 1; }
 696
 697 The primary reason for this is because Shader Compilers treat vec2/3/4 as
 698 "single units".  Recognising this in hardware is just sensible.
 699
 700 # Swizzle <a name="swizzle"></a>
 701
 702 Swizzle is particularly important for 3D work.  It allows in-place
 703 reordering of XYZW, ARGB etc. and access of sub-portions of the same in
 704 arbitrary order *without* requiring timeconsuming scalar mv instructions
 705 (scalar due to the convoluted offsets).
 706
 707 Swizzling does not just do permutations: it allows arbitrary selection and multiple copying of
 708 vec2/3/4 elements, such as XXXZ as the source operand, which will take
 709 3 copies of the vec4 first element (vec4[0]), placing them at positions vec4[0],
 710 vec4[1] and vec4[2], whilst the "Z" element (vec4[2]) was copied into vec4[3].
 711
 712 With somewhere between 10% and 30% of operations in 3D Shaders involving
 713 swizzle this is a huge saving and reduces pressure on register files
 714 due to having to use significant numbers of mv operations to get vector
 715 elements to "line up".
 716
 717 In SV given the percentage of operations that also involve initialisation
 718 to 0.0 or 1.0 into subvector elements the decision was made to include
 719 those:
 720
 721     swizzle = get_swizzle_immed() # 12 bits
 722     for (s = 0; s < SUBVL; s++)
 723         remap = (swizzle >> 3*s) & 0b111
 724         if remap < 4:
 725            sm = id*SUBVL + remap
 726            ireg[rd+s] <= ireg[RA+sm]
 727         elif remap == 4:
 728               ireg[rd+s] <= 0.0
 729         elif remap == 5:
 730               ireg[rd+s] <= 1.0
 731
 732 Note that a value of 6 (and 7) will leave the target subvector element
 733 untouched. This is equivalent to a predicate mask which is built-in,
 734 in immediate form, into the [[sv/mv.swizzle]] operation.  mv.swizzle is
 735 rare in that it is one of the few instructions needed to be added that
 736 are never going to be part of a Scalar ISA.  Even in High Performance
 737 Compute workloads it is unusual: it is only because SV is targetted at
 738 3D and Video that it is being considered.
 739
 740 Some 3D GPU ISAs also allow for two-operand subvector swizzles.  These are
 741 sufficiently unusual, and the immediate opcode space required so large
 742 (12 bits per vec4 source),
 743 that the tradeoff balance was decided in SV to only add mv.swizzle.
 744
 745 # Twin Predication
 746
 747 Twin Predication is cool.  Essentially it is a back-to-back
 748 VCOMPRESS-VEXPAND (a multiple sequentially ordered VINSERT).  The compress
 749 part is covered by the source predicate and the expand part by the
 750 destination predicate.  Of course, if either of those is all 1s then
 751 the operation degenerates *to* VCOMPRESS or VEXPAND, respectively.
 752
 753     function op(RT, RS):
 754       ps = get_pred_val(FALSE, RS); # predication on src
 755       pd = get_pred_val(FALSE, RT); # ... AND on dest
 756       for (int i = 0, int j = 0; i < VL && j < VL;):
 757         if (RS.isvec) while (!(ps & 1<<i)) i++;
 758         if (RT.isvec) while (!(pd & 1<<j)) j++;
 759         reg[RT+j] = SCALAR_OPERATION_ON(reg[RS+i])
 760         if (RS.isvec) i++;
 761         if (RT.isvec) j++; else break
 762
 763 Here's the interesting part: given the fact that SV is a "context"
 764 extension, the above pattern can be applied to a lot more than just MV,
 765 which is normally only what VCOMPRESS and VEXPAND do in traditional
 766 Vector ISAs: move registers.  Twin Predication can be applied to `extsw`
 767 or `fcvt`, LD/ST operations and even `rlwinmi` and other operations
 768 taking a single source and immediate(s) such as `addi`.  All of these
 769 are termed single-source, single-destination.
 770
 771 LDST Address-generation, or AGEN, is a special case of single source,
 772 because elwidth overriding does not make sense to apply to the computation
 773 of the 64 bit address itself, but it *does* make sense to apply elwidth
 774 overrides to the data being accessed *at* that memory address.
 775
 776 It also turns out that by using a single bit set in the source or
 777 destination, *all* the sequential ordered standard patterns of Vector
 778 ISAs are provided: VSPLAT, VSELECT, VINSERT, VCOMPRESS, VEXPAND.
 779
 780 The only one missing from the list here, because it is non-sequential,
 781 is VGATHER (and VSCATTER): moving registers by specifying a vector of
 782 register indices (`regs[rd] = regs[regs[rs]]` in a loop).  This one is
 783 tricky because it typically does not exist in standard scalar ISAs.
 784 If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
 785 VGATHER/VSCATTER.
 786
 787 # CR predicate result analysis
 788
 789 OpenPOWER has Condition Registers.  These store an analysis of the result
 790 of an operation to test it for being greater, less than or equal to zero.
 791 What if a test could be done, similar to branch BO testing, which hooked
 792 into the predication system?
 793
 794     for i in range(VL):
 795         # predication test, skip all masked out elements.
 796         if predicate_masked_out(i): continue # skip
 797         result = op(iregs[RA+i], iregs[RB+i])
 798         CRnew = analyse(result) # calculates eq/lt/gt
 799         # Rc=1 always stores the CR
 800         if RC1 or Rc=1: crregs[offs+i] = CRnew
 801         if RC1: continue # RC1 mode skips result store
 802         # now test CR, similar to branch
 803         if CRnew[BO[0:1]] == BO[2]:
 804             # result optionally stored but CR always is
 805             iregs[RT+i] = result
 806
 807 Note that whilst the Vector of CRs is always written to the CR regfile,
 808 only those result elements that pass the BO test get written to the
 809 integer regfile (when RC1 mode is not set).  In RC1 mode the CR is always
 810 stored, but the result never is. This effectively turns every arithmetic
 811 operation into a type of `cmp` instruction.
 812
 813 Here for example if FP overflow occurred, and the CR testing was carried
 814 out for that, all valid results would be stored but invalid ones would
 815 not, but in addition the Vector of CRs would contain the indicators of
 816 which ones failed.  With the invalid results being simply not written
 817 this could save resources (save on register file writes).
 818
 819 Also expected is, due to the fact that the predicate mask is effectively
 820 ANDed with the post-result analysis as a secondary type of predication,
 821 that there would be savings to be had in some types of operations where
 822 the post-result analysis, if not included in SV, would need a second
 823 predicate calculation followed by a predicate mask AND operation.
 824
 825 Note, hilariously, that Vectorised Condition Register Operations (crand,
 826 cror) may also have post-result analysis applied to them.  With Vectors
 827 of CRs being utilised *for* predication, possibilities for compact and
 828 elegant code begin to emerge from this innocuous-looking addition to SV.
 829
 830 # Exception-based Fail-on-first
 831
 832 One of the major issues with Vectorised LD/ST operations is when a
 833 batch of LDs cross a page-fault boundary.  With considerable resources
 834 being taken up with in-flight data, a large Vector LD being cancelled
 835 or unable to roll back is either a detriment to performance or can cause
 836 data corruption.
 837
 838 What if, then, rather than cancel an entire Vector LD because the last
 839 operation would cause a page fault, instead truncate the Vector to the
 840 last successful element?
 841
 842 This is called "fail-on-first".  Here is strncpy, illustrated from RVV:
 843
 844     strncpy:
 845         c.mv a3, a0               # Copy dst
 846     loop:
 847         setvli x0, a2, vint8    # Vectors of bytes.
 848         vlbff.v v1, (a1)        # Get src bytes
 849         vseq.vi v0, v1, 0       # Flag zero bytes
 850         vmfirst a4, v0          # Zero found?
 851         vmsif.v v0, v0          # Set mask up to and including zero byte.
 852         vsb.v v1, (a3), v0.t    # Write out bytes
 853         c.bgez a4, exit           # Done
 854         csrr t1, vl             # Get number of bytes fetched
 855         c.add a1, a1, t1          # Bump src pointer
 856         c.sub a2, a2, t1          # Decrement count.
 857         c.add a3, a3, t1          # Bump dst pointer
 858         c.bnez a2, loop           # Anymore?
 859     exit:
 860         c.ret
 861
 862 Vector Length VL is truncated inherently at the first page faulting
 863 byte-level LD.  Otherwise, with more powerful hardware the number of
 864 elements LOADed from memory could be dozens to hundreds or greater
 865 (memory bandwidth permitting).
 866
 867 With VL truncated the analysis looking for the zero byte and the
 868 subsequent STORE (a straight ST, not a ffirst ST) can proceed, safe in the
 869 knowledge that every byte loaded in the Vector is valid.  Implementors are
 870 even permitted to "adapt" VL, truncating it early so that, for example,
 871 subsequent iterations of loops will have LD/STs on aligned boundaries.
 872
 873 SIMD strncpy hand-written assembly routines are, to be blunt about it,
 874 a total nightmare.  240 instructions is not uncommon, and the worst
 875 thing about them is that they are unable to cope with detection of a
 876 page fault condition.
 877
 878 Note: see <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 879
 880 # Data-dependent fail-first
 881
 882 This is a minor variant on the CR-based predicate-result mode.  Where
 883 pred-result continues with independent element testing (any of which may
 884 be parallelised), data-dependent fail-first *stops* at the first failure:
 885
 886     if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
 887     for i in range(VL):
 888         # predication test, skip all masked out elements.
 889         if predicate_masked_out(i): continue # skip
 890         result = op(iregs[RA+i], iregs[RB+i])
 891         CRnew = analyse(result) # calculates eq/lt/gt
 892         # now test CR, similar to branch
 893         if CRnew[BO[0:1]] != BO[2]:
 894             VL = i # truncate: only successes allowed
 895             break
 896         # test passed: store result (and CR?)
 897         if not RC1: iregs[RT+i] = result
 898         if RC1 or Rc=1: crregs[offs+i] = CRnew
 899
 900 This is particularly useful, again, for FP operations that might overflow,
 901 where it is desirable to end the loop early, but also desirable to
 902 complete at least those operations that were okay (passed the test)
 903 without also having to slow down execution by adding extra instructions
 904 that tested for the possibility of that failure, in advance of doing
 905 the actual calculation.
 906
 907 The only minor downside here though is the change to VL, which in some
 908 implementations may cause pipeline stalls.  This was one of the reasons
 909 why CR-based pred-result analysis was added, because that at least is
 910 entirely paralleliseable.
 911
 912 # Vertical-First Mode
 913
 914 This is a relatively new addition to SVP64 under development as of
 915 July 2021.  Where Horizontal-First is the standard Cray-style for-loop,
 916 Vertical-First typically executes just the **one** scalar element
 917 in each Vectorised operation. That element is selected by srcstep
 918 and dststep *neither of which are changed as a side-effect of execution*.
 919 Illustrating this in pseodocode, with a branch/loop.
 920 To create loops, a new instruction `svstep` must be called,
 921 explicitly, with Rc=1:
 922
 923 ```
 924 loop:
 925   sv.addi r0.v, r8.v, 5 # GPR(0+dststep) = GPR(8+srcstep) + 5
 926   sv.addi r0.v, r8, 5   # GPR(0+dststep) = GPR(8        ) + 5
 927   sv.addi r0, r8.v, 5   # GPR(0        ) = GPR(8+srcstep) + 5
 928   svstep.               # srcstep++, dststep++, CR0.eq = srcstep==VL
 929   beq loop
 930 ```
 931
 932 Three examples are illustrated of different types of Scalar-Vector
 933 operations. Note that in its simplest form  **only one** element is
 934 executed per instruction **not** multiple elements per instruction.
 935 (The more advanced version of Vertical-First mode may execute multiple
 936 elements per instruction, however the number executed **must** remain
 937 a fixed quantity.)
 938
 939 Now that such explicit loops can increment inexorably towards VL,
 940 of course we now need a way to test if srcstep or dststep have reached
 941 VL. This is achieved in one of two ways: [[sv/svstep]] has an Rc=1 mode
 942 where CR0 will be updated if VL is reached. A standard v3.0B Branch
 943 Conditional may rely on that.  Alternatively, the number of elements
 944 may be transferred into CTR, as is standard practice in Power ISA.
 945 Here, SVP64 [[sv/branches]] have a mode which allows CTR to be decremented
 946 by the number of vertical elements executed.
 947
 948 # Instruction format
 949
 950 Whilst this overview shows the internals, it does not go into detail
 951 on the actual instruction format itself.  There are a couple of reasons
 952 for this: firstly, it's under development, and secondly, it needs to be
 953 proposed to the OpenPOWER Foundation ISA WG for consideration and review.
 954
 955 That said: draft pages for [[sv/setvl]] and [[sv/svp64]] are written up.
 956 The `setvl` instruction is pretty much as would be expected from a
 957 Cray style VL  instruction: the only differences being that, firstly,
 958 the MAXVL (Maximum Vector Length) has to be specified, because that
 959 determines - precisely - how many of the *scalar* registers are to be
 960 used for a given Vector.  Secondly: within the limit of MAXVL, VL is
 961 required to be set to the requested value. By contrast, RVV systems
 962 permit the hardware to set arbitrary values of VL.
 963
 964 The other key question is of course: what's the actual instruction format,
 965 and what's in it? Bearing in mind that this requires OPF review, the
 966 current draft is at the [[sv/svp64]] page, and includes space for all the
 967 different modes, the predicates, element width overrides, SUBVL and the
 968 register extensions, in 24 bits.  This just about fits into an OpenPOWER
 969 v3.1B 64 bit Prefix by borrowing some of the Reserved Encoding space.
 970 The v3.1B suffix - containing as it does a 32 bit OpenPOWER instruction -
 971 aligns perfectly with SV.
 972
 973 Further reading is at the main [[SV|sv]] page.
 974
 975 # Conclusion
 976
 977 Starting from a scalar ISA - OpenPOWER v3.0B - it was shown above that,
 978 with conceptual sub-loops, a Scalar ISA can be turned into a Vector one,
 979 by embedding Scalar instructions - unmodified - into a Vector "context"
 980 using "Prefixing".  With careful thought, this technique reaches 90%
 981 par with good Vector ISAs, increasing to 95% with the addition of a
 982 mere handful of additional context-vectoriseable scalar instructions
 983 ([[sv/mv.x]] amongst them).
 984
 985 What is particularly cool about the SV concept is that custom extensions
 986 and research need not be concerned about inventing new Vector instructions
 987 and how to get them to interact with the Scalar ISA: they are effectively
 988 one and the same.  Any new instruction added at the Scalar level is
 989 inherently and automatically Vectorised, following some simple rules.
 990