openpower/sv/overview.mdwn

   1 # SV Overview
   2
   3 **SV is in DRAFT STATUS**. SV has not yet been submitted to the OpenPOWER Foundation ISA WG for review.
   4
   5 This document provides an overview and introduction as to why SV (a
   6 [[!wikipedia Cray]]-style Vector augmentation to [[!wikipedia OpenPOWER]]) exists, and how it works.
   7
   8 Links:
   9
  10 * This page: [http://libre-soc.org/openpower/sv/overview](http://libre-soc.org/openpower/sv/overview)
  11 * [[discussion]] and
  12   [bugreport](https://bugs.libre-soc.org/show_bug.cgi?id=556)
  13   feel free to add comments, questions.
  14 * [[SV|sv]]
  15 * [[sv/svp64]]
  16
  17 Contents:
  18
  19 [[!toc]]
  20
  21 # Introduction: SIMD and Cray Vectors
  22
  23 SIMD, the primary method for easy parallelism of the
  24 past 30 years in Computer Architectures, is
  25 [known to be harmful](https://www.sigarch.org/simd-instructions-considered-harmful/).
  26 SIMD provides a seductive simplicity that is easy to implement in
  27 hardware.  With each doubling in width it promises increases in raw
  28 performance without the complexity of either multi-issue or out-of-order
  29 execution.
  30
  31 Unfortunately, even with predication added, SIMD only becomes more and
  32 more problematic with each power of two SIMD width increase introduced
  33 through an ISA revision.  The opcode proliferation, at O(N^6), inexorably
  34 spirals out of control in the ISA, detrimentally impacting the hardware,
  35 the software, the compilers and the testing and compliance.  Here are
  36 the typical dimensions that result in such massive proliferation:
  37
  38 * Operation (add, mul)
  39 * bitwidth (8, 16, 32, 64, 128)
  40 * Conversion between bitwidths (FP16-FP32-64)
  41 * Signed/unsigned
  42 * HI/LO swizzle (Audio L/R channels)
  43    - HI/LO selection on src 1
  44    - selection on src 2
  45    - selection on dest
  46    - Example: AndesSTAR Audio DSP
  47 * Saturation (Clamping at max range)
  48
  49 These typically are multiplied up to produce explicit opcodes numbering
  50 in the thousands on, for example the ARC Video/DSP cores.
  51
  52 Cray-style variable-length Vectors on the other hand result in
  53 stunningly elegant and small loops, exceptionally high data throughput
  54 per instruction (by one *or greater* orders of magnitude than SIMD), with
  55 no alarmingly high setup and cleanup code, where at the hardware level
  56 the microarchitecture may execute from one element right the way through
  57 to tens of thousands at a time, yet the executable remains exactly the
  58 same and the ISA remains clear, true to the RISC paradigm, and clean.
  59 Unlike in SIMD, powers of two limitations are not involved in the ISA
  60 or in the assembly code.
  61
  62 SimpleV takes the Cray style Vector principle and applies it in the
  63 abstract to a Scalar ISA, in the process allowing register file size
  64 increases using "tagging" (similar to how x86 originally extended
  65 registers from 32 to 64 bit).
  66
  67 ## SV
  68
  69 The fundamentals are:
  70
  71 * The Program Counter (PC) gains a "Sub Counter" context (Sub-PC)
  72 * Vectorisation pauses the PC and runs a Sub-PC loop from 0 to VL-1
  73   (where VL is Vector Length)
  74 * The [[Program Order]] of "Sub-PC" instructions must be preserved,
  75   just as is expected of instructions ordered by the PC.
  76 * Some registers may be "tagged" as Vectors
  77 * During the loop, "Vector"-tagged register are incremented by
  78   one with each iteration, executing the *same instruction*
  79   but with *different registers*
  80 * Once the loop is completed *only then* is the Program Counter
  81   allowed to move to the next instruction.
  82
  83 Hardware (and simulator) implementors are free and clear to implement this
  84 as literally a for-loop, sitting in between instruction decode and issue.
  85 Higher performance systems may deploy SIMD backends, multi-issue and
  86 out-of-order execution, although it is strongly recommended to add
  87 predication capability directly into SIMD backend units.
  88
  89 In OpenPOWER ISA v3.0B pseudo-code form, an ADD operation, assuming both
  90 source and destination have been "tagged" as Vectors, is simply:
  91
  92     for i = 0 to VL-1:
  93          GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
  94
  95 At its heart, SimpleV really is this simple.  On top of this fundamental
  96 basis further refinements can be added which build up towards an extremely
  97 powerful Vector augmentation system, with very little in the way of
  98 additional opcodes required: simply external "context".
  99
 100 x86 was originally only 80 instructions: prior to AVX512 over 1,300
 101 additional instructions have been added, almost all of them SIMD.
 102
 103 RISC-V RVV as of version 0.9 is over 188 instructions (more than the
 104 rest of RV64G combined: 80 for RV64G and 27 for C). Over 95% of that
 105 functionality is added to OpenPOWER v3 0B, by SimpleV augmentation,
 106 with around 5 to 8 instructions.
 107
 108 Even in OpenPOWER v3.0B, the Scalar Integer ISA is around 150
 109 instructions, with IEEE754 FP adding approximately 80 more. VSX, being
 110 based on SIMD design principles, adds somewhere in the region of 600 more.
 111 SimpleV again provides over 95% of VSX functionality, simply by augmenting
 112 the *Scalar* OpenPOWER ISA, and in the process providing features such
 113 as predication, which VSX is entirely missing.
 114
 115 AVX512, SVE2, VSX, RVV, all of these systems have to provide different
 116 types of register files: Scalar and Vector is the minimum. AVX512
 117 even provides a mini mask regfile, followed by explicit instructions
 118 that handle operations on each of them *and map between all of them*.
 119 SV simply not only uses the existing scalar regfiles (including CRs),
 120 but because operations exist within OpenPOWER to cover interactions
 121 between the scalar regfiles (`mfcr`, `fcvt`) there is very little that
 122 needs to be added.
 123
 124 In fairness to both VSX and RVV, there are things that are not provided
 125 by SimpleV:
 126
 127 * 128 bit or above arithmetic and other operations
 128   (VSX Rijndael and SHA primitives; VSX shuffle and bitpermute operations)
 129 * register files above 128 entries
 130 * Vector lengths over 64
 131 * Unit-strided LD/ST and other comprehensive memory operations
 132   (struct-based LD/ST from RVV for example)
 133 * 32-bit instruction lengths. [[svp64]] had to be added as 64 bit.
 134
 135 These limitations, which stem inherently from the adaptation process of
 136 starting from a Scalar ISA, are not insurmountable. Over time, they may
 137 well be addressed in future revisions of SV.
 138
 139 The rest of this document builds on the above simple loop to add:
 140
 141 * Vector-Scalar, Scalar-Vector and Scalar-Scalar operation
 142  (of all register files: Integer, FP *and CRs*)
 143 * Traditional Vector operations (VSPLAT, VINSERT, VCOMPRESS etc)
 144 * Predication masks (essential for parallel if/else constructs)
 145 * 8, 16 and 32 bit integer operations, and both FP16 and BF16.
 146 * Compacted operations into registers (normally only provided by SIMD)
 147 * Fail-on-first (introduced in ARM SVE2)
 148 * A new concept: Data-dependent fail-first
 149 * Condition-Register based *post-result* predication (also new)
 150 * A completely new concept: "Twin Predication"
 151 * vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
 152
 153 All of this is *without modifying the OpenPOWER v3.0B ISA*, except to add
 154 "wrapping context", similar to how v3.1B 64 Prefixes work.
 155
 156 # Adding Scalar / Vector
 157
 158 The first augmentation to the simple loop is to add the option for all
 159 source and destinations to all be either scalar or vector.  As a FSM
 160 this is where our "simple" loop gets its first complexity.
 161
 162     function op_add(RT, RA, RB) # add not VADD!
 163       int id=0, irs1=0, irs2=0;
 164       for i = 0 to VL-1:
 165         ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 166         if (!RT.isvec) break;
 167         if (RT.isvec)  { id += 1; }
 168         if (RA.isvec)  { irs1 += 1; }
 169         if (RB.isvec)  { irs2 += 1; }
 170
 171 This could have been written out as eight separate cases: one each for
 172 when each of `RA`, `RB` or `RT` is scalar or vector.  Those eight cases,
 173 when optimally combined, result in the pseudocode above.
 174
 175 With some walkthroughs it is clear that the loop exits immediately
 176 after the first scalar destination result is written, and that when the
 177 destination is a Vector the loop proceeds to fill up the register file,
 178 sequentially, starting at `RT` and ending at `RT+VL-1`. The two source
 179 registers will, independently, either remain pointing at `RB` or `RA`
 180 respectively, or, if marked as Vectors, will march incrementally in
 181 lockstep, producing element results along the way, as the destination
 182 also progresses through elements.
 183
 184 In this way all the eight permutations of Scalar and Vector behaviour
 185 are covered, although without predication the scalar-destination ones are
 186 reduced in usefulness.  It does however clearly illustrate the principle.
 187
 188 Note in particular: there is no separate Scalar add instruction and
 189 separate Vector instruction and separate Scalar-Vector instruction, *and
 190 there is no separate Vector register file*: it's all the same instruction,
 191 on the standard register file, just with a loop.  Scalar happens to set
 192 that loop size to one.
 193
 194 The important insight from the above is that, strictly speaking, Simple-V
 195 is not really a Vectorisation scheme at all: it is more of a hardware
 196 ISA "Compression scheme", allowing as it does for what would normally
 197 require multiple sequential instructions to be replaced with just one.
 198 This is where the rule that Program Order must be preserved in Sub-PC
 199 execution derives from.  However in other ways, which will emerge below,
 200 the "tagging" concept presents an opportunity to include features
 201 definitely not common outside of Vector ISAs, and in that regard it's
 202 definitely a class of Vectorisation.
 203
 204 ## Register "tagging"
 205
 206 As an aside: in [[sv/svp64]] the encoding which allows SV to both extend
 207 the range beyond r0-r31 and to determine whether it is a scalar or vector
 208 is encoded in two to three bits, depending on the instruction.
 209
 210 The reason for using so few bits is because there are up to *four*
 211 registers to mark in this way (`fma`, `isel`) which starts to be of
 212 concern when there are only 24 available bits to specify the entire SV
 213 Vectorisation Context.  In fact, for a small subset of instructions it
 214 is just not possible to tag every single register.  Under these rare
 215 circumstances a tag has to be shared between two registers.
 216
 217 Below is the pseudocode which expresses the relationship which is usually
 218 applied to *every* register:
 219
 220     if extra3_mode:
 221         spec = EXTRA3 # bit 2 s/v, 0-1 extends range
 222     else:
 223         spec = EXTRA2 << 1 # same as EXTRA3, shifted
 224     if spec[2]: # vector
 225          RA.isvec = True
 226          return (RA << 2) | spec[0:1]
 227     else:         # scalar
 228          RA.isvec = False
 229          return (spec[0:1] << 5) | RA
 230
 231 Here we can see that the scalar registers are extended in the top bits,
 232 whilst vectors are shifted up by 2 bits, and then extended in the LSBs.
 233 Condition Registers have a slightly different scheme, along the same
 234 principle, which takes into account the fact that each CR may be bit-level
 235 addressed by Condition Register operations.
 236
 237 Readers familiar with OpenPOWER will know of Rc=1 operations that create
 238 an associated post-result "test", placing this test into an implicit
 239 Condition Register.  The original researchers who created the POWER ISA
 240 chose CR0 for Integer, and CR1 for Floating Point.  These *also become
 241 Vectorised* - implicitly - if the associated destination register is
 242 also Vectorised.  This allows for some very interesting savings on
 243 instruction count due to the very same CR Vectors being predication masks.
 244
 245 # Adding single predication
 246
 247 The next step is to add a single predicate mask.  This is where it gets
 248 interesting.  Predicate masks are a bitvector, each bit specifying, in
 249 order, whether the element operation is to be skipped ("masked out")
 250 or allowed. If there is no predicate, it is set to all 1s, which is
 251 effectively the same as "no predicate".
 252
 253     function op_add(RT, RA, RB) # add not VADD!
 254       int id=0, irs1=0, irs2=0;
 255       predval = get_pred_val(FALSE, rd);
 256       for i = 0 to VL-1:
 257         if (predval & 1<<i) # predication bit test
 258            ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 259            if (!RT.isvec) break;
 260         if (RT.isvec)  { id += 1; }
 261         if (RA.isvec)  { irs1 += 1; }
 262         if (RB.isvec)  { irs2 += 1; }
 263
 264 The key modification is to skip the creation and storage of the result
 265 if the relevant predicate mask bit is clear, but *not the progression
 266 through the registers*.
 267
 268 A particularly interesting case is if the destination is scalar, and the
 269 first few bits of the predicate are zero.  The loop proceeds to increment
 270 the Scalar *source* registers until the first nonzero predicate bit is
 271 found, whereupon a single result is computed, and *then* the loop exits.
 272 This therefore uses the predicate to perform Vector source indexing.
 273 This case was not possible without the predicate mask.
 274
 275 If all three registers are marked as Vector then the "traditional"
 276 predicated Vector behaviour is provided.  Yet, just as before, all other
 277 options are still provided, right the way back to the pure-scalar case,
 278 as if this were a straight OpenPOWER v3.0B non-augmented instruction.
 279
 280 Single Predication therefore provides several modes traditionally seen
 281 in Vector ISAs:
 282
 283 * VINSERT: the predicate may be set as a single bit, the sources are
 284   scalar and the destination a vector.
 285 * VSPLAT (result broadcasting) is provided by making the sources scalar
 286   and the destination a vector, and having no predicate set or having
 287   multiple bits set.
 288 * VSELECT is provided by setting up (at least one of) the sources as a
 289   vector, using a single bit in the predicate, and the destination as
 290   a scalar.
 291
 292 All of this capability and coverage without even adding one single actual
 293 Vector opcode, let alone 180, 600 or 1,300!
 294
 295 # Predicate "zeroing" mode
 296
 297 Sometimes with predication it is ok to leave the masked-out element
 298 alone (not modify the result) however sometimes it is better to zero the
 299 masked-out elements.  Zeroing can be combined with bit-wise ORing to build
 300 up vectors from multiple predicate patterns: the same combining with
 301 nonzeroing involves more mv operations and predicate mask operations.
 302 Our pseudocode therefore ends up as follows, to take the enhancement
 303 into account:
 304
 305     function op_add(RT, RA, RB) # add not VADD!
 306       int id=0, irs1=0, irs2=0;
 307       predval = get_pred_val(FALSE, rd);
 308       for i = 0 to VL-1:
 309         if (predval & 1<<i) # predication bit test
 310            ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
 311            if (!RT.isvec) break;
 312         else if zeroing:   # predicate failed
 313            ireg[RT+id] = 0 # set element  to zero
 314         if (RT.isvec)  { id += 1; }
 315         if (RA.isvec)  { irs1 += 1; }
 316         if (RB.isvec)  { irs2 += 1; }
 317
 318 Many Vector systems either have zeroing or they have nonzeroing, they
 319 do not have both.  This is because they usually have separate Vector
 320 register files. However SV sits on top of standard register files and
 321 consequently there are advantages to both, so both are provided.
 322
 323 # Element Width overrides <a name="elwidths"></a>
 324
 325 All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64
 326 bit integer operations, and IEEE754 FP32 and 64.  Often also included
 327 is FP16 and more recently BF16.  The *really* good Vector ISAs have
 328 variable-width vectors right down to bitlevel, and as high as 1024 bit
 329 arithmetic per element, as well as IEEE754 FP128.
 330
 331 SV has an "override" system that *changes* the bitwidth of operations
 332 that were intended by the original scalar ISA designers to have (for
 333 example) 64 bit operations (only).  The override widths are 8, 16 and
 334 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in
 335 the future).
 336
 337 This presents a particularly intriguing conundrum given that the OpenPOWER
 338 Scalar ISA was never designed with for example 8 bit operations in mind,
 339 let alone Vectors of 8 bit.
 340
 341 The solution comes in terms of rethinking the definition of a Register
 342 File.  The typical regfile may be considered to be a multi-ported SRAM
 343 block, 64 bits wide and usually 32 entries deep, to give 32 64 bit
 344 registers.  In c this would be:
 345
 346     typedef uint64_t reg_t;
 347     reg_t int_regfile[32]; // standard scalar 32x 64bit
 348
 349 Conceptually, to get our variable element width vectors,
 350 we may think of the regfile as instead being the following c-based data
 351 structure, where all types uint16_t etc. are in little-endian order:
 352
 353     #pragma(packed)
 354     typedef union {
 355         uint8_t  actual_bytes[8];
 356         uint8_t  b[0]; // array of type uint8_t
 357         uint16_t s[0]; // array of LE ordered uint16_t
 358         uint32_t i[0];
 359         uint64_t l[0]; // default OpenPOWER ISA uses this
 360     } reg_t;
 361
 362     reg_t int_regfile[128]; // SV extends to 128 regs
 363
 364 This means that Vector elements start from locations specified by 64 bit
 365 "register" but that from that location onwards the elements *overlap
 366 subsequent registers*.
 367
 368 Here is another way to view the same concept, bearing in mind that it
 369 is assumed a LE memory order:
 370
 371     uint8_t reg_sram[8*128];
 372     uint8_t *actual_bytes = &reg_sram[RA*8];
 373     if elwidth == 8:
 374         uint8_t *b = (uint8_t*)actual_bytes;
 375         b[idx] = result;
 376     if elwidth == 16:
 377         uint16_t *s = (uint16_t*)actual_bytes;
 378         s[idx] = result;
 379     if elwidth == 32:
 380         uint32_t *i = (uint32_t*)actual_bytes;
 381         i[idx] = result;
 382     if elwidth == default:
 383         uint64_t *l = (uint64_t*)actual_bytes;
 384         l[idx] = result;
 385
 386 Starting with all zeros, setting `actual_bytes[3]` in any given `reg_t`
 387 to 0x01 would mean that:
 388
 389 * b[0..2] = 0x00 and b[3] = 0x01
 390 * s[0] = 0x0000 and s[1] = 0x0001
 391 * i[0] = 0x00010000
 392 * l[0] = 0x0000000000010000
 393
 394 In tabular form, starting an elwidth=8 loop from r0 and extending for
 395 16 elements would begin at r0 and extend over the entirety of r1:
 396
 397        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 398        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 399     r0 | b[0]  | b[1]  | b[2]  | b[3]  | b[4]  | b[5]  | b[6]  | b[7]  |
 400     r1 | b[8]  | b[9]  | b[10] | b[11] | b[12] | b[13] | b[14] | b[15] |
 401
 402 Starting an elwidth=16 loop from r0 and extending for
 403 7 elements would begin at r0 and extend partly over r1.  Note that
 404 b0 indicates the low byte (lowest 8 bits) of each 16-bit word, and
 405 b1 represents the top byte:
 406
 407        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 408        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 409     r0 | s[0].b0  b1   | s[1].b0  b1   | s[2].b0  b1   |  s[3].b0  b1  |
 410     r1 | s[4].b0  b1   | s[5].b0  b1   | s[6].b0  b1   |  unmodified   |
 411
 412 Likewise for elwidth=32, and a loop extending for 3 elements.  b0 through
 413 b3 represent the bytes (numbered lowest for LSB and highest for MSB) within
 414 each element word:
 415
 416        | byte0 | byte1 | byte2 | byte3 | byte4 | byte5 | byte6 | byte7 |
 417        | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
 418     r0 | w[0].b0  b1      b2      b3   | w[1].b0  b1      b2      b3   |
 419     r1 | w[2].b0  b1      b2      b3   | unmodified    unmodified      |
 420
 421 64-bit (default) elements access the full registers.  In each case the
 422 register number (`RT`, `RA`) indicates the *starting* point for the storage
 423 and retrieval of the elements.
 424
 425 Our simple loop, instead of accessing the array of regfile entries
 426 with a computed index `iregs[RT+i]`, would access the appropriate element
 427 of the appropriate width, such as `iregs[RT].s[i]` in order to access
 428 16 bit elements starting from RT.  Thus we have a series of overlapping
 429 conceptual arrays that each start at what is traditionally thought of as
 430 "a register".  It then helps if we have a couple of routines:
 431
 432     get_polymorphed_reg(reg, bitwidth, offset):
 433         reg_t res = 0;
 434         if (!reg.isvec): # scalar
 435             offset = 0
 436         if bitwidth == 8:
 437             reg.b = int_regfile[reg].b[offset]
 438         elif bitwidth == 16:
 439             reg.s = int_regfile[reg].s[offset]
 440         elif bitwidth == 32:
 441             reg.i = int_regfile[reg].i[offset]
 442         elif bitwidth == default: # 64
 443             reg.l = int_regfile[reg].l[offset]
 444         return res
 445
 446     set_polymorphed_reg(reg, bitwidth, offset, val):
 447         if (!reg.isvec): # scalar
 448             offset = 0
 449         if bitwidth == 8:
 450             int_regfile[reg].b[offset] = val
 451         elif bitwidth == 16:
 452             int_regfile[reg].s[offset] = val
 453         elif bitwidth == 32:
 454             int_regfile[reg].i[offset] = val
 455         elif bitwidth == default: # 64
 456             int_regfile[reg].l[offset] = val
 457
 458 These basically provide a convenient parameterised way to access the
 459 register file, at an arbitrary vector element offset and an arbitrary
 460 element width.  Our first simple loop thus becomes:
 461
 462     for i = 0 to VL-1:
 463        src1 = get_polymorphed_reg(RA, srcwid, i)
 464        src2 = get_polymorphed_reg(RB, srcwid, i)
 465        result = src1 + src2 # actual add here
 466        set_polymorphed_reg(rd, destwid, i, result)
 467
 468 With this loop, if elwidth=16 and VL=3 the first 48 bits of the target
 469 register will contain three 16 bit addition results, and the upper 16
 470 bits will be *unaltered*.
 471
 472 Note that things such as zero/sign-extension (and predication) have
 473 been left out to illustrate the elwidth concept. Also note that it turns
 474 out to be important to perform the operation internally at effectively an *infinite* bitwidth such that any truncation, rounding errors or
 475 other artefacts may all be ironed out.  This turns out to be important
 476 when applying Saturation for Audio DSP workloads, particularly for multiply and IEEE754 FP rounding.  By "infinite" this is conceptual only: in reality, the application of the different truncations and width-extensions set a fixed deterministic practical limit on the internal precision needed, on a per-operation basis.
 477
 478 Other than that, element width overrides, which can be applied to *either*
 479 source or destination or both, are pretty straightforward, conceptually.
 480 The details, for hardware engineers, involve byte-level write-enable
 481 lines, which is exactly what is used on SRAMs anyway.  Compiler writers
 482 have to alter Register Allocation Tables to byte-level granularity.
 483
 484 One critical thing to note: upper parts of the underlying 64 bit
 485 register are *not zero'd out* by a write involving a non-aligned Vector
 486 Length. An 8 bit operation with VL=7 will *not* overwrite the 8th byte
 487 of the destination.  The only situation where a full overwrite occurs
 488 is on "default" behaviour.  This is extremely important to consider the
 489 register file as a byte-level store, not a 64-bit-level store.
 490
 491 ## Why a LE regfile?
 492
 493 The concept of having a regfile where the byte ordering of the underlying
 494 SRAM seems utter nonsense.  Surely, a hardware implementation gets to
 495 choose the order, right? It's memory only where LE/BE matters, right? The
 496 bytes come in, all registers are 64 bit and it's just wiring, right?
 497
 498 Ordinarily this would be 100% correct, in both a scalar ISA and in a Cray
 499 style Vector one.  The assumption in that last question was, however, "all
 500 registers are 64 bit".  SV allows SIMD-style packing of vectors into the
 501 64 bit registers, where one instruction and the next may interpret that
 502 very same register as containing elements of completely different widths.
 503
 504 Consequently it becomes critically important to decide a byte-order.
 505 That decision was - arbitrarily - LE mode.  Actually it wasn't arbitrary
 506 at all: it was such hell to implement BE supported interpretations of CRs
 507 and LD/ST in LibreSOC, based on a terse spec that provides insufficient
 508 clarity and assumes significant working knowledge of OpenPOWER, with
 509 arbitrary insertions of 7-index here and 3-bitindex there, the decision
 510 to pick LE was extremely easy.
 511
 512 Without such a decision, if two words are packed as elements into a 64
 513 bit register, what does this mean? Should they be inverted so that the
 514 lower indexed element goes into the HI or the LO word? should the 8
 515 bytes of each register be inverted? Should the bytes in each element
 516 be inverted? Should the element indexing loop order be broken onto
 517 discontiguous chunks such as 32107654 rather than 01234567, and if so
 518 at what granularity of discontinuity? These are all equally valid and
 519 legitimate interpretations of what constitutes "BE" and they all cause
 520 merry mayhem.
 521
 522 The decision was therefore made: the c typedef union is the canonical
 523 definition, and its members are defined as being in LE order. From there,
 524 implementations may choose whatever internal HDL wire order they like
 525 as long as the results produced conform to the elwidth pseudocode.
 526
 527 *Note: it turns out that both x86 SIMD and NEON SIMD follow this convention, namely that both are implicitly LE, even though their ISA Manuals may not explicitly spell this out*
 528
 529 * <https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Application-Level-Memory-Model/Endian-support/Endianness-in-Advanced-SIMD?lang=en>
 530 * <https://stackoverflow.com/questions/24045102/how-does-endianness-work-with-simd-registers>
 531 * <https://llvm.org/docs/BigEndianNEON.html>
 532
 533
 534 ## Source and Destination overrides
 535
 536 A minor fly in the ointment: what happens if the source and destination
 537 are over-ridden to different widths?  For example, FP16 arithmetic is
 538 not accurate enough and may introduce rounding errors when up-converted
 539 to FP32 output.  The rule is therefore set:
 540
 541     The operation MUST take place effectively at infinite precision:
 542     actual precision determined by the operation and the operand widths
 543
 544 In pseudocode this is:
 545
 546     for i = 0 to VL-1:
 547        src1 = get_polymorphed_reg(RA, srcwid, i)
 548        src2 = get_polymorphed_reg(RB, srcwid, i)
 549        opwidth = max(srcwid, destwid) # usually
 550        result = op_add(src1, src2, opwidth) # at max width
 551        set_polymorphed_reg(rd, destwid, i, result)
 552
 553 In reality the source and destination widths determine the actual required
 554 precision in a given ALU.  The reason for setting "effectively" infinite precision
 555 is illustrated for example by Saturated-multiply, where if the internal precision was insufficient it would not be possible to correctly determine the maximum clip range had been exceeded.
 556
 557 Thus it will turn out that under some conditions the combination of the
 558 extension of the source registers followed by truncation of the result
 559 gets rid of bits that didn't matter, and the operation might as well have
 560 taken place at the narrower width and could save resources that way.
 561 Examples include Logical OR where the source extension would place
 562 zeros in the upper bits, the result will be truncated and throw those
 563 zeros away.
 564
 565 Counterexamples include the previously mentioned FP16 arithmetic,
 566 where for operations such as division of large numbers by very small
 567 ones it should be clear that internal accuracy will play a major role
 568 in influencing the result.  Hence the rule that the calculation takes
 569 place at the maximum bitwidth, and truncation follows afterwards.
 570
 571 ## Signed arithmetic
 572
 573 What happens when the operation involves signed arithmetic?  Here the
 574 implementor has to use common sense, and make sure behaviour is accurately
 575 documented.  If the result of the unmodified operation is sign-extended
 576 because one of the inputs is signed, then the input source operands must
 577 be first read at their overridden bitwidth and *then* sign-extended:
 578
 579       for i = 0 to VL-1:
 580        src1 = get_polymorphed_reg(RA, srcwid, i)
 581        src2 = get_polymorphed_reg(RB, srcwid, i)
 582        opwidth = max(srcwid, destwid)
 583        # srces known to be less than result width
 584        src1 = sign_extend(src1, srcwid, opwidth)
 585        src2 = sign_extend(src2, srcwid, opwidth)
 586        result = op_signed(src1, src2, opwidth) # at max width
 587        set_polymorphed_reg(rd, destwid, i, result)
 588
 589 The key here is that the cues are taken from the underlying operation.
 590
 591 ## Saturation
 592
 593 Audio DSPs need to be able to clip sound when the "volume" is adjusted,
 594 but if it is too loud and the signal wraps, distortion occurs.  The
 595 solution is to clip (saturate) the audio and allow this to be detected.
 596 In practical terms this is a post-result analysis however it needs to
 597 take place at the largest bitwidth i.e. before a result is element width
 598 truncated.  Only then can the arithmetic saturation condition be detected:
 599
 600     for i = 0 to VL-1:
 601        src1 = get_polymorphed_reg(RA, srcwid, i)
 602        src2 = get_polymorphed_reg(RB, srcwid, i)
 603        opwidth = max(srcwid, destwid)
 604        # unsigned add
 605        result = op_add(src1, src2, opwidth) # at max width
 606        # now saturate (unsigned)
 607        sat = max(result, (1<<destwid)-1)
 608        set_polymorphed_reg(rd, destwid, i, sat)
 609        # set sat overflow
 610        if Rc=1:
 611           CR.ov = (sat != result)
 612
 613 So the actual computation took place at the larger width, but was
 614 post-analysed as an unsigned operation.  If however "signed" saturation
 615 is requested then the actual arithmetic operation has to be carefully
 616 analysed to see what that actually means.
 617
 618 In terms of FP arithmetic, which by definition has a sign bit (so
 619 always takes place as a signed operation anyway), the request to saturate
 620 to signed min/max is pretty clear.  However for integer arithmetic such
 621 as shift (plain shift, not arithmetic shift), or logical operations
 622 such as XOR, which were never designed to have the assumption that its
 623 inputs be considered as signed numbers, common sense has to kick in,
 624 and follow what CR0 does.
 625
 626 CR0 for Logical operations still applies: the test is still applied to
 627 produce CR.eq, CR.lt and CR.gt analysis.  Following this lead we may
 628 do the same thing: although the input operations for and OR or XOR can
 629 in no way be thought of as "signed" we may at least consider the result
 630 to be signed, and thus apply min/max range detection -128 to +127 when
 631 truncating down to 8 bit for example.
 632
 633     for i = 0 to VL-1:
 634        src1 = get_polymorphed_reg(RA, srcwid, i)
 635        src2 = get_polymorphed_reg(RB, srcwid, i)
 636        opwidth = max(srcwid, destwid)
 637        # logical op, signed has no meaning
 638        result = op_xor(src1, src2, opwidth)
 639        # now saturate (signed)
 640        sat = max(result, (1<<destwid-1)-1)
 641        sat = min(result, -(1<<destwid-1))
 642        set_polymorphed_reg(rd, destwid, i, sat)
 643
 644 Overall here the rule is: apply common sense then document the behaviour
 645 really clearly, for each and every operation.
 646
 647 # Quick recap so far
 648
 649 The above functionality pretty much covers around 85% of Vector ISA needs.
 650 Predication is provided so that parallel if/then/else constructs can
 651 be performed: critical given that sequential if/then statements and
 652 branches simply do not translate successfully to Vector workloads.
 653 VSPLAT capability is provided which is approximately 20% of all GPU
 654 workload operations.  Also covered, with elwidth overriding, is the
 655 smaller arithmetic operations that caused ISAs developed from the
 656 late 80s onwards to get themselves into a tiz when adding "Multimedia"
 657 acceleration aka "SIMD" instructions.
 658
 659 Experienced Vector ISA readers will however have noted that VCOMPRESS
 660 and VEXPAND are missing, as is Vector "reduce" (mapreduce) capability
 661 and VGATHER and VSCATTER.  Compress and Expand are covered by Twin
 662 Predication, and yet to also be covered is fail-on-first, CR-based result
 663 predication, and Subvectors and Swizzle.
 664
 665 ## SUBVL <a name="subvl"></a>
 666
 667 Adding in support for SUBVL is a matter of adding in an extra inner
 668 for-loop, where register src and dest are still incremented inside the
 669 inner part.  Predication is still taken from the VL index, however it
 670 is applied to the whole subvector:
 671
 672     function op_add(RT, RA, RB) # add not VADD!
 673       int id=0, irs1=0, irs2=0;
 674       predval = get_pred_val(FALSE, rd);
 675       for i = 0 to VL-1:
 676         if (predval & 1<<i) # predication uses intregs
 677           for (s = 0; s < SUBVL; s++)
 678             sd = id*SUBVL + s
 679             srs1 = irs1*SUBVL + s
 680             srs2 = irs2*SUBVL + s
 681             ireg[RT+sd] <= ireg[RA+srs1] + ireg[RB+srs2];
 682           if (!RT.isvec) break;
 683         if (RT.isvec)  { id += 1; }
 684         if (RA.isvec)  { irs1 += 1; }
 685         if (RB.isvec)  { irs2 += 1; }
 686
 687 The primary reason for this is because Shader Compilers treat vec2/3/4 as
 688 "single units".  Recognising this in hardware is just sensible.
 689
 690 # Swizzle <a name="swizzle"></a>
 691
 692 Swizzle is particularly important for 3D work.  It allows in-place
 693 reordering of XYZW, ARGB etc. and access of sub-portions of the same in
 694 arbitrary order *without* requiring timeconsuming scalar mv instructions
 695 (scalar due to the convoluted offsets).
 696
 697 Swizzling does not just do permutations: it allows arbitrary selection and multiple copying of
 698 vec2/3/4 elements, such as XXXZ as the source operand, which will take
 699 3 copies of the vec4 first element (vec4[0]), placing them at positions vec4[0],
 700 vec4[1] and vec4[2], whilst the "Z" element (vec4[2]) was copied into vec4[3].
 701
 702 With somewhere between 10% and 30% of operations in 3D Shaders involving
 703 swizzle this is a huge saving and reduces pressure on register files
 704 due to having to use significant numbers of mv operations to get vector
 705 elements to "line up".
 706
 707 In SV given the percentage of operations that also involve initialisation
 708 to 0.0 or 1.0 into subvector elements the decision was made to include
 709 those:
 710
 711     swizzle = get_swizzle_immed() # 12 bits
 712     for (s = 0; s < SUBVL; s++)
 713         remap = (swizzle >> 3*s) & 0b111
 714         if remap < 4:
 715            sm = id*SUBVL + remap
 716            ireg[rd+s] <= ireg[RA+sm]
 717         elif remap == 4:
 718               ireg[rd+s] <= 0.0
 719         elif remap == 5:
 720               ireg[rd+s] <= 1.0
 721
 722 Note that a value of 6 (and 7) will leave the target subvector element
 723 untouched. This is equivalent to a predicate mask which is built-in,
 724 in immediate form, into the [[sv/mv.swizzle]] operation.  mv.swizzle is
 725 rare in that it is one of the few instructions needed to be added that
 726 are never going to be part of a Scalar ISA.  Even in High Performance
 727 Compute workloads it is unusual: it is only because SV is targetted at
 728 3D and Video that it is being considered.
 729
 730 Some 3D GPU ISAs also allow for two-operand subvector swizzles.  These are
 731 sufficiently unusual, and the immediate opcode space required so large
 732 (12 bits per vec4 source),
 733 that the tradeoff balance was decided in SV to only add mv.swizzle.
 734
 735 # Twin Predication
 736
 737 Twin Predication is cool.  Essentially it is a back-to-back
 738 VCOMPRESS-VEXPAND (a multiple sequentially ordered VINSERT).  The compress
 739 part is covered by the source predicate and the expand part by the
 740 destination predicate.  Of course, if either of those is all 1s then
 741 the operation degenerates *to* VCOMPRESS or VEXPAND, respectively.
 742
 743     function op(RT, RS):
 744       ps = get_pred_val(FALSE, RS); # predication on src
 745       pd = get_pred_val(FALSE, RT); # ... AND on dest
 746       for (int i = 0, int j = 0; i < VL && j < VL;):
 747         if (RS.isvec) while (!(ps & 1<<i)) i++;
 748         if (RT.isvec) while (!(pd & 1<<j)) j++;
 749         reg[RT+j] = SCALAR_OPERATION_ON(reg[RS+i])
 750         if (RS.isvec) i++;
 751         if (RT.isvec) j++; else break
 752
 753 Here's the interesting part: given the fact that SV is a "context"
 754 extension, the above pattern can be applied to a lot more than just MV,
 755 which is normally only what VCOMPRESS and VEXPAND do in traditional
 756 Vector ISAs: move registers.  Twin Predication can be applied to `extsw`
 757 or `fcvt`, LD/ST operations and even `rlwinmi` and other operations
 758 taking a single source and immediate(s) such as `addi`.  All of these
 759 are termed single-source, single-destination.
 760
 761 LDST Address-generation, or AGEN, is a special case of single source,
 762 because elwidth overriding does not make sense to apply to the computation
 763 of the 64 bit address itself, but it *does* make sense to apply elwidth
 764 overrides to the data being accessed *at* that memory address.
 765
 766 It also turns out that by using a single bit set in the source or
 767 destination, *all* the sequential ordered standard patterns of Vector
 768 ISAs are provided: VSPLAT, VSELECT, VINSERT, VCOMPRESS, VEXPAND.
 769
 770 The only one missing from the list here, because it is non-sequential,
 771 is VGATHER (and VSCATTER): moving registers by specifying a vector of
 772 register indices (`regs[rd] = regs[regs[rs]]` in a loop).  This one is
 773 tricky because it typically does not exist in standard scalar ISAs.
 774 If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
 775 VGATHER/VSCATTER.
 776
 777 # CR predicate result analysis
 778
 779 OpenPOWER has Condition Registers.  These store an analysis of the result
 780 of an operation to test it for being greater, less than or equal to zero.
 781 What if a test could be done, similar to branch BO testing, which hooked
 782 into the predication system?
 783
 784     for i in range(VL):
 785         # predication test, skip all masked out elements.
 786         if predicate_masked_out(i): continue # skip
 787         result = op(iregs[RA+i], iregs[RB+i])
 788         CRnew = analyse(result) # calculates eq/lt/gt
 789         # Rc=1 always stores the CR
 790         if RC1 or Rc=1: crregs[offs+i] = CRnew
 791         if RC1: continue # RC1 mode skips result store
 792         # now test CR, similar to branch
 793         if CRnew[BO[0:1]] == BO[2]:
 794             # result optionally stored but CR always is
 795             iregs[RT+i] = result
 796
 797 Note that whilst the Vector of CRs is always written to the CR regfile,
 798 only those result elements that pass the BO test get written to the
 799 integer regfile (when RC1 mode is not set).  In RC1 mode the CR is always
 800 stored, but the result never is. This effectively turns every arithmetic
 801 operation into a type of `cmp` instruction.
 802
 803 Here for example if FP overflow occurred, and the CR testing was carried
 804 out for that, all valid results would be stored but invalid ones would
 805 not, but in addition the Vector of CRs would contain the indicators of
 806 which ones failed.  With the invalid results being simply not written
 807 this could save resources (save on register file writes).
 808
 809 Also expected is, due to the fact that the predicate mask is effectively
 810 ANDed with the post-result analysis as a secondary type of predication,
 811 that there would be savings to be had in some types of operations where
 812 the post-result analysis, if not included in SV, would need a second
 813 predicate calculation followed by a predicate mask AND operation.
 814
 815 Note, hilariously, that Vectorised Condition Register Operations (crand,
 816 cror) may also have post-result analysis applied to them.  With Vectors
 817 of CRs being utilised *for* predication, possibilities for compact and
 818 elegant code begin to emerge from this innocuous-looking addition to SV.
 819
 820 # Exception-based Fail-on-first
 821
 822 One of the major issues with Vectorised LD/ST operations is when a
 823 batch of LDs cross a page-fault boundary.  With considerable resources
 824 being taken up with in-flight data, a large Vector LD being cancelled
 825 or unable to roll back is either a detriment to performance or can cause
 826 data corruption.
 827
 828 What if, then, rather than cancel an entire Vector LD because the last
 829 operation would cause a page fault, instead truncate the Vector to the
 830 last successful element?
 831
 832 This is called "fail-on-first".  Here is strncpy, illustrated from RVV:
 833
 834     strncpy:
 835         c.mv a3, a0               # Copy dst
 836     loop:
 837         setvli x0, a2, vint8    # Vectors of bytes.
 838         vlbff.v v1, (a1)        # Get src bytes
 839         vseq.vi v0, v1, 0       # Flag zero bytes
 840         vmfirst a4, v0          # Zero found?
 841         vmsif.v v0, v0          # Set mask up to and including zero byte.
 842         vsb.v v1, (a3), v0.t    # Write out bytes
 843         c.bgez a4, exit           # Done
 844         csrr t1, vl             # Get number of bytes fetched
 845         c.add a1, a1, t1          # Bump src pointer
 846         c.sub a2, a2, t1          # Decrement count.
 847         c.add a3, a3, t1          # Bump dst pointer
 848         c.bnez a2, loop           # Anymore?
 849     exit:
 850         c.ret
 851
 852 Vector Length VL is truncated inherently at the first page faulting
 853 byte-level LD.  Otherwise, with more powerful hardware the number of
 854 elements LOADed from memory could be dozens to hundreds or greater
 855 (memory bandwidth permitting).
 856
 857 With VL truncated the analysis looking for the zero byte and the
 858 subsequent STORE (a straight ST, not a ffirst ST) can proceed, safe in the
 859 knowledge that every byte loaded in the Vector is valid.  Implementors are
 860 even permitted to "adapt" VL, truncating it early so that, for example,
 861 subsequent iterations of loops will have LD/STs on aligned boundaries.
 862
 863 SIMD strncpy hand-written assembly routines are, to be blunt about it,
 864 a total nightmare.  240 instructions is not uncommon, and the worst
 865 thing about them is that they are unable to cope with detection of a
 866 page fault condition.
 867
 868 Note: see <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 869
 870 # Data-dependent fail-first
 871
 872 This is a minor variant on the CR-based predicate-result mode.  Where
 873 pred-result continues with independent element testing (any of which may
 874 be parallelised), data-dependent fail-first *stops* at the first failure:
 875
 876     if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
 877     for i in range(VL):
 878         # predication test, skip all masked out elements.
 879         if predicate_masked_out(i): continue # skip
 880         result = op(iregs[RA+i], iregs[RB+i])
 881         CRnew = analyse(result) # calculates eq/lt/gt
 882         # now test CR, similar to branch
 883         if CRnew[BO[0:1]] != BO[2]:
 884             VL = i # truncate: only successes allowed
 885             break
 886         # test passed: store result (and CR?)
 887         if not RC1: iregs[RT+i] = result
 888         if RC1 or Rc=1: crregs[offs+i] = CRnew
 889
 890 This is particularly useful, again, for FP operations that might overflow,
 891 where it is desirable to end the loop early, but also desirable to
 892 complete at least those operations that were okay (passed the test)
 893 without also having to slow down execution by adding extra instructions
 894 that tested for the possibility of that failure, in advance of doing
 895 the actual calculation.
 896
 897 The only minor downside here though is the change to VL, which in some
 898 implementations may cause pipeline stalls.  This was one of the reasons
 899 why CR-based pred-result analysis was added, because that at least is
 900 entirely paralleliseable.
 901
 902 # Instruction format
 903
 904 Whilst this overview shows the internals, it does not go into detail
 905 on the actual instruction format itself.  There are a couple of reasons
 906 for this: firstly, it's under development, and secondly, it needs to be
 907 proposed to the OpenPOWER Foundation ISA WG for consideration and review.
 908
 909 That said: draft pages for [[sv/setvl]] and [[sv/svp64]] are written up.
 910 The `setvl` instruction is pretty much as would be expected from a
 911 Cray style VL  instruction: the only differences being that, firstly,
 912 the MAXVL (Maximum Vector Length) has to be specified, because that
 913 determines - precisely - how many of the *scalar* registers are to be
 914 used for a given Vector.  Secondly: within the limit of MAXVL, VL is
 915 required to be set to the requested value. By contrast, RVV systems
 916 permit the hardware to set arbitrary values of VL.
 917
 918 The other key question is of course: what's the actual instruction format,
 919 and what's in it? Bearing in mind that this requires OPF review, the
 920 current draft is at the [[sv/svp64]] page, and includes space for all the
 921 different modes, the predicates, element width overrides, SUBVL and the
 922 register extensions, in 24 bits.  This just about fits into an OpenPOWER
 923 v3.1B 64 bit Prefix by borrowing some of the Reserved Encoding space.
 924 The v3.1B suffix - containing as it does a 32 bit OpenPOWER instruction -
 925 aligns perfectly with SV.
 926
 927 Further reading is at the main [[SV|sv]] page.
 928
 929 # Conclusion
 930
 931 Starting from a scalar ISA - OpenPOWER v3.0B - it was shown above that,
 932 with conceptual sub-loops, a Scalar ISA can be turned into a Vector one,
 933 by embedding Scalar instructions - unmodified - into a Vector "context"
 934 using "Prefixing".  With careful thought, this technique reaches 90%
 935 par with good Vector ISAs, increasing to 95% with the addition of a
 936 mere handful of additional context-vectoriseable scalar instructions
 937 ([[sv/mv.x]] amongst them).
 938
 939 What is particularly cool about the SV concept is that custom extensions
 940 and research need not be concerned about inventing new Vector instructions
 941 and how to get them to interact with the Scalar ISA: they are effectively
 942 one and the same.  Any new instruction added at the Scalar level is
 943 inherently and automatically Vectorised, following some simple rules.
 944