openpower/sv/overview.mdwn

   1 # SV Overview
   2
   3 This document provides a crash-course overview as to why SV exists, and how it works.
   4
   5 [SIMD is known to be harmful](https://www.sigarch.org/simd-instructions-considered-harmful/):
   6 a seductive simplicity that is easy to implement in hardware.  Even with predication added, SIMD only becomes more and more problematic with each power of two SIMD width increase introduced through an ISA revision.
   7
   8 Cray-style variable-length Vectors on the other hand result in stunningly elegant and small loops, with no alarmingly high setup and cleanup code, where at the hardware level the microarchitecture may execute from one element right the way through to tens of thousands at a time, yet the executable remains exactly the same.  Unlike in SIMD, powers of two limitations are not involved in either the hardware nor in the assembly code.
   9
  10 SimpleV takes the Cray style Vector principle and applies it to a Scalar ISA, in the process allowing register file size increases using "tagging" (similar to how x86 originally extended registers from 32 to 64 bit).
  11
  12 The fundamentals are:
  13
  14 * The Program Counter gains a "Sub Counter" context.
  15 * Vectorisation pauses the PC and runs a loop from 0 to VL-1
  16  (where VL is Vector Length)
  17 * During the loop the instruction at the PC is executed *multiple*
  18   times.
  19 * Some registers may be "tagged" as Vectors
  20 * During the loop, "Vector"-tagged register are incrememted by
  21   one with each iteration.
  22 * Once the loop is completed *only then* is the Program Counter
  23   allowed to move to the next instruction.
  24
  25 In OpenPOWER ISA v3.0B pseudo-code form, an ADD operation, assuming both source and destination have been "tagged" as Vectors, is simply:
  26
  27     for i = 0 to VL-1:
  28          GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
  29
  30 At its heart, SimpleV really is this simple.  On top of this fundamental basis further refinements can be added which build up towards an extremely powerful Vector augmentation system, with very little in the way of additional opcodes required: simply external "context".
  31
  32 RISC-V RVV as of version 0.9 is over 180 instructions (more than the rest of RV64G combined). Over 95% of that functionality is added to OpenPOWER v3 0B, by SimpleV augmentation, with around 5 to 8 instructions.
  33
  34 Even in OpenPOWER v3.0B, the Scalar Integer ISA is around 150 instructions, with IEEE754 FP adding approximately 80 more. VSX, being based on SIMD design principles, adds somewhere in the region of 600 more.  SimpleV again provides over 95% of VSX functionality, simply by augmenting the *Scalar* OpenPOWER ISA, and in the process providing features such as predication, which VSX is entirely missing.
  35
  36 The rest of this document builds on the above simple loop to add:
  37
  38 * Vector-Scalar, Scalar-Vector and Scalar-Scalar operation
  39 * Traditional Vector operations (VSPLAT, VINSERT, VCOMPRESS etc)
  40 * Predication masks (essential for parallel if/else constructs)
  41 * 8, 16 and 32 bit integer operations, and both FP16 and BF16.
  42 * Fail-on-first (introduced in ARM SVE2)
  43 * A new concept: Data-dependent fail-first
  44 * Condition-Register based *post-result* predication (also new)
  45 * A completely new concept: "Twin Predication"
  46
  47 All of this is *without modifying the OpenPOWER v3.0B ISA*, except to add "wrapping context", similar to how v3.1B 64 Prefixes work.
  48
  49 In fairness to both VSX and RVV, there are things that are not provided by SimpleV:
  50
  51 * 128 bit or above arithmetic and other operations
  52   (VSX Rijndael and SHA primitives; VSX shuffle operations)
  53 * register files above 128
  54 * Vector lengths over 64
  55 * Unit-strided LD/ST and other comprehensive memory operations
  56   (struct-based LD/ST from RVV for example)
  57 * 32-bit instruction lengths. [[svp64]] had to be added as 64 bit.
  58
  59 These are not insurmountable limitations, that, over time, may well be added in future revisions of SV.
  60
  61 # Adding Scalar / Vector
  62
  63 The first augmentation to the simple loop is to add the option for all source and destinations to all be either scalar or vector.  As a FSM this is where our "simple" loop gets its first complexity.
  64
  65     function op_add(rd, rs1, rs2) # add not VADD!
  66       int id=0, irs1=0, irs2=0;
  67       for i = 0 to VL-1:
  68         ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
  69         if (!rd.isvec) break;
  70         if (rd.isvec)  { id += 1; }
  71         if (rs1.isvec)  { irs1 += 1; }
  72         if (rs2.isvec)  { irs2 += 1; }
  73
  74 With some walkthroughs it is clear that the loop exits immediately after the first scalar destination result is written, and that when the destination is a Vector the loop proceeds to fill up the register file, sequentially, starting at `rd` and ending at `rd+VL-1`. The two source registers will, independently, either remain pointing at `rs1` or `rs2` respectively, or, if marked as Vectors, will march incrementally in lockstep, producing element results along the way, as the destination also progresses through elements.
  75
  76 In this way all the eight permutations of Scalar and Vector behaviour are covered, although without predication the scalar-destination ones are reduced in usefulness.  It does however clearly illustrate the principle.
  77
  78 Note in particular: there is no separate Scalar add instruction and separate Vector instruction and separate Scalar-Vector instruction, *and there is no separate Vector register file*: it's all the same instruction, on the standard register file, just with a loop.  Scalar happens to set that loop size to one.
  79
  80 # Adding single predication
  81
  82 The next step is to add a single predicate mask.  This is where it gets interesting.  Predicate masks are a bitvector, each bit specifying, in order, whether the element operation is to be skipped ("masked out") or allowed. If there is no predicate, it is set to all 1s, which is effectively the same as "no predicate".
  83
  84     function op_add(rd, rs1, rs2) # add not VADD!
  85       int id=0, irs1=0, irs2=0;
  86       predval = get_pred_val(FALSE, rd);
  87       for i = 0 to VL-1:
  88         if (predval & 1<<i) # predication bit test
  89            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
  90            if (!rd.isvec) break;
  91         if (rd.isvec)  { id += 1; }
  92         if (rs1.isvec)  { irs1 += 1; }
  93         if (rs2.isvec)  { irs2 += 1; }
  94
  95 The key modification is to skip the creation and storage of the result if the relevant predicate mask bit is clear, but *not the progression through the registers*.
  96
  97 A particularly interesting case is if the destination is scalar, and the first few bits of the predicate are zero.  The loop proceeds to increment the Svalar *source* registers until the first nonzero predicate bit is found, whereupon a single result is computed, and *then* the loop exits.  This therefore uses the predicate to perform Vector source indexing.  This case was not possible without the predicate mask.
  98
  99 If all three registers are marked as Vector then the "traditional" predicated Vector behaviour is provided.  Yet, just as before, all other options are still provided, right the way back to the pure-scalar case, as if this were a straight OpenPOWER v3.0B non-augmented instruction.
 100
 101 # Predicate "zeroing" mode
 102
 103 Sometimes with predication it is ok to leave the masked-out element alone (not modify the result) however sometimes it is better to zero the masked-out elrments.  This can be combined with bit-wise ORing to build up vectors from multiple predicate patterns.  Our pseudocode therefore ends up as follows, to take that into account:
 104
 105     function op_add(rd, rs1, rs2) # add not VADD!
 106       int id=0, irs1=0, irs2=0;
 107       predval = get_pred_val(FALSE, rd);
 108       for i = 0 to VL-1:
 109         if (predval & 1<<i) # predication bit test
 110            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 111            if (!rd.isvec) break;
 112         else if zeroing:
 113            ireg[rd+id] = 0
 114         if (rd.isvec)  { id += 1; }
 115         if (rs1.isvec)  { irs1 += 1; }
 116         if (rs2.isvec)  { irs2 += 1; }
 117
 118 Many Vector systems either have zeroing or they have nonzeroing, they do not have both.  This is because they usually have separate Vector register files. However SV sits on top of standard register files and consequently there are advantages to both, so both are provided.
 119
 120 # Element Width overrides
 121
 122 All good Vector ISAs have the usual bitwidths for operations: 8/16/32/64 bit integer operations, and IEEE754 FP32 and 64.  Often also included is FP16 and more recently BF16.  The *really* good Vector ISAs have variable-width vectors right down to bitlevel, and as high as 1024 bit arithmetic, as well as IEEE754 FP128.
 123
 124 SV has an "override" system that *changes* the bitwidth of operations that were intended by the original scalar ISA designers to have (for example) 64 bit operations.  The override widths are 8, 16 and 32 for integer, and FP16 and FP32 for IEEE754 (with BF16 to be added in the future).
 125
 126 This presents a particularly intriguing conundrum given that the OpenPOWER Scalar ISA was never designed with for example 8 bit operations in mind, let alone Vectors of 8 bit.
 127
 128 The solution comes in terms of rethinking the definition of a Register File.  Rhe typical regfile may be considered to be a multi-ported SRAM block, 64 bits wide and usually 32 entries deep, to give 32 64 bit registers.  Conceptually, to get our variable element width vectors, we may think of the regfile as being the following c-based data structure:
 129
 130     typedef union {
 131         uint8_t  actual_bytes[8];
 132         uint8_t  b[0]; // array of type uint8_t
 133         uint16_t s[0];
 134         uint32_t i[0];
 135         uint64_t l[0]; // default OpenPOWER ISA uses this
 136     } reg_t;
 137
 138     reg_t int_regfile[128]; // SV extends to 128 regs
 139
 140 Then, our simple loop, instead of accessing the array of 64 bits with a computed index, would access the appropriate element of the appropriate type.  Thus we have a series of overlapping conceptual arrays that each start at what is traditionally thought of as "a register".  It then helps if we have a couple of routines:
 141
 142     get_polymorphed_reg(reg, bitwidth, offset):
 143         reg_t res = 0;
 144         if bitwidth == 8:
 145             reg.b = int_regfile[reg].b[offset]
 146         elif bitwidth == 16:
 147             reg.s = int_regfile[reg].s[offset]
 148         elif bitwidth == 32:
 149             reg.i = int_regfile[reg].i[offset]
 150         elif bitwidth == default: # 64
 151             reg.l = int_regfile[reg].l[offset]
 152         return res
 153
 154     set_polymorphed_reg(reg, bitwidth, offset, val):
 155         if (!reg.isvec): # scalar
 156             int_regfile[reg].l[0] = val
 157         elif bitwidth == 8:
 158             int_regfile[reg].b[offset] = val
 159         elif bitwidth == 16:
 160             int_regfile[reg].s[offset] = val
 161         elif bitwidth == 32:
 162             int_regfile[reg].i[offset] = val
 163         elif bitwidth == default: # 64
 164             int_regfile[reg].l[offset] = val
 165
 166 These basically provide a convenient parameterised way to access the register file, at an arbitrary vector element offset and an arbitrary element width.  Our first simple loop thus becomes:
 167
 168     for i = 0 to VL-1:
 169        src1 = get_polymorphed_reg(rs1, srcwid, i)
 170        src2 = get_polymorphed_reg(rs2, srcwid, i)
 171        result = src1 + src2 # actual add here
 172        set_polymorphed_reg(rd, destwid, i, result)
 173
 174 Note that things such as zero/sign-extension (and predication) have been left out to illustrate the elwidth concept. Also note that it turns out to be important to perform the operation at the maximum bitwidth - `max(srcwid, destwid)` - such that any truncation, rounding errors or other artefacts may all be ironed out.  This turns out to be important when applying Saturation for Audio DSP workloads.
 175
 176 Other than that, element width overrides, which can be applied to *either* source or destination or both, are pretty straightforward, conceptually.  The details, for hardware engineers, involve byte-level write-enable lines, which is exactly what is used on SRAMs anyway.  Compiler writers have to alter Register Allocation Tables to byte-level granularity.
 177
 178 One critical thing to note: upper parts of the underlying 64 bit register are *not zero'd out* by a write involving a non-aligned Vector Length. An 8 bit operation with VL=7 will *not* overwrite the 8th byte of the destination.  This is extremely important to consider the register file as a byte-level store, not a 64-bit-level store.