simple_v_extension.mdwn

   1 # SIMD / Simple-V Extension Proposal
   2
   3 This proposal exists so as to be able to satisfy several disparate
   4 requirements: power-conscious, area-conscious, and performance-conscious
   5 designs all pull an ISA and its implementation in different conflicting
   6 directions, as do the specific intended uses for any given implementation.
   7
   8 Also, the existing P (SIMD) proposal and the V (Vector) proposals,
   9 whilst each extremely powerful in their own right and clearly desirable,
  10 are also:
  11
  12 * Clearly independent in their origins (Cray and AndeStar v3 respectively)
  13 * Both contain partial duplication of pre-existing RISC-V instructions
  14   (an undesirable characteristic)
  15 * Both have independent and disparate methods for introducing parallelism
  16   at the instruction level.
  17 * Both require that their respective parallelism paradigm be implemented
  18   along-side their respective functionality *or not at all*.
  19 * Both independently have methods for introducing parallelism that
  20   could, if separated, benefit
  21   *other areas of RISC-V not just DSP or Floating-point respectively*.
  22
  23 Therefore it makes a huge amount of sense to have a means and method
  24 of introducing instruction parallelism in a flexible way that provides
  25 implementors with the option to choose exactly where they wish to offer
  26 performance improvements and where they wish to optimise for power
  27 and/or area.  If that can be offered even on a per-operation basis that
  28 would provide even more flexibility.
  29
  30 **TODO**: reword this to better suit this document:
  31
  32 Having looked at both P and V as they stand, they're _both_ very much
  33 "separate engines" that, despite both their respective merits and
  34 extremely powerful features, don't really cleanly fit into the RV design
  35 ethos (or the flexible extensibility) and, as such, are both in danger
  36 of not being widely adopted.  I'm inclined towards recommending:
  37
  38 * splitting out the DSP aspects of P-SIMD to create a single-issue DSP
  39 * splitting out the polymorphism, esoteric data types (GF, complex
  40   numbers) and unusual operations of V to create a single-issue "Esoteric
  41   Floating-Point" extension
  42 * splitting out the loop-aspects, vector aspects and data-width aspects
  43   of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
  44   apply across *all* Extensions, whether those be DSP, M, Base, V, P -
  45   everything.
  46
  47 # Analysis and discussion of Vector vs SIMD
  48
  49 There are four combined areas between the two proposals that help with
  50 parallelism without over-burdening the ISA with a huge proliferation of
  51 instructions:
  52
  53 * Fixed vs variable parallelism (fixed or variable "M" in SIMD)
  54 * Implicit vs fixed instruction bit-width (integral to instruction or not)
  55 * Implicit vs explicit type-conversion (compounded on bit-width)
  56 * Implicit vs explicit inner loops.
  57 * Masks / tagging (selecting/preventing certain indexed elements from execution)
  58
  59 The pros and cons of each are discussed and analysed below.
  60
  61 ## Fixed vs variable parallelism length
  62
  63 In David Patterson and Andrew Waterman's analysis of SIMD and Vector
  64 ISAs, the analysis comes out clearly in favour of (effectively) variable
  65 length SIMD.  As SIMD is a fixed width, typically 4, 8 or in extreme cases
  66 16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
  67 are extremely burdensome except for applications whose requirements
  68 *specifically* match the *precise and exact* depth of the SIMD engine.
  69
  70 Thus, SIMD, no matter what width is chosen, is never going to be acceptable
  71 for general-purpose computation, and in the context of developing a
  72 general-purpose ISA, is never going to satisfy 100 percent of implementors.
  73
  74 That basically leaves "variable-length vector" as the clear *general-purpose*
  75 winner, at least in terms of greatly simplifying the instruction set,
  76 reducing the number of instructions required for any given task, and thus
  77 reducing power consumption for the same.
  78
  79 ## Implicit vs fixed instruction bit-width
  80
  81 SIMD again has a severe disadvantage here, over Vector: huge proliferation
  82 of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
  83 have to then have operations *for each and between each*.  It gets very
  84 messy, very quickly.
  85
  86 The V-Extension on the other hand proposes to set the bit-width of
  87 future instructions on a per-register basis, such that subsequent instructions
  88 involving that register are *implicitly* of that particular bit-width until
  89 otherwise changed or reset.
  90
  91 This has some extremely useful properties, without being particularly
  92 burdensome to implementations, given that instruction decode already has
  93 to direct the operation to a correctly-sized width ALU engine, anyway.
  94
  95 Not least: in places where an ISA was previously constrained (due for
  96 whatever reason, including limitations of the available operand spcace),
  97 implicit bit-width allows the meaning of certain operations to be
  98 type-overloaded *without* pollution or alteration of frozen and immutable
  99 instructions, in a fully backwards-compatible fashion.
 100
 101 ## Implicit and explicit type-conversion
 102
 103 The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
 104 deal with over-population of instructions, such that type-casting from
 105 integer (and floating point) of various sizes is automatically inferred
 106 due to "type tagging" that is set with a special instruction.  A register
 107 will be *specifically* marked as "16-bit Floating-Point" and, if added
 108 to an operand that is specifically tagged as "32-bit Integer" an implicit
 109 type-conversion will take placce *without* requiring that type-conversion
 110 to be explicitly done with its own separate instruction.
 111
 112 However, implicit type-conversion is not only quite burdensome to
 113 implement (explosion of inferred type-to-type conversion) but also is
 114 never really going to be complete.  It gets even worse when bit-widths
 115 also have to be taken into consideration.
 116
 117 Overall, type-conversion is generally best to leave to explicit
 118 type-conversion instructions, or in definite specific use-cases left to
 119 be part of an actual instruction (DSP or FP)
 120
 121 ## Zero-overhead loops vs explicit loops
 122
 123 The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
 124 contains an extremely interesting feature: zero-overhead loops.  This
 125 proposal would basically allow an inner loop of instructions to be
 126 repeated indefinitely, a fixed number of times.
 127
 128 Its specific advantage over explicit loops is that the pipeline in a
 129 DSP can potentially be kept completely full *even in an in-order
 130 implementation*.  Normally, it requires a superscalar architecture and
 131 out-of-order execution capabilities to "pre-process" instructions in order
 132 to keep ALU pipelines 100% occupied.
 133
 134 This very simple proposal offers a way to increase pipeline activity in the
 135 one key area which really matters: the inner loop.
 136
 137 ## Mask and Tagging
 138
 139 *TODO: research masks as they can be superb and extremely powerful.
 140 If B-Extension is implemented and provides Bit-Gather-Scatter it
 141 becomes really cool and easy to switch out certain indexed values
 142 from an array of data, but actually BGS **on its own** might be
 143 sufficient.  Bottom line, this is complex, and needs a proper analysis.
 144 The other sections are pretty straightforward.*
 145
 146 ## Conclusions
 147
 148 In the above sections the four different ways where parallel instruction
 149 execution has closely and loosely inter-related implications for the ISA and
 150 for implementors, were outlined.  The pluses and minuses came out as
 151 follows:
 152
 153 * Fixed vs variable parallelism: <b>variable</b>
 154 * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
 155 * Implicit vs explicit type-conversion: <b>explicit</b>
 156 * Implicit vs explicit inner loops: <b>implicit</b>
 157 * Tag or no-tag: <b>TODO</b>
 158
 159 In particular: variable-length vectors came out on top because of the
 160 high setup, teardown and corner-cases associated with the fixed width
 161 of SIMD.  Implicit bit-width helps to extend the ISA to escape from
 162 former limitations and restrictions (in a backwards-compatible fashion),
 163 and implicit (zero-overhead) loops provide a means to keep pipelines
 164 potentially 100% occupied *without* requiring a super-scalar or out-of-order
 165 architecture.
 166
 167 Constructing a SIMD/Simple-Vector proposal based around even only these four
 168 (five?) requirements would therefore seem to be a logical thing to do.
 169
 170 # Instruction Format
 171
 172 **TODO** *basically borrow from both P and V, which should be quite simple
 173 to do, with the exception of Tag/no-tag, which needs a bit more
 174 thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
 175 gather-scatterer, and, if implemented, could actually be a really useful
 176 way to span 8-bit up to 64-bit groups of data, where BGS as it stands
 177 and described by Clifford does **bits** of up to 16 width.  Lots to
 178 look at and investigate!*
 179
 180 # References
 181
 182 * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
 183 * Link to first proposal <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8>
 184 * Recommendation by Jacob Bachmeyer to make zero-overhead loop an
 185   "implicit program-counter" <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/vYVi95gF2Mo/SHz6a4_lAgAJ>
 186 * Re-continuing P-Extension proposal <https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/IkLkQn3HvXQ/SEMyC9IlAgAJ>
 187 * First Draft P-SIMD (DSP) proposal <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo>
 188 * B-Extension discussion <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/zi_7B15kj6s>