96aec47ea7554bf79297aee07bcce8d4d5df1cd8
[libreriscv.git] / simple_v_extension.mdwn
1 # SIMD / Simple-V Extension Proposal
2
3 This proposal exists so as to be able to satisfy several disparate
4 requirements: area-conscious designs and performance-conscious designs.
5 Also, the existing P (SIMD) proposal and the V (Vector) proposals,
6 whilst each extremely powerful in their own right and clearly desirable,
7 are also:
8
9 * Clearly independent in their origins (Cray and AndeStar v3 respectively)
10 * Both contain duplication of pre-existing RISC-V instructions
11 * Both have independent and disparate methods for introducing parallelism
12 at the instruction level.
13 * Both require that their respective parallelism paradigm be implemented
14 along-side their respective functionality *or not at all*.
15 * Both independently have methods for introducing parallelism that could,
16 if separated, benefit *other areas of RISC-V not just DSP and Floating-point*.
17
18 Therefore it makes a huge amount of sense to have a means and method
19 of introducing instruction parallelism in a flexible way that provides
20 implementors with the option to choose exactly where they wish to offer
21 performance improvements and where they wish to optimise for power
22 and area. If that can be offered even on a per-operation basis that
23 would provide even more flexibility.
24
25 # Analysis and discussion of Vector vs SIMD
26
27 There are four combined areas between the two proposals that help with
28 parallelism without over-burdening the ISA with a huge proliferation of
29 instructions:
30
31 * Fixed vs variable parallelism (fixed or variable "M" in SIMD)
32 * Implicit vs fixed instruction bit-width (integral to instruction or not)
33 * Implicit vs explicit type-conversion (compounded on bit-width)
34 * Implicit vs explicit inner loops.
35 * Masks / tagging (selecting/preventing certain indexed elements from execution)
36
37 The pros and cons of each are discussed and analysed below.
38
39 ## Fixed vs variable parallelism length
40
41 In David Patterson and Andrew Waterman's analysis of SIMD and Vector
42 ISAs, the analysis comes out clearly in favour of (effectively) variable
43 length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases
44 16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
45 are extremely burdensome except for applications whose requirements
46 *specifically* match the *precise and exact* depth of the SIMD engine.
47
48 Thus, SIMD, no matter what width is chosen, is never going to be acceptable
49 for general-purpose computation, and in the context of developing a
50 general-purpose ISA, is never going to satisfy 100 percent of implementors.
51
52 That basically leaves "variable-length vector" as the clear *general-purpose*
53 winner, at least in terms of greatly simplifying the instruction set,
54 reducing the number of instructions required for any given task, and thus
55 reducing power consumption for the same.
56
57 ## Implicit vs fixed instruction bit-width
58
59 SIMD again has a severe disadvantage here, over Vector: huge proliferation
60 of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
61 have to then have operations *for each and between each*. It gets very
62 messy, very quickly.
63
64 The V-Extension on the other hand proposes to set the bit-width of
65 future instructions on a per-register basis, such that subsequent instructions
66 involving that register are *implicitly* of that particular bit-width until
67 otherwise changed or reset.
68
69 This has some extremely useful properties, without being particularly
70 burdensome to implementations, given that instruction decode already has
71 to direct the operation to a correctly-sized width ALU engine, anyway.
72
73 Not least: in places where an ISA was previously constrained (due for
74 whatever reason, including limitations of the available operand spcace),
75 implicit bit-width allows the meaning of certain operations to be
76 type-overloaded *without* pollution or alteration of frozen and immutable
77 instructions, in a fully backwards-compatible fashion.
78
79 ## Implicit and explicit type-conversion
80
81 The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
82 deal with over-population of instructions, such that type-casting from
83 integer (and floating point) of various sizes is automatically inferred
84 due to "type tagging" that is set with a special instruction. A register
85 will be *specifically* marked as "16-bit Floating-Point" and, if added
86 to an operand that is specifically tagged as "32-bit Integer" an implicit
87 type-conversion will take placce *without* requiring that type-conversion
88 to be explicitly done with its own separate instruction.
89
90 However, implicit type-conversion is not only quite burdensome to
91 implement (explosion of inferred type-to-type conversion) but also is
92 never really going to be complete. It gets even worse when bit-widths
93 also have to be taken into consideration.
94
95 Overall, type-conversion is generally best to leave to explicit
96 type-conversion instructions, or in definite specific use-cases left to
97 be part of an actual instruction (DSP or FP)
98
99 ## Zero-overhead loops vs explicit loops
100
101 The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
102 contains an extremely interesting feature: zero-overhead loops. This
103 proposal would basically allow an inner loop of instructions to be
104 repeated indefinitely, a fixed number of times.
105
106 Its specific advantage over explicit loops is that the pipeline in a
107 DSP can potentially be kept completely full *even in an in-order
108 implementation*. Normally, it requires a superscalar architecture and
109 out-of-order execution capabilities to "pre-process" instructions in order
110 to keep ALU pipelines 100% occupied.
111
112 This very simple proposal offers a way to increase pipeline activity in the
113 one key area which really matters: the inner loop.
114
115 ## Mask and Tagging
116
117 *TODO: research masks as they can be superb and extremely powerful.
118 If B-Extension is implemented and provides Bit-Gather-Scatter it
119 becomes really cool and easy to switch out certain indexed values
120 from an array of data, but actually BGS **on its own** might be
121 sufficient. Bottom line, this is complex, and needs a proper analysis.
122 The other sections are pretty straightforward.*
123
124 ## Conclusions
125
126 In the above sections the four different ways where parallel instruction
127 execution has closely and loosely inter-related implications for the ISA and
128 for implementors, were outlined. The pluses and minuses came out as
129 follows:
130
131 * Fixed vs variable parallelism: <b>variable</b>
132 * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
133 * Implicit vs explicit type-conversion: <b>explicit</b>
134 * Implicit vs explicit inner loops: <b>implicit</b>
135 * Tag or no-tag: <b>TODO</b>
136
137 In particular: variable-length vectors came out on top because of the
138 high setup, teardown and corner-cases associated with the fixed width
139 of SIMD. Implicit bit-width helps to extend the ISA to escape from
140 former limitations and restrictions (in a backwards-compatible fashion),
141 and implicit (zero-overhead) loops provide a means to keep pipelines
142 potentially 100% occupied *without* requiring a super-scalar or out-of-order
143 architecture.
144
145 Constructing a SIMD/Simple-Vector proposal based around even only these four
146 (five?) requirements would therefore seem to be a logical thing to do.
147
148 # Instruction Format
149
150 **TODO** *basically borrow from both P and V, which should be quite simple
151 to do, with the exception of Tag/no-tag, which needs a bit more
152 thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
153 gather-scatterer, and, if implemented, could actually be a really useful
154 way to span 8-bit up to 64-bit groups of data, where BGS as it stands
155 and described by Clifford does **bits** of up to 16 width. Lots to
156 look at and investigate!*
157
158 # References
159
160 * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
161 * Link to first proposal <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/GuukrSjgBH8>
162 * Recommendation by Jacob Bachmeyer to make zero-overhead loop an
163 "implicit program-counter" <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/vYVi95gF2Mo/SHz6a4_lAgAJ>
164 * Re-continuing P-Extension proposal <https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/IkLkQn3HvXQ/SEMyC9IlAgAJ>
165 * First Draft P-SIMD (DSP) proposal <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo>
166 * B-Extension discussion <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/zi_7B15kj6s>