(no commit message)
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shape.mdwn
1 # SimdShape
2
3 Links:
4
5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
7
8 # Requirements Analysis
9
10 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
11 provides sufficient context to both define overrides for individual lengths
12 on a per-mask basis as well as sufficient information to "upcast"
13 back to a SimdSignal, in exactly the same way that c++ virtual base
14 class upcasting works when RTTI (Run Time Type Information) works.
15
16 By deriving from `ast.Shape` both `width` and `signed` are provided
17 already, leaving the `SimdShape` class with the responsibility to
18 additionally define lengths for each mask basis. This is best illustrated
19 with an example.
20
21 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
22 but without massive disruptive code-duplication or intrusive explicit
23 coding as outlined in the worst of the techniques documented in
24 [[dynamic_simd]]. This in turn implies that Signals need to be declared
25 for both mantissa and exponent that **change width to non-power-of-two
26 sizes** depending on Partition Mask Context.
27
28 Mantissa:
29
30 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
31 rounding and sticky)
32 * when the context is 2xFP32 there are **two** mantissas of 23 bits
33 * when the context is 4xFP16 there are **four** mantissas of 10 bits
34 * when the context is 4xBF16 there are four mantissas of 5 bits.
35
36 Exponent:
37
38 * 1xFP64: 11 bits, one exponent
39 * 2xFP32: 8 bits, two exponents
40 * 4xFP16: 5 bits, four exponents
41 * 4xBF16: 8 bits, four exponents
42
43 `SimdShape` needs this information in addition to the normal
44 information (width, sign) in order to create the partitions
45 that allow standard nmigen operations to **transparently**
46 and naturally take place at **all** of these non-uniform
47 widths, as if they were in fact scalar Signals *at* those
48 widths.
49
50 A minor wrinkle which emerges from deep analysis is that the overall
51 available width (`Shape.width`) does in fact need to be explicitly
52 declared under *some* circumstances, and
53 the sub-partitions to fit onto power-of-two boundaries, in order to allow
54 straight wire-connections rather than allow the SimdSignal to be
55 arbitrary-sized (compact). Although on shallow inspection this
56 initially would seem to imply that it would result in large unused
57 sub-partitions (padding partitions) these gates can in fact be eliminated
58 with a "blanking" mask, created from static analysis of the SimdShape
59 context.
60
61 Example:
62
63 * all 32 and 16-bit values are actually to be truncated to 11 bit
64 * all 8-bit values to 5-bit
65
66 from these we can write out the allocations, bearing in mind that
67 in each partition the sub-signal must start on a power-2 boundary,
68
69 |31| | |24| 16|15| | 8|7 0 |
70 32bit | | | | 1.11 |
71 16bit | | 2.11 | | | 1.11 |
72 8bit | | 4.5 | 3.5 | | 2.5 | | 1.5 |
73
74 Next we identify the start and end points, and note
75 that "x" marks unused (padding) portions. We begin by marking
76 the power-of-two boundaries (0-7 .. 24-31) and also including column
77 guidelines to delineate the start and endpoints:
78
79 |31| | |24| 16|15| | 8|7 0 |
80 |31|28|26|24| |20|16|15|12|10|8| |4 0 |
81 32bit | x| x| x| | x| x| x|10 .... 0 |
82 16bit | x| x|26 ... 16 | x| x|10 .... 0 |
83 8bit | x|28 .. 24| 20.16| x|12 .. 8|x|4.. 0 |
84 unused x x
85
86 thus, we deduce, we *actually* need breakpoints at *nine* positions,
87 and that unused portions common to **all** cases can be deduced
88 and marked "x" by looking at the columns above them.
89 These 100% unused "x"s therefore define the "blanking" mask, and in
90 these sub-portions it is unnecessary to allocate computational gates.
91
92 Also in order to save gates, in the example above there are only three
93 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
94 are required to construct the larger overall computational result
95 from the "smaller chunks". At first glance, with there
96 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
97 would appear that 2^9 (512!) cases were required, where in fact
98 there are only three.
99
100 These facts also need to be communicated to both the SimdSignal
101 as well as the submodules implementing its core functionality:
102 add operation and other arithmetic behaviour, as well as
103 [[dynamic_simd/cat]] and others.
104
105 In addition to that, there is a "convenience" that emerged
106 from technical discussions as desirable
107 to have, which is that it should be possible to perform
108 rudimentary arithmetic operations *on a SimdShape* which preserves
109 or adapts the Partition context, where the arithmetic operations
110 occur on `Shape.width`.
111
112 >>> XLEN = SimdShape(64, signed=True, ...)
113 >>> x2 = XLEN // 2
114 >>> print(x2.width)
115 32
116 >>> print(x2.signed)
117 True
118
119 With this capability it becomes possible to use the Liskov Substitution
120 Principle in dynamically compiling code that switches between scalar and
121 SIMD transparently:
122
123 # scalar context
124 scalarctx = scl = object()
125 scl.XLEN = 64
126 scl.SigKls = Signal # standard nmigen Signal
127 # SIMD context
128 simdctx = sdc = object()
129 sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
130 sdc.SigKls = SimdSignal # advanced SIMD Signal
131 sdc.elwidth = Signal(2)
132
133 # select one
134 if compiletime_switch == 'SIMD':
135 ctx = simdctx
136 else:
137 ctx = scalarctx
138
139 # exact same code switching context at compile time
140 m = Module():
141 with ctx:
142 x = ctx.SigKls(ctx.XLEN)
143 y = ctx.SigKls(ctx.XLEN // 2)
144 ...
145 m.d.comb += x.eq(Const(3))
146
147 An interesting practical requirement transpires from attempting to use
148 SimdSignal, that affects the way that SimdShape works. The register files
149 are 64 bit, and are subdivided according to what wikipedia terms
150 "SIMD Within A Register" (SWAR). Therefore, the SIMD ALUs *have* to
151 both accept and output 64-bit signals at that explicit width, with
152 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
153
154 However when it comes to intermediary processing (partial computations)
155 those intermediary Signals can and will be required to be a certain
156 fixed width *regardless* and having nothing to do with the register
157 file source or destination 64 bit fixed width.
158
159 The simplest example here would be a boolean (1 bit) Signal for
160 Scalar (but an 8-bit quantity for SIMD):
161
162 m = Module():
163 with ctx:
164 x = ctx.SigKls(ctx.XLEN)
165 y = ctx.SigKls(ctx.XLEN)
166 b = ctx.SigKls(1)
167 m.d.comb += b.eq(x == y)
168 with m.If(b):
169 ....
170
171 This code is obvious for Scalar behaviour but for SIMD, because
172 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
173 the *elements* are 1 bit (in order to make a total of QTY 8
174 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
175 needs to be **eight** such element bits in order to store up to
176 eight 8-bit comparisons. Exactly how that comparison works
177 is described in [[dynamic_simd/eq]]
178
179 Another example would be a simple test of the first *nibble* of
180 the data.
181
182 m = Module():
183 with ctx:
184 x = ctx.SigKls(ctx.XLEN)
185 y = ctx.SigKls(4)
186 m.d.comb += y.eq(x[0:3])
187 ....
188
189 Here, we do not necessarily want to declare y to be 64-bit: we want
190 only the first 4 bits of each element, after all, and when y is set
191 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
192 4-bit quantities, i.e. only a maximum of 32 bits total.
193
194 If y was declared as 64 bit this would indicate that the actual
195 elements were at least 8 bit long, and if that was then used as a
196 shift input it might produce the wrong calculation because the
197 actual shift amount was only supposed to be 4 bits.
198
199 Thus not one method of setting widths is required but *two*:
200
201 * at the element level
202 * at the width of the entire SIMD signal
203
204 With this background and context in mind the requirements can be determined
205
206 # Requirements
207
208 SimdShape needs:
209
210 * to derive from nmigen ast.Shape in order to provide the overall
211 width and whether it is signed or unsigned. However the
212 overall width is not necessarily hard-set but may be calculated
213 * provides a means to specify the number of partitions in each of
214 an arbitrarily-named set. for convenience and by convention
215 from SVP64 this set is called "elwidths".
216 * to support a range of sub-signal divisions (element widths)
217 and for there to be an option to either set each element width
218 explicitly or to allow each width to be computed from the
219 overall width and the number of partitions.
220 * to provide rudimentary arithmetic operator capability
221 that automatically computes a new SimdShape, adjusting width
222 and element widths accordingly.
223
224 Interfacing to SimdSignal requires an adapter that:
225
226 * allows a switch-case set to be created
227 * the switch statement is the elwidth parameter
228 * the case statements are the PartitionPoints
229 * identifies which partitions are "blank" (padding)