c2c0ea138a8a9f3846697518690ae73462ca6144
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shape.mdwn
1 # SimdShape
2
3 Links:
4
5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
7
8 # Requirements Analysis
9
10 The dynamic partitioned SimdSignal class is based on the logical extension
11 of the full capabilities to the nmigen language behavioural constructs to
12 a parallel dimension, with zero changes in that behaviour as a result of
13 that parallelism.
14
15 Logically therefore even the concept of ast.Shape should be extended
16 solely to express and define the extent of the parallelism and
17 SimdShape should in no
18 way attempt to change the expected behaviour of the Shape
19 class behaviour from which it derives.
20
21 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
22 provides sufficient context to both define overrides for individual lengths
23 on a per-mask basis as well as sufficient information to "upcast"
24 back to a SimdSignal, in exactly the same way that c++ virtual base
25 class upcasting works when RTTI (Run Time Type Information) works.
26
27 By deriving from `ast.Shape` both `width` and `signed` are provided
28 already, leaving the `SimdShape` class with the responsibility to
29 additionally define lengths for each mask basis. This is best illustrated
30 with an example.
31
32 Also by fitting on top of existing nmigen concepts, and defining the
33 `SimdShape.width` equal to and synonymous with `Shape.width` then
34 downcasting becomes possible and practical. *(An alternative proposal
35 to redefine "width" to be in terms of the multiple options, i.e.
36 context-dependent on the partition setting, is unworkable as it
37 prevents downcasting to e.g. `Signal`)*
38
39 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
40 but without massive disruptive code-duplication or intrusive explicit
41 coding as outlined in the worst of the techniques documented in
42 [[dynamic_simd]]. This in turn implies that Signals need to be declared
43 for both mantissa and exponent that **change width to non-power-of-two
44 sizes** depending on Partition Mask Context.
45
46 Mantissa:
47
48 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
49 rounding and sticky)
50 * when the context is 2xFP32 there are **two** mantissas of 23 bits
51 * when the context is 4xFP16 there are **four** mantissas of 10 bits
52 * when the context is 4xBF16 there are four mantissas of 5 bits.
53
54 Exponent:
55
56 * 1xFP64: 11 bits, one exponent
57 * 2xFP32: 8 bits, two exponents
58 * 4xFP16: 5 bits, four exponents
59 * 4xBF16: 8 bits, four exponents
60
61 `SimdShape` needs this information in addition to the normal
62 information (width, sign) in order to create the partitions
63 that allow standard nmigen operations to **transparently**
64 and naturally take place at **all** of these non-uniform
65 widths, as if they were in fact scalar Signals *at* those
66 widths.
67
68 A minor wrinkle which emerges from deep analysis is that the overall
69 available width (`Shape.width`) does in fact need to be explicitly
70 declared under *some* circumstances, and
71 the sub-partitions to fit onto power-of-two boundaries, in order to allow
72 straight wire-connections rather than allow the SimdSignal to be
73 arbitrary-sized (compact). Although on shallow inspection this
74 initially would seem to imply that it would result in large unused
75 sub-partitions (padding partitions) these gates can in fact be eliminated
76 with a "blanking" mask, created from static analysis of the SimdShape
77 context.
78
79 Example:
80
81 * all 32 and 16-bit values are actually to be truncated to 11 bit
82 * all 8-bit values to 5-bit
83
84 from these we can write out the allocations, bearing in mind that
85 in each partition the sub-signal must start on a power-2 boundary,
86
87 |31| | |24| 16|15| | 8|7 0 |
88 32bit | | | | 1.11 |
89 16bit | | 2.11 | | | 1.11 |
90 8bit | | 4.5 | 3.5 | | 2.5 | | 1.5 |
91
92 Next we identify the start and end points, and note
93 that "x" marks unused (padding) portions. We begin by marking
94 the power-of-two boundaries (0-7 .. 24-31) and also including column
95 guidelines to delineate the start and endpoints:
96
97 |31| | |24| 16|15| | 8|7 0 |
98 |31|28|26|24| |20|16|15|12|10|8| |4 0 |
99 32bit | x| x| x| | x| x| x|10 .... 0 |
100 16bit | x| x|26 ... 16 | x| x|10 .... 0 |
101 8bit | x|28 .. 24| 20.16| x|12 .. 8|x|4.. 0 |
102 unused x x
103
104 thus, we deduce, we *actually* need breakpoints at *nine* positions,
105 and that unused portions common to **all** cases can be deduced
106 and marked "x" by looking at the columns above them.
107 These 100% unused "x"s therefore define the "blanking" mask, and in
108 these sub-portions it is unnecessary to allocate computational gates.
109
110 Also in order to save gates, in the example above there are only three
111 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
112 are required to construct the larger overall computational result
113 from the "smaller chunks". At first glance, with there
114 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
115 would appear that 2^9 (512!) cases were required, where in fact
116 there are only three.
117
118 These facts also need to be communicated to both the SimdSignal
119 as well as the submodules implementing its core functionality:
120 add operation and other arithmetic behaviour, as well as
121 [[dynamic_simd/cat]] and others.
122
123 In addition to that, there is a "convenience" that emerged
124 from technical discussions as desirable
125 to have, which is that it should be possible to perform
126 rudimentary arithmetic operations *on a SimdShape* which preserves
127 or adapts the Partition context, where the arithmetic operations
128 occur on `Shape.width`.
129
130 >>> XLEN = SimdShape(fixed_width=64, signed=True, ...)
131 >>> x2 = XLEN // 2
132 >>> print(x2.width)
133 32
134 >>> print(x2.signed)
135 True
136
137 With this capability it becomes possible to use the Liskov Substitution
138 Principle in dynamically compiling code that switches between scalar and
139 SIMD transparently:
140
141 # scalar context
142 scalarctx = scl = object()
143 scl.XLEN = 64
144 scl.SigKls = Signal # standard nmigen Signal
145 # SIMD context
146 simdctx = sdc = object()
147 sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
148 sdc.SigKls = SimdSignal # advanced SIMD Signal
149 sdc.elwidth = Signal(2)
150
151 # select one
152 if compiletime_switch == 'SIMD':
153 ctx = simdctx
154 else:
155 ctx = scalarctx
156
157 # exact same code switching context at compile time
158 m = Module():
159 with ctx:
160 x = ctx.SigKls(ctx.XLEN)
161 y = ctx.SigKls(ctx.XLEN // 2)
162 ...
163 m.d.comb += x.eq(Const(3))
164
165 An interesting practical requirement transpires from attempting to use
166 SimdSignal, that affects the way that SimdShape works. The register files
167 are 64 bit, and are subdivided according to what wikipedia terms
168 "SIMD Within A Register" (SWAR). Therefore, the SIMD ALUs *have* to
169 both accept and output 64-bit signals at that explicit width, with
170 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
171
172 However when it comes to intermediary processing (partial computations)
173 those intermediary Signals can and will be required to be a certain
174 fixed width *regardless* and having nothing to do with the register
175 file source or destination 64 bit fixed width.
176
177 The simplest example here would be a boolean (1 bit) Signal for
178 Scalar (but an 8-bit quantity for SIMD):
179
180 m = Module():
181 with ctx:
182 x = ctx.SigKls(ctx.XLEN)
183 y = ctx.SigKls(ctx.XLEN)
184 b = ctx.SigKls(1)
185 m.d.comb += b.eq(x == y)
186 with m.If(b):
187 ....
188
189 This code is obvious for Scalar behaviour but for SIMD, because
190 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
191 the *elements* are 1 bit (in order to make a total of QTY 8
192 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
193 needs to be **eight** such element bits in order to store up to
194 eight 8-bit comparisons. Exactly how that comparison works
195 is described in [[dynamic_simd/eq]]
196
197 Another example would be a simple test of the first *nibble* of
198 the data.
199
200 m = Module():
201 with ctx:
202 x = ctx.SigKls(ctx.XLEN)
203 y = ctx.SigKls(4)
204 m.d.comb += y.eq(x[0:3])
205 ....
206
207 Here, we do not necessarily want to declare y to be 64-bit: we want
208 only the first 4 bits of each element, after all, and when y is set
209 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
210 4-bit quantities, i.e. only a maximum of 32 bits total.
211
212 If y was declared as 64 bit this would indicate that the actual
213 elements were at least 8 bit long, and if that was then used as a
214 shift input it might produce the wrong calculation because the
215 actual shift amount was only supposed to be 4 bits.
216
217 Thus not one method of setting widths is required but *two*:
218
219 * at the element level
220 * at the width of the entire SIMD signal
221
222 With this background and context in mind the requirements can be determined
223
224 # Requirements
225
226 SimdShape needs:
227
228 * to derive from nmigen ast.Shape in order to provide the overall
229 width and whether it is signed or unsigned. However the
230 overall width is not necessarily hard-set but may be calculated
231 * provides a means to specify the number of partitions in each of
232 an arbitrarily-named set. for convenience and by convention
233 from SVP64 this set is called "elwidths".
234 * to support a range of sub-signal divisions (element widths)
235 and for there to be an option to either set each element width
236 explicitly or to allow each width to be computed from the
237 overall width and the number of partitions.
238 * to provide rudimentary arithmetic operator capability
239 that automatically computes a new SimdShape, adjusting width
240 and element widths accordingly.
241
242 Interfacing to SimdSignal requires an adapter that:
243
244 * allows a switch-case set to be created
245 * the switch statement is the elwidth parameter
246 * the case statements are the PartitionPoints
247 * identifies which partitions are "blank" (padding)
248
249 # SimdShape API
250
251 SimdShape needs:
252
253 * a constructor taking the following arguments:
254 - (mandatory) an elwidth Signal
255 - (optional) an integer vector width or a dictionary of vector widths
256 (the keys to be the "elwidth")
257 - (mandatory) a dictionary of "partition counts":
258 the keys to again be the "elwidth" and the values
259 to be the number of Vector Elements at that elwidth
260 - (optional) a "fixed width" which if given shall
261 auto-compute the dictionary of Vector Widths
262 - (mandatory) a "signed" boolean argument which defaults
263 to False
264 * To derive from Shape, where the (above) constructor passes it
265 the following arguments:
266 - the signed argument. this is simply passed in, unmodified.
267 - a width argument. this will be **either** the fixed_width
268 parameter from the constructor (if given) **or** it will
269 be the **calculated** value sufficient to store all partitions.
270 * a suite of operators (`__add__`, etc) that shall take simple
271 integer arguments and perform the computations on *every*
272 one of the dictionary of Vector widths (examples below)
273 * a "recalculate" function (currently known as layout() in
274 layout_experiment.py) which creates information required
275 by PartitionedSignal.
276 * a function which computes and returns a suite of PartitionPoints
277 as well as an "Adapter" instance, for use by PartitionedSignal
278
279 Examples of the operator usage:
280
281 x = SimdShape(vec_op_widths={0b00: 64, 0b01:32, 0b10: 16})
282 y = x + 5
283 print(y.vec_op_widths)
284 {0b00: 69, 0b01: 37, 0b10: 21}
285
286 In other words, when requesting 5 to be added to x, every single
287 one of the Vector Element widths had 5 added to it. If the
288 partition counts were 2x for 0b00 and 4x for 0b01 then this
289 would create 2x 69-bit and 4x 37-bit Vector Elements.
290
291 # Adapter API
292
293 The Adapter API performs a specific job of letting SimdSignal
294 know the relationship between the supported "configuration"
295 options that a SimdSignal must have, and the actual PartitionPoints
296 bits that must be set or cleared *in order* to have the SimdSignal
297 cut itself into the required sub-sections. This information
298 comes *from* SimdShape but the adapter is not part *of* SimdShape
299 because there can be more than one type of Adapter Mode, depending
300 on SimdShape input parameters.
301
302 class PartType: # TODO decide name
303 def __init__(self, psig):
304 self.psig = psig
305 def get_mask(self):
306 def get_switch(self):
307 def get_cases(self):
308 @property
309 def blanklanes(self):
310
311 # SimdShape arithmetic operators
312
313 Rudimentary arithmetic operations are required in order to perform
314 tricks such as:
315
316 m = Module()
317 with SimdScope(m, elwid, vec_el_counts) as s:
318 shape = SimdShape(s, fixed_width=width)
319 a = s.Signal(shape)
320 b = s.Signal(shape*2)
321 o = s.Signal(shape*3)
322 m.c.comb + o.eq(Cat(a, b))