https://bugs.libre-soc.org/show_bug.cgi?id=985
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shape.mdwn
1 # SimdShape
2
3 Links:
4
5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;hb=HEAD)
6 * [SimdShape source](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/partsig.py;hb=HEAD)
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
8
9 # Requirements Analysis
10
11 The dynamic partitioned SimdSignal class is based on the logical extension
12 of the full capabilities to the nmigen language behavioural constructs to
13 a parallel dimension, with zero changes in that behaviour as a result of
14 that parallelism.
15
16 Logically therefore even the concept of ast.Shape should be extended
17 solely to express and define the extent of the parallelism and
18 SimdShape should in no
19 way attempt to change the expected behaviour of the Shape
20 class behaviour from which it derives.
21
22 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
23 provides sufficient context to both define overrides for individual lengths
24 on a per-mask basis as well as sufficient information to "upcast"
25 back to a SimdSignal, in exactly the same way that c++ virtual base
26 class upcasting works when RTTI (Run Time Type Information) works.
27
28 By deriving from `ast.Shape` both `width` and `signed` are provided
29 already, leaving the `SimdShape` class with the responsibility to
30 additionally define lengths for each mask basis. This is best illustrated
31 with an example.
32
33 Also by fitting on top of existing nmigen concepts, and defining the
34 `SimdShape.width` equal to and synonymous with `Shape.width` then
35 downcasting becomes possible and practical. *(An alternative proposal
36 to redefine "width" to be in terms of the multiple options, i.e.
37 context-dependent on the partition setting, is unworkable as it
38 prevents downcasting to e.g. `Signal`)*
39
40 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
41 but without massive disruptive code-duplication or intrusive explicit
42 coding as outlined in the worst of the techniques documented in
43 [[dynamic_simd]]. This in turn implies that Signals need to be declared
44 for both mantissa and exponent that **change width to non-power-of-two
45 sizes** depending on Partition Mask Context.
46
47 Mantissa:
48
49 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
50 rounding and sticky)
51 * when the context is 2xFP32 there are **two** mantissas of 23 bits
52 * when the context is 4xFP16 there are **four** mantissas of 10 bits
53 * when the context is 4xBF16 there are four mantissas of 5 bits.
54
55 Exponent:
56
57 * 1xFP64: 11 bits, one exponent
58 * 2xFP32: 8 bits, two exponents
59 * 4xFP16: 5 bits, four exponents
60 * 4xBF16: 8 bits, four exponents
61
62 `SimdShape` needs this information in addition to the normal
63 information (width, sign) in order to create the partitions
64 that allow standard nmigen operations to **transparently**
65 and naturally take place at **all** of these non-uniform
66 widths, as if they were in fact scalar Signals *at* those
67 widths.
68
69 A minor wrinkle which emerges from deep analysis is that the overall
70 available width (`Shape.width`) does in fact need to be explicitly
71 declared under *some* circumstances, and
72 the sub-partitions to fit onto power-of-two boundaries, in order to allow
73 straight wire-connections rather than allow the SimdSignal to be
74 arbitrary-sized (compact). Although on shallow inspection this
75 initially would seem to imply that it would result in large unused
76 sub-partitions (padding partitions) these gates can in fact be eliminated
77 with a "blanking" mask, created from static analysis of the SimdShape
78 context.
79
80 Example:
81
82 * all 32 and 16-bit values are actually to be truncated to 11 bit
83 * all 8-bit values to 5-bit
84
85 from these we can write out the allocations, bearing in mind that
86 in each partition the sub-signal must start on a power-2 boundary,
87
88 |31| | |24| 16|15| | 8|7 0 |
89 32bit | | | | 1.11 |
90 16bit | | 2.11 | | | 1.11 |
91 8bit | | 4.5 | 3.5 | | 2.5 | | 1.5 |
92
93 Next we identify the start and end points, and note
94 that "x" marks unused (padding) portions. We begin by marking
95 the power-of-two boundaries (0-7 .. 24-31) and also including column
96 guidelines to delineate the start and endpoints:
97
98 |31| | |24| 16|15| | 8|7 0 |
99 |31|28|26|24| |20|16|15|12|10|8| |4 0 |
100 32bit | x| x| x| | x| x| x|10 .... 0 |
101 16bit | x| x|26 ... 16 | x| x|10 .... 0 |
102 8bit | x|28 .. 24| 20.16| x|12 .. 8|x|4.. 0 |
103 unused x x
104
105 thus, we deduce, we *actually* need breakpoints at *nine* positions,
106 and that unused portions common to **all** cases can be deduced
107 and marked "x" by looking at the columns above them.
108 These 100% unused "x"s therefore define the "blanking" mask, and in
109 these sub-portions it is unnecessary to allocate computational gates.
110
111 Also in order to save gates, in the example above there are only three
112 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
113 are required to construct the larger overall computational result
114 from the "smaller chunks". At first glance, with there
115 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
116 would appear that 2^9 (512!) cases were required, where in fact
117 there are only three.
118
119 These facts also need to be communicated to both the SimdSignal
120 as well as the submodules implementing its core functionality:
121 add operation and other arithmetic behaviour, as well as
122 [[dynamic_simd/cat]] and others.
123
124 In addition to that, there is a "convenience" that emerged
125 from technical discussions as desirable
126 to have, which is that it should be possible to perform
127 rudimentary arithmetic operations *on a SimdShape* which preserves
128 or adapts the Partition context, where the arithmetic operations
129 occur on `Shape.width`.
130
131 >>> XLEN = SimdShape(fixed_width=64, signed=True, ...)
132 >>> x2 = XLEN // 2
133 >>> print(x2.width)
134 32
135 >>> print(x2.signed)
136 True
137
138 With this capability it becomes possible to use the Liskov Substitution
139 Principle in dynamically compiling code that switches between scalar and
140 SIMD transparently:
141
142 # scalar context
143 scalarctx = scl = object()
144 scl.XLEN = 64
145 scl.SigKls = Signal # standard nmigen Signal
146 # SIMD context
147 simdctx = sdc = object()
148 sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
149 sdc.SigKls = SimdSignal # advanced SIMD Signal
150 sdc.elwidth = Signal(2)
151
152 # select one
153 if compiletime_switch == 'SIMD':
154 ctx = simdctx
155 else:
156 ctx = scalarctx
157
158 # exact same code switching context at compile time
159 m = Module():
160 with ctx:
161 x = ctx.SigKls(ctx.XLEN)
162 y = ctx.SigKls(ctx.XLEN // 2)
163 ...
164 m.d.comb += x.eq(Const(3))
165
166 An interesting practical requirement transpires from attempting to use
167 SimdSignal, that affects the way that SimdShape works. The register files
168 are 64 bit, and are subdivided according to what wikipedia terms
169 "SIMD Within A Register" (SWAR). Therefore, the SIMD ALUs *have* to
170 both accept and output 64-bit signals at that explicit width, with
171 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
172
173 However when it comes to intermediary processing (partial computations)
174 those intermediary Signals can and will be required to be a certain
175 fixed width *regardless* and having nothing to do with the register
176 file source or destination 64 bit fixed width.
177
178 The simplest example here would be a boolean (1 bit) Signal for
179 Scalar (but an 8-bit quantity for SIMD):
180
181 m = Module():
182 with ctx:
183 x = ctx.SigKls(ctx.XLEN)
184 y = ctx.SigKls(ctx.XLEN)
185 b = ctx.SigKls(1)
186 m.d.comb += b.eq(x == y)
187 with m.If(b):
188 ....
189
190 This code is obvious for Scalar behaviour but for SIMD, because
191 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
192 the *elements* are 1 bit (in order to make a total of QTY 8
193 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
194 needs to be **eight** such element bits in order to store up to
195 eight 8-bit comparisons. Exactly how that comparison works
196 is described in [[dynamic_simd/eq]]
197
198 Another example would be a simple test of the first *nibble* of
199 the data.
200
201 m = Module():
202 with ctx:
203 x = ctx.SigKls(ctx.XLEN)
204 y = ctx.SigKls(4)
205 m.d.comb += y.eq(x[0:3])
206 ....
207
208 Here, we do not necessarily want to declare y to be 64-bit: we want
209 only the first 4 bits of each element, after all, and when y is set
210 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
211 4-bit quantities, i.e. only a maximum of 32 bits total.
212
213 If y was declared as 64 bit this would indicate that the actual
214 elements were at least 8 bit long, and if that was then used as a
215 shift input it might produce the wrong calculation because the
216 actual shift amount was only supposed to be 4 bits.
217
218 Thus not one method of setting widths is required but *two*:
219
220 * at the element level
221 * at the width of the entire SIMD signal
222
223 With this background and context in mind the requirements can be determined
224
225 # Requirements
226
227 SimdShape needs:
228
229 * to derive from nmigen ast.Shape in order to provide the overall
230 width and whether it is signed or unsigned. However the
231 overall width is not necessarily hard-set but may be calculated
232 * provides a means to specify the number of partitions in each of
233 an arbitrarily-named set. for convenience and by convention
234 from SVP64 this set is called "elwidths".
235 * to support a range of sub-signal divisions (element widths)
236 and for there to be an option to either set each element width
237 explicitly or to allow each width to be computed from the
238 overall width and the number of partitions.
239 * to provide rudimentary arithmetic operator capability
240 that automatically computes a new SimdShape, adjusting width
241 and element widths accordingly.
242
243 Interfacing to SimdSignal requires an adapter that:
244
245 * allows a switch-case set to be created
246 * the switch statement is the elwidth parameter
247 * the case statements are the PartitionPoints
248 * identifies which partitions are "blank" (padding)
249
250 # SimdShape API
251
252 SimdShape needs:
253
254 * a constructor taking the following arguments:
255 - (mandatory, from scope) an elwidth Signal
256 - (optional) an integer vector width or a dictionary of vector widths
257 (the keys to be the "elwidth")
258 - (mandatory, from scope) a dictionary of "partition counts":
259 the keys to again be the "elwidth" and the values
260 to be the number of Vector Elements at that elwidth
261 - (optional) a "fixed width" which if given shall
262 auto-compute the dictionary of Vector Widths
263 - (mandatory) a "signed" boolean argument which defaults
264 to False
265 * To derive from Shape, where the (above) constructor passes it
266 the following arguments:
267 - the signed argument. this is simply passed in, unmodified.
268 - a width argument. this will be **either** the fixed_width
269 parameter from the constructor (if given) **or** it will
270 be the **calculated** value sufficient to store all partitions.
271 * a suite of operators (`__add__`, etc) that shall take simple
272 integer arguments and perform the computations on *every*
273 one of the dictionary of Vector widths (examples below)
274 * a "recalculate" function (currently known as layout() in
275 layout_experiment.py) which creates information required
276 by PartitionedSignal.
277 * a function which computes and returns a suite of PartitionPoints
278 as well as an "Adapter" instance, for use by PartitionedSignal
279
280 Examples of the operator usage:
281
282 x = SimdShape(vec_op_widths={0b00: 64, 0b01:32, 0b10: 16})
283 y = x + 5
284 print(y.vec_op_widths)
285 {0b00: 69, 0b01: 37, 0b10: 21}
286
287 In other words, when requesting 5 to be added to x, every single
288 one of the Vector Element widths had 5 added to it. If the
289 partition counts were 2x for 0b00 and 4x for 0b01 then this
290 would create 2x 69-bit and 4x 37-bit Vector Elements.
291
292 # Adapter API
293
294 The Adapter API performs a specific job of letting SimdSignal
295 know the relationship between the supported "configuration"
296 options that a SimdSignal must have, and the actual PartitionPoints
297 bits that must be set or cleared *in order* to have the SimdSignal
298 cut itself into the required sub-sections. This information
299 comes *from* SimdShape but the adapter is not part *of* SimdShape
300 because there can be more than one type of Adapter Mode, depending
301 on SimdShape input parameters.
302
303 class PartType: # TODO decide name
304 def __init__(self, psig):
305 self.psig = psig
306 def get_mask(self):
307 def get_switch(self):
308 def get_cases(self):
309 @property
310 def blanklanes(self):
311
312 # SimdShape arithmetic operators
313
314 Rudimentary arithmetic operations are required in order to perform
315 tricks such as:
316
317 m = Module()
318 with SimdScope(m, elwid, vec_el_counts) as s:
319 shape = SimdShape(s, fixed_width=width)
320 a = s.Signal(shape)
321 b = s.Signal(shape*2)
322 o = s.Signal(shape*3)
323 m.d.comb + o.eq(Cat(a, b))
324
325 as well as:
326
327 with SimdScope(m, elwid, vec_el_counts) as s:
328 shape = SimdShape(s, fixed_width=width)
329 a = s.Signal(shape)
330 b = s.Signal(shape*2)
331 o2 = s.Signal(a.shape + b.shape)
332
333 and:
334
335 with SimdScope(m, elwid, vec_el_counts) as s:
336 shape = SimdShape(s, fixed_width=width)
337 a = s.Signal(16) # element width set to 16
338 b = s.Signal(shape*2)
339 o2 = s.Signal(a.shape + b.shape)
340
341 From these examples we deduce what the arithmetic operators
342 have to cope with:
343
344 * RHS of simple integers
345 * RHS of another SimdShape
346
347 In both cases, there are further subdivisions because
348 SimdShapes can be created as either fixed_width priority
349 or elwidths priority
350
351 * fixed_width priority (vec_op_widths=None)
352 * elwidths priority (fixed_width=None)
353 * equal (no) priority (both are given)
354
355 ## RHS simple integers
356
357 Although this is a simpler case, there are still three options:
358
359 * add/mul/sub etc. of integer at fixed_width priority (vec_op_widths=None)
360 * add/mul/sub etc. of integer at elwidths priority (fixed_width=None)
361 * add/mul/sub etc. of integer at equal (no) priority (both are given)
362
363 The expected behaviour on fixed_width priority is that the arithmetic
364 operation should take place on the fixed width. adding 8 to a 64-bit
365 fixed-width SimdSignal should result in the overall fixed width being
366 extended to 72-bit, and for all partitions the new element-width within
367 each partition is newly-computed to be the maximum possible permitted
368 amount. Example:
369
370 * assume fixed_width=64 and partition_counts {1,2,4,8}.
371 * therefore the computed element sizes are: 1x64, 2x32, 4x16, 8x8
372 * assume that an add operation has requested to add 8 to the fixed width
373 * the new fixed_width is 72
374 * the newly-computed element sizes shall be: 1x72, 2x36, 4x18, 8x9
375
376 The expected behaviour on element-width-priority on the other hand is that
377 the arithmetic operation takes place on *each of the element-widths*
378
379 Example:
380
381 * assume element_widths={16,16,10,12} and partition_counds {1,2,4,8}
382 * therefore the computed element sizes are: 1x16, 2x16, 4x10, 8x12
383 * therefore the computed overall width is the maximum of these amounts
384 (8x12) which is a 96-bit wide Signal.
385 * assume that a subtract operation has requested to subtract 5 from the
386 element_widths.
387 * the newly-computed element_widths becomes {16-5,16-5,10-5,12-5} which
388 is {11,11,5,7}
389 * therefore the computed element sizes are: 1x11, 2x11, 4x5, 8x7
390 * therefore the newly-computed overall width is the maximum of these amounts
391 (8x7) which is a 56-bit quantity.
392
393 It is important to note that some arithmetic operations will not work
394 correctly if both fixed_width and element-widths are specified. Addition:
395
396 * assume fixed_width=64, element_widths {8,8,8,8} and
397 partition_counts {1,2,4,8}.
398 * adding 8 to each element width creates 16 element width for all
399 partitions
400 * this creates a maximum expected width of 8x16 or 128 bits
401 * but the fixed_width of 64 plus 8 is only 72.
402 * therefore we must prohibit this operation when **both** fixed and
403 elwidth are specified
404
405 However for multiply, it is fine, due to a property of multiplication:
406
407 * assume fixed_width=64, element_widths {8,8,8,8} and
408 partition_counts {1,2,4,8}.
409 * multiply by an integer value of 2
410 * the new fixed_width is 2x64 = 128
411 * each element width is **also** multiplied by 2 to create {16,16,16,16}
412 * this creates partitions 1x16 2x16 4x16 8x16
413 * the maximum is (not by a coincidence) exactly 128 bit wide
414 * this matches perfectly the newly-calculated fixed_width
415
416 Given that left-shift is a type of multiply, a left-shift arithmetic
417 operator with an integer amount (as applied equally to all element widths
418 and to the fixed_width) is also expected to also work.
419
420 Divide on the other hand (and right-shift) will only work (when
421 both elwidth and fixed-width are set) if there is
422 no rounding (no bits lost due to the division / right-shift).
423
424 Summary:
425
426 * **Fixed_width Priority**
427 - all operations (add, mul, sub, shift, div)
428 expected to work (caveat: layout() should check partition alignment)
429 * **Element-width Priority**
430 - all operations expected to work (no caveats)
431 * **Fixed-and-Elwidth (no priority)**
432 - multiply and left-shift always expected to work
433 - divide and right-shift expected to work (caveat: no bits
434 lost due to rounding)
435 - add and subtract **not** expected to work (ambiguous):
436 exception must be raised
437
438 ## Arithmetic of two SimdShapes
439
440 With some thought it becomes clear that when performing operations
441 not involving elwidth priority should simply calculate a new fixed
442 width based on straight arithmetic on the LHS and RHS fixed width.
443 The partition counts remains the same (coming from the scope
444 context) therefore the result may also be a fixed_width priority
445 result using the exact same partition counts.
446
447 However - and bearing in mind that for fixed_width priority the
448 elwidths are *computed* from the fixed width and the partition counts -
449 the moment that elwidths (vec_op_widths) are involved then
450 the priority switches to the elwidths, even if one of those elwidths were
451 calculated initially from a fixed_width and partition counts.
452 In this case, the result will be an elwidths priority SimdShape,
453 where the layout() function is already capable of computing the
454 required overall width based on the (newly-computed) vec_el_widths.
455
456 Note that, interestingly, when fixed_width priority SimdShapes are
457 used on both LHS and RHS, it is effectively an identical case to
458 when the RHS is an integer, because the fixed_width of the RHS is
459 itself an integer.
460
461 For certain operators (mul, shift) a special case of this also occurs in instances where all elwidths
462 in all partitions on either LHS or RHS are identical: element widths {3,3,3,3} for
463 example. Under such special circumstances, multiply would
464 function correctly even for dual-priority, due to uniform scaling.
465
466 Summary:
467
468 * **Both LHS and RHS Fixed-width Priority**
469 - Exactly the same as for when RHS is an Integer, given that the
470 integer fixed width is, in fact, an integer.
471 - the return result is always a fixed-width priority SimdShape
472 * **Either LHS or RHS Elwidth Priority** (but not both)
473 - all operators always expected to work because, again, one of
474 RHS or LHS is an integer, and that integer is used as the
475 input to the arithmetic.
476 - reverse-operators (rmul etc) take care of RHS.
477 - the return result is however always an elwidth-priority SimdShape
478 * **Both LHS and RHS Elwidth priority**
479 - for mul and shift-left, as long as one of LHS or RHS has identical
480 elwidths, by a mathematical coincidence these are
481 fine. return result may be a dual-priority result.
482 - for add and sub, an exception must be raised.
483 - divide and shift-right are expected to work on the condition
484 that, again, all elwidths have the exact same value, and, again,
485 that no LSBs are lost.