add description of SimdSlice and why it needs padding
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd.mdwn
1 # Dynamic SIMD Library
2
3 Funded by NLnet EU Grants
4
5 Links
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
8 (this document)
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment
15
16 # Proposal Summary
17
18 The proposal is two-fold, and targetted at the two distinct types
19 of nmigen language constructs:
20
21 * Type 1 low-level AST. implemented as nmigen.hdl.ast classes:
22 Value, Cat, Repl, Mux, Switch etc.
23 * Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl
24
25 The Type 1 AST low-level proposed modifications mirror and extend the
26 existing long-established python `operator` module interoperability, which nmigen
27 *already leverages* by providing `Value.__add__` and other operator
28 overrides.
29
30 * To extend nmigen "Type 1 (ast.*)" low-level language constructs
31 in the Value class
32 with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
33 * To rename existing old
34 - `ast.Cat` to new `ast._InternalCat`,
35 - `ast.Repl` to `ast._InternalRepl`
36 - etc.
37 * In an identical conceptual fashion, just
38 as python `operator.add(x,y)` redirects to `x.__add__(y)`,
39 to add a new `ast.Cat(x..)` which redirects to `x.__Cat__(...)` etc.
40 * To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
41 * To allow all of the new underscore variants to be overridden without
42 limitation, restriction, or restraint, via UserValue (and ValueCastable)
43
44 The second set of changes is targetted at Type 2 dsl.Module,
45 to complete the 98% abstraction from Type 1 to a 100% level
46
47 * To add a new parameter to `dsl.Module`'s constructor,
48 `_astTypeFn`, which is the AST class type-casting function
49 to be used for "casting" of m.If and m.Elif condition tests,
50 and for m.Switch's value.
51 * To replace the two (2) calls to `Value.cast()` in `dsl.Module`
52 with `self._astTypefn()`
53
54 No further modifications beyond that are strictly necessary.
55
56 With `dsl.Module` already being extremely close to 100%
57 abstracted with respect to Type 1 AST constructs, Liskov
58 Substitution Principle in combination with UserValue (later,
59 ValueCastable) combine to provide extremely powerful and
60 flexible augmentation and capabilities in nmigen, far beyond
61 its original intended design parameters.
62
63 The benefits due to the abstraction extend 100% cleanly to
64 other use-cases (Partitioned Dynamic SIMD is just the first). Other nmigen
65 developers may leverage contextual overriding of the AST
66 constructs in full OO fashion, yet the high-level dsl.Module
67 language concepts remain true to their intended characteristics
68 and behaviour and need neither duplication nor alteration.
69
70 Typical use-cases are, just as is the driving force behind PartitionedSignal,
71 the sharing at the gate level of arithmetic primitives that would
72 otherwise require costly duplication in a final product, or be too
73 complex or costly to develop without such abstraction at the higher
74 Type 2 (dsl.Module) level.
75
76 Examples include an ALU that may dynamically
77 at runtime switch between Complex (imaginary) arithmetic
78 and Real (scalar) arithmetic, whilst *through the same
79 datapath* automatically sharing the arithmetic logic
80 gates between the two abstract concepts. Given that
81 64 bit multipliers are 12,000 to 15,000 gates the savings
82 can be enormous.
83
84 PartitionedSignal is the first concrete third-party example
85 that inspired the completion of the abstraction of dsl.Module
86 from ast.*
87
88 # Rationale / Introduction
89
90 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
91 of nmigen's Signal. It is expected to work transparently as if it was
92 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
93 no requirement on the part of the developer to even know that it is
94 performing parallel dynamically-partitioned operations.
95 All standard nmigen language constructs are both expected and required
96 to behave exactly as they do for scalar Signals.
97
98 nmigen 32-bit Signal:
99
100 a : .... .... .... .... (32 bits, illustrated as 4x8)
101
102 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
103 sections, by 3 partition bits:
104
105 partition: P P P (3 bits)
106 a : .... .... .... .... (32 bits, illustrated as 4x8)
107 exp-a : ....P....P....P.... (32+3 bits, P=0 if no partition)
108
109 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
110 into all 8 possible combinations of the 3 Partition bits:
111
112 exp-a : ....0....0....0.... 1x 32-bit
113 exp-a : ....0....0....1.... 1x 24-bit plus 1x 8-bit
114 exp-a : ....0....1....0.... 2x 16-bit
115 ...
116 ...
117 exp-a : ....1....1....0.... 2x 8-bit, 1x 16-bit
118 exp-a : ....1....1....1.... 4x 8-bit
119
120 A simple example, a "min" function:
121
122 # declare x, y and out as 16-bit scalar Signals
123 x = Signal(16)
124 y = Signal(16)
125 out = Signal(16)
126
127 # compare x against y and set output accordingly
128 with m.If(x < y):
129 comb += out.eq(x)
130 with m.Else():
131 comb += out.eq(y)
132
133 This is very straightforward and obvious, that the 16-bit output is the
134 lesser of the x and y inputs. We require the exact same obviousness
135 and under no circumstances any change of any kind to any nmigen language
136 construct:
137
138 # a mask of length 3 indicates a desire to partition Signals at
139 # 3 points into 4 equally-spaced SIMD "partitions".
140 mask = Signal(3)
141
142 with mask:
143 # x y and out are all 16-bit so are subdivided at:
144 # | mask[0] mask[1] mask[3] |
145 # | 0-3 | 4-7 | 8-11 | 12-15 |
146 x = PartitionedSignal(16) # identical except for mask
147 y = PartitionedSignal(16) # identical except for mask
148 out = PartitionedSignal(16) # identical except for mask
149
150 # all code here is required to be absolutely identical to the
151 # scalar case, and identical in nmigen language behaviour in
152 # every way. no changes to the nmigen language or its use
153 # are permitted
154
155 with m.If(x < y): # exactly
156 comb += out.eq(x) # the
157 with m.Else(): # same
158 comb += out.eq(y) # code
159
160 The purpose of PartitionedSignal is therefore to provide full 100%
161 transparent SIMD run-time dynamic behaviour as far as end-usage is
162 concerned.
163
164 The alternative is absolutely awful and completely unacceptable
165 for both maintenance cost and development cost. Even the most
166 basic HDL is literally an order of magnitude larger:
167
168 # declare x, y and out as 16-bit scalar Signals
169 x = Signal(16)
170 y = Signal(16)
171 out = Signal(16)
172
173 # start an absolutely awful unmaintainable duplication of
174 # SIMD behaviour.
175 with m.If(mask == 0b111): # 1x 16-bit
176 # compare x against y and set output accordingly
177 with m.If(x < y):
178 comb += out.eq(x)
179 with m.Else():
180 comb += out.eq(y)
181 with m.ElIf(mask == 0b101): # 2x 8-bit
182 for i in range(2):
183 xh = x[i*8:(i+1)*8]
184 yh = y[i*8:(i+1)*8]
185 outh = out[i*8:(i+1)*8]
186 # compare halves of x against halves y and set
187 # halves of output accordingly
188 with m.If(xh < yh):
189 comb += outh.eq(xh)
190 with m.Else():
191 comb += outh.eq(yh)
192 with m.ElIf(mask == 0b000): # 4x 4-bit
193 ....
194 with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
195 ....
196 with m.ElIf(....)
197 ....
198 with m.ElIf(....)
199 ....
200 with m.ElIf(....)
201 ....
202
203 # Overview
204
205 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
206
207 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
208 These "breaks" may be set at runtime at any time.
209
210 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs. Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
211
212 # Type 1 (AST) nmigen constructs
213
214 nmigen's `ast.Value` already provides operator overloads for arithmetic
215 operations such that natural standard python syntax `(a+b)` may be used.
216 This proposal extends that concept naturally to allow further contextual
217 overrides of the remaining nmigen Type 1 AST constructs: Cat, Repl,
218 Mux, Switch, Part etc.
219
220 Instead of these being defined exclusively as global functions which cannot comprehend Object-Oriented
221 context, Cat, Switch and Mux etc are redefined to behave as analogues
222 of the python `operator` module. `nmigen.hdl.ast.Mux(sel,val2,val1)`
223 therefore calls `sel.__Mux__(val2,val1)` and so on.
224
225 As an example of a fully-functioning use of this enhancement,
226 pages below describe the basic features of each and track the relevant bugreports. These features here are the building blocks which lie behind
227 PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
228 constructs in the form described above.
229
230 * [[dynamic_simd/shape]] a derivative of ast.Shape with partition info
231 * [[dynamic_simd/assign]] nmigen eq (ast.Assign)
232 * [[dynamic_simd/cat]] nmigen ast.Cat
233 * [[dynamic_simd/repl]] nmigen ast.Repl
234 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
235 * [[dynamic_simd/gt]] aka `__gt__` in python operator terms
236 * [[dynamic_simd/add]]
237 * [[dynamic_simd/mul]]
238 * [[dynamic_simd/shift]]
239 * [[dynamic_simd/logicops]] Horizontal reduction: some all xor bool
240 * [[dynamic_simd/slice]] nmigen ast.Slice
241
242 # Integration with nmigen: "Type 2" (dsl.Module)
243
244 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
245
246 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
247
248 if partitions == 1x64
249 with m.If(x > y):
250 do something
251 elif partitions == 2x32:
252 with m.If(x[0:31] > y[0:31]):
253 do something on 1st half
254 elif ...
255 elif ...
256 # many more lines of repeated laborious hand written
257 # SIMD nonsense all exactly the same except for the
258 # for loop and sizes.
259
260 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
261
262 A much more intelligent approach is needed. What we actually want is:
263
264 with m.If(x > y): # do a partitioned compare here
265 do something dynamic here
266
267 where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
268
269 This means one of two things:
270
271 1. that nmigen needs to "understand" the partitioning, in a Type 2
272 construct (m.If, m.Else and m.Switch), at the bare minimum.
273 2. that nmigen's "Type 2" language constructs be sufficiently abstract
274 that they *pass through* SIMD behaviour entirely to a lower level,
275 *completely intact*.
276
277 Mathod (1) is an alarmingly large amount of work with severe to
278 catastrophic implications in multiple areas.
279
280 Method (2) by complete contrast allows the Liskov Substitution Principle to
281 be put to surprisingly elegant effect. If that was possible to
282 achieve then **almost no modifications to nmigen would
283 be required** because dsl.Module is *already* 99% abstracted in terms
284 of the lower-level Type 1 (ast.*) constructs.
285
286 Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`. Within that ast.Switch
287 function only ast.Mux and other Type 1 (AST) "global" functions
288 similar to python operator are used. The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
289
290 <https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
291
292 A deeper analysis shows that dsl.Module uses explicit Value.cast on its
293 If, Elif, and Switch clauses. Overriding that and allowing a cast to
294 a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
295 to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.
296
297 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement. With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
298
299 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions. Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
300
301 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column. Take the following example:
302
303 mask = Signal(3) # creates four partitions
304 a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
305 b = PartitionedSignal(mask, 4) # likewise
306 c = PartitionedSignal(mask, 32)
307 d = PartitionedSignal(mask, 32)
308 o = PartitionedSignal(mask, 32)
309
310 with m.If(a):
311 comb += o.eq(c)
312 with m.Elif(b):
313 comb += o.eq(d)
314
315 If these were ordinary Signals, they would be translated to a Switch where:
316
317 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
318 * cases would be (quantity 2) "1-" and "-1" in order to match
319 against the first binary test bit of Cat(a, b) and the second,
320 respectively.
321 * the first case would be "1-" to activate `o.eq(c)
322 * the second case would be "-1" to activate o.eq(d)
323
324 A parallel variant may thus perform a for-loop, creating four
325 **independent** Switches:
326
327 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
328 * take the output of each case result `o[0].eq[c[0])` and
329 so on
330 * create the first independent Switch
331 * take a[1] and b[1] etc.
332
333 There are several ways in which the parts of each case, when
334 activated, can be split up: temporary Signals, analysing
335 the AST, or using PartitionedMux.
336
337 # Alternative implementation concepts
338
339 Several alternative ideas have been proposed. They are listed here for
340 completeness. The worst offender (use code-duplication) has already been
341 raised.
342
343 The fundamental common characteristic of all of these ideas, despite
344 their disparate nature, is that absolutely every single one of them
345 assumes that there is explicit action (or hard work) required to
346 be taken. None of them - at all - leverage the fact that nmigen
347 has two different inter-related abstraction levels.
348
349 * **Explicit Code**. For every ALU, write code (in nmigen 0.3
350 at the time of writing) which is inherently
351 parallel, rather than duplicated with a series of Switch/Case or
352 an if-elif-elif chain based on what the partition mask is set to.
353 * **Use a low-level class**. Rather than tie in to nmigen, write a
354 class that performs all the equivalent (explicit coded) parallel
355 operations, as close to nmigen Type 1 (AST) constructs as possible.
356 Alternatives to Cat, called SIMDCat, and forego use of Type 2
357 (dsl.Module) nmigen language constructs entirely.
358 * **Wrapper classes**. A dsl.Module wrapper class which duplicates the
359 entirety of the dsl.Module functionality with "redirection" functions
360 and a dsl.Module instance as a member variable. In the rare instances
361 where the function is different it is expected to provide a replacement
362 rather than call the dsl.Module variant.
363 * **Replacement for dsl.Module**. This idea is to implement a full total
364 replacement for dsl.Module, called dsl.SIMDModule (or other).
365 * **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
366 with external functions.
367 * **Compilers / Macros**. On the basis that if this was VHDL or Verilog
368 one technique for creating SIMD variants of the required code would be
369 to use macro substitution or crude compilers to autogenerate the dynamic
370 SIMD from VHDL / Verilog templates, why not do exactly the same thing.
371 Design a SIMD HDL language, write python in that, and have it output
372 python source code with nmigen HDL.
373
374 All of these ideas, unfortunately, are extremely costly in many different ways:
375
376 1. Any overrides of any kind give a catastrophically fatal impression
377 that the high-level language behaviour might in some tiny way be different,
378 purely because of the very *existence* of an override class.
379 This in turn, if prevalent and commonly used, would quickly be
380 seen as a rather nasty indirect unspoken implication that nmigen
381 itself was badly designed, so much so that its users were forced
382 to engage in "unauthorised overrides" in order to achieve what they
383 want and need.
384 2. Proponents of such override mechanisms (and there have been many)
385 fundamentally misunderstand the distinction between Type 1 (AST) low-level
386 and Type 2 (dsl.Module) high-level nmigen language constructs, and
387 miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
388 (Type 2) constructs.
389 3. Wrapper classes are a maintenance nightmare. Unless included in the
390 library itself, any update to the library being wrapped is a huge risk
391 of breaking the wrapper. Long-term this is completely unsustainable.
392 Likewise monkey-patching.
393 4. Wrapper classes introduce huge performance degradation. Every function
394 requires one additional function call. Every object instantiated requires
395 both the wrapper class to be instantiated and the object being wrapped,
396 every single time. A single wrapper class is never enough: the entire
397 class hierarchy, everything that is ever instantiated by a "wrapped"
398 class instance, also requires wrapping. This quickly
399 gets completely out of hand and diverts developer time into a
400 nightmare time-sink that has actively harmful consequences *even
401 once completed*.
402 Memory requirements double; performance degrades.
403 Both are unacceptable.
404 5. "Explicit coding" is actually more costly even than for-loop
405 code duplication. It is "The Right Thing (tm)" in order to achieve
406 optimal gate-level design but requires hyper-aware and hyper-intelligent
407 developers, whom, if ever lost, take knowledge of the internal workings
408 of ultra-complex code with them.
409 6. "Low-level classes" have in fact already been implemented: over a
410 dozen modules utilised and combined behind PartitionedSignal exist
411 and are already in use. PartitionedMux was one of the very first SIMD-aware
412 "Type 1" AST operators written. The problem is that it's nowhere near
413 enough. Conversion of 100,000 lines of readable nmigen HDL using
414 Type 2 (m.If/Elif) high-level constructs to **not** use those
415 constructs and return to AST Mux, Cat, and so on is a completely
416 unreasonable expectation.
417 7. "Replacements for dsl.Module" aside from coming with the catastrophic
418 implication that they provide alternative behaviour from dsl.Module
419 are a heavy maintenance burden. Not only that but there is the risk
420 that, unless actually included *in* nmigen, the underlying AST modules
421 that they use might change, or be deprecated, or have bugs fixed which
422 break the replacement.
423 8. "Compilers". It is very well known that any python program that outputs
424 another python program as an ASCII output may then immediately read that
425 ASCII output from within the very program that created it, `eval` it,
426 and then execute it. That being the case, you might as well skip the
427 output to ASCII and the eval part, and create classes that
428 dynamically and directly instantiate the desired target behaviour.
429 This saves months of effort but is a common mistake, frequently made.
430 9. "Lost In Translation". ("Out of sight, out of mind" being translated
431 to Russian then English by a Diplomat's translator famously returned
432 the phrase, "Invisible lunatic").
433 Any compiler risks losing information in between translations,
434 making debugging harder, and finding mistakes particularly difficult.
435 With nmigen already being a Language Translator, adding an extra layer
436 seems extremely unwise. And, worse, again relies on nmigen remaining
437 stable, indefinitely.
438
439 Bottom line is that all the alternatives are really quite harmful, costly,
440 and unmaintainable, and in some cases actively damage nmigen's reputation
441 as a stable, useful and powerful HDL.
442