add svstep, start on ls008
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd.mdwn
1 # Dynamic SIMD Library
2
3 Funded by NLnet EU Grants
4
5 Links
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
8 (this document)
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment
15 * <https://bugs.libre-soc.org/show_bug.cgi?id=1024> continuation and completion of Type II
16 (DSL) astraction, for completing PartitionedSIMD.
17
18 # Proposal Summary
19
20 The proposal is two-fold, and targetted at the two distinct types
21 of nmigen language constructs:
22
23 * Type 1 low-level AST. implemented as nmigen.hdl.ast classes:
24 Value, Cat, Repl, Mux, Switch etc.
25 * Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl
26
27 The Type 1 AST low-level proposed modifications mirror and extend the
28 existing long-established python `operator` module interoperability, which nmigen
29 *already leverages* by providing `Value.__add__` and other operator
30 overrides.
31
32 * To extend nmigen "Type 1 (ast.*)" low-level language constructs
33 in the Value class
34 with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
35 * To rename existing old
36 - `ast.Cat` to new `ast._InternalCat`,
37 - `ast.Repl` to `ast._InternalRepl`
38 - etc.
39 * In an identical conceptual fashion, just
40 as python `operator.add(x,y)` redirects to `x.__add__(y)`,
41 to add a new `ast.Cat(x..)` which redirects to `x.__Cat__(...)` etc.
42 * To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
43 * To allow all of the new underscore variants to be overridden without
44 limitation, restriction, or restraint, via UserValue (and ValueCastable)
45
46 The second set of changes is targetted at Type 2 dsl.Module,
47 to complete the 98% abstraction from Type 1 to a 100% level
48
49 * To add a new parameter to `dsl.Module`'s constructor,
50 `_astTypeFn`, which is the AST class type-casting function
51 to be used for "casting" of m.If and m.Elif condition tests,
52 and for m.Switch's value.
53 * To replace the two (2) calls to `Value.cast()` in `dsl.Module`
54 with `self._astTypefn()`
55
56 No further modifications beyond that are strictly necessary.
57
58 With `dsl.Module` already being extremely close to 100%
59 abstracted with respect to Type 1 AST constructs, Liskov
60 Substitution Principle in combination with UserValue (later,
61 ValueCastable) combine to provide extremely powerful and
62 flexible augmentation and capabilities in nmigen, far beyond
63 its original intended design parameters.
64
65 The benefits due to the abstraction extend 100% cleanly to
66 other use-cases (Partitioned Dynamic SIMD is just the first). Other nmigen
67 developers may leverage contextual overriding of the AST
68 constructs in full OO fashion, yet the high-level dsl.Module
69 language concepts remain true to their intended characteristics
70 and behaviour and need neither duplication nor alteration.
71
72 Typical use-cases are, just as is the driving force behind PartitionedSignal,
73 the sharing at the gate level of arithmetic primitives that would
74 otherwise require costly duplication in a final product, or be too
75 complex or costly to develop without such abstraction at the higher
76 Type 2 (dsl.Module) level.
77
78 Examples include an ALU that may dynamically
79 at runtime switch between Complex (imaginary) arithmetic
80 and Real (scalar) arithmetic, whilst *through the same
81 datapath* automatically sharing the arithmetic logic
82 gates between the two abstract concepts. Given that
83 64 bit multipliers are 12,000 to 15,000 gates the savings
84 can be enormous.
85
86 PartitionedSignal is the first concrete third-party example
87 that inspired the completion of the abstraction of dsl.Module
88 from ast.*
89
90 # Rationale / Introduction
91
92 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
93 of nmigen's Signal. It is expected to work transparently as if it was
94 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
95 no requirement on the part of the developer to even know that it is
96 performing parallel dynamically-partitioned operations.
97 All standard nmigen language constructs are both expected and required
98 to behave exactly as they do for scalar Signals.
99
100 nmigen 32-bit Signal:
101
102 a : .... .... .... .... (32 bits, illustrated as 4x8)
103
104 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
105 sections, by 3 partition bits:
106
107 partition: P P P (3 bits)
108 a : .... .... .... .... (32 bits, illustrated as 4x8)
109 exp-a : ....P....P....P.... (32+3 bits, P=0 if no partition)
110
111 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
112 into all 8 possible combinations of the 3 Partition bits:
113
114 exp-a : ....0....0....0.... 1x 32-bit
115 exp-a : ....0....0....1.... 1x 24-bit plus 1x 8-bit
116 exp-a : ....0....1....0.... 2x 16-bit
117 ...
118 ...
119 exp-a : ....1....1....0.... 2x 8-bit, 1x 16-bit
120 exp-a : ....1....1....1.... 4x 8-bit
121
122 A simple example, a "min" function:
123
124 # declare x, y and out as 16-bit scalar Signals
125 x = Signal(16)
126 y = Signal(16)
127 out = Signal(16)
128
129 # compare x against y and set output accordingly
130 with m.If(x < y):
131 comb += out.eq(x)
132 with m.Else():
133 comb += out.eq(y)
134
135 This is very straightforward and obvious, that the 16-bit output is the
136 lesser of the x and y inputs. We require the exact same obviousness
137 and under no circumstances any change of any kind to any nmigen language
138 construct:
139
140 # a mask of length 3 indicates a desire to partition Signals at
141 # 3 points into 4 equally-spaced SIMD "partitions".
142 mask = Signal(3)
143
144 with mask:
145 # x y and out are all 16-bit so are subdivided at:
146 # | mask[0] mask[1] mask[3] |
147 # | 0-3 | 4-7 | 8-11 | 12-15 |
148 x = PartitionedSignal(16) # identical except for mask
149 y = PartitionedSignal(16) # identical except for mask
150 out = PartitionedSignal(16) # identical except for mask
151
152 # all code here is required to be absolutely identical to the
153 # scalar case, and identical in nmigen language behaviour in
154 # every way. no changes to the nmigen language or its use
155 # are permitted
156
157 with m.If(x < y): # exactly
158 comb += out.eq(x) # the
159 with m.Else(): # same
160 comb += out.eq(y) # code
161
162 The purpose of PartitionedSignal is therefore to provide full 100%
163 transparent SIMD run-time dynamic behaviour as far as end-usage is
164 concerned.
165
166 The alternative is absolutely awful and completely unacceptable
167 for both maintenance cost and development cost. Even the most
168 basic HDL is literally an order of magnitude larger:
169
170 # declare x, y and out as 16-bit scalar Signals
171 x = Signal(16)
172 y = Signal(16)
173 out = Signal(16)
174
175 # start an absolutely awful unmaintainable duplication of
176 # SIMD behaviour.
177 with m.If(mask == 0b111): # 1x 16-bit
178 # compare x against y and set output accordingly
179 with m.If(x < y):
180 comb += out.eq(x)
181 with m.Else():
182 comb += out.eq(y)
183 with m.ElIf(mask == 0b101): # 2x 8-bit
184 for i in range(2):
185 xh = x[i*8:(i+1)*8]
186 yh = y[i*8:(i+1)*8]
187 outh = out[i*8:(i+1)*8]
188 # compare halves of x against halves y and set
189 # halves of output accordingly
190 with m.If(xh < yh):
191 comb += outh.eq(xh)
192 with m.Else():
193 comb += outh.eq(yh)
194 with m.ElIf(mask == 0b000): # 4x 4-bit
195 ....
196 with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
197 ....
198 with m.ElIf(....)
199 ....
200 with m.ElIf(....)
201 ....
202 with m.ElIf(....)
203 ....
204
205 # Overview
206
207 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
208
209 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
210 These "breaks" may be set at runtime at any time.
211
212 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs. Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
213
214 # Type 1 (AST) nmigen constructs
215
216 nmigen's `ast.Value` already provides operator overloads for arithmetic
217 operations such that natural standard python syntax `(a+b)` may be used.
218 This proposal extends that concept naturally to allow further contextual
219 overrides of the remaining nmigen Type 1 AST constructs: Cat, Repl,
220 Mux, Switch, Part etc.
221
222 Instead of these being defined exclusively as global functions which cannot comprehend Object-Oriented
223 context, Cat, Switch and Mux etc are redefined to behave as analogues
224 of the python `operator` module. `nmigen.hdl.ast.Mux(sel,val2,val1)`
225 therefore calls `sel.__Mux__(val2,val1)` and so on.
226
227 As an example of a fully-functioning use of this enhancement,
228 pages below describe the basic features of each and track the relevant bugreports. These features here are the building blocks which lie behind
229 PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
230 constructs in the form described above.
231
232 * [[dynamic_simd/simdscope]] a Context Manager which allows signals
233 to look like nmigen Signal
234 * [[dynamic_simd/shape]] a derivative of ast.Shape with partition info
235 * [[dynamic_simd/assign]] nmigen eq (ast.Assign)
236 * [[dynamic_simd/cat]] nmigen ast.Cat
237 * [[dynamic_simd/repl]] nmigen ast.Repl
238 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
239 * [[dynamic_simd/gt]] aka `__gt__` in python operator terms
240 * [[dynamic_simd/add]]
241 * [[dynamic_simd/mul]]
242 * [[dynamic_simd/shift]]
243 * [[dynamic_simd/logicops]] Horizontal reduction: some all xor bool
244 * [[dynamic_simd/slice]] nmigen ast.Slice
245
246 # Integration with nmigen: "Type 2" (dsl.Module)
247
248 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
249
250 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
251
252 if partitions == 1x64
253 with m.If(x > y):
254 do something
255 elif partitions == 2x32:
256 with m.If(x[0:31] > y[0:31]):
257 do something on 1st half
258 elif ...
259 elif ...
260 # many more lines of repeated laborious hand written
261 # SIMD nonsense all exactly the same except for the
262 # for loop and sizes.
263
264 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
265
266 A much more intelligent approach is needed. What we actually want is:
267
268 with m.If(x > y): # do a partitioned compare here
269 do something dynamic here
270
271 where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
272
273 This means one of two things:
274
275 1. that nmigen needs to "understand" the partitioning, in a Type 2
276 construct (m.If, m.Else and m.Switch), at the bare minimum.
277 2. that nmigen's "Type 2" language constructs be sufficiently abstract
278 that they *pass through* SIMD behaviour entirely to a lower level,
279 *completely intact*.
280
281 Mathod (1) is an alarmingly large amount of work with severe to
282 catastrophic implications in multiple areas.
283
284 Method (2) by complete contrast allows the Liskov Substitution Principle to
285 be put to surprisingly elegant effect. If that was possible to
286 achieve then **almost no modifications to nmigen would
287 be required** because dsl.Module is *already* 99% abstracted in terms
288 of the lower-level Type 1 (ast.*) constructs.
289
290 Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`. Within that ast.Switch
291 function only ast.Mux and other Type 1 (AST) "global" functions
292 similar to python operator are used. The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
293
294 <https://gitlab.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
295
296 A deeper analysis shows that dsl.Module uses explicit Value.cast on its
297 If, Elif, and Switch clauses. Overriding that and allowing a cast to
298 a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
299 to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.
300
301 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement. With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
302
303 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions. Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
304
305 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column. Take the following example:
306
307 mask = Signal(3) # creates four partitions
308 a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
309 b = PartitionedSignal(mask, 4) # likewise
310 c = PartitionedSignal(mask, 32)
311 d = PartitionedSignal(mask, 32)
312 o = PartitionedSignal(mask, 32)
313
314 with m.If(a):
315 comb += o.eq(c)
316 with m.Elif(b):
317 comb += o.eq(d)
318
319 If these were ordinary Signals, they would be translated to a Switch where:
320
321 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
322 * cases would be (quantity 2) "1-" and "-1" in order to match
323 against the first binary test bit of Cat(a, b) and the second,
324 respectively.
325 * the first case would be "1-" to activate `o.eq(c)
326 * the second case would be "-1" to activate o.eq(d)
327
328 A parallel variant may thus perform a for-loop, creating four
329 **independent** Switches:
330
331 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
332 * take the output of each case result `o[0].eq[c[0])` and
333 so on
334 * create the first independent Switch
335 * take a[1] and b[1] etc.
336
337 There are several ways in which the parts of each case, when
338 activated, can be split up: temporary Signals, analysing
339 the AST, or using PartitionedMux.
340
341 # Alternative implementation concepts
342
343 Several alternative ideas have been proposed. They are listed here for
344 completeness. The worst offender (use code-duplication) has already been
345 raised.
346
347 The fundamental common characteristic of all of these ideas, despite
348 their disparate nature, is that absolutely every single one of them
349 assumes that there is explicit action (or hard work) required to
350 be taken. None of them - at all - leverage the fact that nmigen
351 has two different inter-related abstraction levels.
352
353 * **Explicit Code**. For every ALU, write code (in nmigen 0.3
354 at the time of writing) which is inherently
355 parallel, rather than duplicated with a series of Switch/Case or
356 an if-elif-elif chain based on what the partition mask is set to.
357 * **Use a low-level class**. Rather than tie in to nmigen, write a
358 class that performs all the equivalent (explicit coded) parallel
359 operations, as close to nmigen Type 1 (AST) constructs as possible.
360 Alternatives to Cat, called SIMDCat, and forego use of Type 2
361 (dsl.Module) nmigen language constructs entirely.
362 * **Wrapper classes**. A dsl.Module wrapper class which duplicates the
363 entirety of the dsl.Module functionality with "redirection" functions
364 and a dsl.Module instance as a member variable. In the rare instances
365 where the function is different it is expected to provide a replacement
366 rather than call the dsl.Module variant.
367 * **Replacement for dsl.Module**. This idea is to implement a full total
368 replacement for dsl.Module, called dsl.SIMDModule (or other).
369 * **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
370 with external functions.
371 * **Compilers / Macros**. On the basis that if this was VHDL or Verilog
372 one technique for creating SIMD variants of the required code would be
373 to use macro substitution or crude compilers to autogenerate the dynamic
374 SIMD from VHDL / Verilog templates, why not do exactly the same thing.
375 Design a SIMD HDL language, write python in that, and have it output
376 python source code with nmigen HDL.
377
378 All of these ideas, unfortunately, are extremely costly in many different ways:
379
380 1. Any overrides of any kind give a catastrophically fatal impression
381 that the high-level language behaviour might in some tiny way be different,
382 purely because of the very *existence* of an override class.
383 This in turn, if prevalent and commonly used, would quickly be
384 seen as a rather nasty indirect unspoken implication that nmigen
385 itself was badly designed, so much so that its users were forced
386 to engage in "unauthorised overrides" in order to achieve what they
387 want and need.
388 2. Proponents of such override mechanisms (and there have been many)
389 fundamentally misunderstand the distinction between Type 1 (AST) low-level
390 and Type 2 (dsl.Module) high-level nmigen language constructs, and
391 miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
392 (Type 2) constructs.
393 3. Wrapper classes are a maintenance nightmare. Unless included in the
394 library itself, any update to the library being wrapped is a huge risk
395 of breaking the wrapper. Long-term this is completely unsustainable.
396 Likewise monkey-patching.
397 4. Wrapper classes introduce huge performance degradation. Every function
398 requires one additional function call. Every object instantiated requires
399 both the wrapper class to be instantiated and the object being wrapped,
400 every single time. A single wrapper class is never enough: the entire
401 class hierarchy, everything that is ever instantiated by a "wrapped"
402 class instance, also requires wrapping. This quickly
403 gets completely out of hand and diverts developer time into a
404 nightmare time-sink that has actively harmful consequences *even
405 once completed*.
406 Memory requirements double; performance degrades.
407 Both are unacceptable.
408 5. "Explicit coding" is actually more costly even than for-loop
409 code duplication. It is "The Right Thing (tm)" in order to achieve
410 optimal gate-level design but requires hyper-aware and hyper-intelligent
411 developers, whom, if ever lost, take knowledge of the internal workings
412 of ultra-complex code with them.
413 6. "Low-level classes" have in fact already been implemented: over a
414 dozen modules utilised and combined behind PartitionedSignal exist
415 and are already in use. PartitionedMux was one of the very first SIMD-aware
416 "Type 1" AST operators written. The problem is that it's nowhere near
417 enough. Conversion of 100,000 lines of readable nmigen HDL using
418 Type 2 (m.If/Elif) high-level constructs to **not** use those
419 constructs and return to AST Mux, Cat, and so on is a completely
420 unreasonable expectation.
421 7. "Replacements for dsl.Module" aside from coming with the catastrophic
422 implication that they provide alternative behaviour from dsl.Module
423 are a heavy maintenance burden. Not only that but there is the risk
424 that, unless actually included *in* nmigen, the underlying AST modules
425 that they use might change, or be deprecated, or have bugs fixed which
426 break the replacement.
427 8. "Compilers". It is very well known that any python program that outputs
428 another python program as an ASCII output may then immediately read that
429 ASCII output from within the very program that created it, `eval` it,
430 and then execute it. That being the case, you might as well skip the
431 output to ASCII and the eval part, and create classes that
432 dynamically and directly instantiate the desired target behaviour.
433 This saves months of effort but is a common mistake, frequently made.
434 9. "Lost In Translation". ("Out of sight, out of mind" being translated
435 to Russian then English by a Diplomat's translator famously returned
436 the phrase, "Invisible lunatic").
437 Any compiler risks losing information in between translations,
438 making debugging harder, and finding mistakes particularly difficult.
439 With nmigen already being a Language Translator, adding an extra layer
440 seems extremely unwise. And, worse, again relies on nmigen remaining
441 stable, indefinitely.
442
443 Bottom line is that all the alternatives are really quite harmful, costly,
444 and unmaintainable, and in some cases actively damage nmigen's reputation
445 as a stable, useful and powerful HDL.
446