3d_gpu/architecture/dynamic_simd.mdwn

   1 # Dynamic SIMD Library
   2
   3 Funded by NLnet EU Grants
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
   8   (this document)
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
  14 * <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment
  15 * <https://bugs.libre-soc.org/show_bug.cgi?id=1024> continuation and completion of Type II
  16   (DSL) astraction, for completing PartitionedSIMD.
  17
  18 # Proposal Summary
  19
  20 The proposal is two-fold, and targetted at the two distinct types
  21 of nmigen language constructs:
  22
  23 * Type 1 low-level AST.  implemented as nmigen.hdl.ast classes:
  24   Value, Cat, Repl, Mux, Switch etc.
  25 * Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl
  26
  27 The Type 1 AST low-level proposed modifications mirror and extend the
  28 existing long-established python `operator` module interoperability, which nmigen
  29 *already leverages* by providing `Value.__add__` and other operator
  30 overrides.
  31
  32 * To extend nmigen "Type 1 (ast.*)" low-level language constructs
  33   in the Value class
  34   with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
  35 * To rename existing old
  36   - `ast.Cat` to new `ast._InternalCat`,
  37   - `ast.Repl` to `ast._InternalRepl`
  38   - etc.
  39 * In an identical conceptual fashion, just
  40   as python `operator.add(x,y)` redirects to `x.__add__(y)`,
  41   to add a new `ast.Cat(x..)` which redirects to `x.__Cat__(...)` etc.
  42 * To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
  43 * To allow all of the new underscore variants to be overridden without
  44   limitation, restriction, or restraint, via UserValue (and ValueCastable)
  45
  46 The second set of changes is targetted at Type 2 dsl.Module,
  47 to complete the 98% abstraction from Type 1 to a 100% level
  48
  49 * To add a new parameter to `dsl.Module`'s constructor,
  50   `_astTypeFn`, which is the AST class type-casting function
  51   to be used for "casting" of m.If and m.Elif condition tests,
  52   and for m.Switch's value.
  53 * To replace the two (2) calls to `Value.cast()` in `dsl.Module`
  54   with `self._astTypefn()`
  55
  56 No further modifications beyond that are strictly necessary.
  57
  58 With `dsl.Module` already being extremely close to 100%
  59 abstracted with respect to Type 1 AST constructs, Liskov
  60 Substitution Principle in combination with UserValue (later,
  61 ValueCastable) combine to provide extremely powerful and
  62 flexible augmentation and capabilities in nmigen, far beyond
  63 its original intended design parameters.
  64
  65 The benefits due to the abstraction extend 100% cleanly to
  66 other use-cases (Partitioned Dynamic SIMD is just the first). Other nmigen
  67 developers may leverage contextual overriding of the AST
  68 constructs in full OO fashion, yet the high-level dsl.Module
  69 language concepts remain true to their intended characteristics
  70 and behaviour and need neither duplication nor alteration.
  71
  72 Typical use-cases are, just as is the driving force behind PartitionedSignal,
  73 the sharing at the gate level of arithmetic primitives that would
  74 otherwise require costly duplication in a final product, or be too
  75 complex or costly to develop without such abstraction at the higher
  76 Type 2 (dsl.Module) level.
  77
  78 Examples include an ALU that may dynamically
  79 at runtime switch between Complex (imaginary) arithmetic
  80 and Real (scalar) arithmetic, whilst *through the same
  81 datapath* automatically sharing the arithmetic logic
  82 gates between the two abstract concepts. Given that
  83 64 bit multipliers are 12,000 to 15,000 gates the savings
  84 can be enormous.
  85
  86 PartitionedSignal is the first concrete third-party example
  87 that inspired the completion of the abstraction of dsl.Module
  88 from ast.*
  89
  90 # Rationale / Introduction
  91
  92 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
  93 of nmigen's Signal.  It is expected to work transparently as if it was
  94 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
  95 no requirement on the part of the developer to even know that it is
  96 performing parallel dynamically-partitioned operations.
  97 All standard nmigen language constructs are both expected and required
  98 to behave exactly as they do for scalar Signals.
  99
 100 nmigen 32-bit Signal:
 101
 102     a        : .... .... .... .... (32 bits, illustrated as 4x8)
 103
 104 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
 105 sections, by 3 partition bits:
 106
 107     partition:     P    P    P     (3 bits)
 108     a        : .... .... .... .... (32 bits, illustrated as 4x8)
 109     exp-a    : ....P....P....P.... (32+3 bits, P=0 if no partition)
 110
 111 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
 112 into all 8 possible combinations of the 3 Partition bits:
 113
 114     exp-a    : ....0....0....0.... 1x 32-bit
 115     exp-a    : ....0....0....1.... 1x 24-bit plus 1x 8-bit
 116     exp-a    : ....0....1....0.... 2x 16-bit
 117     ...
 118     ...
 119     exp-a    : ....1....1....0.... 2x 8-bit, 1x 16-bit
 120     exp-a    : ....1....1....1.... 4x 8-bit
 121
 122 A simple example, a "min" function:
 123
 124     # declare x, y and out as 16-bit scalar Signals
 125     x = Signal(16)
 126     y = Signal(16)
 127     out = Signal(16)
 128
 129     # compare x against y and set output accordingly
 130     with m.If(x < y):
 131         comb += out.eq(x)
 132     with m.Else():
 133         comb += out.eq(y)
 134
 135 This is very straightforward and obvious, that the 16-bit output is the
 136 lesser of the x and y inputs.  We require the exact same obviousness
 137 and under no circumstances any change of any kind to any nmigen language
 138 construct:
 139
 140     # a mask of length 3 indicates a desire to partition Signals at
 141     # 3 points into 4 equally-spaced SIMD "partitions".
 142     mask = Signal(3)
 143
 144     with mask:
 145         # x y and out are all 16-bit so are subdivided at:
 146         # |      mask[0]     mask[1]     mask[3]         |
 147         # |  0-3    |    4-7    |   8-11    |   12-15    |
 148         x = PartitionedSignal(16)    # identical except for mask
 149         y = PartitionedSignal(16)    # identical except for mask
 150         out = PartitionedSignal(16)  # identical except for mask
 151
 152     # all code here is required to be absolutely identical to the
 153     # scalar case, and identical in nmigen language behaviour in
 154     # every way.  no changes to the nmigen language or its use
 155     # are permitted
 156
 157     with m.If(x < y):       # exactly
 158         comb += out.eq(x)   # the
 159     with m.Else():          # same
 160         comb += out.eq(y)   # code
 161
 162 The purpose of PartitionedSignal is therefore to provide full 100%
 163 transparent SIMD run-time dynamic behaviour as far as end-usage is
 164 concerned.
 165
 166 The alternative is absolutely awful and completely unacceptable
 167 for both maintenance cost and development cost.  Even the most
 168 basic HDL is literally an order of magnitude larger:
 169
 170     # declare x, y and out as 16-bit scalar Signals
 171     x = Signal(16)
 172     y = Signal(16)
 173     out = Signal(16)
 174
 175     # start an absolutely awful unmaintainable duplication of
 176     # SIMD behaviour.
 177     with m.If(mask == 0b111): # 1x 16-bit
 178        # compare x against y and set output accordingly
 179        with m.If(x < y):
 180            comb += out.eq(x)
 181        with m.Else():
 182            comb += out.eq(y)
 183     with m.ElIf(mask == 0b101): # 2x 8-bit
 184        for i in range(2):
 185            xh = x[i*8:(i+1)*8]
 186            yh = y[i*8:(i+1)*8]
 187            outh = out[i*8:(i+1)*8]
 188            # compare halves of x against halves y and set
 189            # halves of output accordingly
 190            with m.If(xh < yh):
 191                comb += outh.eq(xh)
 192            with m.Else():
 193                comb += outh.eq(yh)
 194     with m.ElIf(mask == 0b000): # 4x 4-bit
 195         ....
 196     with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
 197         ....
 198     with m.ElIf(....)
 199         ....
 200     with m.ElIf(....)
 201         ....
 202     with m.ElIf(....)
 203         ....
 204
 205 # Overview
 206
 207 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
 208
 209 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
 210 These "breaks" may be set at runtime at any time.
 211
 212 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs.  Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
 213
 214 # Type 1 (AST) nmigen constructs
 215
 216 nmigen's `ast.Value` already provides operator overloads for arithmetic
 217 operations such that natural standard python syntax `(a+b)` may be used.
 218 This proposal extends that concept naturally to allow further contextual
 219 overrides of the remaining nmigen Type 1 AST constructs: Cat, Repl,
 220 Mux, Switch, Part etc.
 221
 222 Instead of these being defined exclusively as global functions which cannot comprehend Object-Oriented
 223 context, Cat, Switch and Mux etc are redefined to behave as analogues
 224 of the python `operator` module. `nmigen.hdl.ast.Mux(sel,val2,val1)`
 225 therefore calls `sel.__Mux__(val2,val1)` and so on.
 226
 227 As an example of a fully-functioning use of this enhancement,
 228 pages below describe the basic features of each and track the relevant bugreports.  These features here are the building blocks which lie behind
 229 PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
 230 constructs in the form described above.
 231
 232 * [[dynamic_simd/simdscope]] a Context Manager which allows signals
 233   to look like nmigen Signal
 234 * [[dynamic_simd/shape]] a derivative of ast.Shape with partition info
 235 * [[dynamic_simd/assign]] nmigen eq (ast.Assign)
 236 * [[dynamic_simd/cat]] nmigen ast.Cat
 237 * [[dynamic_simd/repl]] nmigen ast.Repl
 238 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
 239 * [[dynamic_simd/gt]] aka `__gt__` in python operator terms
 240 * [[dynamic_simd/add]]
 241 * [[dynamic_simd/mul]]
 242 * [[dynamic_simd/shift]]
 243 * [[dynamic_simd/logicops]] Horizontal reduction: some all xor bool
 244 * [[dynamic_simd/slice]] nmigen ast.Slice
 245
 246 # Integration with nmigen: "Type 2" (dsl.Module)
 247
 248 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
 249
 250 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
 251
 252     if partitions == 1x64
 253          with m.If(x > y):
 254               do something
 255     elif partitions == 2x32:
 256          with m.If(x[0:31] > y[0:31]):
 257               do something on 1st half
 258          elif ...
 259     elif ...
 260     # many more lines of repeated laborious hand written
 261     # SIMD nonsense all exactly the same except for the
 262     # for loop and sizes.
 263
 264 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
 265
 266 A much more intelligent approach is needed. What we actually want is:
 267
 268     with m.If(x > y): # do a partitioned compare here
 269          do something dynamic here
 270
 271 where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
 272
 273 This means one of two things:
 274
 275 1. that nmigen needs to "understand" the partitioning, in a Type 2
 276   construct (m.If, m.Else and m.Switch), at the bare minimum.
 277 2. that nmigen's "Type 2" language constructs be sufficiently abstract
 278   that they *pass through* SIMD behaviour entirely to a lower level,
 279   *completely intact*.
 280
 281 Mathod (1) is an alarmingly large amount of work with severe to
 282 catastrophic implications in multiple areas.
 283
 284 Method (2) by complete contrast allows the Liskov Substitution Principle to
 285 be put to surprisingly elegant effect. If that was possible to
 286 achieve then **almost no modifications to nmigen would
 287 be required** because dsl.Module is *already* 99% abstracted in terms
 288 of the lower-level Type 1 (ast.*) constructs.
 289
 290 Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`.  Within that ast.Switch
 291 function only ast.Mux and other Type 1 (AST) "global" functions
 292 similar to python operator are used.  The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
 293
 294 <https://gitlab.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
 295
 296 A deeper analysis shows that dsl.Module uses explicit Value.cast on its
 297 If, Elif, and Switch clauses. Overriding that and allowing a cast to
 298 a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
 299 to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.
 300
 301 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement.  With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
 302
 303 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions.  Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
 304
 305 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column.  Take the following example:
 306
 307      mask = Signal(3) # creates four partitions
 308      a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
 309      b = PartitionedSignal(mask, 4) # likewise
 310      c = PartitionedSignal(mask, 32)
 311      d = PartitionedSignal(mask, 32)
 312      o = PartitionedSignal(mask, 32)
 313
 314      with m.If(a):
 315          comb += o.eq(c)
 316      with m.Elif(b):
 317          comb += o.eq(d)
 318
 319 If these were ordinary Signals, they would be translated to a Switch where:
 320
 321 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
 322 * cases would be (quantity 2) "1-" and "-1" in order to match
 323   against the first binary test bit of Cat(a, b) and the second,
 324   respectively.
 325 * the first case would be "1-" to activate `o.eq(c)
 326 * the second case would be "-1" to activate o.eq(d)
 327
 328 A parallel variant may thus perform a for-loop, creating four
 329 **independent** Switches:
 330
 331 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
 332 * take the output of each case result `o[0].eq[c[0])` and
 333   so on
 334 * create the first independent Switch
 335 * take a[1] and b[1] etc.
 336
 337 There are several ways in which the parts of each case, when
 338 activated, can be split up: temporary Signals, analysing
 339 the AST, or using PartitionedMux.
 340
 341 # Alternative implementation concepts
 342
 343 Several alternative ideas have been proposed. They are listed here for
 344 completeness.  The worst offender (use code-duplication) has already been
 345 raised.
 346
 347 The fundamental common characteristic of all of these ideas, despite
 348 their disparate nature, is that absolutely every single one of them
 349 assumes that there is explicit action (or hard work) required to
 350 be taken. None of them - at all - leverage the fact that nmigen
 351 has two different inter-related abstraction levels.
 352
 353 * **Explicit Code**. For every ALU, write code (in nmigen 0.3
 354   at the time of writing) which is inherently
 355   parallel, rather than duplicated with a series of Switch/Case or
 356   an if-elif-elif chain based on what the partition mask is set to.
 357 * **Use a low-level class**.  Rather than tie in to nmigen, write a
 358   class that performs all the equivalent (explicit coded) parallel
 359   operations, as close to nmigen Type 1 (AST) constructs as possible.
 360   Alternatives to Cat, called SIMDCat, and forego use of Type 2
 361   (dsl.Module) nmigen language constructs entirely.
 362 * **Wrapper classes**. A dsl.Module wrapper class which duplicates the
 363   entirety of the dsl.Module functionality with "redirection" functions
 364   and a dsl.Module instance as a member variable.  In the rare instances
 365   where the function is different it is expected to provide a replacement
 366   rather than call the dsl.Module variant.
 367 * **Replacement for dsl.Module**. This idea is to implement a full total
 368   replacement for dsl.Module, called dsl.SIMDModule (or other).
 369 * **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
 370   with external functions.
 371 * **Compilers / Macros**. On the basis that if this was VHDL or Verilog
 372   one technique for creating SIMD variants of the required code would be
 373   to use macro substitution or crude compilers to autogenerate the dynamic
 374   SIMD from VHDL / Verilog templates, why not do exactly the same thing.
 375   Design a SIMD HDL language, write python in that, and have it output
 376   python source code with nmigen HDL.
 377
 378 All of these ideas, unfortunately, are extremely costly in many different ways:
 379
 380 1. Any overrides of any kind give a catastrophically fatal impression
 381   that the high-level language behaviour might in some tiny way be different,
 382   purely because of the very *existence* of an override class.
 383   This in turn, if prevalent and commonly used, would quickly be
 384   seen as a rather nasty indirect unspoken implication that nmigen
 385   itself was badly designed, so much so that its users were forced
 386   to engage in "unauthorised overrides" in order to achieve what they
 387   want and need.
 388 2. Proponents of such override mechanisms (and there have been many)
 389   fundamentally misunderstand the distinction between Type 1 (AST) low-level
 390   and Type 2 (dsl.Module) high-level nmigen language constructs, and
 391   miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
 392   (Type 2) constructs.
 393 3. Wrapper classes are a maintenance nightmare. Unless included in the
 394   library itself, any update to the library being wrapped is a huge risk
 395   of breaking the wrapper.  Long-term this is completely unsustainable.
 396   Likewise monkey-patching.
 397 4. Wrapper classes introduce huge performance degradation. Every function
 398   requires one additional function call. Every object instantiated requires
 399   both the wrapper class to be instantiated and the object being wrapped,
 400   every single time.  A single wrapper class is never enough: the entire
 401   class hierarchy, everything that is ever instantiated by a "wrapped"
 402   class instance, also requires wrapping.  This quickly
 403   gets completely out of hand and diverts developer time into a
 404   nightmare time-sink that has actively harmful consequences *even
 405   once completed*.
 406   Memory requirements double; performance degrades.
 407   Both are unacceptable.
 408 5. "Explicit coding" is actually more costly even than for-loop
 409   code duplication.  It is "The Right Thing (tm)" in order to achieve
 410   optimal gate-level design but requires hyper-aware and hyper-intelligent
 411   developers, whom, if ever lost, take knowledge of the internal workings
 412   of ultra-complex code with them.
 413 6. "Low-level classes" have in fact already been implemented: over a
 414   dozen modules utilised and combined behind PartitionedSignal exist
 415   and are already in use. PartitionedMux was one of the very first SIMD-aware
 416   "Type 1" AST operators written.  The problem is that it's nowhere near
 417   enough.  Conversion of 100,000 lines of readable nmigen HDL using
 418   Type 2 (m.If/Elif) high-level constructs to **not** use those
 419   constructs and return to AST Mux, Cat, and so on is a completely
 420   unreasonable expectation.
 421 7. "Replacements for dsl.Module" aside from coming with the catastrophic
 422   implication that they provide alternative behaviour from dsl.Module
 423   are a heavy maintenance burden.  Not only that but there is the risk
 424   that, unless actually included *in* nmigen, the underlying AST modules
 425   that they use might change, or be deprecated, or have bugs fixed which
 426   break the replacement.
 427 8. "Compilers". It is very well known that any python program that outputs
 428   another python program as an ASCII output may then immediately read that
 429   ASCII output from within the very program that created it, `eval` it,
 430   and then execute it.  That being the case, you might as well skip the
 431   output to ASCII and the eval part, and create classes that
 432   dynamically and directly instantiate the desired target behaviour.
 433   This saves months of effort but is a common mistake, frequently made.
 434 9. "Lost In Translation". ("Out of sight, out of mind" being translated
 435   to Russian then English by a Diplomat's translator famously returned
 436   the phrase, "Invisible lunatic").
 437   Any compiler risks losing information in between translations,
 438   making debugging harder, and finding mistakes particularly difficult.
 439   With nmigen already being a Language Translator, adding an extra layer
 440   seems extremely unwise.  And, worse, again relies on nmigen remaining
 441   stable, indefinitely.
 442
 443 Bottom line is that all the alternatives are really quite harmful, costly,
 444 and unmaintainable, and in some cases actively damage nmigen's reputation
 445 as a stable, useful and powerful HDL.
 446