3d_gpu/architecture/dynamic_simd.mdwn

   1 # Dynamic SIMD Library
   2
   3 Funded by NLnet EU Grants
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
   8   (this document)
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
  14 * <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment
  15
  16 # Proposal Summary
  17
  18 The proposal is two-fold, and targetted at the two distinct types
  19 of nmigen language constructs:
  20
  21 * Type 1 low-level AST.  implemented as nmigen.hdl.ast classes:
  22   Value, Cat, Repl, Mux, Switch etc.
  23 * Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl
  24
  25 The Type 1 AST low-level proposed modifications mirror and extend the
  26 existing long-established python `operator` module interoperability, which nmigen
  27 *already leverages* by providing `Value.__add__` and other operator
  28 overrides.
  29
  30 * To extend nmigen "Type 1 (ast.*)" low-level language constructs
  31   in the Value class
  32   with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
  33 * To rename existing old
  34   - `ast.Cat` to new `ast._InternalCat`,
  35   - `ast.Repl` to `ast._InternalRepl`
  36   - etc.
  37 * In an identical conceptual fashion, just
  38   as python `operator.add(x,y)` redirects to `x.__add__(y)`,
  39   to add a new `ast.Cat(x..)` which redirects to `x.__Cat__(...)` etc.
  40 * To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
  41 * To allow all of the new underscore variants to be overridden without
  42   limitation, restriction, or restraint, via UserValue (and ValueCastable)
  43
  44 The second set of changes is targetted at Type 2 dsl.Module,
  45 to complete the 98% abstraction from Type 1 to a 100% level
  46
  47 * To add a new parameter to `dsl.Module`'s constructor,
  48   `_astTypeFn`, which is the AST class type-casting function
  49   to be used for "casting" of m.If and m.Elif condition tests,
  50   and for m.Switch's value.
  51 * To replace the two (2) calls to `Value.cast()` in `dsl.Module`
  52   with `self._astTypefn()`
  53
  54 No further modifications beyond that are strictly necessary.
  55
  56 With `dsl.Module` already being extremely close to 100%
  57 abstracted with respect to Type 1 AST constructs, Liskov
  58 Substitution Principle in combination with UserValue (later,
  59 ValueCastable) combine to provide extremely powerful and
  60 flexible augmentation and capabilities in nmigen, far beyond
  61 its original intended design parameters.
  62
  63 The benefits due to the abstraction extend 100% cleanly to
  64 other use-cases (Partitioned Dynamic SIMD is just the first). Other nmigen
  65 developers may leverage contextual overriding of the AST
  66 constructs in full OO fashion, yet the high-level dsl.Module
  67 language concepts remain true to their intended characteristics
  68 and behaviour and need neither duplication nor alteration.
  69
  70 Typical use-cases are, just as is the driving force behind PartitionedSignal,
  71 the sharing at the gate level of arithmetic primitives that would
  72 otherwise require costly duplication in a final product, or be too
  73 complex or costly to develop without such abstraction at the higher
  74 Type 2 (dsl.Module) level.
  75
  76 Examples include an ALU that may dynamically
  77 at runtime switch between Complex (imaginary) arithmetic
  78 and Real (scalar) arithmetic, whilst *through the same
  79 datapath* automatically sharing the arithmetic logic
  80 gates between the two abstract concepts. Given that
  81 64 bit multipliers are 12,000 to 15,000 gates the savings
  82 can be enormous.
  83
  84 PartitionedSignal is the first concrete third-party example
  85 that inspired the completion of the abstraction of dsl.Module
  86 from ast.*
  87
  88 # Rationale / Introduction
  89
  90 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
  91 of nmigen's Signal.  It is expected to work transparently as if it was
  92 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
  93 no requirement on the part of the developer to even know that it is
  94 performing parallel dynamically-partitioned operations.
  95 All standard nmigen language constructs are both expected and required
  96 to behave exactly as they do for scalar Signals.
  97
  98 nmigen 32-bit Signal:
  99
 100     a        : .... .... .... .... (32 bits, illustrated as 4x8)
 101
 102 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
 103 sections, by 3 partition bits:
 104
 105     partition:     P    P    P     (3 bits)
 106     a        : .... .... .... .... (32 bits, illustrated as 4x8)
 107     exp-a    : ....P....P....P.... (32+3 bits, P=0 if no partition)
 108
 109 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
 110 into all 8 possible combinations of the 3 Partition bits:
 111
 112     exp-a    : ....0....0....0.... 1x 32-bit
 113     exp-a    : ....0....0....1.... 1x 24-bit plus 1x 8-bit
 114     exp-a    : ....0....1....0.... 2x 16-bit
 115     ...
 116     ...
 117     exp-a    : ....1....1....0.... 2x 8-bit, 1x 16-bit
 118     exp-a    : ....1....1....1.... 4x 8-bit
 119
 120 A simple example, a "min" function:
 121
 122     # declare x, y and out as 16-bit scalar Signals
 123     x = Signal(16)
 124     y = Signal(16)
 125     out = Signal(16)
 126
 127     # compare x against y and set output accordingly
 128     with m.If(x < y):
 129         comb += out.eq(x)
 130     with m.Else():
 131         comb += out.eq(y)
 132
 133 This is very straightforward and obvious, that the 16-bit output is the
 134 lesser of the x and y inputs.  We require the exact same obviousness
 135 and under no circumstances any change of any kind to any nmigen language
 136 construct:
 137
 138     # a mask of length 3 indicates a desire to partition Signals at
 139     # 3 points into 4 equally-spaced SIMD "partitions".
 140     mask = Signal(3)
 141
 142     with mask:
 143         # x y and out are all 16-bit so are subdivided at:
 144         # |      mask[0]     mask[1]     mask[3]         |
 145         # |  0-3    |    4-7    |   8-11    |   12-15    |
 146         x = PartitionedSignal(16)    # identical except for mask
 147         y = PartitionedSignal(16)    # identical except for mask
 148         out = PartitionedSignal(16)  # identical except for mask
 149
 150     # all code here is required to be absolutely identical to the
 151     # scalar case, and identical in nmigen language behaviour in
 152     # every way.  no changes to the nmigen language or its use
 153     # are permitted
 154
 155     with m.If(x < y):       # exactly
 156         comb += out.eq(x)   # the
 157     with m.Else():          # same
 158         comb += out.eq(y)   # code
 159
 160 The purpose of PartitionedSignal is therefore to provide full 100%
 161 transparent SIMD run-time dynamic behaviour as far as end-usage is
 162 concerned.
 163
 164 The alternative is absolutely awful and completely unacceptable
 165 for both maintenance cost and development cost.  Even the most
 166 basic HDL is literally an order of magnitude larger:
 167
 168     # declare x, y and out as 16-bit scalar Signals
 169     x = Signal(16)
 170     y = Signal(16)
 171     out = Signal(16)
 172
 173     # start an absolutely awful unmaintainable duplication of
 174     # SIMD behaviour.
 175     with m.If(mask == 0b111): # 1x 16-bit
 176        # compare x against y and set output accordingly
 177        with m.If(x < y):
 178            comb += out.eq(x)
 179        with m.Else():
 180            comb += out.eq(y)
 181     with m.ElIf(mask == 0b101): # 2x 8-bit
 182        for i in range(2):
 183            xh = x[i*8:(i+1)*8]
 184            yh = y[i*8:(i+1)*8]
 185            outh = out[i*8:(i+1)*8]
 186            # compare halves of x against halves y and set
 187            # halves of output accordingly
 188            with m.If(xh < yh):
 189                comb += outh.eq(xh)
 190            with m.Else():
 191                comb += outh.eq(yh)
 192     with m.ElIf(mask == 0b000): # 4x 4-bit
 193         ....
 194     with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
 195         ....
 196     with m.ElIf(....)
 197         ....
 198     with m.ElIf(....)
 199         ....
 200     with m.ElIf(....)
 201         ....
 202
 203 # Overview
 204
 205 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
 206
 207 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
 208 These "breaks" may be set at runtime at any time.
 209
 210 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs.  Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
 211
 212 # Type 1 (AST) nmigen constructs
 213
 214 nmigen's `ast.Value` already provides operator overloads for arithmetic
 215 operations such that natural standard python syntax `(a+b)` may be used.
 216 This proposal extends that concept naturally to allow further contextual
 217 overrides of the remaining nmigen Type 1 AST constructs: Cat, Repl,
 218 Mux, Switch, Part etc.
 219
 220 Instead of these being defined exclusively as global functions which cannot comprehend Object-Oriented
 221 context, Cat, Switch and Mux etc are redefined to behave as analogues
 222 of the python `operator` module. `nmigen.hdl.ast.Mux(sel,val2,val1)`
 223 therefore calls `sel.__Mux__(val2,val1)` and so on.
 224
 225 As an example of a fully-functioning use of this enhancement,
 226 pages below describe the basic features of each and track the relevant bugreports.  These features here are the building blocks which lie behind
 227 PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
 228 constructs in the form described above.
 229
 230 * [[dynamic_simd/shape]] a derivative of ast.Shape with partition info
 231 * [[dynamic_simd/assign]] nmigen eq (ast.Assign)
 232 * [[dynamic_simd/cat]] nmigen ast.Cat
 233 * [[dynamic_simd/repl]] nmigen ast.Repl
 234 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
 235 * [[dynamic_simd/gt]] aka `__gt__` in python operator terms
 236 * [[dynamic_simd/add]]
 237 * [[dynamic_simd/mul]]
 238 * [[dynamic_simd/shift]]
 239 * [[dynamic_simd/logicops]] Horizontal reduction: some all xor bool
 240 * [[dynamic_simd/slice]] nmigen ast.Slice
 241
 242 # Integration with nmigen: "Type 2" (dsl.Module)
 243
 244 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
 245
 246 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
 247
 248     if partitions == 1x64
 249          with m.If(x > y):
 250               do something
 251     elif partitions == 2x32:
 252          with m.If(x[0:31] > y[0:31]):
 253               do something on 1st half
 254          elif ...
 255     elif ...
 256     # many more lines of repeated laborious hand written
 257     # SIMD nonsense all exactly the same except for the
 258     # for loop and sizes.
 259
 260 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
 261
 262 A much more intelligent approach is needed. What we actually want is:
 263
 264     with m.If(x > y): # do a partitioned compare here
 265          do something dynamic here
 266
 267 where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
 268
 269 This means one of two things:
 270
 271 1. that nmigen needs to "understand" the partitioning, in a Type 2
 272   construct (m.If, m.Else and m.Switch), at the bare minimum.
 273 2. that nmigen's "Type 2" language constructs be sufficiently abstract
 274   that they *pass through* SIMD behaviour entirely to a lower level,
 275   *completely intact*.
 276
 277 Mathod (1) is an alarmingly large amount of work with severe to
 278 catastrophic implications in multiple areas.
 279
 280 Method (2) by complete contrast allows the Liskov Substitution Principle to
 281 be put to surprisingly elegant effect. If that was possible to
 282 achieve then **almost no modifications to nmigen would
 283 be required** because dsl.Module is *already* 99% abstracted in terms
 284 of the lower-level Type 1 (ast.*) constructs.
 285
 286 Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`.  Within that ast.Switch
 287 function only ast.Mux and other Type 1 (AST) "global" functions
 288 similar to python operator are used.  The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
 289
 290 <https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
 291
 292 A deeper analysis shows that dsl.Module uses explicit Value.cast on its
 293 If, Elif, and Switch clauses. Overriding that and allowing a cast to
 294 a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
 295 to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.
 296
 297 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement.  With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
 298
 299 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions.  Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
 300
 301 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column.  Take the following example:
 302
 303      mask = Signal(3) # creates four partitions
 304      a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
 305      b = PartitionedSignal(mask, 4) # likewise
 306      c = PartitionedSignal(mask, 32)
 307      d = PartitionedSignal(mask, 32)
 308      o = PartitionedSignal(mask, 32)
 309
 310      with m.If(a):
 311          comb += o.eq(c)
 312      with m.Elif(b):
 313          comb += o.eq(d)
 314
 315 If these were ordinary Signals, they would be translated to a Switch where:
 316
 317 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
 318 * cases would be (quantity 2) "1-" and "-1" in order to match
 319   against the first binary test bit of Cat(a, b) and the second,
 320   respectively.
 321 * the first case would be "1-" to activate `o.eq(c)
 322 * the second case would be "-1" to activate o.eq(d)
 323
 324 A parallel variant may thus perform a for-loop, creating four
 325 **independent** Switches:
 326
 327 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
 328 * take the output of each case result `o[0].eq[c[0])` and
 329   so on
 330 * create the first independent Switch
 331 * take a[1] and b[1] etc.
 332
 333 There are several ways in which the parts of each case, when
 334 activated, can be split up: temporary Signals, analysing
 335 the AST, or using PartitionedMux.
 336
 337 # Alternative implementation concepts
 338
 339 Several alternative ideas have been proposed. They are listed here for
 340 completeness.  The worst offender (use code-duplication) has already been
 341 raised.
 342
 343 The fundamental common characteristic of all of these ideas, despite
 344 their disparate nature, is that absolutely every single one of them
 345 assumes that there is explicit action (or hard work) required to
 346 be taken. None of them - at all - leverage the fact that nmigen
 347 has two different inter-related abstraction levels.
 348
 349 * **Explicit Code**. For every ALU, write code (in nmigen 0.3
 350   at the time of writing) which is inherently
 351   parallel, rather than duplicated with a series of Switch/Case or
 352   an if-elif-elif chain based on what the partition mask is set to.
 353 * **Use a low-level class**.  Rather than tie in to nmigen, write a
 354   class that performs all the equivalent (explicit coded) parallel
 355   operations, as close to nmigen Type 1 (AST) constructs as possible.
 356   Alternatives to Cat, called SIMDCat, and forego use of Type 2
 357   (dsl.Module) nmigen language constructs entirely.
 358 * **Wrapper classes**. A dsl.Module wrapper class which duplicates the
 359   entirety of the dsl.Module functionality with "redirection" functions
 360   and a dsl.Module instance as a member variable.  In the rare instances
 361   where the function is different it is expected to provide a replacement
 362   rather than call the dsl.Module variant.
 363 * **Replacement for dsl.Module**. This idea is to implement a full total
 364   replacement for dsl.Module, called dsl.SIMDModule (or other).
 365 * **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
 366   with external functions.
 367 * **Compilers / Macros**. On the basis that if this was VHDL or Verilog
 368   one technique for creating SIMD variants of the required code would be
 369   to use macro substitution or crude compilers to autogenerate the dynamic
 370   SIMD from VHDL / Verilog templates, why not do exactly the same thing.
 371   Design a SIMD HDL language, write python in that, and have it output
 372   python source code with nmigen HDL.
 373
 374 All of these ideas, unfortunately, are extremely costly in many different ways:
 375
 376 1. Any overrides of any kind give a catastrophically fatal impression
 377   that the high-level language behaviour might in some tiny way be different,
 378   purely because of the very *existence* of an override class.
 379   This in turn, if prevalent and commonly used, would quickly be
 380   seen as a rather nasty indirect unspoken implication that nmigen
 381   itself was badly designed, so much so that its users were forced
 382   to engage in "unauthorised overrides" in order to achieve what they
 383   want and need.
 384 2. Proponents of such override mechanisms (and there have been many)
 385   fundamentally misunderstand the distinction between Type 1 (AST) low-level
 386   and Type 2 (dsl.Module) high-level nmigen language constructs, and
 387   miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
 388   (Type 2) constructs.
 389 3. Wrapper classes are a maintenance nightmare. Unless included in the
 390   library itself, any update to the library being wrapped is a huge risk
 391   of breaking the wrapper.  Long-term this is completely unsustainable.
 392   Likewise monkey-patching.
 393 4. Wrapper classes introduce huge performance degradation. Every function
 394   requires one additional function call. Every object instantiated requires
 395   both the wrapper class to be instantiated and the object being wrapped,
 396   every single time.  A single wrapper class is never enough: the entire
 397   class hierarchy, everything that is ever instantiated by a "wrapped"
 398   class instance, also requires wrapping.  This quickly
 399   gets completely out of hand and diverts developer time into a
 400   nightmare time-sink that has actively harmful consequences *even
 401   once completed*.
 402   Memory requirements double; performance degrades.
 403   Both are unacceptable.
 404 5. "Explicit coding" is actually more costly even than for-loop
 405   code duplication.  It is "The Right Thing (tm)" in order to achieve
 406   optimal gate-level design but requires hyper-aware and hyper-intelligent
 407   developers, whom, if ever lost, take knowledge of the internal workings
 408   of ultra-complex code with them.
 409 6. "Low-level classes" have in fact already been implemented: over a
 410   dozen modules utilised and combined behind PartitionedSignal exist
 411   and are already in use. PartitionedMux was one of the very first SIMD-aware
 412   "Type 1" AST operators written.  The problem is that it's nowhere near
 413   enough.  Conversion of 100,000 lines of readable nmigen HDL using
 414   Type 2 (m.If/Elif) high-level constructs to **not** use those
 415   constructs and return to AST Mux, Cat, and so on is a completely
 416   unreasonable expectation.
 417 7. "Replacements for dsl.Module" aside from coming with the catastrophic
 418   implication that they provide alternative behaviour from dsl.Module
 419   are a heavy maintenance burden.  Not only that but there is the risk
 420   that, unless actually included *in* nmigen, the underlying AST modules
 421   that they use might change, or be deprecated, or have bugs fixed which
 422   break the replacement.
 423 8. "Compilers". It is very well known that any python program that outputs
 424   another python program as an ASCII output may then immediately read that
 425   ASCII output from within the very program that created it, `eval` it,
 426   and then execute it.  That being the case, you might as well skip the
 427   output to ASCII and the eval part, and create classes that
 428   dynamically and directly instantiate the desired target behaviour.
 429   This saves months of effort but is a common mistake, frequently made.
 430 9. "Lost In Translation". ("Out of sight, out of mind" being translated
 431   to Russian then English by a Diplomat's translator famously returned
 432   the phrase, "Invisible lunatic").
 433   Any compiler risks losing information in between translations,
 434   making debugging harder, and finding mistakes particularly difficult.
 435   With nmigen already being a Language Translator, adding an extra layer
 436   seems extremely unwise.  And, worse, again relies on nmigen remaining
 437   stable, indefinitely.
 438
 439 Bottom line is that all the alternatives are really quite harmful, costly,
 440 and unmaintainable, and in some cases actively damage nmigen's reputation
 441 as a stable, useful and powerful HDL.
 442