3d_gpu/architecture/dynamic_simd.mdwn

   1 # Links
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
   4   (this document)
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment
  11
  12 # Proposal Summary
  13
  14 The proposal is two-fold, and targetted at the two distinct types
  15 of nmigen language constructs:
  16
  17 * Type 1 low-level AST.  implemented as nmigen.hdl.ast classes:
  18   Value, Cat, Repl, Mux, Switch etc.
  19 * Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl
  20
  21 The Type 1 AST low-level proposed modifications are mirrored on the
  22 existing long-established python `operator` module, which nmigen
  23 *already leverages* by providing `Value.__add__` and other operator
  24 overrides.
  25
  26 * To extend nmigen "Type 1 (ast.*)" low-level language constructs
  27   with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
  28 * To rename existing old `ast.Cat` etc to new `ast._InternalCat`
  29 * In an identical conceptual fashion, just as
  30   as python `operator.add(x,y)` redirects to `x.__add__(y)`,
  31   to add a new `ast.Cat(x..)` which redirects to`x.__Cat__(...)` etc.
  32 * To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
  33 * To allow all of the new underscore variants to be overridden without
  34   limitation, restriction, or restraint, via UserValue (and ValueCastable)
  35
  36 The second set of changes is targetted at Type 2 dsl.Module,
  37 to complete the 98% abstraction from Type 1 to a 100% level
  38
  39 * To add a new parameter to `dsl.Module`'s constructor,
  40   `_astType`, which is the AST class type
  41   to be used for "casting" of m.If and m.Elif condition tests,
  42   and for m.Switch's value.
  43 * To replace the two (2) calls to `Value.cast()` in `dsl.Module`
  44   with `self._astType.cast()`
  45
  46 No further modifications beyond that are strictly necessary.
  47
  48 With `dsl.Module` already being extremely close to 100%
  49 abstracted with respect to Type 1 AST constructs, Liskov
  50 Substitution Principle in combination with UserValue (later,
  51 ValueCastable) combine to provide extremely powerful and
  52 flexible augmentation and capabilities in nmigen, far beyond
  53 its original intended design parameters.
  54
  55 The benefits due to the abstraction extend far beyond this
  56 particular use-case (Partitioned Dynamic SIMD)
  57
  58 # Rationale / Introduction
  59
  60 The Dynamic Partitioned SIMD Signal is effectively a parallelisation
  61 of nmigen's Signal.  It is expected to work transparently as if it was
  62 a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
  63 no requirement on the part of the developer to even know that it is
  64 performing parallel dynamically-partitioned operations.
  65 All standard nmigen language constructs are both expected and required
  66 to behave exactly as they do for scalar Signals.
  67
  68 nmigen 32-bit Signal:
  69
  70     a        : .... .... .... .... (32 bits, illustrated as 4x8)
  71
  72 Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
  73 sections, by 3 partition bits:
  74
  75     partition:     P    P    P     (3 bits)
  76     a        : .... .... .... .... (32 bits, illustrated as 4x8)
  77     exp-a    : ....P....P....P.... (32+3 bits, P=0 if no partition)
  78
  79 Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
  80 into all 8 possible combinations of the 3 Partition bits:
  81
  82     exp-a    : ....0....0....0.... 1x 32-bit
  83     exp-a    : ....0....0....1.... 1x 24-bit plus 1x 8-bit
  84     exp-a    : ....0....1....0.... 2x 16-bit
  85     ...
  86     ...
  87     exp-a    : ....1....1....0.... 2x 8-bit, 1x 16-bit
  88     exp-a    : ....1....1....1.... 4x 8-bit
  89
  90 A simple example, a "min" function:
  91
  92     # declare x, y and out as 16-bit scalar Signals
  93     x = Signal(16)
  94     y = Signal(16)
  95     out = Signal(16)
  96
  97     # compare x against y and set output accordingly
  98     with m.If(x < y):
  99         comb += out.eq(x)
 100     with m.Else():
 101         comb += out.eq(y)
 102
 103 This is very straightforward and obvious, that the 16-bit output is the
 104 lesser of the x and y inputs.  We require the exact same obviousness
 105 and under no circumstances any change of any kind to any nmigen language
 106 construct:
 107
 108     # a mask of length 3 indicates a desire to partition Signals at
 109     # 3 points into 4 equally-spaced SIMD "partitions".
 110     mask = Signal(3)
 111
 112     with mask:
 113         # x y and out are all 16-bit so are subdivided at:
 114         # |      mask[0]     mask[1]     mask[3]         |
 115         # |  0-3    |    4-7    |   8-11    |   12-15    |
 116         x = PartitionedSignal(16)    # identical except for mask
 117         y = PartitionedSignal(16)    # identical except for mask
 118         out = PartitionedSignal(16)  # identical except for mask
 119
 120     # all code here is required to be absolutely identical to the
 121     # scalar case, and identical in nmigen language behaviour in
 122     # every way.  no changes to the nmigen language or its use
 123     # are permitted
 124
 125     with m.If(x < y):       # exactly
 126         comb += out.eq(x)   # the
 127     with m.Else():          # same
 128         comb += out.eq(y)   # code
 129
 130 The purpose of PartitionedSignal is therefore to provide full 100%
 131 transparent SIMD run-time dynamic behaviour as far as end-usage is
 132 concerned.
 133
 134 The alternative is absolutely awful and completely unacceptable
 135 for both maintenance cost and development cost.  Even the most
 136 basic HDL is literally an order of magnitude larger:
 137
 138     # declare x, y and out as 16-bit scalar Signals
 139     x = Signal(16)
 140     y = Signal(16)
 141     out = Signal(16)
 142
 143     # start an absolutely awful unmaintainable duplication of
 144     # SIMD behaviour.
 145     with m.If(mask == 0b111): # 1x 16-bit
 146        # compare x against y and set output accordingly
 147        with m.If(x < y):
 148            comb += out.eq(x)
 149        with m.Else():
 150            comb += out.eq(y)
 151     with m.ElIf(mask == 0b101): # 2x 8-bit
 152        for i in range(2):
 153            xh = x[i*8:(i+1)*8]
 154            yh = y[i*8:(i+1)*8]
 155            outh = out[i*8:(i+1)*8]
 156            # compare halves of x against halves y and set
 157            # halves of output accordingly
 158            with m.If(xh < yh):
 159                comb += outh.eq(xh)
 160            with m.Else():
 161                comb += outh.eq(yh)
 162     with m.ElIf(mask == 0b000): # 4x 4-bit
 163         ....
 164     with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
 165         ....
 166     with m.ElIf(....)
 167         ....
 168     with m.ElIf(....)
 169         ....
 170     with m.ElIf(....)
 171         ....
 172
 173 # Overview
 174
 175 To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.
 176
 177 Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.
 178
 179 Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs.  Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.
 180
 181 Pages below describe the basic features of each and track the relevant bugreports.  These features here are the building blocks which lie behind
 182 PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
 183 constructs.
 184
 185 * [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
 186 * [[dynamic_simd/assign]] nmigen eq (assignment)
 187 * [[dynamic_simd/gt]] aka `__gt__` in python operator terms
 188 * [[dynamic_simd/add]]
 189 * [[dynamic_simd/cat]] - limited capability
 190 * [[dynamic_simd/mul]]
 191 * [[dynamic_simd/shift]]
 192 * [[dynamic_simd/logicops]] some all xor bool
 193
 194 # Integration with nmigen: "Type 2" (dsl.Module)
 195
 196 Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements.
 197
 198 With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:
 199
 200     if partitions == 1x64
 201          with m.If(x > y):
 202               do something
 203     elif partitions == 2x32:
 204          with m.If(x[0:31] > y[0:31]):
 205               do something on 1st half
 206          elif ...
 207     elif ...
 208     # many more lines of repeated laborious hand written
 209     # SIMD nonsense all exactly the same except for the
 210     # for loop and sizes.
 211
 212 Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.
 213
 214 A much more intelligent approach is needed. What we actually want is:
 215
 216     with m.If(x > y): # do a partitioned compare here
 217          do something dynamic here
 218
 219 where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.
 220
 221 This means one of two things:
 222
 223 1. that nmigen needs to "understand" the partitioning, in a Type 2
 224   construct (m.If, m.Else and m.Switch), at the bare minimum.
 225 2. that nmigen's "Type 2" language constructs be sufficiently abstract
 226   that they *pass through* SIMD behaviour entirely to a lower level,
 227   *completely intact*.
 228
 229 Method (2) allows the Liskov Substitution Principle to be
 230 put to surprisingly good effect. If that was possible to
 231 achieve then **almost no modifications to nmigen would
 232 be required**.
 233
 234 Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`.  Within that function Mux and other "global" functions (similar to python operator functions).  The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.
 235
 236 <https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>
 237
 238 A deeper analysis shows that dsl.Module uses explicit Value.cast on its
 239 If, Elif, and Switch clauses. Overriding that and allowing a cast to
 240 a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
 241 to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.
 242
 243 m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement.  With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.
 244
 245 For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions.  Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.
 246
 247 Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column.  Take the following example:
 248
 249      mask = Signal(3) # creates four partitions
 250      a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
 251      b = PartitionedSignal(mask, 4) # likewise
 252      c = PartitionedSignal(mask, 32)
 253      d = PartitionedSignal(mask, 32)
 254      o = PartitionedSignal(mask, 32)
 255
 256      with m.If(a):
 257          comb += o.eq(c)
 258      with m.Elif(b):
 259          comb += o.eq(d)
 260
 261 If these were ordinary Signals, they would be translated to a Switch where:
 262
 263 * if_tests would be Cat(a, b) i.e. a 2 bit quantity
 264 * cases would be (quantity 2) "1-" and "-1" in order to match
 265   against the first binary test bit of Cat(a, b) and the second,
 266   respectively.
 267 * the first case would be "1-" to activate `o.eq(c)
 268 * the second case would be "-1" to activate o.eq(d)
 269
 270 A parallel variant may thus perform a for-loop, creating four
 271 **independent** Switches:
 272
 273 * take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
 274 * take the output of each case result `o[0].eq[c[0])` and
 275   so on
 276 * create the first independent Switch
 277 * take a[1] and b[1] etc.
 278
 279 There are several ways in which the parts of each case, when
 280 activated, can be split up: temporary Signals, analysing
 281 the AST, or using PartitionedMux.
 282
 283 # Alternative implementation concepts
 284
 285 Several alternative ideas have been proposed. They are listed here for
 286 completeness.  The worst offender (use code-duplication) has already been
 287 raised.
 288
 289 The fundamental common characteristic of all of these ideas, despite
 290 their disparate nature, is that absolutely every single one of them
 291 assumes that there is explicit action (or hard work) required to
 292 be taken. None of them - at all - leverage the fact that nmigen
 293 has two different inter-related abstraction levels.
 294
 295 * **Explicit Code**. For every ALU, write code (in nmigen 0.3
 296   at the time of writing) which is inherently
 297   parallel, rather than duplicated with a series of Switch/Case or
 298   an if-elif-elif chain based on what the partition mask is set to.
 299 * **Use a low-level class**.  Rather than tie in to nmigen, write a
 300   class that performs all the equivalent (explicit coded) parallel
 301   operations, as close to nmigen Type 1 (AST) constructs as possible.
 302   Alternatives to Cat, called SIMDCat, and forego use of Type 2
 303   (dsl.Module) nmigen language constructs entirely.
 304 * **Wrapper classes**. A dsl.Module wrapper class which duplicates the
 305   entirety of the dsl.Module functionality with "redirection" functions
 306   and a dsl.Module instance as a member variable.  In the rare instances
 307   where the function is different it is expected to provide a replacement
 308   rather than call the dsl.Module variant.
 309 * **Replacement for dsl.Module**. This idea is to implement a full total
 310   replacement for dsl.Module, called dsl.SIMDModule (or other).
 311 * **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
 312   with external functions.
 313 * **Compilers / Macros**. On the basis that if this was VHDL or Verilog
 314   one technique for creating SIMD variants of the required code would be
 315   to use macro substitution or crude compilers to autogenerate the dynamic
 316   SIMD from VHDL / Verilog templates, why not do exactly the same thing.
 317   Design a SIMD HDL language, write python in that, and have it output
 318   nmigen HDL.
 319
 320 All of these ideas, unfortunately, are extremely costly in many different ways:
 321
 322 1. Any overrides of any kind give a catastrophically fatal impression
 323   that the high-level language behaviour might in some tiny way be different,
 324   purely because of the very *existence* of an override class.
 325   This in turn, if prevalent and commonly used, would quickly be
 326   seen as a rather nasty indirect unspoken implication that nmigen
 327   itself was badly designed, so much so that its users were forced
 328   to engage in "unauthorised overrides" in order to achieve what they
 329   want.
 330 2. Proponents of such override mechanisms (and there have been many)
 331   fundamentally misunderstand the distinction between Type 1 (AST) low-level
 332   and Type 2 (dsl.Module) high-level nmigen language constructs, and
 333   miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
 334   (Type 2) constructs.
 335 3. Wrapper classes are a maintenance nightmare. Unless included in the
 336   library itself, any update to the library being wrapped is a huge risk
 337   of breaking the wrapper.  Long-term this is completely unsustainable.
 338   Likewise monkey-patching.
 339 4. Wrapper classes introduce huge performance degradation. Every function
 340   requires one additional function call. Every object instantiated requires
 341   both the wrapper class to be instantiated and the object being wrapped,
 342   every single time.  A single wrapper class is never enough: the entire
 343   class hierarchy, everything that is ever instantiated by a "wrapped"
 344   class instance, also requires wrapping.  This quickly
 345   gets completely out of hand and diverts developer time into a
 346   nightmare time-sink that has actively harmful consequences *even
 347   once completed*.
 348   Memory requirements double; performance degrades.
 349   Both are unacceptable.
 350 5. "Explicit coding" is actually more costly even than for-loop
 351   code duplication.  It is "The Right Thing (tm)" in order to achieve
 352   optimal gate-level design but requires hyper-aware and hyper-intelligent
 353   developers, whom, if ever lost, take knowledge of the internal workings
 354   of ultra-complex code with them.
 355 6. "Low-level classes" have in fact already been implemented: over a
 356   dozen modules utilised and combined behind PartitionedSignal exist
 357   and are already in use. PartitionedMux was one of the very first SIMD-aware
 358   "Type 1" AST operators written.  The problem is that it's nowhere near
 359   enough.  Conversion of 100,000 lines of readable nmigen HDL using
 360   Type 2 (m.If/Elif) high-level constructs to **not** use those
 361   constructs and return to AST Mux, Cat, and so on is a completely
 362   unreasonable expectation.
 363 7. "Replacements for dsl.Module" aside from coming with the catastrophic
 364   implication that they provide alternative behaviour from dsl.Module
 365   are a heavy maintenance burden.  Not only that but there is the risk
 366   that, unless actually included *in* nmigen, the underlying AST modules
 367   that they use might change, or be deprecated, or have bugs fixed which
 368   break the replacement.
 369 8. "Compilers". It is very well known that any python program that outputs
 370   another python program as an ASCII output may then immediately read that
 371   ASCII output from within the very program that created it, `eval` it,
 372   and then execute it.  That being the case, you might as well skip the
 373   output to ASCII and the eval part, and create classes that
 374   dynamically and directly instantiate the desired target behaviour.
 375   This saves months of effort but is a common mistake, frequently made.
 376 9. "Lost In Translation". ("Out of sight, out of mind" being translated
 377   to Russian then English by a Diplomat's translator famously returned
 378   the phrase, "Invisible lunatic").
 379   Any compiler risks losing information in between translations,
 380   making debugging harder, and finding mistakes particularly difficult.
 381   With nmigen already being a Language Translator, adding an extra layer
 382   seems extremely unwise.  And, worse, again relies on nmigen remaining
 383   stable, indefinitely.
 384
 385 Bottom line is that all the alternatives are really quite harmful, costly,
 386 and unmaintainable, and in some cases actively damage nmigen's reputation
 387 as a stable, useful and powerful HDL.
 388