# Links

* <https://bugs.libre-soc.org/show_bug.cgi?id=594> RFC for nmigen integration
  (this document)
* <https://bugs.libre-soc.org/show_bug.cgi?id=115> top level SIMD
* <https://bugs.libre-soc.org/show_bug.cgi?id=458> m.If/Switch 
* <https://bugs.libre-soc.org/show_bug.cgi?id=707> Limited Cat
* <https://bugs.libre-soc.org/show_bug.cgi?id=565> Formal proof of PartitionedSignal
* <https://bugs.libre-soc.org/show_bug.cgi?id=596> Formal proof of PartitionedSignal nmigen interaction
* <https://bugs.libre-soc.org/show_bug.cgi?id=713> Partition-context-sensitive length adjustment

# Proposal Summary

The proposal is two-fold, and targetted at the two distinct types
of nmigen language constructs:

* Type 1 low-level AST.  implemented as nmigen.hdl.ast classes:
  Value, Cat, Repl, Mux, Switch etc.
* Type 2 high-level DSL. Implemented as Module in nmigen.hdl.dsl

The Type 1 AST low-level proposed modifications are mirrored on the
existing long-established python `operator` module, which nmigen
*already leverages* by providing `Value.__add__` and other operator
overrides.

* To extend nmigen "Type 1 (ast.*)" low-level language constructs
  with `Value.__Cat__`, `Value.__Switch__`, `Value.__Repl__` etc.
* To rename existing old `ast.Cat` etc to new `ast._InternalCat`
* In an identical conceptual fashion, just as
  as python `operator.add(x,y)` redirects to `x.__add__(y)`,
  to add a new `ast.Cat(x..)` which redirects to`x.__Cat__(...)` etc.
* To add `Value.__Cat__` etc which call the now-renamed `ast._InternalCat`
* To allow all of the new underscore variants to be overridden without
  limitation, restriction, or restraint, via UserValue (and ValueCastable)

The second set of changes is targetted at Type 2 dsl.Module,
to complete the 98% abstraction from Type 1 to a 100% level

* To add a new parameter to `dsl.Module`'s constructor,
  `_astType`, which is the AST class type
  to be used for "casting" of m.If and m.Elif condition tests,
  and for m.Switch's value.
* To replace the two (2) calls to `Value.cast()` in `dsl.Module`
  with `self._astType.cast()`

No further modifications beyond that are strictly necessary.

With `dsl.Module` already being extremely close to 100%
abstracted with respect to Type 1 AST constructs, Liskov
Substitution Principle in combination with UserValue (later,
ValueCastable) combine to provide extremely powerful and
flexible augmentation and capabilities in nmigen, far beyond
its original intended design parameters.

The benefits due to the abstraction extend far beyond this
particular use-case (Partitioned Dynamic SIMD)

# Rationale / Introduction

The Dynamic Partitioned SIMD Signal is effectively a parallelisation
of nmigen's Signal.  It is expected to work transparently as if it was
a nmigen Signal, in every way, as a full peer of a nmigen Signal, with
no requirement on the part of the developer to even know that it is
performing parallel dynamically-partitioned operations.
All standard nmigen language constructs are both expected and required
to behave exactly as they do for scalar Signals.

nmigen 32-bit Signal:

    a        : .... .... .... .... (32 bits, illustrated as 4x8)

Dynamically-partitioned 32-bit Signal subdivided into four 8-bit
sections, by 3 partition bits:

    partition:     P    P    P     (3 bits)
    a        : .... .... .... .... (32 bits, illustrated as 4x8)
    exp-a    : ....P....P....P.... (32+3 bits, P=0 if no partition)

Each partitioned section shall act as an independent Signal where the **partitioning is dynamic at runtime** and may subdivide the above example
into all 8 possible combinations of the 3 Partition bits:

    exp-a    : ....0....0....0.... 1x 32-bit
    exp-a    : ....0....0....1.... 1x 24-bit plus 1x 8-bit
    exp-a    : ....0....1....0.... 2x 16-bit
    ...
    ...
    exp-a    : ....1....1....0.... 2x 8-bit, 1x 16-bit
    exp-a    : ....1....1....1.... 4x 8-bit

A simple example, a "min" function:

    # declare x, y and out as 16-bit scalar Signals
    x = Signal(16)
    y = Signal(16)
    out = Signal(16)

    # compare x against y and set output accordingly
    with m.If(x < y):
        comb += out.eq(x)
    with m.Else():
        comb += out.eq(y)

This is very straightforward and obvious, that the 16-bit output is the
lesser of the x and y inputs.  We require the exact same obviousness
and under no circumstances any change of any kind to any nmigen language
construct:

    # a mask of length 3 indicates a desire to partition Signals at
    # 3 points into 4 equally-spaced SIMD "partitions".
    mask = Signal(3)

    with mask:
        # x y and out are all 16-bit so are subdivided at:
        # |      mask[0]     mask[1]     mask[3]         |
        # |  0-3    |    4-7    |   8-11    |   12-15    |
        x = PartitionedSignal(16)    # identical except for mask
        y = PartitionedSignal(16)    # identical except for mask
        out = PartitionedSignal(16)  # identical except for mask

    # all code here is required to be absolutely identical to the
    # scalar case, and identical in nmigen language behaviour in
    # every way.  no changes to the nmigen language or its use
    # are permitted

    with m.If(x < y):       # exactly
        comb += out.eq(x)   # the
    with m.Else():          # same
        comb += out.eq(y)   # code

The purpose of PartitionedSignal is therefore to provide full 100%
transparent SIMD run-time dynamic behaviour as far as end-usage is
concerned.

The alternative is absolutely awful and completely unacceptable
for both maintenance cost and development cost.  Even the most
basic HDL is literally an order of magnitude larger:

    # declare x, y and out as 16-bit scalar Signals
    x = Signal(16)
    y = Signal(16)
    out = Signal(16)

    # start an absolutely awful unmaintainable duplication of
    # SIMD behaviour.
    with m.If(mask == 0b111): # 1x 16-bit
       # compare x against y and set output accordingly
       with m.If(x < y):
           comb += out.eq(x)
       with m.Else():
           comb += out.eq(y)
    with m.ElIf(mask == 0b101): # 2x 8-bit
       for i in range(2):
           xh = x[i*8:(i+1)*8]
           yh = y[i*8:(i+1)*8]
           outh = out[i*8:(i+1)*8]
           # compare halves of x against halves y and set
           # halves of output accordingly
           with m.If(xh < yh):
               comb += outh.eq(xh)
           with m.Else():
               comb += outh.eq(yh)
    with m.ElIf(mask == 0b000): # 4x 4-bit
        ....
    with m.ElIf(mask == 0b100): # 1x 8-bit followed by 2x 4-bit
        ....
    with m.ElIf(....)
        ....
    with m.ElIf(....)
        ....
    with m.ElIf(....)
        ....

# Overview

To save hugely on gate count the normal practice of having separate scalar ALUs and separate SIMD ALUs is not followed.

Instead a suite of "partition points" identical in fashion to the Aspex Microelectronics ASP (Array-String-Architecture) architecture is deployed.

Basic principle: when all partition gates are open the ALU is subdivided into isolated and independent 8 bit SIMD ALUs.  Whenever any one gate is opened, the relevant 8 bit "part-results" are chained together in a downstream cascade to create 16 bit, 32 bit, 64 bit and 128 bit compound results.

Pages below describe the basic features of each and track the relevant bugreports.  These features here are the building blocks which lie behind
PartitionedSignal, which in turn provides "Type 1 (ast.*)" nmigen language
constructs.

* [[dynamic_simd/eq]] aka `__eq__` not to be confused with nmigen eq
* [[dynamic_simd/assign]] nmigen eq (assignment)
* [[dynamic_simd/gt]] aka `__gt__` in python operator terms
* [[dynamic_simd/add]]
* [[dynamic_simd/cat]] - limited capability
* [[dynamic_simd/mul]]
* [[dynamic_simd/shift]]
* [[dynamic_simd/logicops]] some all xor bool

# Integration with nmigen: "Type 2" (dsl.Module)

Dynamic partitioning of signals is not enough on its own. Normal nmigen programs involve conditional decisions, that means if statements and switch statements. 

With the PartitionedSignal class, basic operations such as `x + y` are functional, producing results 1x64 bit, or 2x32 or 4x16 or 8x8 or anywhere in between, but what about control and decisions? Here is the "normal" way in which SIMD decisions are performed:

    if partitions == 1x64
         with m.If(x > y):
              do something
    elif partitions == 2x32:
         with m.If(x[0:31] > y[0:31]):
              do something on 1st half
         elif ...
    elif ...
    # many more lines of repeated laborious hand written
    # SIMD nonsense all exactly the same except for the
    # for loop and sizes.

Clearly this is a total unmaintainable nightmare of worthless crud which, if continued throughout a large project with 40,000 lines of code when written without SIMD, would completely destroy all chances of that project being successful by turning 40,000 lines into 400,000 lines of unreadable spaghetti.

A much more intelligent approach is needed. What we actually want is:

    with m.If(x > y): # do a partitioned compare here
         do something dynamic here

where *behind the scenes* the above laborious for-loops (conceptually) are created, hidden, looking to all intents and purposes that this is exactly like any other nmigen Signal.

This means one of two things:

1. that nmigen needs to "understand" the partitioning, in a Type 2
  construct (m.If, m.Else and m.Switch), at the bare minimum.
2. that nmigen's "Type 2" language constructs be sufficiently abstract
  that they *pass through* SIMD behaviour entirely to a lower level,
  *completely intact*.

Method (2) allows the Liskov Substitution Principle to be
put to surprisingly good effect. If that was possible to
achieve then **almost no modifications to nmigen would
be required**.

Analysis of the internals of nmigen shows that m.If, m.Else, m.FSM and m.Switch are all redirected to ast.py `Switch`.  Within that function Mux and other "global" functions (similar to python operator functions).  The hypothesis is therefore proposed that if `Value.mux` is added in an identical way to how `operator.add` calls `__add__` this may turn out to be all that (or most of what) is needed.

<https://github.com/nmigen/nmigen/blob/59ef6e6a1c4e389a41148554f2dd492328820ecd/nmigen/hdl/dsl.py#L447>

A deeper analysis shows that dsl.Module uses explicit Value.cast on its
If, Elif, and Switch clauses. Overriding that and allowing a cast to
a PartitionedSignal.cast (or, PartitionedBool.cast) would be sufficient
to make Type 2 (dsl.Module) nmigen language constructs 100% abstracted from the Type 1 (ast) lower-level ones.

m.If and m.Else work by constructing a series of Switch cases, each case test being one of "--1---" or "-----1-" where the binary tests themselves are concatenated together as the "Switch" statement.  With switch statements being order-dependent, the first match will succeed which will stop subsequent "Else" or "Elif" statements from being executed.

For a parallel variant each partition column may be assumed to be independent. A mask of 3 bits subdivides Signals down into four separate partitions.  Therefore what was previously a single-bit binary test is, just like for Partitioned Mux, actually four separate and distinct partition-column-specific single-bit binary tests.

Therefore, a Parallel Switch statement is as simple as taking the relevant column of each Switch case and creating one independent Switch per Partition column.  Take the following example:

     mask = Signal(3) # creates four partitions
     a = PartitionedSignal(mask, 4) # creates a 4-bit partitioned signal
     b = PartitionedSignal(mask, 4) # likewise
     c = PartitionedSignal(mask, 32)
     d = PartitionedSignal(mask, 32)
     o = PartitionedSignal(mask, 32)

     with m.If(a):
         comb += o.eq(c)
     with m.Elif(b):
         comb += o.eq(d)

If these were ordinary Signals, they would be translated to a Switch where:

* if_tests would be Cat(a, b) i.e. a 2 bit quantity
* cases would be (quantity 2) "1-" and "-1" in order to match
  against the first binary test bit of Cat(a, b) and the second,
  respectively.
* the first case would be "1-" to activate `o.eq(c)
* the second case would be "-1" to activate o.eq(d)

A parallel variant may thus perform a for-loop, creating four
**independent** Switches:

* take a[0] and b[0] and Cat them together `Cat(a[0], b[0])`
* take the output of each case result `o[0].eq[c[0])` and
  so on
* create the first independent Switch
* take a[1] and b[1] etc.

There are several ways in which the parts of each case, when
activated, can be split up: temporary Signals, analysing
the AST, or using PartitionedMux.

# Alternative implementation concepts

Several alternative ideas have been proposed. They are listed here for
completeness.  The worst offender (use code-duplication) has already been
raised.

The fundamental common characteristic of all of these ideas, despite
their disparate nature, is that absolutely every single one of them
assumes that there is explicit action (or hard work) required to
be taken. None of them - at all - leverage the fact that nmigen
has two different inter-related abstraction levels.

* **Explicit Code**. For every ALU, write code (in nmigen 0.3
  at the time of writing) which is inherently
  parallel, rather than duplicated with a series of Switch/Case or
  an if-elif-elif chain based on what the partition mask is set to.
* **Use a low-level class**.  Rather than tie in to nmigen, write a
  class that performs all the equivalent (explicit coded) parallel
  operations, as close to nmigen Type 1 (AST) constructs as possible.
  Alternatives to Cat, called SIMDCat, and forego use of Type 2
  (dsl.Module) nmigen language constructs entirely.
* **Wrapper classes**. A dsl.Module wrapper class which duplicates the
  entirety of the dsl.Module functionality with "redirection" functions
  and a dsl.Module instance as a member variable.  In the rare instances
  where the function is different it is expected to provide a replacement
  rather than call the dsl.Module variant.
* **Replacement for dsl.Module**. This idea is to implement a full total
  replacement for dsl.Module, called dsl.SIMDModule (or other).
* **Monkey-patching dsl.Module**. This idea intrusively modifies dsl.Module
  with external functions.
* **Compilers / Macros**. On the basis that if this was VHDL or Verilog
  one technique for creating SIMD variants of the required code would be
  to use macro substitution or crude compilers to autogenerate the dynamic
  SIMD from VHDL / Verilog templates, why not do exactly the same thing.
  Design a SIMD HDL language, write python in that, and have it output
  nmigen HDL.

All of these ideas, unfortunately, are extremely costly in many different ways:

1. Any overrides of any kind give a catastrophically fatal impression
  that the high-level language behaviour might in some tiny way be different,
  purely because of the very *existence* of an override class.
  This in turn, if prevalent and commonly used, would quickly be
  seen as a rather nasty indirect unspoken implication that nmigen
  itself was badly designed, so much so that its users were forced
  to engage in "unauthorised overrides" in order to achieve what they
  want.
2. Proponents of such override mechanisms (and there have been many)
  fundamentally misunderstand the distinction between Type 1 (AST) low-level
  and Type 2 (dsl.Module) high-level nmigen language constructs, and
  miss the fact that dsl.Module (Type 2) is 100% implemented *in* AST
  (Type 2) constructs.
3. Wrapper classes are a maintenance nightmare. Unless included in the
  library itself, any update to the library being wrapped is a huge risk
  of breaking the wrapper.  Long-term this is completely unsustainable.
  Likewise monkey-patching.
4. Wrapper classes introduce huge performance degradation. Every function
  requires one additional function call. Every object instantiated requires
  both the wrapper class to be instantiated and the object being wrapped,
  every single time.  A single wrapper class is never enough: the entire
  class hierarchy, everything that is ever instantiated by a "wrapped"
  class instance, also requires wrapping.  This quickly
  gets completely out of hand and diverts developer time into a
  nightmare time-sink that has actively harmful consequences *even
  once completed*.
  Memory requirements double; performance degrades.
  Both are unacceptable.
5. "Explicit coding" is actually more costly even than for-loop
  code duplication.  It is "The Right Thing (tm)" in order to achieve
  optimal gate-level design but requires hyper-aware and hyper-intelligent
  developers, whom, if ever lost, take knowledge of the internal workings
  of ultra-complex code with them.
6. "Low-level classes" have in fact already been implemented: over a
  dozen modules utilised and combined behind PartitionedSignal exist
  and are already in use. PartitionedMux was one of the very first SIMD-aware
  "Type 1" AST operators written.  The problem is that it's nowhere near
  enough.  Conversion of 100,000 lines of readable nmigen HDL using
  Type 2 (m.If/Elif) high-level constructs to **not** use those
  constructs and return to AST Mux, Cat, and so on is a completely
  unreasonable expectation.
7. "Replacements for dsl.Module" aside from coming with the catastrophic
  implication that they provide alternative behaviour from dsl.Module
  are a heavy maintenance burden.  Not only that but there is the risk
  that, unless actually included *in* nmigen, the underlying AST modules
  that they use might change, or be deprecated, or have bugs fixed which
  break the replacement.
8. "Compilers". It is very well known that any python program that outputs
  another python program as an ASCII output may then immediately read that
  ASCII output from within the very program that created it, `eval` it,
  and then execute it.  That being the case, you might as well skip the
  output to ASCII and the eval part, and create classes that
  dynamically and directly instantiate the desired target behaviour.
  This saves months of effort but is a common mistake, frequently made.
9. "Lost In Translation". ("Out of sight, out of mind" being translated
  to Russian then English by a Diplomat's translator famously returned
  the phrase, "Invisible lunatic").
  Any compiler risks losing information in between translations,
  making debugging harder, and finding mistakes particularly difficult.
  With nmigen already being a Language Translator, adding an extra layer
  seems extremely unwise.  And, worse, again relies on nmigen remaining
  stable, indefinitely.

Bottom line is that all the alternatives are really quite harmful, costly,
and unmaintainable, and in some cases actively damage nmigen's reputation
as a stable, useful and powerful HDL.