# SimdShape

Links:

* [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
* <https://bugs.libre-soc.org/show_bug.cgi?id=713>

# Requirements Analysis

A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
provides sufficient context to both define overrides for individual lengths
on a per-mask basis as well as sufficient information to "upcast"
back to a SimdSignal, in exactly the same way that c++ virtual base
class upcasting works when RTTI (Run Time Type Information) works.

By deriving from `ast.Shape` both `width` and `signed` are provided
already, leaving the `SimdShape` class with the responsibility to
additionally define lengths for each mask basis. This is best illustrated
with an example.

The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
but without massive disruptive code-duplication or intrusive explicit
coding as outlined in the worst of the techniques documented in
[[dynamic_simd]].  This in turn implies that Signals need to be declared
for both mantissa and exponent that **change width to non-power-of-two
sizes** depending on Partition Mask Context.

Mantissa:

* when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  rounding and sticky)
* when the context is 2xFP32 there are **two** mantissas of 23 bits
* when the context is 4xFP16 there are **four** mantissas of 10 bits
* when the context is 4xBF16 there are four mantissas of 5 bits.

Exponent:

* 1xFP64: 11 bits, one exponent
* 2xFP32: 8 bits, two exponents
* 4xFP16: 5 bits, four exponents
* 4xBF16: 8 bits, four exponents

`SimdShape` needs this information in addition to the normal
information (width, sign) in order to create the partitions
that allow standard nmigen operations to **transparently**
and naturally take place at **all** of these non-uniform
widths, as if they were in fact scalar Signals *at* those
widths.

A minor wrinkle which emerges from deep analysis is that the overall
available width (`Shape.width`) does in fact need to be explicitly
declared under *some* circumstances, and
the sub-partitions to fit onto power-of-two boundaries, in order to allow
straight wire-connections rather than allow the SimdSignal to be
arbitrary-sized (compact).  Although on shallow inspection this
initially would seem to imply that it would result in large unused
sub-partitions (padding partitions) these gates can in fact be eliminated
with a "blanking" mask, created from static analysis of the SimdShape
context.

Example:

* all 32 and 16-bit values are actually to be truncated to 11 bit
* all 8-bit values to 5-bit

from these we can write out the allocations, bearing in mind that
in each partition the sub-signal must start on a power-2 boundary,

          |31|  |  |24|     16|15|  |   8|7     0 |
    32bit |           |          |  | 1.11        |
    16bit |     | 2.11        |  |  | 1.11        |
    8bit  |  |  4.5   | 3.5   |  | 2.5   | | 1.5  |

Next we identify the start and end points, and note
that "x" marks unused (padding) portions. We begin by marking
the power-of-two boundaries (0-7 .. 24-31) and also including column
guidelines to delineate the start and endpoints:
 
          |31|  |  |24|     16|15|  |   8|7     0 |
          |31|28|26|24| |20|16|15|12|10|8| |4   0 |
    32bit | x| x| x|  |      x| x| x|10 ....    0 |
    16bit | x| x|26    ... 16 | x| x|10 ....    0 |
    8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
    unused  x                   x

thus, we deduce, we *actually* need breakpoints at *nine* positions,
and that unused portions common to **all** cases can be deduced
and marked "x" by looking at the columns above them.
These 100% unused "x"s therefore define the "blanking" mask, and in
these sub-portions it is unnecessary to allocate computational gates.

Also in order to save gates, in the example above there are only three
cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
are required to construct the larger overall computational result
from the "smaller chunks". At first glance, with there
being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
would appear that 2^9 (512!) cases were required, where in fact
there are only three.

These facts also need to be communicated to both the SimdSignal
as well as the submodules implementing its core functionality:
add operation and other arithmetic behaviour, as well as
[[dynamic_simd/cat]] and others.

In addition to that, there is a "convenience" that emerged
from technical discussions as desirable
to have, which is that it should be possible to perform
rudimentary arithmetic operations *on a SimdShape* which preserves
or adapts the Partition context, where the arithmetic operations
occur on `Shape.width`.

    >>> XLEN = SimdShape(64, signed=True, ...)
    >>> x2 = XLEN // 2
    >>> print(x2.width)
    32
    >>> print(x2.signed)
    True

With this capability it becomes possible to use the Liskov Substitution
Principle in dynamically compiling code that switches between scalar and
SIMD transparently:

    # scalar context
    scalarctx = scl = object()
    scl.XLEN = 64
    scl.SigKls = Signal         # standard nmigen Signal
    # SIMD context
    simdctx = sdc = object()
    sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
    sdc.SigKls = SimdSignal     # advanced SIMD Signal
    sdc.elwidth = Signal(2)

    # select one 
    if compiletime_switch == 'SIMD':
        ctx = simdctx
    else:
        ctx = scalarctx

    # exact same code switching context at compile time
    m = Module():
    with ctx:
        x = ctx.SigKls(ctx.XLEN)
        y = ctx.SigKls(ctx.XLEN // 2)
        ...
    m.d.comb += x.eq(Const(3))

An interesting practical requirement transpires from attempting to use
SimdSignal, that affects the way that SimdShape works.  The register files
are 64 bit, and are subdivided according to what wikipedia terms
"SIMD Within A Register" (SWAR).  Therefore, the SIMD ALUs *have* to
both accept and output 64-bit signals at that explicit width, with
subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.

However when it comes to intermediary processing (partial computations)
those intermediary Signals can and will be required to be a certain
fixed width *regardless* and having nothing to do with the register
file source or destination 64 bit fixed width.

The simplest example here would be a boolean (1 bit) Signal for
Scalar (but an 8-bit quantity for SIMD):

    m = Module():
    with ctx:
        x = ctx.SigKls(ctx.XLEN)
        y = ctx.SigKls(ctx.XLEN)
        b = ctx.SigKls(1)
    m.d.comb += b.eq(x == y)
    with m.If(b):
        ....

This code is obvious for Scalar behaviour but for SIMD, because
the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
the *elements* are 1 bit (in order to make a total of QTY 8
comparisons of 8 parallel SIMD 8-bit values), there correspondingly
needs to be **eight** such element bits in order to store up to
eight 8-bit comparisons.  Exactly how that comparison works
is described in [[dynamic_simd/eq]]

Another example would be a simple test of the first *nibble* of
the data.

    m = Module():
    with ctx:
        x = ctx.SigKls(ctx.XLEN)
        y = ctx.SigKls(4)
    m.d.comb += y.eq(x[0:3])
    ....

Here, we do not necessarily want to declare y to be 64-bit: we want
only the first 4 bits of each element, after all, and when y is set
to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
4-bit quantities, i.e. only a maximum of 32 bits total.

If y was declared as 64 bit this would indicate that the actual
elements were at least 8 bit long, and if that was then used as a
shift input it might produce the wrong calculation because the
actual shift amount was only supposed to be 4 bits.

Thus not one method of setting widths is required but *two*:

* at the element level
* at the width of the entire SIMD signal

With this background and context in mind the requirements can be determined

# Requirements

SimdShape needs:

* to derive from nmigen ast.Shape in order to provide the overall
  width and whether it is signed or unsigned.  However the
  overall width is not necessarily hard-set but may be calculated
* provides a means to specify the number of partitions in each of
  an arbitrarily-named set. for convenience and by convention
  from SVP64 this set is called "elwidths".
* to support a range of sub-signal divisions (element widths)
  and for there to be an option to either set each element width
  explicitly or to allow each width to be computed from the
  overall width and the number of partitions.
* to provide rudimentary arithmetic operator capability
  that automatically computes a new SimdShape, adjusting width
  and element widths accordingly.

Interfacing to SimdSignal requires an adapter that:

* allows a switch-case set to be created
* the switch statement is the elwidth parameter
* the case statements are the PartitionPoints
* identifies which partitions are "blank" (padding)