3d_gpu/architecture/dynamic_simd/shape.mdwn

   1 # SimdShape
   2
   3 Links:
   4
   5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
   7
   8 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
   9 provides sufficient context to both define overrides for individual lengths
  10 on a per-mask basis as well as sufficient information to "upcast"
  11 back to a SimdSignal, in exactly the same way that c++ virtual base
  12 class upcasting works when RTTI (Run Time Type Information) works.
  13
  14 By deriving from `ast.Shape` both `width` and `signed` are provided
  15 already, leaving the `SimdShape` class with the responsibility to
  16 additionally define lengths for each mask basis. This is best illustrated
  17 with an example.
  18
  19 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
  20 but without massive disruptive code-duplication or intrusive explicit
  21 coding as outlined in the worst of the techniques documented in
  22 [[dynamic_simd]].  This in turn implies that Signals need to be declared
  23 for both mantissa and exponent that **change width to non-power-of-two
  24 sizes** depending on Partition Mask Context.
  25
  26 Mantissa:
  27
  28 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  29   rounding and sticky)
  30 * when the context is 2xFP32 there are **two** mantissas of 23 bits
  31 * when the context is 4xFP16 there are **four** mantissas of 10 bits
  32 * when the context is 4xBF16 there are four mantissas of 5 bits.
  33
  34 Exponent:
  35
  36 * 1xFP64: 11 bits, one exponent
  37 * 2xFP32: 8 bits, two exponents
  38 * 4xFP16: 5 bits, four exponents
  39 * 4xBF16: 8 bits, four exponents
  40
  41 `SimdShape` needs this information in addition to the normal
  42 information (width, sign) in order to create the partitions
  43 that allow standard nmigen operations to **transparently**
  44 and naturally take place at **all** of these non-uniform
  45 widths, as if they were in fact scalar Signals *at* those
  46 widths.
  47
  48 A minor wrinkle which emerges from deep analysis is that the overall
  49 available width (`Shape.width`) does in fact need to be explicitly
  50 declared under *some* circumstances, and
  51 the sub-partitions to fit onto power-of-two boundaries, in order to allow
  52 straight wire-connections rather than allow the SimdSignal to be
  53 arbitrary-sized (compact).  Although on shallow inspection this
  54 initially would seem to imply that it would result in large unused
  55 sub-partitions (padding partitions) these gates can in fact be eliminated
  56 with a "blanking" mask, created from static analysis of the SimdShape
  57 context.
  58
  59 Example:
  60
  61 * all 32 and 16-bit values are actually to be truncated to 11 bit
  62 * all 8-bit values to 5-bit
  63
  64 from these we can write out the allocations, bearing in mind that
  65 in each partition the sub-signal must start on a power-2 boundary,
  66
  67           |31|  |  |24|     16|15|  |   8|7     0 |
  68     32bit |           |          |  | 1.11        |
  69     16bit |     | 2.11        |  |  | 1.11        |
  70     8bit  |  |  4.5   | 3.5   |  | 2.5   | | 1.5  |
  71
  72 Next we identify the start and end points, and note
  73 that "x" marks unused (padding) portions. We begin by marking
  74 the power-of-two boundaries (0-7 .. 24-31) and also including column
  75 guidelines to delineate the start and endpoints:
  76
  77           |31|  |  |24|     16|15|  |   8|7     0 |
  78           |31|28|26|24| |20|16|15|12|10|8| |4   0 |
  79     32bit | x| x| x|  |      x| x| x|10 ....    0 |
  80     16bit | x| x|26    ... 16 | x| x|10 ....    0 |
  81     8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
  82     unused  x                   x
  83
  84 thus, we deduce, we *actually* need breakpoints at *nine* positions,
  85 and that unused portions common to **all** cases can be deduced
  86 and marked "x" by looking at the columns above them.
  87 These 100% unused "x"s therefore define the "blanking" mask, and in
  88 these sub-portions it is unnecessary to allocate computational gates.
  89
  90 Also in order to save gates, in the example above there are only three
  91 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
  92 are required to construct the larger overall computational result
  93 from the "smaller chunks". At first glance, with there
  94 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
  95 would appear that 2^9 (512!) cases were required, where in fact
  96 there are only three.
  97
  98 These facts also need to be communicated to both the SimdSignal
  99 as well as the submodules implementing its core functionality:
 100 add operation and other arithmetic behaviour, as well as
 101 [[dynamic_simd/cat]] and others.
 102
 103 In addition to that, there is a "convenience" that emerged
 104 from technical discussions as desirable
 105 to have, which is that it should be possible to perform
 106 rudimentary arithmetic operations *on a SimdShape* which preserves
 107 or adapts the Partition context, where the arithmetic operations
 108 occur on `Shape.width`.
 109
 110     >>> XLEN = SimdShape(64, signed=True, ...)
 111     >>> x2 = XLEN // 2
 112     >>> print(x2.width)
 113     32
 114     >>> print(x2.signed)
 115     True
 116
 117 With this capability it becomes possible to use the Liskov Substitution
 118 Principle in dynamically compiling code that switches between scalar and
 119 SIMD transparently:
 120
 121     # scalar context
 122     scalarctx = scl = object()
 123     scl.XLEN = 64
 124     scl.SigKls = Signal         # standard nmigen Signal
 125     # SIMD context
 126     simdctx = sdc = object()
 127     sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
 128     sdc.SigKls = SimdSignal     # advanced SIMD Signal
 129     sdc.elwidth = Signal(2)
 130
 131     # select one
 132     if compiletime_switch == 'SIMD':
 133         ctx = simdctx
 134     else:
 135         ctx = scalarctx
 136
 137     # exact same code switching context at compile time
 138     m = Module():
 139     with ctx:
 140         x = ctx.SigKls(ctx.XLEN)
 141         y = ctx.SigKls(ctx.XLEN // 2)
 142         ...
 143     m.d.comb += x.eq(Const(3))
 144
 145 An interesting practical requirement transpires from attempting to use
 146 SimdSignal, that affects the way that SimdShape works.  The register files
 147 are 64 bit, and are subdivided according to what wikipedia terms
 148 "SIMD Within A Register" (SWAR).  Therefore, the SIMD ALUs *have* to
 149 both accept and output 64-bit signals at that explicit width, with
 150 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
 151
 152 However when it comes to intermediary processing (partial computations)
 153 those intermediary Signals can and will be required to be a certain
 154 fixed width *regardless* and having nothing to do with the register
 155 file source or destination 64 bit fixed width.
 156
 157 The simplest example here would be a boolean (1 bit) Signal for
 158 Scalar (but an 8-bit quantity for SIMD):
 159
 160     m = Module():
 161     with ctx:
 162         x = ctx.SigKls(ctx.XLEN)
 163         y = ctx.SigKls(ctx.XLEN)
 164         b = ctx.SigKls(1)
 165     m.d.comb += b.eq(x > y)
 166     with m.If(b):
 167         ....
 168
 169 This code is obvious for Scalar behaviour but for SIMD, because
 170 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
 171 the *elements* are 1 bit (in order to make a total of QTY 8
 172 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
 173 needs to be **eight** such element bits in order to store up to
 174 eight 8-bit comparisons.
 175
 176 Another example would be a simple test of the first *nibble* of
 177 the data.
 178
 179     m = Module():
 180     with ctx:
 181         x = ctx.SigKls(ctx.XLEN)
 182         y = ctx.SigKls(4)
 183     m.d.comb += y.eq(x[0:3])
 184     ....
 185
 186 Here, we do not necessarily want to declare y to be 64-bit: we want
 187 only the first 4 bits of each element, after all, and when y is set
 188 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
 189 4-bit quantities, i.e. only a maximum of 32 bits total.
 190
 191 If y was declared as 64 bit this would indicate that the actual
 192 elements were at least 8 bit long, and if that was then used as a
 193 shift input it might produce the wrong calculation because the
 194 actual shift amount was only supposed to be 4 bits.
 195
 196 Thus not one method of setting widths is required but *two*:
 197
 198 * at the element level
 199 * at the width of the entire SIMD signal