3d_gpu/architecture/dynamic_simd/shape.mdwn

   1 # SimdShape
   2
   3 Links:
   4
   5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
   7
   8 # Requirements Analysis
   9
  10 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
  11 provides sufficient context to both define overrides for individual lengths
  12 on a per-mask basis as well as sufficient information to "upcast"
  13 back to a SimdSignal, in exactly the same way that c++ virtual base
  14 class upcasting works when RTTI (Run Time Type Information) works.
  15
  16 By deriving from `ast.Shape` both `width` and `signed` are provided
  17 already, leaving the `SimdShape` class with the responsibility to
  18 additionally define lengths for each mask basis. This is best illustrated
  19 with an example.
  20
  21 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
  22 but without massive disruptive code-duplication or intrusive explicit
  23 coding as outlined in the worst of the techniques documented in
  24 [[dynamic_simd]].  This in turn implies that Signals need to be declared
  25 for both mantissa and exponent that **change width to non-power-of-two
  26 sizes** depending on Partition Mask Context.
  27
  28 Mantissa:
  29
  30 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  31   rounding and sticky)
  32 * when the context is 2xFP32 there are **two** mantissas of 23 bits
  33 * when the context is 4xFP16 there are **four** mantissas of 10 bits
  34 * when the context is 4xBF16 there are four mantissas of 5 bits.
  35
  36 Exponent:
  37
  38 * 1xFP64: 11 bits, one exponent
  39 * 2xFP32: 8 bits, two exponents
  40 * 4xFP16: 5 bits, four exponents
  41 * 4xBF16: 8 bits, four exponents
  42
  43 `SimdShape` needs this information in addition to the normal
  44 information (width, sign) in order to create the partitions
  45 that allow standard nmigen operations to **transparently**
  46 and naturally take place at **all** of these non-uniform
  47 widths, as if they were in fact scalar Signals *at* those
  48 widths.
  49
  50 A minor wrinkle which emerges from deep analysis is that the overall
  51 available width (`Shape.width`) does in fact need to be explicitly
  52 declared under *some* circumstances, and
  53 the sub-partitions to fit onto power-of-two boundaries, in order to allow
  54 straight wire-connections rather than allow the SimdSignal to be
  55 arbitrary-sized (compact).  Although on shallow inspection this
  56 initially would seem to imply that it would result in large unused
  57 sub-partitions (padding partitions) these gates can in fact be eliminated
  58 with a "blanking" mask, created from static analysis of the SimdShape
  59 context.
  60
  61 Example:
  62
  63 * all 32 and 16-bit values are actually to be truncated to 11 bit
  64 * all 8-bit values to 5-bit
  65
  66 from these we can write out the allocations, bearing in mind that
  67 in each partition the sub-signal must start on a power-2 boundary,
  68
  69           |31|  |  |24|     16|15|  |   8|7     0 |
  70     32bit |           |          |  | 1.11        |
  71     16bit |     | 2.11        |  |  | 1.11        |
  72     8bit  |  |  4.5   | 3.5   |  | 2.5   | | 1.5  |
  73
  74 Next we identify the start and end points, and note
  75 that "x" marks unused (padding) portions. We begin by marking
  76 the power-of-two boundaries (0-7 .. 24-31) and also including column
  77 guidelines to delineate the start and endpoints:
  78
  79           |31|  |  |24|     16|15|  |   8|7     0 |
  80           |31|28|26|24| |20|16|15|12|10|8| |4   0 |
  81     32bit | x| x| x|  |      x| x| x|10 ....    0 |
  82     16bit | x| x|26    ... 16 | x| x|10 ....    0 |
  83     8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
  84     unused  x                   x
  85
  86 thus, we deduce, we *actually* need breakpoints at *nine* positions,
  87 and that unused portions common to **all** cases can be deduced
  88 and marked "x" by looking at the columns above them.
  89 These 100% unused "x"s therefore define the "blanking" mask, and in
  90 these sub-portions it is unnecessary to allocate computational gates.
  91
  92 Also in order to save gates, in the example above there are only three
  93 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
  94 are required to construct the larger overall computational result
  95 from the "smaller chunks". At first glance, with there
  96 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
  97 would appear that 2^9 (512!) cases were required, where in fact
  98 there are only three.
  99
 100 These facts also need to be communicated to both the SimdSignal
 101 as well as the submodules implementing its core functionality:
 102 add operation and other arithmetic behaviour, as well as
 103 [[dynamic_simd/cat]] and others.
 104
 105 In addition to that, there is a "convenience" that emerged
 106 from technical discussions as desirable
 107 to have, which is that it should be possible to perform
 108 rudimentary arithmetic operations *on a SimdShape* which preserves
 109 or adapts the Partition context, where the arithmetic operations
 110 occur on `Shape.width`.
 111
 112     >>> XLEN = SimdShape(64, signed=True, ...)
 113     >>> x2 = XLEN // 2
 114     >>> print(x2.width)
 115     32
 116     >>> print(x2.signed)
 117     True
 118
 119 With this capability it becomes possible to use the Liskov Substitution
 120 Principle in dynamically compiling code that switches between scalar and
 121 SIMD transparently:
 122
 123     # scalar context
 124     scalarctx = scl = object()
 125     scl.XLEN = 64
 126     scl.SigKls = Signal         # standard nmigen Signal
 127     # SIMD context
 128     simdctx = sdc = object()
 129     sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
 130     sdc.SigKls = SimdSignal     # advanced SIMD Signal
 131     sdc.elwidth = Signal(2)
 132
 133     # select one
 134     if compiletime_switch == 'SIMD':
 135         ctx = simdctx
 136     else:
 137         ctx = scalarctx
 138
 139     # exact same code switching context at compile time
 140     m = Module():
 141     with ctx:
 142         x = ctx.SigKls(ctx.XLEN)
 143         y = ctx.SigKls(ctx.XLEN // 2)
 144         ...
 145     m.d.comb += x.eq(Const(3))
 146
 147 An interesting practical requirement transpires from attempting to use
 148 SimdSignal, that affects the way that SimdShape works.  The register files
 149 are 64 bit, and are subdivided according to what wikipedia terms
 150 "SIMD Within A Register" (SWAR).  Therefore, the SIMD ALUs *have* to
 151 both accept and output 64-bit signals at that explicit width, with
 152 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
 153
 154 However when it comes to intermediary processing (partial computations)
 155 those intermediary Signals can and will be required to be a certain
 156 fixed width *regardless* and having nothing to do with the register
 157 file source or destination 64 bit fixed width.
 158
 159 The simplest example here would be a boolean (1 bit) Signal for
 160 Scalar (but an 8-bit quantity for SIMD):
 161
 162     m = Module():
 163     with ctx:
 164         x = ctx.SigKls(ctx.XLEN)
 165         y = ctx.SigKls(ctx.XLEN)
 166         b = ctx.SigKls(1)
 167     m.d.comb += b.eq(x == y)
 168     with m.If(b):
 169         ....
 170
 171 This code is obvious for Scalar behaviour but for SIMD, because
 172 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
 173 the *elements* are 1 bit (in order to make a total of QTY 8
 174 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
 175 needs to be **eight** such element bits in order to store up to
 176 eight 8-bit comparisons.  Exactly how that comparison works
 177 is described in [[dynamic_simd/eq]]
 178
 179 Another example would be a simple test of the first *nibble* of
 180 the data.
 181
 182     m = Module():
 183     with ctx:
 184         x = ctx.SigKls(ctx.XLEN)
 185         y = ctx.SigKls(4)
 186     m.d.comb += y.eq(x[0:3])
 187     ....
 188
 189 Here, we do not necessarily want to declare y to be 64-bit: we want
 190 only the first 4 bits of each element, after all, and when y is set
 191 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
 192 4-bit quantities, i.e. only a maximum of 32 bits total.
 193
 194 If y was declared as 64 bit this would indicate that the actual
 195 elements were at least 8 bit long, and if that was then used as a
 196 shift input it might produce the wrong calculation because the
 197 actual shift amount was only supposed to be 4 bits.
 198
 199 Thus not one method of setting widths is required but *two*:
 200
 201 * at the element level
 202 * at the width of the entire SIMD signal
 203
 204 With this background and context in mind the requirements can be determined
 205
 206 # Requirements
 207
 208 SimdShape needs:
 209
 210 * to derive from nmigen ast.Shape in order to provide the overall
 211   width and whether it is signed or unsigned.  However the
 212   overall width is not necessarily hard-set but may be calculated
 213 * provides a means to specify the number of partitions in each of
 214   an arbitrarily-named set. for convenience and by convention
 215   from SVP64 this set is called "elwidths".
 216 * to support a range of sub-signal divisions (element widths)
 217   and for there to be an option to either set each element width
 218   explicitly or to allow each width to be computed from the
 219   overall width and the number of partitions.
 220 * to provide rudimentary arithmetic operator capability
 221   that automatically computes a new SimdShape, adjusting width
 222   and element widths accordingly.
 223
 224 Interfacing to SimdSignal requires an adapter that:
 225
 226 * allows a switch-case set to be created
 227 * the switch statement is the elwidth parameter
 228 * the case statements are the PartitionPoints
 229 * identifies which partitions are "blank" (padding)