3d_gpu/architecture/dynamic_simd/shape.mdwn

   1 # SimdShape
   2
   3 Links:
   4
   5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
   7
   8 # Requirements Analysis
   9
  10 The dynamic partitioned SimdSignal class is based on the logical extension
  11 of the full capabilities to the nmigen language behavioural constructs to
  12 a parallel dimension, with zero changes in that behaviour as a result of
  13 that parallelism.
  14
  15 Logically therefore even the concept of ast.Shape should be extended
  16 solely to express and define the extent of the parallelism and
  17 SimdShape should in no
  18 way attempt to change the expected behaviour of the Shape
  19 class behaviour from which it derives.
  20
  21 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
  22 provides sufficient context to both define overrides for individual lengths
  23 on a per-mask basis as well as sufficient information to "upcast"
  24 back to a SimdSignal, in exactly the same way that c++ virtual base
  25 class upcasting works when RTTI (Run Time Type Information) works.
  26
  27 By deriving from `ast.Shape` both `width` and `signed` are provided
  28 already, leaving the `SimdShape` class with the responsibility to
  29 additionally define lengths for each mask basis. This is best illustrated
  30 with an example.
  31
  32 Also by fitting on top of existing nmigen concepts, and defining the
  33 `SimdShape.width` equal to and synonymous with `Shape.width` then
  34 downcasting becomes possible and practical. *(An alternative proposal
  35 to redefine "width" to be in terms of the multiple options, i.e.
  36 context-dependent on the partition setting, is unworkable as it
  37 prevents downcasting to e.g. `Signal`)*
  38
  39 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
  40 but without massive disruptive code-duplication or intrusive explicit
  41 coding as outlined in the worst of the techniques documented in
  42 [[dynamic_simd]].  This in turn implies that Signals need to be declared
  43 for both mantissa and exponent that **change width to non-power-of-two
  44 sizes** depending on Partition Mask Context.
  45
  46 Mantissa:
  47
  48 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  49   rounding and sticky)
  50 * when the context is 2xFP32 there are **two** mantissas of 23 bits
  51 * when the context is 4xFP16 there are **four** mantissas of 10 bits
  52 * when the context is 4xBF16 there are four mantissas of 5 bits.
  53
  54 Exponent:
  55
  56 * 1xFP64: 11 bits, one exponent
  57 * 2xFP32: 8 bits, two exponents
  58 * 4xFP16: 5 bits, four exponents
  59 * 4xBF16: 8 bits, four exponents
  60
  61 `SimdShape` needs this information in addition to the normal
  62 information (width, sign) in order to create the partitions
  63 that allow standard nmigen operations to **transparently**
  64 and naturally take place at **all** of these non-uniform
  65 widths, as if they were in fact scalar Signals *at* those
  66 widths.
  67
  68 A minor wrinkle which emerges from deep analysis is that the overall
  69 available width (`Shape.width`) does in fact need to be explicitly
  70 declared under *some* circumstances, and
  71 the sub-partitions to fit onto power-of-two boundaries, in order to allow
  72 straight wire-connections rather than allow the SimdSignal to be
  73 arbitrary-sized (compact).  Although on shallow inspection this
  74 initially would seem to imply that it would result in large unused
  75 sub-partitions (padding partitions) these gates can in fact be eliminated
  76 with a "blanking" mask, created from static analysis of the SimdShape
  77 context.
  78
  79 Example:
  80
  81 * all 32 and 16-bit values are actually to be truncated to 11 bit
  82 * all 8-bit values to 5-bit
  83
  84 from these we can write out the allocations, bearing in mind that
  85 in each partition the sub-signal must start on a power-2 boundary,
  86
  87           |31|  |  |24|     16|15|  |   8|7     0 |
  88     32bit |           |          |  | 1.11        |
  89     16bit |     | 2.11        |  |  | 1.11        |
  90     8bit  |  |  4.5   | 3.5   |  | 2.5   | | 1.5  |
  91
  92 Next we identify the start and end points, and note
  93 that "x" marks unused (padding) portions. We begin by marking
  94 the power-of-two boundaries (0-7 .. 24-31) and also including column
  95 guidelines to delineate the start and endpoints:
  96
  97           |31|  |  |24|     16|15|  |   8|7     0 |
  98           |31|28|26|24| |20|16|15|12|10|8| |4   0 |
  99     32bit | x| x| x|  |      x| x| x|10 ....    0 |
 100     16bit | x| x|26    ... 16 | x| x|10 ....    0 |
 101     8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
 102     unused  x                   x
 103
 104 thus, we deduce, we *actually* need breakpoints at *nine* positions,
 105 and that unused portions common to **all** cases can be deduced
 106 and marked "x" by looking at the columns above them.
 107 These 100% unused "x"s therefore define the "blanking" mask, and in
 108 these sub-portions it is unnecessary to allocate computational gates.
 109
 110 Also in order to save gates, in the example above there are only three
 111 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
 112 are required to construct the larger overall computational result
 113 from the "smaller chunks". At first glance, with there
 114 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
 115 would appear that 2^9 (512!) cases were required, where in fact
 116 there are only three.
 117
 118 These facts also need to be communicated to both the SimdSignal
 119 as well as the submodules implementing its core functionality:
 120 add operation and other arithmetic behaviour, as well as
 121 [[dynamic_simd/cat]] and others.
 122
 123 In addition to that, there is a "convenience" that emerged
 124 from technical discussions as desirable
 125 to have, which is that it should be possible to perform
 126 rudimentary arithmetic operations *on a SimdShape* which preserves
 127 or adapts the Partition context, where the arithmetic operations
 128 occur on `Shape.width`.
 129
 130     >>> XLEN = SimdShape(fixed_width=64, signed=True, ...)
 131     >>> x2 = XLEN // 2
 132     >>> print(x2.width)
 133     32
 134     >>> print(x2.signed)
 135     True
 136
 137 With this capability it becomes possible to use the Liskov Substitution
 138 Principle in dynamically compiling code that switches between scalar and
 139 SIMD transparently:
 140
 141     # scalar context
 142     scalarctx = scl = object()
 143     scl.XLEN = 64
 144     scl.SigKls = Signal         # standard nmigen Signal
 145     # SIMD context
 146     simdctx = sdc = object()
 147     sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
 148     sdc.SigKls = SimdSignal     # advanced SIMD Signal
 149     sdc.elwidth = Signal(2)
 150
 151     # select one
 152     if compiletime_switch == 'SIMD':
 153         ctx = simdctx
 154     else:
 155         ctx = scalarctx
 156
 157     # exact same code switching context at compile time
 158     m = Module():
 159     with ctx:
 160         x = ctx.SigKls(ctx.XLEN)
 161         y = ctx.SigKls(ctx.XLEN // 2)
 162         ...
 163     m.d.comb += x.eq(Const(3))
 164
 165 An interesting practical requirement transpires from attempting to use
 166 SimdSignal, that affects the way that SimdShape works.  The register files
 167 are 64 bit, and are subdivided according to what wikipedia terms
 168 "SIMD Within A Register" (SWAR).  Therefore, the SIMD ALUs *have* to
 169 both accept and output 64-bit signals at that explicit width, with
 170 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
 171
 172 However when it comes to intermediary processing (partial computations)
 173 those intermediary Signals can and will be required to be a certain
 174 fixed width *regardless* and having nothing to do with the register
 175 file source or destination 64 bit fixed width.
 176
 177 The simplest example here would be a boolean (1 bit) Signal for
 178 Scalar (but an 8-bit quantity for SIMD):
 179
 180     m = Module():
 181     with ctx:
 182         x = ctx.SigKls(ctx.XLEN)
 183         y = ctx.SigKls(ctx.XLEN)
 184         b = ctx.SigKls(1)
 185     m.d.comb += b.eq(x == y)
 186     with m.If(b):
 187         ....
 188
 189 This code is obvious for Scalar behaviour but for SIMD, because
 190 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
 191 the *elements* are 1 bit (in order to make a total of QTY 8
 192 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
 193 needs to be **eight** such element bits in order to store up to
 194 eight 8-bit comparisons.  Exactly how that comparison works
 195 is described in [[dynamic_simd/eq]]
 196
 197 Another example would be a simple test of the first *nibble* of
 198 the data.
 199
 200     m = Module():
 201     with ctx:
 202         x = ctx.SigKls(ctx.XLEN)
 203         y = ctx.SigKls(4)
 204     m.d.comb += y.eq(x[0:3])
 205     ....
 206
 207 Here, we do not necessarily want to declare y to be 64-bit: we want
 208 only the first 4 bits of each element, after all, and when y is set
 209 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
 210 4-bit quantities, i.e. only a maximum of 32 bits total.
 211
 212 If y was declared as 64 bit this would indicate that the actual
 213 elements were at least 8 bit long, and if that was then used as a
 214 shift input it might produce the wrong calculation because the
 215 actual shift amount was only supposed to be 4 bits.
 216
 217 Thus not one method of setting widths is required but *two*:
 218
 219 * at the element level
 220 * at the width of the entire SIMD signal
 221
 222 With this background and context in mind the requirements can be determined
 223
 224 # Requirements
 225
 226 SimdShape needs:
 227
 228 * to derive from nmigen ast.Shape in order to provide the overall
 229   width and whether it is signed or unsigned.  However the
 230   overall width is not necessarily hard-set but may be calculated
 231 * provides a means to specify the number of partitions in each of
 232   an arbitrarily-named set. for convenience and by convention
 233   from SVP64 this set is called "elwidths".
 234 * to support a range of sub-signal divisions (element widths)
 235   and for there to be an option to either set each element width
 236   explicitly or to allow each width to be computed from the
 237   overall width and the number of partitions.
 238 * to provide rudimentary arithmetic operator capability
 239   that automatically computes a new SimdShape, adjusting width
 240   and element widths accordingly.
 241
 242 Interfacing to SimdSignal requires an adapter that:
 243
 244 * allows a switch-case set to be created
 245 * the switch statement is the elwidth parameter
 246 * the case statements are the PartitionPoints
 247 * identifies which partitions are "blank" (padding)
 248
 249 # SimdShape API
 250
 251 SimdShape needs:
 252
 253 * a constructor taking the following arguments:
 254   - (mandatory) an elwidth Signal
 255   - (optional) an integer vector width or a dictionary of vector widths
 256     (the keys to be the "elwidth")
 257   - (mandatory) a dictionary of "partition counts":
 258     the keys to again be the "elwidth" and the values
 259      to be the number of Vector Elements at that elwidth
 260   - (optional) a "fixed width" which if given shall
 261     auto-compute the dictionary of Vector Widths
 262   - (mandatory) a "signed" boolean argument which defaults
 263     to False
 264 * To derive from Shape, where the (above) constructor passes it
 265   the following arguments:
 266   - the signed argument.  this is simply passed in, unmodified.
 267   - a width argument.  this will be **either** the fixed_width
 268     parameter from the constructor (if given) **or** it will
 269     be the **calculated** value sufficient to store all partitions.
 270 * a suite of operators (`__add__`, etc) that shall take simple
 271   integer arguments and perform the computations on *every*
 272   one of the dictionary of Vector widths (examples below)
 273 * a "recalculate" function (currently known as layout() in
 274   layout_experiment.py) which creates information required
 275   by PartitionedSignal.
 276 * a function which computes and returns a suite of PartitionPoints
 277   as well as an "Adapter" instance, for use by PartitionedSignal
 278
 279 Examples of the operator usage:
 280
 281     x = SimdShape(vec_op_widths={0b00: 64, 0b01:32, 0b10: 16})
 282     y = x + 5
 283     print(y.vec_op_widths)
 284     {0b00: 69, 0b01: 37, 0b10: 21}
 285
 286 In other words, when requesting 5 to be added to x, every single
 287 one of the Vector Element widths had 5 added to it. If the
 288 partition counts were 2x for 0b00 and 4x for 0b01 then this
 289 would create 2x 69-bit and 4x 37-bit Vector Elements.
 290
 291 # Adapter API
 292
 293 The Adapter API performs a specific job of letting SimdSignal
 294 know the relationship between the supported "configuration"
 295 options that a SimdSignal must have, and the actual PartitionPoints
 296 bits that must be set or cleared *in order* to have the SimdSignal
 297 cut itself into the required sub-sections.  This information
 298 comes *from* SimdShape but the adapter is not part *of* SimdShape
 299 because there can be more than one type of Adapter Mode, depending
 300 on SimdShape input parameters.
 301
 302     class PartType: # TODO decide name
 303         def __init__(self, psig):
 304             self.psig = psig
 305         def get_mask(self):
 306         def get_switch(self):
 307         def get_cases(self):
 308         @property
 309         def blanklanes(self):
 310
 311 # SimdShape arithmetic operators
 312
 313 Rudimentary arithmetic operations are required in order to perform
 314 tricks such as:
 315
 316        m = Module()
 317        with SimdScope(m, elwid, vec_el_counts) as s:
 318            shape = SimdShape(s, fixed_width=width)
 319            a = s.Signal(shape)
 320            b = s.Signal(shape*2)
 321            o = s.Signal(shape*3)
 322        m.c.comb + o.eq(Cat(a, b))