3d_gpu/architecture/dynamic_simd/shape.mdwn

   1 # SimdShape
   2
   3 Links:
   4
   5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;hb=HEAD)
   6 * [SimdShape source](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/partsig.py;hb=HEAD)
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
   8
   9 # Requirements Analysis
  10
  11 The dynamic partitioned SimdSignal class is based on the logical extension
  12 of the full capabilities to the nmigen language behavioural constructs to
  13 a parallel dimension, with zero changes in that behaviour as a result of
  14 that parallelism.
  15
  16 Logically therefore even the concept of ast.Shape should be extended
  17 solely to express and define the extent of the parallelism and
  18 SimdShape should in no
  19 way attempt to change the expected behaviour of the Shape
  20 class behaviour from which it derives.
  21
  22 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
  23 provides sufficient context to both define overrides for individual lengths
  24 on a per-mask basis as well as sufficient information to "upcast"
  25 back to a SimdSignal, in exactly the same way that c++ virtual base
  26 class upcasting works when RTTI (Run Time Type Information) works.
  27
  28 By deriving from `ast.Shape` both `width` and `signed` are provided
  29 already, leaving the `SimdShape` class with the responsibility to
  30 additionally define lengths for each mask basis. This is best illustrated
  31 with an example.
  32
  33 Also by fitting on top of existing nmigen concepts, and defining the
  34 `SimdShape.width` equal to and synonymous with `Shape.width` then
  35 downcasting becomes possible and practical. *(An alternative proposal
  36 to redefine "width" to be in terms of the multiple options, i.e.
  37 context-dependent on the partition setting, is unworkable as it
  38 prevents downcasting to e.g. `Signal`)*
  39
  40 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
  41 but without massive disruptive code-duplication or intrusive explicit
  42 coding as outlined in the worst of the techniques documented in
  43 [[dynamic_simd]].  This in turn implies that Signals need to be declared
  44 for both mantissa and exponent that **change width to non-power-of-two
  45 sizes** depending on Partition Mask Context.
  46
  47 Mantissa:
  48
  49 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  50   rounding and sticky)
  51 * when the context is 2xFP32 there are **two** mantissas of 23 bits
  52 * when the context is 4xFP16 there are **four** mantissas of 10 bits
  53 * when the context is 4xBF16 there are four mantissas of 5 bits.
  54
  55 Exponent:
  56
  57 * 1xFP64: 11 bits, one exponent
  58 * 2xFP32: 8 bits, two exponents
  59 * 4xFP16: 5 bits, four exponents
  60 * 4xBF16: 8 bits, four exponents
  61
  62 `SimdShape` needs this information in addition to the normal
  63 information (width, sign) in order to create the partitions
  64 that allow standard nmigen operations to **transparently**
  65 and naturally take place at **all** of these non-uniform
  66 widths, as if they were in fact scalar Signals *at* those
  67 widths.
  68
  69 A minor wrinkle which emerges from deep analysis is that the overall
  70 available width (`Shape.width`) does in fact need to be explicitly
  71 declared under *some* circumstances, and
  72 the sub-partitions to fit onto power-of-two boundaries, in order to allow
  73 straight wire-connections rather than allow the SimdSignal to be
  74 arbitrary-sized (compact).  Although on shallow inspection this
  75 initially would seem to imply that it would result in large unused
  76 sub-partitions (padding partitions) these gates can in fact be eliminated
  77 with a "blanking" mask, created from static analysis of the SimdShape
  78 context.
  79
  80 Example:
  81
  82 * all 32 and 16-bit values are actually to be truncated to 11 bit
  83 * all 8-bit values to 5-bit
  84
  85 from these we can write out the allocations, bearing in mind that
  86 in each partition the sub-signal must start on a power-2 boundary,
  87
  88           |31|  |  |24|     16|15|  |   8|7     0 |
  89     32bit |           |          |  | 1.11        |
  90     16bit |     | 2.11        |  |  | 1.11        |
  91     8bit  |  |  4.5   | 3.5   |  | 2.5   | | 1.5  |
  92
  93 Next we identify the start and end points, and note
  94 that "x" marks unused (padding) portions. We begin by marking
  95 the power-of-two boundaries (0-7 .. 24-31) and also including column
  96 guidelines to delineate the start and endpoints:
  97
  98           |31|  |  |24|     16|15|  |   8|7     0 |
  99           |31|28|26|24| |20|16|15|12|10|8| |4   0 |
 100     32bit | x| x| x|  |      x| x| x|10 ....    0 |
 101     16bit | x| x|26    ... 16 | x| x|10 ....    0 |
 102     8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
 103     unused  x                   x
 104
 105 thus, we deduce, we *actually* need breakpoints at *nine* positions,
 106 and that unused portions common to **all** cases can be deduced
 107 and marked "x" by looking at the columns above them.
 108 These 100% unused "x"s therefore define the "blanking" mask, and in
 109 these sub-portions it is unnecessary to allocate computational gates.
 110
 111 Also in order to save gates, in the example above there are only three
 112 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
 113 are required to construct the larger overall computational result
 114 from the "smaller chunks". At first glance, with there
 115 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
 116 would appear that 2^9 (512!) cases were required, where in fact
 117 there are only three.
 118
 119 These facts also need to be communicated to both the SimdSignal
 120 as well as the submodules implementing its core functionality:
 121 add operation and other arithmetic behaviour, as well as
 122 [[dynamic_simd/cat]] and others.
 123
 124 In addition to that, there is a "convenience" that emerged
 125 from technical discussions as desirable
 126 to have, which is that it should be possible to perform
 127 rudimentary arithmetic operations *on a SimdShape* which preserves
 128 or adapts the Partition context, where the arithmetic operations
 129 occur on `Shape.width`.
 130
 131     >>> XLEN = SimdShape(fixed_width=64, signed=True, ...)
 132     >>> x2 = XLEN // 2
 133     >>> print(x2.width)
 134     32
 135     >>> print(x2.signed)
 136     True
 137
 138 With this capability it becomes possible to use the Liskov Substitution
 139 Principle in dynamically compiling code that switches between scalar and
 140 SIMD transparently:
 141
 142     # scalar context
 143     scalarctx = scl = object()
 144     scl.XLEN = 64
 145     scl.SigKls = Signal         # standard nmigen Signal
 146     # SIMD context
 147     simdctx = sdc = object()
 148     sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
 149     sdc.SigKls = SimdSignal     # advanced SIMD Signal
 150     sdc.elwidth = Signal(2)
 151
 152     # select one
 153     if compiletime_switch == 'SIMD':
 154         ctx = simdctx
 155     else:
 156         ctx = scalarctx
 157
 158     # exact same code switching context at compile time
 159     m = Module():
 160     with ctx:
 161         x = ctx.SigKls(ctx.XLEN)
 162         y = ctx.SigKls(ctx.XLEN // 2)
 163         ...
 164     m.d.comb += x.eq(Const(3))
 165
 166 An interesting practical requirement transpires from attempting to use
 167 SimdSignal, that affects the way that SimdShape works.  The register files
 168 are 64 bit, and are subdivided according to what wikipedia terms
 169 "SIMD Within A Register" (SWAR).  Therefore, the SIMD ALUs *have* to
 170 both accept and output 64-bit signals at that explicit width, with
 171 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
 172
 173 However when it comes to intermediary processing (partial computations)
 174 those intermediary Signals can and will be required to be a certain
 175 fixed width *regardless* and having nothing to do with the register
 176 file source or destination 64 bit fixed width.
 177
 178 The simplest example here would be a boolean (1 bit) Signal for
 179 Scalar (but an 8-bit quantity for SIMD):
 180
 181     m = Module():
 182     with ctx:
 183         x = ctx.SigKls(ctx.XLEN)
 184         y = ctx.SigKls(ctx.XLEN)
 185         b = ctx.SigKls(1)
 186     m.d.comb += b.eq(x == y)
 187     with m.If(b):
 188         ....
 189
 190 This code is obvious for Scalar behaviour but for SIMD, because
 191 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
 192 the *elements* are 1 bit (in order to make a total of QTY 8
 193 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
 194 needs to be **eight** such element bits in order to store up to
 195 eight 8-bit comparisons.  Exactly how that comparison works
 196 is described in [[dynamic_simd/eq]]
 197
 198 Another example would be a simple test of the first *nibble* of
 199 the data.
 200
 201     m = Module():
 202     with ctx:
 203         x = ctx.SigKls(ctx.XLEN)
 204         y = ctx.SigKls(4)
 205     m.d.comb += y.eq(x[0:3])
 206     ....
 207
 208 Here, we do not necessarily want to declare y to be 64-bit: we want
 209 only the first 4 bits of each element, after all, and when y is set
 210 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
 211 4-bit quantities, i.e. only a maximum of 32 bits total.
 212
 213 If y was declared as 64 bit this would indicate that the actual
 214 elements were at least 8 bit long, and if that was then used as a
 215 shift input it might produce the wrong calculation because the
 216 actual shift amount was only supposed to be 4 bits.
 217
 218 Thus not one method of setting widths is required but *two*:
 219
 220 * at the element level
 221 * at the width of the entire SIMD signal
 222
 223 With this background and context in mind the requirements can be determined
 224
 225 # Requirements
 226
 227 SimdShape needs:
 228
 229 * to derive from nmigen ast.Shape in order to provide the overall
 230   width and whether it is signed or unsigned.  However the
 231   overall width is not necessarily hard-set but may be calculated
 232 * provides a means to specify the number of partitions in each of
 233   an arbitrarily-named set. for convenience and by convention
 234   from SVP64 this set is called "elwidths".
 235 * to support a range of sub-signal divisions (element widths)
 236   and for there to be an option to either set each element width
 237   explicitly or to allow each width to be computed from the
 238   overall width and the number of partitions.
 239 * to provide rudimentary arithmetic operator capability
 240   that automatically computes a new SimdShape, adjusting width
 241   and element widths accordingly.
 242
 243 Interfacing to SimdSignal requires an adapter that:
 244
 245 * allows a switch-case set to be created
 246 * the switch statement is the elwidth parameter
 247 * the case statements are the PartitionPoints
 248 * identifies which partitions are "blank" (padding)
 249
 250 # SimdShape API
 251
 252 SimdShape needs:
 253
 254 * a constructor taking the following arguments:
 255   - (mandatory, from scope) an elwidth Signal
 256   - (optional) an integer vector width or a dictionary of vector widths
 257     (the keys to be the "elwidth")
 258   - (mandatory, from scope) a dictionary of "partition counts":
 259     the keys to again be the "elwidth" and the values
 260      to be the number of Vector Elements at that elwidth
 261   - (optional) a "fixed width" which if given shall
 262     auto-compute the dictionary of Vector Widths
 263   - (mandatory) a "signed" boolean argument which defaults
 264     to False
 265 * To derive from Shape, where the (above) constructor passes it
 266   the following arguments:
 267   - the signed argument.  this is simply passed in, unmodified.
 268   - a width argument.  this will be **either** the fixed_width
 269     parameter from the constructor (if given) **or** it will
 270     be the **calculated** value sufficient to store all partitions.
 271 * a suite of operators (`__add__`, etc) that shall take simple
 272   integer arguments and perform the computations on *every*
 273   one of the dictionary of Vector widths (examples below)
 274 * a "recalculate" function (currently known as layout() in
 275   layout_experiment.py) which creates information required
 276   by PartitionedSignal.
 277 * a function which computes and returns a suite of PartitionPoints
 278   as well as an "Adapter" instance, for use by PartitionedSignal
 279
 280 Examples of the operator usage:
 281
 282     x = SimdShape(vec_op_widths={0b00: 64, 0b01:32, 0b10: 16})
 283     y = x + 5
 284     print(y.vec_op_widths)
 285     {0b00: 69, 0b01: 37, 0b10: 21}
 286
 287 In other words, when requesting 5 to be added to x, every single
 288 one of the Vector Element widths had 5 added to it. If the
 289 partition counts were 2x for 0b00 and 4x for 0b01 then this
 290 would create 2x 69-bit and 4x 37-bit Vector Elements.
 291
 292 # Adapter API
 293
 294 The Adapter API performs a specific job of letting SimdSignal
 295 know the relationship between the supported "configuration"
 296 options that a SimdSignal must have, and the actual PartitionPoints
 297 bits that must be set or cleared *in order* to have the SimdSignal
 298 cut itself into the required sub-sections.  This information
 299 comes *from* SimdShape but the adapter is not part *of* SimdShape
 300 because there can be more than one type of Adapter Mode, depending
 301 on SimdShape input parameters.
 302
 303     class PartType: # TODO decide name
 304         def __init__(self, psig):
 305             self.psig = psig
 306         def get_mask(self):
 307         def get_switch(self):
 308         def get_cases(self):
 309         @property
 310         def blanklanes(self):
 311
 312 # SimdShape arithmetic operators
 313
 314 Rudimentary arithmetic operations are required in order to perform
 315 tricks such as:
 316
 317        m = Module()
 318        with SimdScope(m, elwid, vec_el_counts) as s:
 319            shape = SimdShape(s, fixed_width=width)
 320            a = s.Signal(shape)
 321            b = s.Signal(shape*2)
 322            o = s.Signal(shape*3)
 323        m.d.comb + o.eq(Cat(a, b))
 324
 325 as well as:
 326
 327        with SimdScope(m, elwid, vec_el_counts) as s:
 328            shape = SimdShape(s, fixed_width=width)
 329            a = s.Signal(shape)
 330            b = s.Signal(shape*2)
 331            o2 = s.Signal(a.shape + b.shape)
 332
 333 and:
 334
 335        with SimdScope(m, elwid, vec_el_counts) as s:
 336            shape = SimdShape(s, fixed_width=width)
 337            a = s.Signal(16) # element width set to 16
 338            b = s.Signal(shape*2)
 339            o2 = s.Signal(a.shape + b.shape)
 340
 341 From these examples we deduce what the arithmetic operators
 342 have to cope with:
 343
 344 * RHS of simple integers
 345 * RHS of another SimdShape
 346
 347 In both cases, there are further subdivisions because
 348 SimdShapes can be created as either fixed_width priority
 349 or elwidths priority
 350
 351 * fixed_width priority (vec_op_widths=None)
 352 * elwidths priority  (fixed_width=None)
 353 * equal (no) priority (both are given)
 354
 355 ## RHS simple integers
 356
 357 Although this is a simpler case, there are still three options:
 358
 359 * add/mul/sub etc. of integer at fixed_width priority (vec_op_widths=None)
 360 * add/mul/sub etc. of integer at elwidths priority  (fixed_width=None)
 361 * add/mul/sub etc. of integer at equal (no) priority (both are given)
 362
 363 The expected behaviour on fixed_width priority is that the arithmetic
 364 operation should take place on the fixed width.  adding 8 to a 64-bit
 365 fixed-width SimdSignal should result in the overall fixed width being
 366 extended to 72-bit, and for all partitions the new element-width within
 367 each partition is newly-computed to be the maximum possible permitted
 368 amount.  Example:
 369
 370 * assume fixed_width=64 and partition_counts {1,2,4,8}.
 371 * therefore the computed element sizes are: 1x64, 2x32, 4x16, 8x8
 372 * assume that an add operation has requested to add 8 to the fixed width
 373 * the new fixed_width is 72
 374 * the newly-computed element sizes shall be: 1x72, 2x36, 4x18, 8x9
 375
 376 The expected behaviour on element-width-priority on the other hand is that
 377 the arithmetic operation takes place on *each of the element-widths*
 378
 379 Example:
 380
 381 * assume element_widths={16,16,10,12} and partition_counds {1,2,4,8}
 382 * therefore the computed element sizes are: 1x16, 2x16, 4x10, 8x12
 383 * therefore the computed overall width is the maximum of these amounts
 384   (8x12) which is a 96-bit wide Signal.
 385 * assume that a subtract operation has requested to subtract 5 from the
 386   element_widths.
 387 * the newly-computed element_widths becomes {16-5,16-5,10-5,12-5} which
 388   is {11,11,5,7}
 389 * therefore the computed element sizes are: 1x11, 2x11, 4x5, 8x7
 390 * therefore the newly-computed overall width is the maximum of these amounts
 391   (8x7) which is a 56-bit quantity.
 392
 393 It is important to note that some arithmetic operations will not work
 394 correctly if both fixed_width and element-widths are specified.  Addition:
 395
 396 * assume fixed_width=64, element_widths {8,8,8,8} and
 397   partition_counts {1,2,4,8}.
 398 * adding 8 to each element width creates 16 element width for all
 399   partitions
 400 * this creates a maximum expected width of 8x16 or 128 bits
 401 * but the fixed_width of 64 plus 8 is only 72.
 402 * therefore we must prohibit this operation when **both** fixed and
 403   elwidth are specified
 404
 405 However for multiply, it is fine, due to a property of multiplication:
 406
 407 * assume fixed_width=64, element_widths {8,8,8,8} and
 408   partition_counts {1,2,4,8}.
 409 * multiply by an integer value of 2
 410 * the new fixed_width is 2x64 = 128
 411 * each element width is **also** multiplied by 2 to create {16,16,16,16}
 412 * this creates partitions 1x16 2x16 4x16 8x16
 413 * the maximum is (not by a coincidence) exactly 128 bit wide
 414 * this matches perfectly the newly-calculated fixed_width
 415
 416 Given that left-shift is a type of multiply, a left-shift arithmetic
 417 operator with an integer amount (as applied equally to all element widths
 418 and to the fixed_width) is also expected to also work.
 419
 420 Divide on the other hand (and right-shift) will only work (when
 421 both elwidth and fixed-width are set) if there is
 422 no rounding (no bits lost due to the division / right-shift).
 423
 424 Summary:
 425
 426 * **Fixed_width Priority**
 427   - all operations (add, mul, sub, shift, div)
 428     expected to work (caveat: layout() should check partition alignment)
 429 * **Element-width Priority**
 430   - all operations expected to work (no caveats)
 431 * **Fixed-and-Elwidth (no priority)**
 432   - multiply and left-shift always expected to work
 433   - divide and right-shift expected to work (caveat: no bits
 434     lost due to rounding)
 435   - add and subtract **not** expected to work (ambiguous):
 436     exception must be raised
 437
 438 ## Arithmetic of two SimdShapes
 439
 440 With some thought it becomes clear that when performing operations
 441 not involving elwidth priority should simply calculate a new fixed
 442 width based on straight arithmetic on the LHS and RHS fixed width.
 443 The partition counts remains the same (coming from the scope
 444 context) therefore the result may also be a fixed_width priority
 445 result using the exact same partition counts.
 446
 447 However - and bearing in mind that for fixed_width priority the
 448 elwidths are *computed* from the fixed width and the partition counts -
 449 the moment that elwidths (vec_op_widths) are involved then
 450 the priority switches to the elwidths, even if one of those elwidths were
 451 calculated initially from a fixed_width and partition counts.
 452 In this case, the result will be an elwidths priority SimdShape,
 453 where the layout() function is already capable of computing the
 454 required overall width based on the (newly-computed) vec_el_widths.
 455
 456 Note that, interestingly, when fixed_width priority SimdShapes are
 457 used on both LHS and RHS, it is effectively an identical case to
 458 when the RHS is an integer, because the fixed_width of the RHS is
 459 itself an integer.
 460
 461 For certain operators (mul, shift) a special case of this also occurs in instances where all elwidths
 462 in all partitions on either LHS or RHS are identical: element widths {3,3,3,3} for
 463 example.  Under such special circumstances, multiply would
 464 function correctly even for dual-priority, due to uniform scaling.
 465
 466 Summary:
 467
 468 * **Both LHS and RHS Fixed-width Priority**
 469   - Exactly the same as for when RHS is an Integer, given that the
 470     integer fixed width is, in fact, an integer.
 471   - the return result is always a fixed-width priority SimdShape
 472 * **Either LHS or RHS Elwidth Priority** (but not both)
 473   - all operators always expected to work because, again, one of
 474     RHS or LHS is an integer, and that integer is used as the
 475     input to the arithmetic.
 476   - reverse-operators (rmul etc) take care of RHS.
 477   - the return result is however always an elwidth-priority SimdShape
 478 * **Both LHS and RHS Elwidth priority**
 479   - for mul and shift-left, as long as one of LHS or RHS has identical
 480     elwidths, by a mathematical coincidence these are
 481     fine. return result may be a dual-priority result.
 482   - for add and sub, an exception must be raised.
 483   - divide and shift-right are expected to work on the condition
 484     that, again, all elwidths have the exact same value, and, again,
 485     that no LSBs are lost.