3d_gpu/architecture/dynamic_simd/shape.mdwn

   1 # SimdShape
   2
   3 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
   4 provides sufficient context to both define overrides for individual lengths
   5 on a per-mask basis as well as sufficient information to "upcast"
   6 back to a SimdSignal, in exactly the same way that c++ virtual base
   7 class upcasting works when RTTI (Run Time Type Information) works.
   8
   9 By deriving from `ast.Shape` both `width` and `signed` are provided
  10 already, leaving the `SimdShape` class with the responsibility to
  11 additionally define lengths for each mask basis. This is best illustrated
  12 with an example.
  13
  14 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
  15 but without massive disruptive code-duplication or intrusive explicit
  16 coding as outlined in the worst of the techniques documented in
  17 [[dynamic_simd]].  This in turn implies that Signals need to be declared
  18 for both mantissa and exponent that **change width to non-power-of-two
  19 sizes** depending on Partition Mask Context.
  20
  21 Mantissa:
  22
  23 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
  24   rounding and sticky)
  25 * when the context is 2xFP32 there are **two** mantissas of 23 bits
  26 * when the context is 4xFP16 there are **four** mantissas of 10 bits
  27 * when the context is 4xBF16 there are four mantissas of 5 bits.
  28
  29 Exponent:
  30
  31 * 1xFP64: 11 bits, one exponent
  32 * 2xFP32: 8 bits, two exponents
  33 * 4xFP16: 5 bits, four exponents
  34 * 4xBF16: 8 bits, four exponents
  35
  36 `SimdShape` needs this information in addition to the normal
  37 information (width, sign) in order to create the partitions
  38 that allow standard nmigen operations to **transparently**
  39 and naturally take place at **all** of these non-uniform
  40 widths, as if they were in fact scalar Signals *at* those
  41 widths.
  42
  43 A minor wrinkle which emerges from deep analysis is that the overall
  44 available width (`Shape.width`) does in fact need to be explicitly
  45 declared, and
  46 the sub-partitions to fit onto power-of-two boundaries, in order to allow
  47 straight wire-connections rather than allow the SimdSignal to be
  48 arbitrary-sized (compact).  Although on shallow inspection this
  49 initially would seem to imply that it would result in large unused
  50 sub-partitions (padding partitions) these gates can in fact be eliminated
  51 with a "blanking" mask, created from static analysis of the SimdShape
  52 context.
  53
  54 Example:
  55
  56 * all 32 and 16-bit values are actually to be truncated to 11 bit
  57 * all 8-bit values to 5-bit
  58
  59 from these we can write out the allocations, bearing in mind that
  60 in each partition the sub-signal must start on a power-2 boundary,
  61 and that "x" marks unused (padding) portions:
  62
  63           |31|  |  |  |     16|15|  |   8|7     0 |
  64     32bit | x| x| x|  |      x| x| x|10 ....    0 |
  65     16bit | x| x|26    ... 16 | x| x|10 ....    0 |
  66     8bit  | x|28 .. 24|  20.16| x|12 .. 8|x|4.. 0 |
  67
  68 thus, we deduce, we *actually* need breakpoints at these positions,
  69 and that unused portions common to **all** cases can be deduced
  70 and marked "x"
  71
  72           |  |28|26|24| |20|16|  |12|10|8| |4     |
  73             x                   x
  74
  75 These 100% unused "x"s therefore define the "blanking" mask, and in
  76 these sub-portions it is unnecessary to allocate computational gates.
  77
  78 Also in order to save gates, in the example above there are only three
  79 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
  80 are required to construct the larger overall computational result
  81 from the "smaller chunks", rather than at first glance, with there
  82 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
  83 would appear that 2^9 (512!) cases were required, where in fact
  84 there are only three.
  85
  86 These facts also need to be communicated to both the SimdSignal
  87 as well as the submodules implementing its core functionality:
  88 add operation and other arithmetic behaviour, as well as
  89 [[dynamic_simd/cat]] and others.
  90
  91 In addition to that, there is a "convenience" that emerged
  92 from technical discussions as desirable
  93 to have, which is that it should be possible to perform
  94 rudimentary arithmetic operations *on a SimdShape* which preserves
  95 or adapts the Partition context, where the arithmetic operations
  96 occur on `Shape.width`.
  97
  98     >>> XLEN = SimdShape(64, signed=True, ...)
  99     >>> x2 = XLEN // 2
 100     >>> print(x2.width)
 101     32
 102     >>> print(x2.signed)
 103     True
 104
 105 With this capability it becomes possible to use the Liskov Substitution
 106 Principle in dynamically compiling code that switches between scalar and
 107 SIMD transparently:
 108
 109     # scalar context
 110     scalarctx = scl = object()
 111     scl.XLEN = 64
 112     scl.SigKls = Signal         # standard nmigen Signal
 113     # SIMD context
 114     simdctx = sdc = object()
 115     sdc = SimdShape(64, ....)
 116     sdc.SigKls = SimdSignal     # advanced SIMD Signal
 117     sdc.elwidth = Signal(2)
 118     # select one
 119     if compiletime_switch == 'SIMD':
 120         ctx = simdctx
 121     else:
 122         ctx = scalarctx
 123
 124     # exact same code switching context at compile time
 125     m = Module():
 126     with ctx:
 127         x = ctx.SigKls(ctx.XLEN)
 128         ...
 129     m.d.comb += x.eq(Const(3))
 130