e8f812f301e3ac1db837228c7c69d291a446e9fb
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shape.mdwn
1 # SimdShape
2
3 Links:
4
5 * [layout experiment](https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part/layout_experiment.py;h=2a31a57dbcb4cb075ec14b4799e521fca6aa509b;hb=0407d90ccaf7e0e42f40918c3fa5dc1d89cf0155)
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=713>
7
8 A logical extension of the nmigen `ast.Shape` concept, `SimdShape`
9 provides sufficient context to both define overrides for individual lengths
10 on a per-mask basis as well as sufficient information to "upcast"
11 back to a SimdSignal, in exactly the same way that c++ virtual base
12 class upcasting works when RTTI (Run Time Type Information) works.
13
14 By deriving from `ast.Shape` both `width` and `signed` are provided
15 already, leaving the `SimdShape` class with the responsibility to
16 additionally define lengths for each mask basis. This is best illustrated
17 with an example.
18
19 The Libre-SOC IEEE754 ALUs need to be converted to SIMD Partitioning
20 but without massive disruptive code-duplication or intrusive explicit
21 coding as outlined in the worst of the techniques documented in
22 [[dynamic_simd]]. This in turn implies that Signals need to be declared
23 for both mantissa and exponent that **change width to non-power-of-two
24 sizes** depending on Partition Mask Context.
25
26 Mantissa:
27
28 * when the context is 1xFP64 the mantissa is 54 bits (excluding guard
29 rounding and sticky)
30 * when the context is 2xFP32 there are **two** mantissas of 23 bits
31 * when the context is 4xFP16 there are **four** mantissas of 10 bits
32 * when the context is 4xBF16 there are four mantissas of 5 bits.
33
34 Exponent:
35
36 * 1xFP64: 11 bits, one exponent
37 * 2xFP32: 8 bits, two exponents
38 * 4xFP16: 5 bits, four exponents
39 * 4xBF16: 8 bits, four exponents
40
41 `SimdShape` needs this information in addition to the normal
42 information (width, sign) in order to create the partitions
43 that allow standard nmigen operations to **transparently**
44 and naturally take place at **all** of these non-uniform
45 widths, as if they were in fact scalar Signals *at* those
46 widths.
47
48 A minor wrinkle which emerges from deep analysis is that the overall
49 available width (`Shape.width`) does in fact need to be explicitly
50 declared under *some* circumstances, and
51 the sub-partitions to fit onto power-of-two boundaries, in order to allow
52 straight wire-connections rather than allow the SimdSignal to be
53 arbitrary-sized (compact). Although on shallow inspection this
54 initially would seem to imply that it would result in large unused
55 sub-partitions (padding partitions) these gates can in fact be eliminated
56 with a "blanking" mask, created from static analysis of the SimdShape
57 context.
58
59 Example:
60
61 * all 32 and 16-bit values are actually to be truncated to 11 bit
62 * all 8-bit values to 5-bit
63
64 from these we can write out the allocations, bearing in mind that
65 in each partition the sub-signal must start on a power-2 boundary,
66
67 |31| | |24| 16|15| | 8|7 0 |
68 32bit | | | | 1.11 |
69 16bit | | 2.11 | | | 1.11 |
70 8bit | | 4.5 | 3.5 | | 2.5 | | 1.5 |
71
72 Next we identify the start and end points, and note
73 that "x" marks unused (padding) portions. We begin by marking
74 the power-of-two boundaries (0-7 .. 24-31) and also including column
75 guidelines to delineate the start and endpoints:
76
77 |31| | |24| 16|15| | 8|7 0 |
78 |31|28|26|24| |20|16|15|12|10|8| |4 0 |
79 32bit | x| x| x| | x| x| x|10 .... 0 |
80 16bit | x| x|26 ... 16 | x| x|10 .... 0 |
81 8bit | x|28 .. 24| 20.16| x|12 .. 8|x|4.. 0 |
82 unused x x
83
84 thus, we deduce, we *actually* need breakpoints at *nine* positions,
85 and that unused portions common to **all** cases can be deduced
86 and marked "x" by looking at the columns above them.
87 These 100% unused "x"s therefore define the "blanking" mask, and in
88 these sub-portions it is unnecessary to allocate computational gates.
89
90 Also in order to save gates, in the example above there are only three
91 cases (32 bit, 16 bit, 8 bit) therefore only three sets of logic
92 are required to construct the larger overall computational result
93 from the "smaller chunks". At first glance, with there
94 being 9 actual partitions (28, 26, 24, 20, 16, 12, 10, 8, 4), it
95 would appear that 2^9 (512!) cases were required, where in fact
96 there are only three.
97
98 These facts also need to be communicated to both the SimdSignal
99 as well as the submodules implementing its core functionality:
100 add operation and other arithmetic behaviour, as well as
101 [[dynamic_simd/cat]] and others.
102
103 In addition to that, there is a "convenience" that emerged
104 from technical discussions as desirable
105 to have, which is that it should be possible to perform
106 rudimentary arithmetic operations *on a SimdShape* which preserves
107 or adapts the Partition context, where the arithmetic operations
108 occur on `Shape.width`.
109
110 >>> XLEN = SimdShape(64, signed=True, ...)
111 >>> x2 = XLEN // 2
112 >>> print(x2.width)
113 32
114 >>> print(x2.signed)
115 True
116
117 With this capability it becomes possible to use the Liskov Substitution
118 Principle in dynamically compiling code that switches between scalar and
119 SIMD transparently:
120
121 # scalar context
122 scalarctx = scl = object()
123 scl.XLEN = 64
124 scl.SigKls = Signal # standard nmigen Signal
125 # SIMD context
126 simdctx = sdc = object()
127 sdc.XLEN = SimdShape({1x64, 2x32, 4x16, 8x8})
128 sdc.SigKls = SimdSignal # advanced SIMD Signal
129 sdc.elwidth = Signal(2)
130
131 # select one
132 if compiletime_switch == 'SIMD':
133 ctx = simdctx
134 else:
135 ctx = scalarctx
136
137 # exact same code switching context at compile time
138 m = Module():
139 with ctx:
140 x = ctx.SigKls(ctx.XLEN)
141 y = ctx.SigKls(ctx.XLEN // 2)
142 ...
143 m.d.comb += x.eq(Const(3))
144
145 An interesting practical requirement transpires from attempting to use
146 SimdSignal, that affects the way that SimdShape works. The register files
147 are 64 bit, and are subdivided according to what wikipedia terms
148 "SIMD Within A Register" (SWAR). Therefore, the SIMD ALUs *have* to
149 both accept and output 64-bit signals at that explicit width, with
150 subdivisions for 1x64, 2x32, 4x16 and 8x8 SIMD capability.
151
152 However when it comes to intermediary processing (partial computations)
153 those intermediary Signals can and will be required to be a certain
154 fixed width *regardless* and having nothing to do with the register
155 file source or destination 64 bit fixed width.
156
157 The simplest example here would be a boolean (1 bit) Signal for
158 Scalar (but an 8-bit quantity for SIMD):
159
160 m = Module():
161 with ctx:
162 x = ctx.SigKls(ctx.XLEN)
163 y = ctx.SigKls(ctx.XLEN)
164 b = ctx.SigKls(1)
165 m.d.comb += b.eq(x > y)
166 with m.If(b):
167 ....
168
169 This code is obvious for Scalar behaviour but for SIMD, because
170 the elwidths are declared as `1x64, 2x32, 4x16, 8x8` then whilst
171 the *elements* are 1 bit (in order to make a total of QTY 8
172 comparisons of 8 parallel SIMD 8-bit values), there correspondingly
173 needs to be **eight** such element bits in order to store up to
174 eight 8-bit comparisons.
175
176 Another example would be a simple test of the first *nibble* of
177 the data.
178
179 m = Module():
180 with ctx:
181 x = ctx.SigKls(ctx.XLEN)
182 y = ctx.SigKls(4)
183 m.d.comb += y.eq(x[0:3])
184 ....
185
186 Here, we do not necessarily want to declare y to be 64-bit: we want
187 only the first 4 bits of each element, after all, and when y is set
188 to be QTY 8of 8-bit elements, then y will only need to store QTY 8of
189 4-bit quantities, i.e. only a maximum of 32 bits total.
190
191 If y was declared as 64 bit this would indicate that the actual
192 elements were at least 8 bit long, and if that was then used as a
193 shift input it might produce the wrong calculation because the
194 actual shift amount was only supposed to be 4 bits.
195
196 Thus not one method of setting widths is required but *two*:
197
198 * at the element level
199 * at the width of the entire SIMD signal