(no commit message)
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shift.mdwn
1 # Dynamic Partitioned Shift
2
3 This is almost as complex as multiplication, except there is a trick that can be deployed. In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.
4
5 Partitioned Shifting will also require to have an NxN matrix, however it is slightly different. first, define the following:
6
7 a0 = a[7..0], a1 = a[15..8], ....
8 b0 = b[0+4..0], b1 = b[8+4..8], b2 = b[16+4..16], b3 = b[24+4..24]
9
10 QUESTION: should b1 be limited to min(b[8+4..8], 24), b2 be similarly limited to 15, and b3 be limited/min'd to 8?
11
12 then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit
13
14 | a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
15 | | a1 << b1 | a2 << b1 | a3 << b1
16 | | | a2 << b2 | a3 << b2
17 | | | | a3 << b3
18
19 Where multiply would perform a cascading-add across those partial results,
20 shift is different in that we *know* (assume) that for each shift-amount
21 (operand b), within each partition the topmost bits are **zero**.
22
23 This because, in the typical 64-bit shift, the operation is actually:
24
25 result[63..0] = a[63..0] << b[5..0]
26
27 **NOT** b[63..0], i.e. the amount to shift a 64-bit number by by is in the
28 *lower* six bits of b. Likewise, for a 32-bit number, this is 5 bits.
29
30 Therefore, in principle, it should be possible to simply use Muxes on the
31 partial-result matrix, ORing them together. Assuming (again) a 32-bit
32 input and a 4-way partition:
33
34 out0 = p00[7..0]
35 out1 = pmask[0] ? p01[7..0] : p00[15..8]
36
37 Table based on partitions:
38
39 [[!table data="""
40 p2p1p0 | o0 | o1 | o2 | o3
41 ++++++ | ++++++++ | ++++++++ | ++++++++ | ++
42 0 0 0 | a0b0 | a1b0 | a2b0 | a3b0
43 0 0 1 | a0b0 | a1b1 | a2b1 | a3b1
44 0 1 0 | a0b0 | a1b0 | a2b2 | a3b2
45 0 1 1 | a0b0 | a1b1 | a2b2 | a3b2
46 1 0 0 | a0b0 | a1b0 | a2b0 | a3b3
47 1 0 1 | a0b0 | a1b1 | a2b1 | a3b3
48 1 1 0 | a0b0 | a1b0 | a2b2 | a3b3
49 1 1 1 | a0b0 | a1b1 | a2b2 | a3b3
50 """]]
51
52 For o0 the output is simple: a0b0 for all partition permutations.
53
54 o0 = a0b0[7:0]
55
56 The output for o1 looks something like this:
57
58 p2p1p0 : o1
59 0 0 0 : a0b0[15:8] | a1b0[7:0]
60 0 0 1 | a0b0[15:8] | a1b1[7:0]
61 0 1 0 | a0b0[15:8] | a1b0[7:0]
62 0 1 1 | a0b0[15:8] | a1b1[7:0]
63 1 0 0 | a0b0[15:8] | a1b0[7:0]
64 1 0 1 | a0b0[15:8] | a1b1[7:0]
65 1 1 0 | a0b0[15:8] | a1b0[7:0]
66 1 1 1 | a0b0[15:8] | a1b1[7:0]
67
68 if True: o1 = a0b0[15:8]
69 if ~p0: o1 |= a1b0[7:0]
70 if p0: o1 |= a1b1[7:0]
71
72 For o2:
73
74 p2p1p0 : o2
75 0 0 0 | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
76 0 0 1 | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
77 0 1 0 | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
78 0 1 1 | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]
79 1 0 0 | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
80 1 0 1 | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
81 1 1 0 | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
82 1 1 1 | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]
83
84 therefore:
85
86 if True: o2 = a0b0[23:16]
87 if ~p0: o2 |= a1b0[15:0]
88 if p0: o2 |= a1b1[15:0]
89 if ~p0&~p1: o2 |= a2b0[7:0]
90 if p0&~p1: o2 |= a2b1[7:0]
91 if ~p1: o2 |= a2b2[7:0]
92
93 For o3:
94
95 p2p1p0 | o3
96 0 0 0 | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b0[7:0]
97 0 0 1 | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b1[7:0]
98 0 1 0 | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b2[7:0]
99 0 1 1 | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b2[7:0]
100 1 0 0 | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b3[7:0]
101 1 0 1 | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b3[7:0]
102 1 1 0 | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b3[7:0]
103 1 1 1 | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b3[7:0]
104
105 therefore:
106
107 if True: o3 = a0b0[31:24]
108 if ~p0: o3 |= a1b0[23:16]
109 if p0: o3 |= a1b1[23:16]
110 if ~p0& p1: o3 |= a2b1[15:8]
111 if p0& p1: o3 |= a2b0[15:8]
112 if p1: o3 |= a2b2[15:8]
113 if ~p0&~p1&~p2: o3 |= a3b0[7:0]
114 if p0&~p1&~p2: o3 |= a3b1[7:0]
115 if p1&~p2: o3 |= a3b2[7:0]
116 if p2: o3 |= a3b3[7:0]
117
118 Code that prints the above:
119
120 def decoderange(col, k):
121 xs = (col-k)*8
122 return "%d:%d" % (xs+7, xs)
123
124 def decodelist(l):
125 res = []
126 for (idx, sgn) in l:
127 sgn = " " if sgn else "~"
128 res.append("%sp%d" % (sgn, idx))
129 return '&'.join(res)
130
131 N = 4
132 M = 4
133
134 for col in range(M):
135 r = decoderange(col, 0)
136 print "if True: o%d = a0b0[%s]" % (col, r)
137 for i in range(1,col+1):
138 l = []
139 for j in range(i):
140 l.append([j, False])
141 k = 0
142 s = decodelist(l)
143 r = decoderange(col, i)
144 print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
145 k += 1
146 while l:
147 l[0][1] = True
148 s = decodelist(l)
149 r = decoderange(col, i)
150 print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
151 k += 1
152 del l[0]
153 print
154
155 ## Note
156
157 One important part to remember is that the upper bits of the shift amount need to specifically *ignored* rather than assuming that they are zeros, this is because, for shift instructions: `a << b` actually is `a << (b % bit_len(a))`
158
159 # Static Partitioned Shift
160
161 Static shift is pretty straightforward: the input is the entire number
162 shifted, however clearly due to partitioning some of the bits need to
163 be masked out (to zero, or to 1 if it is an arithmetic shift right).
164 This can therefore use the class currently known as "MultiShiftRMerge"
165 or similar, which can merge in a 1 into the shifted data. Luckily
166 the amount to be 0'd/1'd is also statically computable: it is just that
167 the location *where* is dynamic, based on the partition location(s).
168
169 # Bugreports
170
171 * <http://bugs.libre-riscv.org/show_bug.cgi?id=173>