(no commit message)
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shift.mdwn
1 # Dynamic Partitioned Shift
2
3 This is almost as complex as multiplication, except there is a trick that can be deployed. In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.
4
5 Partitioned Shifting will also require to have an NxN matrix, however it is slightly different. first, define the following:
6
7 a0 = a[7..0], a1 = a[15..8], ....
8 b0 = b[0+4..0], b1 = b[8+4..8], b2 = b[16+4..16], b3 = b[24+4..24]
9
10 QUESTION: should b1 be limited to min(b[8+4..8], 24), b2 be similarly limited to 15, and b3 be limited/min'd to 8?
11
12 then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit
13
14 | a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
15 | | a1 << b1 | a2 << b1 | a3 << b1
16 | | | a2 << b2 | a3 << b2
17 | | | | a3 << b3
18
19 Where multiply would perform a cascading-add across those partial results,
20 shift is different in that we *know* (assume) that for each shift-amount
21 (operand b), within each partition the topmost bits are **zero**.
22
23 This because, in the typical 64-bit shift, the operation is actually:
24
25 result[63..0] = a[63..0] << b[5..0]
26
27 **NOT** b[63..0], i.e. the amount to shift a 64-bit number by by is in the
28 *lower* six bits of b. Likewise, for a 32-bit number, this is 5 bits.
29
30 Therefore, in principle, it should be possible to simply use Muxes on the
31 partial-result matrix, ORing them together. Assuming (again) a 32-bit
32 input and a 4-way partition:
33
34 out0 = p00[7..0]
35 out1 = pmask[0] ? p01[7..0] : p00[15..8]
36
37 Table based on partitions:
38
39 [[!table data="""
40 p2p1p0 | o0 | o1 | o2 | o3
41 ++++++ | ++++++++ | ++++++++ | ++++++++ | ++
42 0 0 0 | a0b0 | a1b0 | a2b0 | a3b0
43 0 0 1 | a0b0 | a1b1 | a2b1 | a3b1
44 0 1 0 | a0b0 | a1b0 | a2b2 | a3b2
45 0 1 1 | a0b0 | a1b1 | a2b2 | a3b2
46 1 0 0 | a0b0 | a1b0 | a2b0 | a3b3
47 1 0 1 | a0b0 | a1b1 | a2b1 | a3b3
48 1 1 0 | a0b0 | a1b0 | a2b2 | a3b3
49 1 1 1 | a0b0 | a1b1 | a2b2 | a3b3
50 """]]
51
52 Therefore, the actual output for o1 looks something like this:
53
54 p2p1p0 : o1
55 0 0 0 : a0b0[15:8] | a1b0[7:0]
56 0 0 1 | a0b0[15:8] | a1b1[7:0]
57 0 1 0 | a0b0[15:8] | a1b0[7:0]
58 0 1 1 | a0b0[15:8] | a1b1[7:0]
59 1 0 0 | a0b0[15:8] | a1b0[7:0]
60 1 0 1 | a0b0[15:8] | a1b1[7:0]
61 1 1 0 | a0b0[15:8] | a1b0[7:0]
62 1 1 1 | a0b0[15:8] | a1b1[7:0]
63
64 For o2:
65
66 p2p1p0 : o2
67 0 0 0 | a0b0[23:16] | a1b0[15:8] | a2b0
68 0 0 1 | a0b0[23:16] | a1b1[15:8] | a2b1
69 0 1 0 | a0b0[23:16] | a1b0[15:8] | a2b2
70 0 1 1 | a0b0[23:16] | a1b1[15:8] | a2b2
71 1 0 0 | a0b0[23:16] | a1b0[15:8] | a2b0
72 1 0 1 | a0b0[23:16] | a1b1[15:8] | a2b1
73 1 1 0 | a0b0[23:16] | a1b0[15:8] | a2b2
74 1 1 1 | a0b0[23:16] | a1b1[15:8] | a2b2
75
76 For o3:
77
78 p2p1p0 | o3
79 0 0 0 | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b0
80 0 0 1 | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b1
81 0 1 0 | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b2
82 0 1 1 | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b2
83 1 0 0 | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b3
84 1 0 1 | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b3
85 1 1 0 | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b3
86 1 1 1 | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b3
87
88 Where for o0 the output is simple a0b0 for all partition permutations.
89
90 ## Note
91
92 One important part to remember is that the upper bits of the shift amount need to specifically *ignored* rather than assuming that they are zeros, this is because, for shift instructions: `a << b` actually is `a << (b % bit_len(a))`
93
94 # Static Partitioned Shift
95
96 Static shift is pretty straightforward: the input is the entire number
97 shifted, however clearly due to partitioning some of the bits need to
98 be masked out (to zero, or to 1 if it is an arithmetic shift right).
99 This can therefore use the class currently known as "MultiShiftRMerge"
100 or similar, which can merge in a 1 into the shifted data. Luckily
101 the amount to be 0'd/1'd is also statically computable: it is just that
102 the location *where* is dynamic, based on the partition location(s).