(no commit message)
[libreriscv.git] / 3d_gpu / architecture / dynamic_simd / shift.mdwn
1 # Dynamic Partitioned Shift
2
3 This is almost as complex as multiplication, except there is a trick that can be deployed. In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.
4
5 Partitioned Shifting will also require to have an NxN matrix, however it is slightly different. first, define the following:
6
7 a0 = a[7..0], a1 = a[15..8], ....
8 b0 = b[7..0], b1 = b[15..8], ....
9
10 then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit
11
12 | a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
13 | a0 << b1 | a1 << b1 | a2 << b1 | a3 << b1
14 | a0 << b2 | a1 << b2 | a2 << b2 | a3 << b2
15 | a0 << b3 | a1 << b3 | a2 << b3 | a3 << b3
16
17 Where multiply would perform a cascading-add across those partial results,
18 shift is different in that we *know* (assume) that for each shift-amount
19 (operand b), within each partition the topmost bits are **zero**.
20
21 This because, in the typical 64-bit shift, the operation is actually:
22
23 result[63..0] = a[63..0] << b[5..0]
24
25 **NOT** b[63..0], i.e. the amount to shift a 64-bit number by by is in the
26 *lower* six bits of b. Likewise, for a 32-bit number, this is 5 bits.
27
28 Therefore, in principle, it should be possible to simply use Muxes on the
29 partial-result matrix, ORing them together. Assuming (again) a 32-bit
30 input and a 4-way partition:
31
32 out0 = p00[7..0]
33 out1 = pmask[0] ? p01[7..0] : p00[15..8]
34
35 Table based on partitions:
36
37 [[!table data="""
38 p2p1p0 | o0 | o1 | o2 | o3
39 ++++++ | ++++++++ | ++++++++ | ++++++++ | ++
40 0 0 0 | a0b0 | a1b0 | a2b0 | a3b0
41 0 0 1 | eq0 | &(eq1-3) | 0 | 0
42 0 1 0 | &(eq0-1) | 0 | &(eq2-3) | 0
43 0 1 1 | eq0 | eq1 | &(eq2-3) | 0
44 1 0 0 | &(eq0-2) | 0 | 0 | eq3
45 1 0 1 | eq0 | &(eq1-2) | 0 | eq3
46 1 1 0 | &(eq0-1) | 0 | eq2 | eq3
47 1 1 1 | eq0 | eq1 | eq2 | eq3
48 """]]
49
50 # Static Partitioned Shift
51
52 Static shift is pretty straightforward: the input is the entire number
53 shifted, however clearly due to partitioning some of the bits need to
54 be masked out (to zero, or to 1 if it is an arithmetic shift right).
55 This can therefore use the class currently known as "MultiShiftRMerge"
56 or similar, which can merge in a 1 into the shifted data. Luckily
57 the amount to be 0'd/1'd is also statically computable: it is just that
58 the location *where* is dynamic, based on the partition location(s).