3d_gpu/architecture/dynamic_simd/shift.mdwn

   1 # Dynamic Partitioned Shift
   2
   3 This is almost as complex as multiplication, except there is a trick that can be deployed.  In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.
   4
   5 Partitioned Shifting will also require to have an NxN matrix, however it is slightly different.  first, define the following:
   6
   7     a0 = a[7..0], a1 = a[15..8], ....
   8     b0 = b[7..0], b1 = b[15..8], ....
   9
  10 then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit
  11
  12     | a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
  13     | a0 << b1 | a1 << b1 | a2 << b1 | a3 << b1
  14     | a0 << b2 | a1 << b2 | a2 << b2 | a3 << b2
  15     | a0 << b3 | a1 << b3 | a2 << b3 | a3 << b3
  16
  17 Where multiply would perform a cascading-add across those partial results,
  18 shift is different in that we *know* (assume) that for each shift-amount
  19 (operand b), within each partition the topmost bits are **zero**.
  20
  21 This because, in the typical 64-bit shift, the operation is actually:
  22
  23     result[63..0] = a[63..0] << b[5..0]
  24
  25 **NOT** b[63..0], i.e. the amount to shift a 64-bit number by by is in the
  26 *lower* six bits of b.  Likewise, for a 32-bit number, this is 5 bits.
  27
  28 Therefore, in principle, it should be possible to simply use Muxes on the
  29 partial-result matrix, ORing them together.  Assuming (again) a 32-bit
  30 input and a 4-way partition:
  31
  32     out0 = p00[7..0]
  33     out1 = pmask[0] ? p01[7..0] : p00[15..8]
  34
  35 Table based on partitions:
  36
  37 [[!table  data="""
  38 p2p1p0 | o0        | o1       | o2       | o3
  39 ++++++ | ++++++++  | ++++++++ | ++++++++ | ++
  40 0 0 0  | a0b0      | a1b0     | a2b0     | a3b0
  41 0 0 1  | eq0       | &(eq1-3) | 0        | 0
  42 0 1 0  | &(eq0-1)  | 0        | &(eq2-3) | 0
  43 0 1 1  | eq0       | eq1      | &(eq2-3) | 0
  44 1 0 0  | &(eq0-2)  | 0        | 0        | eq3
  45 1 0 1  | eq0       | &(eq1-2) | 0        | eq3
  46 1 1 0  | &(eq0-1)  | 0        | eq2      | eq3
  47 1 1 1  | eq0       | eq1      | eq2      | eq3
  48 """]]
  49
  50 # Static Partitioned Shift
  51
  52 Static shift is pretty straightforward: the input is the entire number
  53 shifted, however clearly due to partitioning some of the bits need to
  54 be masked out (to zero, or to 1 if it is an arithmetic shift right).
  55 This can therefore use the class currently known as "MultiShiftRMerge"
  56 or similar, which can merge in a 1 into the shifted data.  Luckily
  57 the amount to be 0'd/1'd is also statically computable: it is just that
  58 the location *where* is dynamic, based on the partition location(s).