3d_gpu/architecture/dynamic_simd/shift.mdwn

   1 # Dynamic Partitioned Shift
   2
   3 This is almost as complex as multiplication, except there is a trick that can be deployed.  In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.
   4
   5 Partitioned Shifting will also require to have an NxN matrix, however it is slightly different.  first, define the following:
   6
   7     a0 = a[7..0], a1 = a[15..8], ....
   8     b0 = b[0+4..0], b1 = b[8+4..8], b2 = b[16+4..16], b3 = b[24+4..24]
   9
  10 QUESTION: should b1 be limited to min(b[8+4..8], 24), b2 be similarly limited to 15, and b3 be limited/min'd to 8?
  11
  12 then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit
  13
  14     | a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
  15     |          | a1 << b1 | a2 << b1 | a3 << b1
  16     |          |          | a2 << b2 | a3 << b2
  17     |          |          |          | a3 << b3
  18
  19 Where multiply would perform a cascading-add across those partial results,
  20 shift is different in that we *know* (assume) that for each shift-amount
  21 (operand b), within each partition the topmost bits are **zero**.
  22
  23 This because, in the typical 64-bit shift, the operation is actually:
  24
  25     result[63..0] = a[63..0] << b[5..0]
  26
  27 **NOT** b[63..0], i.e. the amount to shift a 64-bit number by by is in the
  28 *lower* six bits of b.  Likewise, for a 32-bit number, this is 5 bits.
  29
  30 Therefore, in principle, it should be possible to simply use Muxes on the
  31 partial-result matrix, ORing them together.  Assuming (again) a 32-bit
  32 input and a 4-way partition:
  33
  34     out0 = p00[7..0]
  35     out1 = pmask[0] ? p01[7..0] : p00[15..8]
  36
  37 Table based on partitions:
  38
  39 [[!table  data="""
  40 p2p1p0 | o0        | o1       | o2       | o3
  41 ++++++ | ++++++++  | ++++++++ | ++++++++ | ++
  42 0 0 0  | a0b0      | a1b0     | a2b0     | a3b0
  43 0 0 1  | a0b0      | a1b1     | a2b1     | a3b1
  44 0 1 0  | a0b0      | a1b0     | a2b2     | a3b2
  45 0 1 1  | a0b0      | a1b1     | a2b2     | a3b2
  46 1 0 0  | a0b0      | a1b0     | a2b0     | a3b3
  47 1 0 1  | a0b0      | a1b1     | a2b1     | a3b3
  48 1 1 0  | a0b0      | a1b0     | a2b2     | a3b3
  49 1 1 1  | a0b0      | a1b1     | a2b2     | a3b3
  50 """]]
  51
  52 For o0 the output is simple: a0b0 for all partition permutations.
  53
  54     o0 = a0b0[7:0]
  55
  56 The output for o1 looks something like this:
  57
  58     p2p1p0 : o1
  59     0 0 0  : a0b0[15:8] | a1b0[7:0]
  60     0 0 1  | a0b0[15:8] | a1b1[7:0]
  61     0 1 0  | a0b0[15:8] | a1b0[7:0]
  62     0 1 1  | a0b0[15:8] | a1b1[7:0]
  63     1 0 0  | a0b0[15:8] | a1b0[7:0]
  64     1 0 1  | a0b0[15:8] | a1b1[7:0]
  65     1 1 0  | a0b0[15:8] | a1b0[7:0]
  66     1 1 1  | a0b0[15:8] | a1b1[7:0]
  67
  68     if True:           o1  = a0b0[15:8]
  69     if ~p0:            o1 |= a1b0[7:0]
  70     if  p0:            o1 |= a1b1[7:0]
  71
  72 For o2:
  73
  74     p2p1p0 : o2
  75     0 0 0  | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
  76     0 0 1  | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
  77     0 1 0  | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
  78     0 1 1  | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]
  79     1 0 0  | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
  80     1 0 1  | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
  81     1 1 0  | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
  82     1 1 1  | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]
  83
  84     therefore:
  85
  86     if True:          o2 =  a0b0[23:16]
  87     if ~p0:           o2 |= a1b0[15:0]
  88     if  p0:           o2 |= a1b1[15:0]
  89     if ~p0&~p1:       o2 |= a2b0[7:0]
  90     if  p0&~p1:       o2 |= a2b1[7:0]
  91     if     ~p1:       o2 |= a2b2[7:0]
  92
  93 For o3:
  94
  95     p2p1p0 | o3
  96     0 0 0  | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b0[7:0]
  97     0 0 1  | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b1[7:0]
  98     0 1 0  | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b2[7:0]
  99     0 1 1  | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b2[7:0]
 100     1 0 0  | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b3[7:0]
 101     1 0 1  | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b3[7:0]
 102     1 1 0  | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b3[7:0]
 103     1 1 1  | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b3[7:0]
 104
 105     therefore:
 106
 107     if True:          o3 =  a0b0[31:24]
 108     if ~p0:           o3 |= a1b0[23:16]
 109     if  p0:           o3 |= a1b1[23:16]
 110     if ~p0& p1:       o3 |= a2b1[15:8]
 111     if  p0& p1:       o3 |= a2b0[15:8]
 112     if      p1:       o3 |= a2b2[15:8]
 113     if ~p0&~p1&~p2:   o3 |= a3b0[7:0]
 114     if  p0&~p1&~p2:   o3 |= a3b1[7:0]
 115     if      p1&~p2:   o3 |= a3b2[7:0]
 116     if          p2:   o3 |= a3b3[7:0]
 117
 118 Code that prints the above:
 119
 120     def decoderange(col, k):
 121         xs = (col-k)*8
 122         return "%d:%d" % (xs+7, xs)
 123
 124     def decodelist(l):
 125         res = []
 126         for (idx, sgn) in l:
 127             sgn = " " if sgn else "~"
 128             res.append("%sp%d" % (sgn, idx))
 129         return '&'.join(res)
 130
 131     N = 4
 132     M = 4
 133
 134     for col in range(M):
 135         r = decoderange(col, 0)
 136         print "if True: o%d = a0b0[%s]" % (col, r)
 137         for i in range(1,col+1):
 138             l = []
 139             for j in range(i):
 140                 l.append([j, False])
 141             k = 0
 142             s = decodelist(l)
 143             r = decoderange(col, i)
 144             print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
 145             k += 1
 146             while l:
 147                 l[0][1] = True
 148                 s = decodelist(l)
 149                 r = decoderange(col, i)
 150                 print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
 151                 k += 1
 152                 del l[0]
 153         print
 154
 155 ## Note
 156
 157 One important part to remember is that the upper bits of the shift amount need to specifically *ignored* rather than assuming that they are zeros, this is because, for shift instructions: `a << b` actually is `a << (b % bit_len(a))`
 158
 159 # Static Partitioned Shift
 160
 161 Static shift is pretty straightforward: the input is the entire number
 162 shifted, however clearly due to partitioning some of the bits need to
 163 be masked out (to zero, or to 1 if it is an arithmetic shift right).
 164 This can therefore use the class currently known as "MultiShiftRMerge"
 165 or similar, which can merge in a 1 into the shifted data.  Luckily
 166 the amount to be 0'd/1'd is also statically computable: it is just that
 167 the location *where* is dynamic, based on the partition location(s).
 168
 169 # Bugreports
 170
 171 * <http://bugs.libre-riscv.org/show_bug.cgi?id=173>