When the partitions are all closed (4x SIMD) each partition of B is
2 bits wide, therefore only the *first two* bits of A are copied into
*each* of the four 2-bit partitions in B.
+
+For the case where A is shorter than B output, sign or zero
+extension is required. Here we assume A is 8 bits, B is 16.
+This is similar to the parallel case except A is repeated
+(broadcast) across all of B.
+
+
+| partition | o3 | o2 | o1 | o0 |
+| --------- | -- | -- | -- | -- |
+| 000 | [A7A7A7A7] | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
+| 001 | [A7A7A7A7] | [A7A7]A7A6 | A5A4A3A2 | [A1A1]A1A0 |
+| 010 | [A7A7A7A7] | A7A6A5A4 | [A3A3A3A3] | A3A2A1A0 |
+| 011 | [A7A7A7A7] | A7A6A5A4 | [A3A3]A3A2 | [A1A1]A1A0 |
+| 100 | [A7A7]A7A6 | [A5A5A5A5] | [A5A5]A5A4 | A3A2A1A0 |
+| 101 | [A7A7]A7A6 | [A5A5A5A5] | A5A4A3A2 | [A1A1]A1A0 |
+| 110 | [A7A7]A7A6 | [A5A5]A5A4 | [A3A3A3A3] | A3A2A1A0 |
+| 111 | [A7A7]A7A6 | [A5A5]A5A4 | [A3A3]A3A2 | [A1A1]A1A0 |