| partition | o3 | o2 | o1 | o0 |
| --------- | -- | -- | -- | -- |
| 000 | [A7A7A7A7] | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
-| 001 | [A5A5A5A5] | [A5A5]A5A4 | A3A2A1A0 | [A1A1]A1A0 |
-| 010 | [A3A3A3A3] | A3A2A1A0 | [A3A3A3A3] | A3A2A1A0 |
-| 011 | [A3A3A3A3] | A3A2A1A0 | [A1A1]A1A0 | [A1A1]A1A0 |
-| 100 | [A1A1]A1A0 | [A5A5A5A5] | [A5A5]A5A4 | A3A2A1A0 |
-| 101 | [A1A1]A1A0 | [A3A3A3A3] | A3A2A1A0 | [A1A1]A1A0 |
-| 110 | [A1A1]A1A0 | [A1A1]A1A0 | [A3A3A3A3] | A3A2A1A0 |
-| 111 | [A1A1]A1A0 | [A1A1]A1A0 | [A1A1]A1A0 | [A1A1]A1A0 |
+| 001 | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 |
+| 010 | A7A6A5A4 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 |
+| 011 | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 | A3A2A1A0 |
+| 100 | A3A2A1A0 | [A7A7A7A7] | A7A6A5A4 | A3A2A1A0 |
+| 101 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 | A3A2A1A0 |
+| 110 | A3A2A1A0 | A3A2A1A0 | A7A6A5A4 | A3A2A1A0 |
+| 111 | A3A2A1A0 | A3A2A1A0 | A3A2A1A0v | A3A2A1A0 |
+
+Note how when the entire partition set is open (1x 16-bit output)
+that all of A is copied out, and either zero or sign extended
+in the top half of the output. At the other extreme is the
+4x 4-bit output partitions, which have four copies of A, truncated
+from the first 4 bits of A.
+
+Unlike the parallel case, A is not itself partitioned, so is copied
+over as much as is possible. In some cases such as `1x 4-bit, 1x 12-bit`
+the 8-bit scalar source will need sign or zero extending.