For multiply, divide and shift it is worthwhile to use
one scalar register effectively as a full 64-bit carry/chain.
+The limitations of this approach therefore become pretty clear:
+not only must Vertical-First Mode be used but also the predication
+with zeroing trick. Worse than that, an entire temporary vector
+is required which wastes register space.
+A better way would be to create a single
+scalar instruction that can do the long-shift in-place.
+
The basic principle of the 3-in 2-out `dsrd` is:
```