creating it is to provide a manageable way to turn a pre-existing design
into a parallel one, in a step-by-step incremental fashion, allowing
the implementor to focus on adding hardware where it is needed and necessary.
+The primary target is for mobile-class 3D GPUs and VPUs, with secondary
+goals being to reduce executable size and reduce context-switch latency.
Critically: **No new instructions are added**. The parallelism (if any
is implemented) is implicitly added by tagging *standard* scalar registers