into a parallel one, in a step-by-step incremental fashion, without adding any new opcodes, thus allowing
the implementor to focus on adding hardware where it is needed and necessary.
The primary target is for mobile-class 3D GPUs and VPUs, with secondary
-goals being to reduce executable size and reduce context-switch latency.
+goals being to reduce executable size (by extending the effectiveness of RV opcodes, RVC in particular) and reduce context-switch latency.
Critically: **No new instructions are added**. The parallelism (if any
is implemented) is implicitly added by tagging *standard* scalar registers
* To over-ride the implicit or explicit bitwidth that the operation would
normally give the register.
+Note: clearly, if an RVC operation uses a 3 bit spec'd register (x8-x15) and the Register table contains entried that only refer to registerd x1-x14 or x16-x31, such operations will *never* activate the VL hardware loop!
+
+If however the (16 bit) Register table does contain such an entry (x8-x15 or x2 in the case of LWSP), that src or dest reg may be redirected anywhere to the *full* 128 register range. Thus, RVC becomes far more powerful and has many more opportunities to reduce code size that in Standard RV32/RV64 executables.
+
16 bit format:
| RegCAM | | 15 | (14..8) | 7 | (6..5) | (4..0) |