parallel ALUs) is only equal to one ("virtual" parallelism), or is
greater than one, should not be underestimated.
+# Appendix
+
+# Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4. They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+
# References
* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>