whitespace cleanup

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index c41191bc357a04fed64ac6472072fec6e645a86b..5d4e67ab4afb3c193f6383a9b805dc2e1f0aee01 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -2,6 +2,22 @@
  
  [[!toc ]]
  
+# Summary
+
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FILO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit *if*
+Simple-V was added on top.
+
+# Introduction
+
  This proposal exists so as to be able to satisfy several disparate
  requirements: power-conscious, area-conscious, and performance-conscious
  designs all pull an ISA and its implementation in different conflicting
@@ -11,7 +27,7 @@ Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
  whilst each extremely powerful in their own right and clearly desirable,
  are also:
  
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
    so need work to adapt to the RISC-V ethos and paradigm
  * Are sufficiently large so as to make adoption (and exploration for
    analysis and review purposes) prohibitively expensive
@@ -249,14 +265,21 @@ contains an extremely interesting feature: zero-overhead loops.  This
  proposal would basically allow an inner loop of instructions to be
  repeated indefinitely, a fixed number of times.
  
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
  implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
+
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
  
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
  
  ## Mask and Tagging (Predication)
  
@@ -399,19 +422,24 @@ follows:
  * Fixed vs variable parallelism: <b>variable</b>
  * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
  * Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>Complex and needs further thought</b>
+* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Tag or no-tag: <b>Complex but highly beneficial</b>
+
+In particular:
  
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD.  Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
+* variable-length vectors came out on top because of the high setup, teardown
+  and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+  former limitations and restrictions (in a backwards-compatible fashion),
+  whilst also leaving implementors free to simmplify implementations
+  by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+  potentially 100% occupied in a single-issue in-order implementation
+  i.e. *without* requiring a super-scalar or out-of-order architecture,
+  but doing a proper, full job (ZOLC) is an entirely different matter.
  
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
+Constructing a SIMD/Simple-Vector proposal based around four of these five
+requirements would therefore seem to be a logical thing to do.
  
  # Instruction Format
  
@@ -682,6 +710,84 @@ existing RV Load/Store:
   offset[11:5]  |   src    |   base    |  width   |  offset[4:0]  |   STORE |
  """]]
  
+The RV32 instruction opcodes as follows:
+
+[[!table  data="""
+31 28  27 | 26 25 | 24  20 |19  15  |14| 13 12 | 11   7 | 6     0 | op  |
+imm[4:0]  | 00    | 00000  |    rs1 | 1| m     | vd     | 0000111 | VLD |
+imm[4:0]  | 01    |   rs2  |    rs1 | 1| m     | vd     | 0000111 | VLDS|
+imm[4:0]  | 11    |   vs2  |    rs1 | 1| m     | vd     | 0000111 | VLDX|
+vs3       | 00    | 00000  |    rs1 |1 | m     |imm[4:0]| 0100111 |VST  |
+vs3       | 01    | rs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTS |
+vs3       | 11    | vs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTX |
+"""]]
+
+Conversion on LOAD as follows:
+
+* rd or rs1 are CSR-vectorised indicating "Vector Mode"
+* rd equivalent to vd
+* rs1 equivalent to rs1
+* imm[4:0] from RV format (11..7]) is same
+* imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
+* imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
+* width from RV format (14..12) is same (width and zero/sign extend)
+
+[[!table  data="""
+31 30 | 29 25 | 24    20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5     | 5        | 5     | 3      | 5         | 7      |
+00    | 00000 | imm[4:0] | base  | width  | dest      | LOAD   |
+01    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.S |
+11    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.X |
+"""]]
+
+Similar conversion on STORE as follows:
+
+[[!table  data="""
+31 30 | 29  25 | 24   20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5      | 5       | 5     | 3      | 5         | 7      |
+00    | 00000  | src     | base  | width  | offs[4:0] | LOAD   |
+01    | rs3    | src     | base  | width  | offs[4:0] | LOAD.S |
+11    | rs3    | src     | base  | width  | offs[4:0] | LOAD.X |
+"""]]
+
+Notes:
+
+* Predication CSR-marking register is not explicitly shown in instruction
+* In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
+* That in turn means that Indexed Load need not have an explicit opcode
+* That in turn means that bit 30 may indicate "stride" and bit 31 is free
+
+Revised LOAD:
+
+[[!table  data="""
+31 | 30 | 29  25      |  24   20 | 19    15 | 14    12 | 11   7 | 6     0 |
+   imm[11:0]                  ||||    rs1   |  funct3  |   rd   |  opcode |
+ 1 | 1  |  5          |    5     |     5    |    3     |    5   |     7   |
+ ? | s  |  rs2        | imm[4:0] |   base   |  width   |  dest  |  LOAD   |
+"""]]
+
+Where in turn the pseudo-code may now combine the two:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+        {
+          if CSRvectorised[rs2])
+             offs = vreg[rs2][i]
+          else
+             offs = i*(seglen+1)*stride;
+          vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
+        }
+
+Notes:
+
+* j is multiplied by stride, not elsize, including in the rs2 vectorised case.
+* There may be more sophisticated variants involving the 31st bit, however
+  it would be nice to reserve that bit for post-increment of address registers
  
  ## 17.19 Vector Register Gather
  
@@ -1018,6 +1124,229 @@ translates effectively to:
  * Throw an exception.  Whether that actually results in spawning threads
    as part of the trap-handling remains to be seen.
  
+# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+## [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+  allocating identical opcodes multiple independent registers) meaning
+  that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+  more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+  need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+  not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+  but with the down-side that they're an all-or-nothing part of the Extension.
+  No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+  parallelisation can be carried out, followed by further parallel Lane-based
+  work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+  is to drop data into memory and immediately back in again (like MMX).
+
+## Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware.  It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity of having to use register renames, OoO, VLIW,
+  register file cacheing, all of which has been done before but is a
+  pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+  saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+  parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+  a CSR register to indicate vector length, a separate one to indicate
+  that it is a predicate register and so on) means a little more setup
+  time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+  approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+  operations not suited to parallelisation may be carried out interleaved
+  between parallelised instructions *without* requiring data to be dropped
+  down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+  files means that huge parallel workloads would use up considerable
+  chunks of the register file.  However in the case of RV64 and 32-bit
+  operations, that effectively means 64 slots are available for parallel
+  operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+  be added, yet the instruction opcodes remain unchanged (and still appear
+  to be parallel).  consistent "API" regardless of actual internal parallelism:
+  even an in-order single-issue implementation with a single ALU would still
+  appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+  hard to say if there would be pluses or minuses (on die area).  At worse it
+  would be "no worse" than existing register renaming, OoO, VLIW and register
+  file cacheing schemes.
+
+## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+  streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+  (similar to Alt-RVP) may be used as an implementation detail,
+  using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+  really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+  do not gain parallelism, resulting in prolific duplication of functionality
+  inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+  using the standard integer or FP regfile) an entire vector must be
+  transferred out to memory, into standard regfiles, then back to memory,
+  then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is.  May
+  be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+  vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+  implementation time and die area, meaning that adoption is likely only
+  to be in high-performance specialist supercomputing (where it will
+  be absolutely superb).
+
+## Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance.  Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+  at once.  Parallelism is inherent at the ALU, making the addition of
+  SIMD-style parallelism an easy decision that has zero significant impact
+  on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+  therefore result in superb throughput, easily achieved even with a very
+  simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+  increase instruction count on what would otherwise be a "simple loop",
+  should the number of elements in an array not happen to exactly match
+  the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+  are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+  are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+  dimension and parallelism (width): an at least O(N^2) and quite probably
+  O(N^3) ISA proliferation that often results in several thousand
+  separate instructions.  all requiring separate and distinct corner-case
+  algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+  8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+  For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+  four separate and distinct instructions: one for (r1:low r2:high),
+  one for (r1:high r2:low), one for (r1:high r2:high) and one for
+  (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+  between operand and result bit-widths.  In combination with high/low
+  proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+  that allow control over individual elements within the SIMD block.
+
+# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD.  In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+## [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+  a SIMD architecture where the ALU becomes responsible for the parallelism,
+  Alt-RVP ALUs would likewise be so responsible... with *additional*
+  (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+  at least one dimension are avoided (architectural upgrades introducing
+  128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+  SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+  of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+  be able to subdivide the bits of each register lane (columns) down into
+  arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+  16-bit or even 8-bit, effectively dividing the registerfile into
+  Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further.  If inter-lane
+  "swapping" instructions were then introduced, some of the disadvantages
+  of SIMD could be mitigated.
+
+## RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+  parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+  DSPs with a focus on Multimedia (Audio, Video and Image processing),
+  RVV's primary focus appears to be on Supercomputing: optimisation of
+  mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel) 
+  into a SIMD instruction requires an equivalent to be added to the
+  RVV Extension, if one does not exist.  Given the specialist nature of
+  some SIMD instructions (8-bit or 16-bit saturated or halving add),
+  this possibility seems extremely unlikely to occur, even if the
+  implementation overhead of RVV were acceptable (compared to
+  normal SIMD/DSP-style single-issue in-order simplicity).
+
+## Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+  topologically transplant every single instruction from RVV (as
+  designed) into Simple-V equivalents, with *zero loss of functionality
+   or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+  Extension which contained the basic primitives (non-parallelised
+  8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+  automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+  to have special SIMD-parallel opcodes added need no longer have *any*
+  of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+  4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+  *standard* RV opcodes (present and future) and automatically parallelises
+  them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+  with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+  capable of being subdivided down to an implementor-chosen bitwidth
+  in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+  and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+  choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+  ALUs that perform twin 8-bit operations as they see fit, or anything
+  else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+  (with RV64) SIMD, they *must* provide predication that transparently
+  switches off appropriate units on the last loop, thus neatly fitting
+  underlying SIMD ALU implementations *into* the arbitrary vector-length
+  RVV paradigm, keeping the uniform consistent API that is a key strategic
+  feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+  of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+  can be done by applying *Parallelised* Bit-manipulation operations
+  followed by parallelised *straight* versions of element-to-element
+  arithmetic operations, even if the bit-manipulation operations require
+  changing the bitwidth of the "vectors" to do so.  Predication can
+  be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+  identical functions over time as an architecture evolves from 32-bit
+  wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+  vector-style parallelism being dropped on top of 8-bit or 16-bit
+  operations, all the while keeping a consistent ISA-level "API" irrespective
+  of implementor design choices (or indeed actual implementations).
+
  # Impementing V on top of Simple-V
  
  * Number of Offset CSRs extends from 2
@@ -1032,9 +1361,84 @@ translates effectively to:
    (caveat: anything not specified drops through to software-emulation / traps)
  * TODO
  
-# Analysis of CSR decoding on latency
-
-<a name="csr_decoding_analysis"></a>
+# Register reordering <a name="register_reordering"></a>
+
+## Register File 
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+
+## Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |  
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+## Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+## Virtual Register Reordering:
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+## Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FILO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+## Insights 
+
+SIMD register file splitting still to consider.  For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult.  May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples.  Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches).  Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
  
  It could indeed have been logically deduced (or expected), that there
  would be additional decode latency in this proposal, because if
@@ -1122,6 +1526,26 @@ pluses:
    parallel ALUs) is only equal to one ("virtual" parallelism), or is
    greater than one, should not be underestimated.
  
+# Appendix
+
+# Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4.  They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+
  
  # References
  
@@ -1143,3 +1567,7 @@ pluses:
  * Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
  * Discussion proposing CSRs that change ISA definition
    <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>