(no commit message)

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 46c20774a9e72301be7fd05f1822aa07f61191c7..1f479badad0209ee49ebd85864e2376ee9613fb4 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -135,7 +135,7 @@ integer (and floating point) of various sizes is automatically inferred
  due to "type tagging" that is set with a special instruction.  A register
  will be *specifically* marked as "16-bit Floating-Point" and, if added
  to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
+type-conversion will take place *without* requiring that type-conversion
  to be explicitly done with its own separate instruction.
  
  However, implicit type-conversion is not only quite burdensome to
@@ -268,13 +268,12 @@ requirements would therefore seem to be a logical thing to do.
  
  # Instruction Format
  
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width.  Lots to
-look at and investigate*
+The instruction format for Simple-V does not actually have *any* compare
+operations, *any* arithmetic, floating point or memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on implicit CSR configurations for both vector length and
+bitwidth.  This includes Compressed instructions.
  
  * For analysis of RVV see [[v_comparative_analysis]] which begins to
    outline topologically-equivalent mappings of instructions
@@ -376,9 +375,11 @@ Notes:
    comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
    src1 and src2).
  
-# LOAD / STORE Instructions
+## LOAD / STORE Instructions
  
-For full analysis of adaptation of RVV LOAD/STORE see [[v_comparative_analysis]]
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction.
  
  Revised LOAD:
  
@@ -389,9 +390,16 @@ imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
  ?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
  """]]
  
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format.  Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
  Notes:
  
  * LOAD remains functionally (topologically) identical to RVV LOAD
+  (for both integer and floating-point variants).
  * Predication CSR-marking register is not explicitly shown in instruction, it's
    implicit based on the CSR predicate state for the rd (destination) register
  * rs2, the source, may *also be marked as a vector*, which implicitly
@@ -420,8 +428,55 @@ Pseudo-code (excludes CSR SIMD bitwidth):
            vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
          }
  
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
  A similar instruction exists for STORE, with identical topological
-translation of all features.
+translation of all features.  **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table  data="""
+15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
+funct3 | imm         | rs10   | imm         | rd0  | op   |
+3      | 3           | 3      | 2           | 3    | 2    |
+C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled.  In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not".  The interesting question then arises:
+how does RVV deal with the exact same scenario?
  
  # Note on implementation of parallelism
  
@@ -470,6 +525,7 @@ precedent in the setting of MISA to enable / disable extensions).
  * Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
  * Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
  * Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
  
  Notes:
  
@@ -929,7 +985,7 @@ This section has been moved to its own page [[p_comparative_analysis]]
               irs2 += 1;
      }
  
-## Retro-fitting Predication into branch-explicit ISA
+## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
  
  One of the goals of this parallelism proposal is to avoid instruction
  duplication.  However, with the base ISA having been designed explictly
@@ -1067,6 +1123,7 @@ This is interpreted as follows:
  * Given that the context is RV32, ELEN=32.
  * With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
  * Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
  
  So when using an operation that uses r2 as a source (or destination)
  the operation is carried out as follows:
@@ -1090,6 +1147,38 @@ operations carried out 32-bits at a time is perfectly acceptable, as is
  Regardless of the internal parallelism choice, *predication must
  still be respected*, making Simple-V in effect the "consistent public API".
  
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default  |
+| 001 | 8        |
+| 010 | 16       |
+| 011 | 32       |
+| 100 | 64       |
+| 101 | 128      |
+| 110 | rsvd     |
+| 111 | rsvd     |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 1 2 4 8 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+    function regoffs(rnum, i):
+        regidx = floor(i / simdmult)  # integer-div rounded down
+        byteidx = i % simdmult        # integer-remainder
+        return rnum + regidx,         # actual real register
+               byteidx * 8,           # low
+               byteidx * 8 + (vew-1), # high
+
  ### Example Instruction translation: <a name="example_translation"></a>
  
  Instructions "ADD r2 r4 r4" would result in three instructions being