X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension%2Fappendix.mdwn;h=c29044cfea6b9772be22c43d9b8dc3d968f819ee;hb=1f429eeba125e65ba4649045196d043a4acac31d;hp=6e883137c349c42a1b9c891cda4d8e13faaf89d4;hpb=9aba9521933abc4cd2bc394790f64aa461c0c83b;p=libreriscv.git
diff --git a/simple_v_extension/appendix.mdwn b/simple_v_extension/appendix.mdwn
index 6e883137c..c29044cfe 100644
--- a/simple_v_extension/appendix.mdwn
+++ b/simple_v_extension/appendix.mdwn
@@ -1,13 +1,17 @@
-# Simple-V (Parallelism Extension Proposal) Appendix
+[[!oldstandards]]
+
+# Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE)
+
+**OBSOLETE**
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
-* Last edited: 25 jun 2019
+* Last edited: 30 jun 2019
* main spec [[specification]]
[[!toc ]]
-# Fail-on-first modes
+# Fail-on-first modes
Fail-on-first data dependency has different behaviour for traps than
for conditional testing. "Conditional" is taken to mean "anything
@@ -15,47 +19,67 @@ that is zero", however with traps, the first element has to
be given the opportunity to throw the exact same trap that would
be thrown if this were a scalar operation (when VL=1).
+Note that implementors are required to mutually exclusively choose one
+or the other modes: an instruction is **not** permitted to fail on a
+trap *and* fail a conditional test at the same time. This advice to
+custom opcode writers as well as future extension writers.
+
## Fail-on-first traps
Except for the first element, ffirst stops sequential element processing
when a trap occurs. The first element is treated normally (as if ffirst
is clear). Should any subsequent element instruction require a trap,
instead it and subsequent indexed elements are ignored (or cancelled in
-out-of-order designs), and VL is set to the *last* instruction that did
-not take the trap.
+out-of-order designs), and VL is set to the *last* in-sequence instruction
+that did not take the trap.
-Note that predicated-out elements (where the predicate mask bit is zero)
-are clearly excluded (i.e. the trap will not occur). However, note that
-the loop still had to test the predicate bit: thus on return,
+Note that predicated-out elements (where the predicate mask bit is
+zero) are clearly excluded (i.e. the trap will not occur). However,
+note that the loop still had to test the predicate bit: thus on return,
VL is set to include elements that did not take the trap *and* includes
the elements that were predicated (masked) out (not tested up to the
point where the trap occurred).
+Unlike conditional tests, "fail-on-first trap" instruction behaviour is
+unaltered by setting zero or non-zero predication mode.
+
If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
-will cause a trap as normal (as if ffirst is not set); subsequently,
-the trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
-be modified.
+will cause a trap as normal (as if ffirst is not set); subsequently, the
+trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
+be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
+determine the element that caused the trap.
Given that predication bits apply to SUBVL groups, the same rules apply
-to predicated-out (masked-out) sub-groups in calculating the value that VL
-is set to.
+to predicated-out (masked-out) sub-groups in calculating the value that
+VL is set to.
## Fail-on-first conditional tests
-ffirst stops sequential element conditional testing on the first element result
-being zero. VL is set to the number of elements that were processed before
-the fail-condition was encountered.
-
-Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
-will cause the processing to end, and, even if there were elements within
-the *sub-group* that passed the test, that sub-group is still (entirely)
-excluded from the count (from setting VL). i.e. VL is set to the total
-number of *sub-groups* that had no fail-condition up until execution was
-stopped.
+ffirst stops sequential (or sequentially-appearing in the case of
+out-of-order designs) element conditional testing on the first element
+result being zero (or other "fail" condition). VL is set to the number
+of elements that were (sequentially) processed before the fail-condition
+was encountered.
+
+Unlike trap fail-on-first, fail-on-first conditional testing behaviour
+responds to changes in the zero or non-zero predication mode. Whilst
+in non-zeroing mode, masked-out elements are simply not tested (and
+thus considered "never to fail"), in zeroing mode, masked-out elements
+may be viewed as *always* (unconditionally) failing. This effectively
+turns VL into something akin to a software-controlled loop.
+
+Note that just as with traps, if SUBVL!=1, the first trap in the
+*sub-group* will cause the processing to end, and, even if there were
+elements within the *sub-group* that passed the test, that sub-group is
+still (entirely) excluded from the count (from setting VL). i.e. VL is
+set to the total number of *sub-groups* that had no fail-condition up
+until execution was stopped. However, again: SUBVL must not be modified:
+traps must analyse (x)eSTATE (subvl offset indices) to determine the
+element that caused the trap.
Note again that, just as with traps, predicated-out (masked-out) elements
-are included in the count leading up to the fail-condition, even though they
-were not tested.
+are included in the (sequential) count leading up to the fail-condition,
+even though they were not tested.
# Instructions
@@ -105,23 +129,10 @@ attention must be paid.
Example pseudo-code for an integer ADD operation (including scalar
operations). Floating-point uses the FP Register Table.
- function op_add(rd, rs1, rs2) # add not VADD!
- Â int i, id=0, irs1=0, irs2=0;
- Â predval = get_pred_val(FALSE, rd);
- Â rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
- Â rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
- Â rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
- Â for (i = 0; i < VL; i++)
- xSTATE.srcoffs = i # save context
- if (predval & 1<
Branch operations use standard RV opcodes that are reinterpreted to
@@ -226,7 +281,8 @@ to zero if **zeroing** is enabled.
Note that just as with the standard (scalar, non-predicated) branch
operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
+src1 and src2, however note that in doing so, the predicate table
+setup must also be correspondingly adjusted.
In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
for predicated compare operations of function "cmp":
@@ -254,8 +310,14 @@ complex), this becomes:
ps = get_pred_val(I/F==INT, rs1);
rd = get_pred_val(I/F==INT, rs2); # this may not exist
+ ffirst_mode, zeroing = get_pred_flags(rs1)
+ if exists(rd):
+ pred_inversion, pred_zeroing = get_pred_flags(rs2)
+ else
+ pred_inversion, pred_zeroing = False, False
+
if not exists(rd) or zeroing:
- result = 0
+ result = (1<
This section contains examples of vectorised LOAD operations, showing
how the two stage process works (three if zero/sign-extension is included).
@@ -1039,13 +1117,12 @@ This is:
* from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
* RV64, where XLEN=64 is assumed.
-First, the memory table, which, due to the
-element width being 16 and the operation being LD (64), the 64-bits
-loaded from memory are subdivided into groups of **four** elements.
-And, with VL being 7 (deliberately to illustrate that this is reasonable
-and possible), the first four are sourced from the offset addresses pointed
-to by x5, and the next three from the ofset addresses pointed to by
-the next contiguous register, x6:
+First, the memory table, which, due to the element width being 16 and the
+operation being LD (64), the 64-bits loaded from memory are subdivided
+into groups of **four** elements. And, with VL being 7 (deliberately
+to illustrate that this is reasonable and possible), the first four are
+sourced from the offset addresses pointed to by x5, and the next three
+from the ofset addresses pointed to by the next contiguous register, x6:
[[!table data="""
addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
@@ -1294,9 +1371,9 @@ rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
only where the bitwidth of either rs1 or rs2 are different, will the
lesser-width operand be sign-extended.
-Effectively however, both rs1 and rs2 are being sign-extended (or truncated),
-where for add they are both zero-extended. This holds true for all arithmetic
-operations ending with "W".
+Effectively however, both rs1 and rs2 are being sign-extended (or
+truncated), where for add they are both zero-extended. This holds true
+for all arithmetic operations ending with "W".
### addiw
@@ -1383,7 +1460,7 @@ circumstances it is perfectly fine to simply have the lanes
"inactive" for predicated elements, even though it results in
less than 100% ALU utilisation.
-## Twin-predication (based on source and destination register)
+## Twin-predication (based on source and destination register)
Twin-predication is not that much different, except that that
the source is independently zero-predicated from the destination.
@@ -1511,92 +1588,121 @@ of total length 128 bit given that XLEN is now 128.
TODO evaluate strncpy and strlen
-## strncpy
-
-RVV version: >
-
- strncpy:
- mv a3, a0 # Copy dst
- loop:
- setvli x0, a2, vint8 # Vectors of bytes.
- vlbff.v v1, (a1) # Get src bytes
- vseq.vi v0, v1, 0 # Flag zero bytes
- vmfirst a4, v0 # Zero found?
- vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio
- vsb.v v1, (a3), v0.t # Write out bytes
- bgez a4, exit # Done
- csrr t1, vl # Get number of bytes fetched
- add a1, a1, t1 # Bump src pointer
- sub a2, a2, t1 # Decrement count.
- add a3, a3, t1 # Bump dst pointer
- bnez a2, loop # Anymore?
-
- exit:
- ret
+## strncpy >
+
+RVV version:
+
+ strncpy:
+ c.mv a3, a0 # Copy dst
+ loop:
+ setvli x0, a2, vint8 # Vectors of bytes.
+ vlbff.v v1, (a1) # Get src bytes
+ vseq.vi v0, v1, 0 # Flag zero bytes
+ vmfirst a4, v0 # Zero found?
+ vmsif.v v0, v0 # Set mask up to and including zero byte.
+ vsb.v v1, (a3), v0.t # Write out bytes
+ c.bgez a4, exit # Done
+ csrr t1, vl # Get number of bytes fetched
+ c.add a1, a1, t1 # Bump src pointer
+ c.sub a2, a2, t1 # Decrement count.
+ c.add a3, a3, t1 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
+
+ exit:
+ c.ret
SV version (WIP):
strncpy:
- mv a3, a0
- SETMVLI 8 # set max vector to 8
- RegCSR[a3] = 8bit, a3, scalar
- RegCSR[a1] = 8bit, a1, scalar
- RegCSR[t0] = 8bit, t0, vector
- PredTb[t0] = ffirst, x0, inv
+ c.mv a3, a0
+ VBLK.RegCSR[t0] = 8bit, t0, vector
+ VBLK.PredTb[t0] = ffirst, x0, inv
loop:
- SETVLI a2, t4 # t4 and VL now 1..8
- ldb t0, (a1) # t0 fail first mode
- bne t0, x0, allnonzero # still ff
- # VL points to last nonzero
- GETVL t4 # from bne tests
- addi t4, t4, 1 # include zero
- SETVL t4 # set exactly to t4
- stb t0, (a3) # store incl zero
- ret # end subroutine
+ VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
+ c.ldb t0, (a1) # t0 fail first mode
+ c.bne t0, x0, allnonzero # still ff
+ # VL (t4) points to last nonzero
+ c.addi t4, t4, 1 # include zero
+ c.stb t0, (a3) # store incl zero
+ c.ret # end subroutine
allnonzero:
- stb t0, (a3) # VL legal range
- GETVL t4 # from bne tests
- add a1, a1, t4 # Bump src pointer
- sub a2, a2, t4 # Decrement count.
- add a3, a3, t4 # Bump dst pointer
- bnez a2, loop # Anymore?
+ c.stb t0, (a3) # VL legal range
+ c.add a1, a1, t4 # Bump src pointer
+ c.sub a2, a2, t4 # Decrement count.
+ c.add a3, a3, t4 # Bump dst pointer
+ c.bnez a2, loop # Anymore?
exit:
- ret
+ c.ret
Notes:
-* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0.
-* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead.
-* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC.
-* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection
-* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
-* with the exception of the GETVL, all Vector Context may be done in VBLOCK form.
-* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0
+* Setting MVL to 8 is just an example. If enough registers are spare it
+ may be set to XLEN which will require a bank of 8 scalar registers for
+ a1, a3 and t0.
+* obviously if that is done, t0 is not separated by 8 full registers, and
+ would overwrite t1 thru t7. x80 would work well, as an example, instead.
+* with the exception of the GETVL (a pseudo code alias for csrr), every
+ single instruction above may use RVC.
+* RVC C.BNEZ can be used because rs1' may be extended to the full 128
+ registers through redirection
+* RVC C.LW and C.SW may be used because the W format may be overridden by
+ the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
+* with the exception of the GETVL, all Vector Context may be done in
+ VBLOCK form.
+* setting predication to x0 (zero) and invert on t0 is a trick to enable
+ just ffirst on t0
* ldb and bne are both using t0, both in ffirst mode
-* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0
-* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0
-* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well
+* t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
+ vectorised, no (un)sign-extension or truncation" mode.
+* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
+ into t0 (could contain zeros).
+* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
+ scalar x0
+* however as t0 is in ffirst mode, the first fail will ALSO stop the
+ compares, and reduce VL as well
* the branch only goes to allnonzero if all tests succeed
-* if it did not, we can safely increment VL by 1 (using a4) to include the zero.
+* if it did not, we can safely increment VL by 1 (using a4) to include
+ the zero.
* SETVL sets *exactly* the requested amount into VL.
-* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not.
+* the SETVL just after allnonzero label is needed in case the ldb ffirst
+ activates but the bne allzeros does not.
* this would cause the stb to copy up to the end of the legal memory
-* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location.
+* of course, on the next loop the ldb would throw a trap, as a1 now
+ points to the first illegal mem location.
## strcpy
RVV version:
- mv a3, a0 # Save start
- loop:
+ mv a3, a0 # Save start
+ loop:
setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes actually read e.g. if fault
- vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
+ vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
add a3, a3, a1 # Bump pointer
vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index of zero byte
sub a0, a3, a0 # Subtract start address+bump
- ret
+ ret
+
+## DAXPY
+
+[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
+
+Notes:
+
+* Setting MVL to 4 is just an example. With enough space between the
+ FP regs, MVL may be set to larger values
+* VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
+ taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
+ overhead for use of VBLOCK: 48 bits (3 16-bit words).
+* All instructions except fmadd may use Compressed variants. Total
+ number of 16-bit instruction words: 11.
+* Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
+
+## BigInt add
+
+[[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]