X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension%2Fappendix.mdwn;h=c29044cfea6b9772be22c43d9b8dc3d968f819ee;hb=1f429eeba125e65ba4649045196d043a4acac31d;hp=6e883137c349c42a1b9c891cda4d8e13faaf89d4;hpb=9aba9521933abc4cd2bc394790f64aa461c0c83b;p=libreriscv.git diff --git a/simple_v_extension/appendix.mdwn b/simple_v_extension/appendix.mdwn index 6e883137c..c29044cfe 100644 --- a/simple_v_extension/appendix.mdwn +++ b/simple_v_extension/appendix.mdwn @@ -1,13 +1,17 @@ -# Simple-V (Parallelism Extension Proposal) Appendix +[[!oldstandards]] + +# Simple-V (Parallelism Extension Proposal) Appendix (OBSOLETE) + +**OBSOLETE** * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton * Status: DRAFTv0.6 -* Last edited: 25 jun 2019 +* Last edited: 30 jun 2019 * main spec [[specification]] [[!toc ]] -# Fail-on-first modes +# Fail-on-first modes Fail-on-first data dependency has different behaviour for traps than for conditional testing. "Conditional" is taken to mean "anything @@ -15,47 +19,67 @@ that is zero", however with traps, the first element has to be given the opportunity to throw the exact same trap that would be thrown if this were a scalar operation (when VL=1). +Note that implementors are required to mutually exclusively choose one +or the other modes: an instruction is **not** permitted to fail on a +trap *and* fail a conditional test at the same time. This advice to +custom opcode writers as well as future extension writers. + ## Fail-on-first traps Except for the first element, ffirst stops sequential element processing when a trap occurs. The first element is treated normally (as if ffirst is clear). Should any subsequent element instruction require a trap, instead it and subsequent indexed elements are ignored (or cancelled in -out-of-order designs), and VL is set to the *last* instruction that did -not take the trap. +out-of-order designs), and VL is set to the *last* in-sequence instruction +that did not take the trap. -Note that predicated-out elements (where the predicate mask bit is zero) -are clearly excluded (i.e. the trap will not occur). However, note that -the loop still had to test the predicate bit: thus on return, +Note that predicated-out elements (where the predicate mask bit is +zero) are clearly excluded (i.e. the trap will not occur). However, +note that the loop still had to test the predicate bit: thus on return, VL is set to include elements that did not take the trap *and* includes the elements that were predicated (masked) out (not tested up to the point where the trap occurred). +Unlike conditional tests, "fail-on-first trap" instruction behaviour is +unaltered by setting zero or non-zero predication mode. + If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements -will cause a trap as normal (as if ffirst is not set); subsequently, -the trap must not occur in the *sub-group* of elements. SUBVL will **NOT** -be modified. +will cause a trap as normal (as if ffirst is not set); subsequently, the +trap must not occur in the *sub-group* of elements. SUBVL will **NOT** +be modified. Traps must analyse (x)eSTATE (subvl offset indices) to +determine the element that caused the trap. Given that predication bits apply to SUBVL groups, the same rules apply -to predicated-out (masked-out) sub-groups in calculating the value that VL -is set to. +to predicated-out (masked-out) sub-groups in calculating the value that +VL is set to. ## Fail-on-first conditional tests -ffirst stops sequential element conditional testing on the first element result -being zero. VL is set to the number of elements that were processed before -the fail-condition was encountered. - -Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group* -will cause the processing to end, and, even if there were elements within -the *sub-group* that passed the test, that sub-group is still (entirely) -excluded from the count (from setting VL). i.e. VL is set to the total -number of *sub-groups* that had no fail-condition up until execution was -stopped. +ffirst stops sequential (or sequentially-appearing in the case of +out-of-order designs) element conditional testing on the first element +result being zero (or other "fail" condition). VL is set to the number +of elements that were (sequentially) processed before the fail-condition +was encountered. + +Unlike trap fail-on-first, fail-on-first conditional testing behaviour +responds to changes in the zero or non-zero predication mode. Whilst +in non-zeroing mode, masked-out elements are simply not tested (and +thus considered "never to fail"), in zeroing mode, masked-out elements +may be viewed as *always* (unconditionally) failing. This effectively +turns VL into something akin to a software-controlled loop. + +Note that just as with traps, if SUBVL!=1, the first trap in the +*sub-group* will cause the processing to end, and, even if there were +elements within the *sub-group* that passed the test, that sub-group is +still (entirely) excluded from the count (from setting VL). i.e. VL is +set to the total number of *sub-groups* that had no fail-condition up +until execution was stopped. However, again: SUBVL must not be modified: +traps must analyse (x)eSTATE (subvl offset indices) to determine the +element that caused the trap. Note again that, just as with traps, predicated-out (masked-out) elements -are included in the count leading up to the fail-condition, even though they -were not tested. +are included in the (sequential) count leading up to the fail-condition, +even though they were not tested. # Instructions @@ -105,23 +129,10 @@ attention must be paid. Example pseudo-code for an integer ADD operation (including scalar operations). Floating-point uses the FP Register Table. - function op_add(rd, rs1, rs2) # add not VADD! -  int i, id=0, irs1=0, irs2=0; -  predval = get_pred_val(FALSE, rd); -  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; -  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; -  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; -  for (i = 0; i < VL; i++) - xSTATE.srcoffs = i # save context - if (predval & 1< Branch operations use standard RV opcodes that are reinterpreted to @@ -226,7 +281,8 @@ to zero if **zeroing** is enabled. Note that just as with the standard (scalar, non-predicated) branch operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting -src1 and src2. +src1 and src2, however note that in doing so, the predicate table +setup must also be correspondingly adjusted. In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given for predicated compare operations of function "cmp": @@ -254,8 +310,14 @@ complex), this becomes: ps = get_pred_val(I/F==INT, rs1); rd = get_pred_val(I/F==INT, rs2); # this may not exist + ffirst_mode, zeroing = get_pred_flags(rs1) + if exists(rd): + pred_inversion, pred_zeroing = get_pred_flags(rs2) + else + pred_inversion, pred_zeroing = False, False + if not exists(rd) or zeroing: - result = 0 + result = (1< This section contains examples of vectorised LOAD operations, showing how the two stage process works (three if zero/sign-extension is included). @@ -1039,13 +1117,12 @@ This is: * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11) * RV64, where XLEN=64 is assumed. -First, the memory table, which, due to the -element width being 16 and the operation being LD (64), the 64-bits -loaded from memory are subdivided into groups of **four** elements. -And, with VL being 7 (deliberately to illustrate that this is reasonable -and possible), the first four are sourced from the offset addresses pointed -to by x5, and the next three from the ofset addresses pointed to by -the next contiguous register, x6: +First, the memory table, which, due to the element width being 16 and the +operation being LD (64), the 64-bits loaded from memory are subdivided +into groups of **four** elements. And, with VL being 7 (deliberately +to illustrate that this is reasonable and possible), the first four are +sourced from the offset addresses pointed to by x5, and the next three +from the ofset addresses pointed to by the next contiguous register, x6: [[!table data=""" addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 | @@ -1294,9 +1371,9 @@ rs1 equals the bitwidth of rs2, no sign-extending will occur. It is only where the bitwidth of either rs1 or rs2 are different, will the lesser-width operand be sign-extended. -Effectively however, both rs1 and rs2 are being sign-extended (or truncated), -where for add they are both zero-extended. This holds true for all arithmetic -operations ending with "W". +Effectively however, both rs1 and rs2 are being sign-extended (or +truncated), where for add they are both zero-extended. This holds true +for all arithmetic operations ending with "W". ### addiw @@ -1383,7 +1460,7 @@ circumstances it is perfectly fine to simply have the lanes "inactive" for predicated elements, even though it results in less than 100% ALU utilisation. -## Twin-predication (based on source and destination register) +## Twin-predication (based on source and destination register) Twin-predication is not that much different, except that that the source is independently zero-predicated from the destination. @@ -1511,92 +1588,121 @@ of total length 128 bit given that XLEN is now 128. TODO evaluate strncpy and strlen -## strncpy - -RVV version: - - strncpy: - mv a3, a0 # Copy dst - loop: - setvli x0, a2, vint8 # Vectors of bytes. - vlbff.v v1, (a1) # Get src bytes - vseq.vi v0, v1, 0 # Flag zero bytes - vmfirst a4, v0 # Zero found? - vmsif.v v0, v0 # Set mask up to and including zero byte. Ppplio - vsb.v v1, (a3), v0.t # Write out bytes - bgez a4, exit # Done - csrr t1, vl # Get number of bytes fetched - add a1, a1, t1 # Bump src pointer - sub a2, a2, t1 # Decrement count. - add a3, a3, t1 # Bump dst pointer - bnez a2, loop # Anymore? - - exit: - ret +## strncpy + +RVV version: + + strncpy: + c.mv a3, a0 # Copy dst + loop: + setvli x0, a2, vint8 # Vectors of bytes. + vlbff.v v1, (a1) # Get src bytes + vseq.vi v0, v1, 0 # Flag zero bytes + vmfirst a4, v0 # Zero found? + vmsif.v v0, v0 # Set mask up to and including zero byte. + vsb.v v1, (a3), v0.t # Write out bytes + c.bgez a4, exit # Done + csrr t1, vl # Get number of bytes fetched + c.add a1, a1, t1 # Bump src pointer + c.sub a2, a2, t1 # Decrement count. + c.add a3, a3, t1 # Bump dst pointer + c.bnez a2, loop # Anymore? + + exit: + c.ret SV version (WIP): strncpy: - mv a3, a0 - SETMVLI 8 # set max vector to 8 - RegCSR[a3] = 8bit, a3, scalar - RegCSR[a1] = 8bit, a1, scalar - RegCSR[t0] = 8bit, t0, vector - PredTb[t0] = ffirst, x0, inv + c.mv a3, a0 + VBLK.RegCSR[t0] = 8bit, t0, vector + VBLK.PredTb[t0] = ffirst, x0, inv loop: - SETVLI a2, t4 # t4 and VL now 1..8 - ldb t0, (a1) # t0 fail first mode - bne t0, x0, allnonzero # still ff - # VL points to last nonzero - GETVL t4 # from bne tests - addi t4, t4, 1 # include zero - SETVL t4 # set exactly to t4 - stb t0, (a3) # store incl zero - ret # end subroutine + VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8) + c.ldb t0, (a1) # t0 fail first mode + c.bne t0, x0, allnonzero # still ff + # VL (t4) points to last nonzero + c.addi t4, t4, 1 # include zero + c.stb t0, (a3) # store incl zero + c.ret # end subroutine allnonzero: - stb t0, (a3) # VL legal range - GETVL t4 # from bne tests - add a1, a1, t4 # Bump src pointer - sub a2, a2, t4 # Decrement count. - add a3, a3, t4 # Bump dst pointer - bnez a2, loop # Anymore? + c.stb t0, (a3) # VL legal range + c.add a1, a1, t4 # Bump src pointer + c.sub a2, a2, t4 # Decrement count. + c.add a3, a3, t4 # Bump dst pointer + c.bnez a2, loop # Anymore? exit: - ret + c.ret Notes: -* Setting MVL to 8 is just an example. If enough registers are spare it may be set to XLEN which will require a bank of 8 scalar registers for a1, a3 and t0. -* obviously if that is done, t0 is not separated by 8 full registers, and would overwrite t1 thru t7. x80 would work well, as an example, instead. -* with the exception of the GETVL (a pseudo code alias for csrr), every single instruction above may use RVC. -* RVC C.BNEZ can be used because rs1' may be extended to the full 128 registers through redirection -* RVC C.LW and C.SW may be used because the W format may be overridden by the 8 bit format. All of t0, a3 and a1 are overridden to make that work. -* with the exception of the GETVL, all Vector Context may be done in VBLOCK form. -* setting predication to x0 (zero) and invert on t0 is a trick to enable just ffirst on t0 +* Setting MVL to 8 is just an example. If enough registers are spare it + may be set to XLEN which will require a bank of 8 scalar registers for + a1, a3 and t0. +* obviously if that is done, t0 is not separated by 8 full registers, and + would overwrite t1 thru t7. x80 would work well, as an example, instead. +* with the exception of the GETVL (a pseudo code alias for csrr), every + single instruction above may use RVC. +* RVC C.BNEZ can be used because rs1' may be extended to the full 128 + registers through redirection +* RVC C.LW and C.SW may be used because the W format may be overridden by + the 8 bit format. All of t0, a3 and a1 are overridden to make that work. +* with the exception of the GETVL, all Vector Context may be done in + VBLOCK form. +* setting predication to x0 (zero) and invert on t0 is a trick to enable + just ffirst on t0 * ldb and bne are both using t0, both in ffirst mode -* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff into t0 -* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against scalar x0 -* however as t0 is in ffirst mode, the first fail wil ALSO stop the compares, and reduce VL as well +* t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride, + vectorised, no (un)sign-extension or truncation" mode. +* ldb will end on illegal mem, reduce VL, but copied all sorts of stuff + into t0 (could contain zeros). +* bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against + scalar x0 +* however as t0 is in ffirst mode, the first fail will ALSO stop the + compares, and reduce VL as well * the branch only goes to allnonzero if all tests succeed -* if it did not, we can safely increment VL by 1 (using a4) to include the zero. +* if it did not, we can safely increment VL by 1 (using a4) to include + the zero. * SETVL sets *exactly* the requested amount into VL. -* the SETVL just after allnonzero label is needed in case the ldb ffirst activates but the bne allzeros does not. +* the SETVL just after allnonzero label is needed in case the ldb ffirst + activates but the bne allzeros does not. * this would cause the stb to copy up to the end of the legal memory -* of course, on the next loop the ldb would throw a trap, as a1 now points to the first illegal mem location. +* of course, on the next loop the ldb would throw a trap, as a1 now + points to the first illegal mem location. ## strcpy RVV version: - mv a3, a0 # Save start - loop: + mv a3, a0 # Save start + loop: setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len vldbff.v v1, (a3) # Get bytes csrr a1, vl # Get bytes actually read e.g. if fault - vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0 + vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0 add a3, a3, a1 # Bump pointer vmfirst a2, v0 # Find first set bit in mask, returns -1 if none bltz a2, loop # Not found? add a0, a0, a1 # Sum start + bump add a3, a3, a2 # Add index of zero byte sub a0, a3, a0 # Subtract start address+bump - ret + ret + +## DAXPY + +[[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]] + +Notes: + +* Setting MVL to 4 is just an example. With enough space between the + FP regs, MVL may be set to larger values +* VBLOCK header takes 16 bits, 8-bit mode may be used on the registers, + taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total + overhead for use of VBLOCK: 48 bits (3 16-bit words). +* All instructions except fmadd may use Compressed variants. Total + number of 16-bit instruction words: 11. +* Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words. + +## BigInt add + +[[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]