From aebf14f46e2e4867d804a3e9253316e6de2cb3f5 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 25 Jun 2019 12:46:57 +0100 Subject: [PATCH] move instructions to appendix --- simple_v_extension/appendix.mdwn | 540 ++++++++++++++++++++++++++ simple_v_extension/specification.mdwn | 538 +------------------------ 2 files changed, 541 insertions(+), 537 deletions(-) diff --git a/simple_v_extension/appendix.mdwn b/simple_v_extension/appendix.mdwn index b17320105..33d02b56a 100644 --- a/simple_v_extension/appendix.mdwn +++ b/simple_v_extension/appendix.mdwn @@ -7,6 +7,546 @@ [[!toc ]] +# Instructions + +Despite being a 98% complete and accurate topological remap of RVV +concepts and functionality, no new instructions are needed. +Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip +becomes a critical dependency for efficient manipulation of predication +masks (as a bit-field). Despite the removal of all operations, +with the exception of CLIP and VSELECT.X +*all instructions from RVV Base are topologically re-mapped and retain their +complete functionality, intact*. Note that if RV64G ever had +a MV.X added as well as FCLIP, the full functionality of RVV-Base would +be obtained in SV. + +Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard +equivalents, so are left out of Simple-V. VSELECT could be included if +there existed a MV.X instruction in RV (MV.X is a hypothetical +non-immediate variant of MV that would allow another register to +specify which register was to be copied). Note that if any of these three +instructions are added to any given RV extension, their functionality +will be inherently parallelised. + +With some exceptions, where it does not make sense or is simply too +challenging, all RV-Base instructions are parallelised: + +* CSR instructions, whilst a case could be made for fast-polling of + a CSR into multiple registers, or for being able to copy multiple + contiguously addressed CSRs into contiguous registers, and so on, + are the fundamental core basis of SV. If parallelised, extreme + care would need to be taken. Additionally, CSR reads are done + using x0, and it is *really* inadviseable to tag x0. +* LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are + left as scalar. +* LR/SC could hypothetically be parallelised however their purpose is + single (complex) atomic memory operations where the LR must be followed + up by a matching SC. A sequence of parallel LR instructions followed + by a sequence of parallel SC instructions therefore is guaranteed to + not be useful. Not least: the guarantees of a Multi-LR/SC + would be impossible to provide if emulated in a trap. +* EBREAK, NOP, FENCE and others do not use registers so are not inherently + paralleliseable anyway. + +All other operations using registers are automatically parallelised. +This includes AMOMAX, AMOSWAP and so on, where particular care and +attention must be paid. + +Example pseudo-code for an integer ADD operation (including scalar +operations). Floating-point uses the FP Register Table. + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  predval = get_pred_val(FALSE, rd); +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  for (i = 0; i < VL; i++) + xSTATE.srcoffs = i # save context + if (predval & 1< + +Adding in support for SUBVL is a matter of adding in an extra inner +for-loop, where register src and dest are still incremented inside the +inner part. Not that the predication is still taken from the VL index. + +So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are +indexed by "(i)" + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  predval = get_pred_val(FALSE, rd); +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  for (i = 0; i < VL; i++) + xSTATE.srcoffs = i # save context + for (s = 0; s < SUBVL; s++) + xSTATE.ssvoffs = s # save context + if (predval & 1< + +Branch operations use standard RV opcodes that are reinterpreted to +be "predicate variants" in the instance where either of the two src +registers are marked as vectors (active=1, vector=1). + +Note that the predication register to use (if one is enabled) is taken from +the *first* src register, and that this is used, just as with predicated +arithmetic operations, to mask whether the comparison operations take +place or not. The target (destination) predication register +to use (if one is enabled) is taken from the *second* src register. + +If either of src1 or src2 are scalars (whether by there being no +CSR register entry or whether by the CSR entry specifically marking +the register as "scalar") the comparison goes ahead as vector-scalar +or scalar-vector. + +In instances where no vectorisation is detected on either src registers +the operation is treated as an absolutely standard scalar branch operation. +Where vectorisation is present on either or both src registers, the +branch may stil go ahead if any only if *all* tests succeed (i.e. excluding +those tests that are predicated out). + +Note that when zero-predication is enabled (from source rs1), +a cleared bit in the predicate indicates that the result +of the compare is set to "false", i.e. that the corresponding +destination bit (or result)) be set to zero. Contrast this with +when zeroing is not set: bits in the destination predicate are +only *set*; they are **not** cleared. This is important to appreciate, +as there may be an expectation that, going into the hardware-loop, +the destination predicate is always expected to be set to zero: +this is **not** the case. The destination predicate is only set +to zero if **zeroing** is enabled. + +Note that just as with the standard (scalar, non-predicated) branch +operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting +src1 and src2. + +In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given +for predicated compare operations of function "cmp": + + for (int i=0; i + +There is no MV instruction in RV however there is a C.MV instruction. +It is used for copying integer-to-integer registers (vectorised FMV +is used for copying floating-point). + +If either the source or the destination register are marked as vectors +C.MV is reinterpreted to be a vectorised (multi-register) predicated +move operation. The actual instruction's format does not change: + +[[!table data=""" +15 12 | 11 7 | 6 2 | 1 0 | +funct4 | rd | rs | op | +4 | 5 | 5 | 2 | +C.MV | dest | src | C0 | +"""]] + +A simplified version of the pseudocode for this operation is as follows: + + function op_mv(rd, rs) # MV not VMV! +  rd = int_csr[rd].active ? int_csr[rd].regidx : rd; +  rs = int_csr[rs].active ? int_csr[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_csr[rs].isvec) while (!(ps & 1< + +An earlier draft of SV modified the behaviour of LOAD/STORE (modified +the interpretation of the instruction fields). This +actually undermined the fundamental principle of SV, namely that there +be no modifications to the scalar behaviour (except where absolutely +necessary), in order to simplify an implementor's task if considering +converting a pre-existing scalar design to support parallelism. + +So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality +do not change in SV, however just as with C.MV it is important to note +that dual-predication is possible. + +In vectorised architectures there are usually at least two different modes +for LOAD/STORE: + +* Read (or write for STORE) from sequential locations, where one + register specifies the address, and the one address is incremented + by a fixed amount. This is usually known as "Unit Stride" mode. +* Read (or write) from multiple indirected addresses, where the + vector elements each specify separate and distinct addresses. + +To support these different addressing modes, the CSR Register "isvector" +bit is used. So, for a LOAD, when the src register is set to +scalar, the LOADs are sequentially incremented by the src register +element width, and when the src register is set to "vector", the +elements are treated as indirection addresses. Simplified +pseudo-code would look like this: + + function op_ld(rd, rs) # LD not VLD! +  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd; +  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_csr[rs].isvec) while (!(ps & 1< + +C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated, +where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register. +It is therefore possible to use predicated C.LWSP to efficiently +pop registers off the stack (by predicating x2 as the source), cherry-picking +which registers to store to (by predicating the destination). Likewise +for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved. + +The two modes ("unit stride" and multi-indirection) are still supported, +as with standard LD/ST. Essentially, the only difference is that the +use of x2 is hard-coded into the instruction. + +**Note**: it is still possible to redirect x2 to an alternative target +register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as +general-purpose LOAD/STORE operations. + +## Compressed LOAD / STORE Instructions + +Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE, +where the same rules apply and the same pseudo-code apply as for +non-compressed LOAD/STORE. Again: setting scalar or vector mode +on the src for LOAD and dest for STORE switches mode from "Unit Stride" +to "Multi-indirection", respectively. + # Element bitwidth polymorphism Element bitwidth is best covered as its own special section, as it diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn index 27d14b0e3..dfc406208 100644 --- a/simple_v_extension/specification.mdwn +++ b/simple_v_extension/specification.mdwn @@ -974,543 +974,7 @@ to the **one** instruction. # Instructions -Despite being a 98% complete and accurate topological remap of RVV -concepts and functionality, no new instructions are needed. -Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip -becomes a critical dependency for efficient manipulation of predication -masks (as a bit-field). Despite the removal of all operations, -with the exception of CLIP and VSELECT.X -*all instructions from RVV Base are topologically re-mapped and retain their -complete functionality, intact*. Note that if RV64G ever had -a MV.X added as well as FCLIP, the full functionality of RVV-Base would -be obtained in SV. - -Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard -equivalents, so are left out of Simple-V. VSELECT could be included if -there existed a MV.X instruction in RV (MV.X is a hypothetical -non-immediate variant of MV that would allow another register to -specify which register was to be copied). Note that if any of these three -instructions are added to any given RV extension, their functionality -will be inherently parallelised. - -With some exceptions, where it does not make sense or is simply too -challenging, all RV-Base instructions are parallelised: - -* CSR instructions, whilst a case could be made for fast-polling of - a CSR into multiple registers, or for being able to copy multiple - contiguously addressed CSRs into contiguous registers, and so on, - are the fundamental core basis of SV. If parallelised, extreme - care would need to be taken. Additionally, CSR reads are done - using x0, and it is *really* inadviseable to tag x0. -* LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are - left as scalar. -* LR/SC could hypothetically be parallelised however their purpose is - single (complex) atomic memory operations where the LR must be followed - up by a matching SC. A sequence of parallel LR instructions followed - by a sequence of parallel SC instructions therefore is guaranteed to - not be useful. Not least: the guarantees of a Multi-LR/SC - would be impossible to provide if emulated in a trap. -* EBREAK, NOP, FENCE and others do not use registers so are not inherently - paralleliseable anyway. - -All other operations using registers are automatically parallelised. -This includes AMOMAX, AMOSWAP and so on, where particular care and -attention must be paid. - -Example pseudo-code for an integer ADD operation (including scalar -operations). Floating-point uses the FP Register Table. - - function op_add(rd, rs1, rs2) # add not VADD! -  int i, id=0, irs1=0, irs2=0; -  predval = get_pred_val(FALSE, rd); -  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; -  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; -  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; -  for (i = 0; i < VL; i++) - xSTATE.srcoffs = i # save context - if (predval & 1< - -Adding in support for SUBVL is a matter of adding in an extra inner -for-loop, where register src and dest are still incremented inside the -inner part. Not that the predication is still taken from the VL index. - -So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are -indexed by "(i)" - - function op_add(rd, rs1, rs2) # add not VADD! -  int i, id=0, irs1=0, irs2=0; -  predval = get_pred_val(FALSE, rd); -  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; -  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; -  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; -  for (i = 0; i < VL; i++) - xSTATE.srcoffs = i # save context - for (s = 0; s < SUBVL; s++) - xSTATE.ssvoffs = s # save context - if (predval & 1< - -Branch operations use standard RV opcodes that are reinterpreted to -be "predicate variants" in the instance where either of the two src -registers are marked as vectors (active=1, vector=1). - -Note that the predication register to use (if one is enabled) is taken from -the *first* src register, and that this is used, just as with predicated -arithmetic operations, to mask whether the comparison operations take -place or not. The target (destination) predication register -to use (if one is enabled) is taken from the *second* src register. - -If either of src1 or src2 are scalars (whether by there being no -CSR register entry or whether by the CSR entry specifically marking -the register as "scalar") the comparison goes ahead as vector-scalar -or scalar-vector. - -In instances where no vectorisation is detected on either src registers -the operation is treated as an absolutely standard scalar branch operation. -Where vectorisation is present on either or both src registers, the -branch may stil go ahead if any only if *all* tests succeed (i.e. excluding -those tests that are predicated out). - -Note that when zero-predication is enabled (from source rs1), -a cleared bit in the predicate indicates that the result -of the compare is set to "false", i.e. that the corresponding -destination bit (or result)) be set to zero. Contrast this with -when zeroing is not set: bits in the destination predicate are -only *set*; they are **not** cleared. This is important to appreciate, -as there may be an expectation that, going into the hardware-loop, -the destination predicate is always expected to be set to zero: -this is **not** the case. The destination predicate is only set -to zero if **zeroing** is enabled. - -Note that just as with the standard (scalar, non-predicated) branch -operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting -src1 and src2. - -In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given -for predicated compare operations of function "cmp": - - for (int i=0; i - -There is no MV instruction in RV however there is a C.MV instruction. -It is used for copying integer-to-integer registers (vectorised FMV -is used for copying floating-point). - -If either the source or the destination register are marked as vectors -C.MV is reinterpreted to be a vectorised (multi-register) predicated -move operation. The actual instruction's format does not change: - -[[!table data=""" -15 12 | 11 7 | 6 2 | 1 0 | -funct4 | rd | rs | op | -4 | 5 | 5 | 2 | -C.MV | dest | src | C0 | -"""]] - -A simplified version of the pseudocode for this operation is as follows: - - function op_mv(rd, rs) # MV not VMV! -  rd = int_csr[rd].active ? int_csr[rd].regidx : rd; -  rs = int_csr[rs].active ? int_csr[rs].regidx : rs; -  ps = get_pred_val(FALSE, rs); # predication on src -  pd = get_pred_val(FALSE, rd); # ... AND on dest -  for (int i = 0, int j = 0; i < VL && j < VL;): - if (int_csr[rs].isvec) while (!(ps & 1< - -An earlier draft of SV modified the behaviour of LOAD/STORE (modified -the interpretation of the instruction fields). This -actually undermined the fundamental principle of SV, namely that there -be no modifications to the scalar behaviour (except where absolutely -necessary), in order to simplify an implementor's task if considering -converting a pre-existing scalar design to support parallelism. - -So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality -do not change in SV, however just as with C.MV it is important to note -that dual-predication is possible. - -In vectorised architectures there are usually at least two different modes -for LOAD/STORE: - -* Read (or write for STORE) from sequential locations, where one - register specifies the address, and the one address is incremented - by a fixed amount. This is usually known as "Unit Stride" mode. -* Read (or write) from multiple indirected addresses, where the - vector elements each specify separate and distinct addresses. - -To support these different addressing modes, the CSR Register "isvector" -bit is used. So, for a LOAD, when the src register is set to -scalar, the LOADs are sequentially incremented by the src register -element width, and when the src register is set to "vector", the -elements are treated as indirection addresses. Simplified -pseudo-code would look like this: - - function op_ld(rd, rs) # LD not VLD! -  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd; -  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs; -  ps = get_pred_val(FALSE, rs); # predication on src -  pd = get_pred_val(FALSE, rd); # ... AND on dest -  for (int i = 0, int j = 0; i < VL && j < VL;): - if (int_csr[rs].isvec) while (!(ps & 1< - -C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated, -where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register. -It is therefore possible to use predicated C.LWSP to efficiently -pop registers off the stack (by predicating x2 as the source), cherry-picking -which registers to store to (by predicating the destination). Likewise -for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved. - -The two modes ("unit stride" and multi-indirection) are still supported, -as with standard LD/ST. Essentially, the only difference is that the -use of x2 is hard-coded into the instruction. - -**Note**: it is still possible to redirect x2 to an alternative target -register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as -general-purpose LOAD/STORE operations. - -## Compressed LOAD / STORE Instructions - -Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE, -where the same rules apply and the same pseudo-code apply as for -non-compressed LOAD/STORE. Again: setting scalar or vector mode -on the src for LOAD and dest for STORE switches mode from "Unit Stride" -to "Multi-indirection", respectively. +See [[appendix]] # Exceptions -- 2.30.2