X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsvp64%2Fappendix.mdwn;h=57eeef79173f714468fdbab567aaf2f3a81a7b3f;hb=b56081a94c7a1eab5ba51298230b89657b9b13eb;hp=2c8ffbaf9bd3332dc2885d5a725fc1ed992f3050;hpb=95b6e6255690190e3370cf728b2c1039d611db9a;p=libreriscv.git diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index 2c8ffbaf9..57eeef791 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -1,8 +1,11 @@ +[[!tag standards]] + # Appendix -* -* -* +* Saturation +* Parallel Prefix +* Reduce Modes +* OV sv.addex discussion This is the appendix to [[sv/svp64]], providing explanations of modes etc. leaving the main svp64 page's primary purpose as outlining the @@ -12,39 +15,106 @@ Table of contents: [[!toc]] +# Partial Implementations + +It is perfectly legal to implement subsets of SVP64 as long as illegal +instruction traps are always raised on unimplemented features, +so that soft-emulation is possible, +even for future revisions of SVP64. With SVP64 being partly controlled +through contextual SPRs, a little care has to be taken. + +**All** SPRs +not implemented including reserved ones for future use must raise an illegal +instruction trap if read or written. This allows software the +opportunity to emulate the context created by the given SPR. + +**Embedded Scalar Scenario** + +In this scenario an implementation does not wish to implement the Vectorisation +but simply wishes to take advantage of predication or other feature +of SVP64, such as instructions that might only be available if prefixed. +Such an implementation would be entirely free to do so with the proviso +that: + +* any attempts to call `setvl` shall either raise an illegal instruction + or be partially implemented to set SVSTATE correctly. +* if SVSTATE contains any value in any bit that is not supported + in hardware, an illegal instruction shall be raised when an SVP64 + prefixed instruction is executed. +* if SVSTATE contains values requesting supported features at the time + that the prefixed instruction is executed then it is executed in + hardware as per specification, with no illegal exception trap raised. + +Example, assuming that hardware implements predication but not +elwidth overrides: + + setvli r0, 4 # sets VL equal to 4 + sv.addi r5, r0, 1 # raises an 0x700 trap + setvli r0, 1 # sets VL equal to 1 + sv.addi r5, r0, 1 # gets executed by hardware + sv.addi/ew=8 r5, r0, 1 # raises an 0x700 trap + sv.ori/sm=EQ r5, r0, 1 # executed by hardware + +The first + # XER, SO and other global flags Vector systems are expected to be high performance. This is achieved through parallelism, which requires that elements in the vector be -independent. XER SO and other global "accumulation" flags (CR.OV) cause +independent. XER SO/OV and other global "accumulation" flags (CR.SO) cause Read-Write Hazards on single-bit global resources, having a significant detrimental effect. -Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including -in `cmp` instructions). XER is simply neither read nor written. +Consequently in SV, XER.SO behaviour is disregarded (including +in `cmp` instructions). XER.SO is not read, but XER.OV may be written, +breaking the Read-Modify-Write Hazard Chain that complicates +microarchitectural implementations. This includes when `scalar identity behaviour` occurs. If precise OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1 instructions should be used without an SV Prefix. -An interesting side-effect of this decision is that the OE flag is now -free for other uses when SV Prefixing is used. - -Regarding XER.CA: this does not fit either: it was designed for a scalar -ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given -Vector element. This provides a means to perform large parallel batches -of Vectorised carry-capable additions. crweird instructions can be used -to transfer the CRs in and out of an integer, where bitmanipulation -may be performed to analyse the carry bits (including carry lookahead -propagation) before continuing with further parallel additions. +TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf + +Of note here is that XER.SO and OV may already be disregarded in the +Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset. +SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets, +but only for SVP64 Prefixed Operations. + +XER.CA/CA32 on the other hand is expected and required to be implemented +according to standard Power ISA Scalar behaviour. Interestingly, due +to SVP64 being in effect a hardware for-loop around Scalar instructions +executing in precise Program Order, a little thought shows that a Vectorised +Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In +and producing, at the end, a single bit Carry out. High performance +implementations may exploit this observation to deploy efficient +Parallel Carry Lookahead. + + # assume VL=4, this results in 4 sequential ops (below) + sv.adde r0.v, r4.v, r8.v + + # instructions that get executed in backend hardware: + adde r0, r4, r8 # takes carry-in, produces carry-out + adde r1, r5, r9 # takes carry from previous + ... + adde r3, r7, r11 # likewise + +It can clearly be seen that the carry chains from one +64 bit add to the next, the end result being that a +256-bit "Big Integer Add" has been performed, and that +CA contains the 257th bit. A one-instruction 512-bit Add +may be performed by setting VL=8, and a one-instruction +1024-bit add by setting VL=16, and so on. # v3.0B/v3.1 relevant instructions SV is primarily designed for use as an efficient hybrid 3D GPU / VPU / CPU ISA. -As mentioned above, OE=1 is not applicable in SV, freeing this bit for -alternative uses. Additionally, Vectorisation of the VSX SIMD system -likewise makes no sense whatsoever. SV *replaces* VSX and provides, +Vectorisation of the VSX Packed SIMD system makes no sense whatsoever, +the sole exceptions potentially being any operations with 128-bit +operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar +Quad-precision Add). +SV effectively *replaces* VSX requiring far less instructions, and provides, at the very minimum, predication (which VSX was designed without). Thus all VSX Major Opcodes - all of them - are "unused" and must raise illegal instruction exceptions in SV Prefix Mode. @@ -81,16 +151,18 @@ v3.1B is *not* altered by svp64 in any way. This table is taken from v3.0B. Table 9: Primary Opcode Map (opcode bits 0:5) - | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 - 000 | | | tdi | twi | EXT04 | | | mulli | 000 - 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 - 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010 - 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 - 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 - 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101 - 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 - 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111 - | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +``` + | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +000 | | | tdi | twi | EXT04 | | | mulli | 000 +001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 +010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010 +011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 +100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 +101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101 +110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 +111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111 + | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +``` ## Suitable for svp64-only @@ -98,21 +170,25 @@ This is the same table containing v3.0B Primary Opcodes except those that make no sense in a Vectorisation Context have been removed. These removed POs can, *in the SV Vector Context only*, be assigned to alternative (Vectorised-only) instructions, including future extensions. +EXT04 retains the scalar `madd*` operations but would have all PackedSIMD +(aka VSX) operations removed. Note, again, to emphasise: outside of svp64 these opcodes **do not** change. When not prefixed with svp64 these opcodes **specifically** retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning. - | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 - 000 | | | | | | | | mulli | 000 - 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 - 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010 - 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 - 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 - 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101 - 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 - 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111 - | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +``` + | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +000 | | | | | EXT04 | | | mulli | 000 +001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001 +010 | bc/l/a | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010 +011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011 +100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100 +101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101 +110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110 +111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111 + | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 +``` It is important to note that having a different v3.0B Scalar opcode that is different from an SVP64 one is highly undesirable: the complexity @@ -121,11 +197,11 @@ in the decoder is greatly increased. # EXTRA Field Mapping The purpose of the 9-bit EXTRA field mapping is to mark individual -registers (RT, RA, BFA) as either scalat or vector, and to extend +registers (RT, RA, BFA) as either scalar or vector, and to extend their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64. Three of the 9 bits may also be used up for a 2nd Predicate (Twin Predication) leaving a mere 6 bits for qualifying registers. As can -be seen there is significant pressure on these (and all) SVP64 bits. +be seen there is significant pressure on these (and in fact all) SVP64 bits. In Power ISA v3.1 prefixing there are bits which describe and classify the prefix in a fashion that is independent of the suffix. MLSS for @@ -296,7 +372,7 @@ one register must be assigned, by convention by the programmer to be the are neither `UNDEFINED` nor prohibited, despite them not making much sense at first glance. Scalar reduce is strictly defined behaviour, and the cost in -hardware terms of prohibition of seemingly non-sensical operations is too great. +hardware terms of prohibition of seemingly non-sensical operations is too great. Therefore it is permitted and required to be executed successfully. Implementors **MAY** choose to optimise such instructions in instances where their use results in "extraneous execution", i.e. where it is clear @@ -305,11 +381,13 @@ a scalar destination **without** cumulative, iterative, or reductive behaviour (no "accumulator"), may discard all but the last element operation. Identification of such is trivial to do for `setb` and `cmp`: the source register type is -a completely different register file from the destination* +a completely different register file from the destination. +Likewise Scalar reduction when the destination is a Vector +is as if the Reduction Mode was not requested.* Typical applications include simple operations such as `ADD r3, r10.v, r3` where, clearly, r3 is being used to accumulate the addition of all -elements is the vector starting at r10. +elements of the vector starting at r10. # add RT, RA,RB but when RT==RA for i in range(VL): @@ -331,12 +409,20 @@ a cumulative series of overlapping add operations into the Execution units of the underlying hardware. Other examples include shift-mask operations where a Vector of inserts -into a single destination register is required, as a way to construct +into a single destination register is required (see [[sv/bitmanip]], bmset), +as a way to construct a value quickly from multiple arbitrary bit-ranges and bit-offsets. Using the same register as both the source and destination, with Vectors of different offsets masks and values to be inserted has multiple applications including Video, cryptography and JIT compilation. + # assume VL=4: + # * Vector of shift-offsets contained in RC (r12.v) + # * Vector of masks contained in RB (r8.v) + # * Vector of values to be masked-in in RA (r4.v) + # * Scalar destination RT (r0) to receive all mask-offset values + sv.bmset/mr r0, r4.v, r8.v, r12.v + Due to the Deterministic Scheduling, Subtract and Divide are still permitted to be executed in this mode, although from an algorithmic perspective it is strongly discouraged. @@ -511,13 +597,11 @@ is performed, and if it fails it is considered to have been *as if* the destination predicate bit was zero. Arithmetic and Logical Pred-result is covered in [[sv/normal]] -## pred-result mode on CR ops +Ped-result mode may not be applied on CR ops. -CR operations (mtcr, crand, cror) may be Vectorised, -predicated, and also pred-result mode applied to it. -Vectorisation applies to 4-bit CR Fields which are treated as -elements, not the individual bits of the 32-bit CR. -CR ops and how to identify them is described in [[sv/cr_ops]] +Although CR operations (mtcr, crand, cror) may be Vectorised, +predicated, pred-result mode applies to operations that have +an Rc=1 mode, or make sense to add an RC1 option. # CR Operations @@ -669,12 +753,12 @@ AND behavior. ### Table of CR fields -CR[i] is the notation used by the OpenPower spec to refer to CR field #i, -so FP instructions with Rc=1 write to CR[1] aka SVCR1_000. +CRn is the notation used by the OpenPower spec to refer to CR field #i, +so FP instructions with Rc=1 write to CR1 (n=1). CRs are not stored in SPRs: they are registers in their own right. Therefore context-switching the full set of CRs involves a Vectorised -mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how +mfcr or mtcr, using VL=8 to do so. This is exactly as how scalar OpenPOWER context-switches CRs: it is just that there are now more of them. @@ -703,22 +787,31 @@ illustration of normal mode add operation: zeroing not included, elwidth overrides not included. if there is no predicate, it is set to all 1s function op_add(rd, rs1, rs2) # add not VADD! - int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd); + int i, id=0, irs1=0, irs2=0; + predval = get_pred_val(FALSE, rd); for (i = 0; i < VL; i++) - STATE.srcoffs = i # save context if (predval & 1< +The first principle in SVP64 being violated is that SVP64 is a fully-independent +Abstraction of hardware-looping in between issue and execute phases +that has no relation to the operation it issues. The above pseudocode +conditionally changes not only the type of element operation issued +(a MV in some cases) but also the number of arguments (2 for a MV). +At the very least, for Vertical-First Mode this will result in unanticipated and unexpected behaviour (maximise "surprises" for programmers) in +the middle of loops, that will be far too hard to explain. + +The second principle being violated by the above algorithm is the expectation +that temporary storage is available for a modified predicate: there is no +such space, and predicates are read-only to reduce complexity at the +micro-architectural level. +SVP64 is founded on the principle that all operations are +"re-entrant" with respect to interrupts and exceptions: SVSTATE must +be saved and restored alongside PC and MSR, but nothing more. It is perfectly +fine to have context-switching back to the operation be somewhat slower, +through "reconstruction" of temporary internal state based on what SVSTATE +contains, but nothing more. + +An alternative algorithm is therefore required that does not perform MVs, +and does not require additional state to be saved on context-switching. ``` -def reduce( vl, vec, pred, pred,): +def reduce( vl, vec, pred ): + pred = copy(pred) # must not damage predicate j = 0 vi = [] # array of lookup indices to skip nonpredicated for i, pbit in enumerate(pred): @@ -831,11 +951,192 @@ def reduce( vl, vec, pred, pred,): halfstep = step // 2 for i in (0..vl).step_by(step) other = vi[i + halfstep] - i = vi[i] + ir = vi[i] other_pred = other < vl && pred[other] if pred[i] && other_pred - vec[i] += vec[other] - pred[i] |= other_pred + vec[ir] += vec[other] + else if other_pred: + vi[ir] = vi[other] # index redirection, no MV + pred[ir] |= other_pred # reconstructed on context-switch step *= 2 - ``` + +In this version the need for an explicit MV is made unnecessary by instead +leaving elements *in situ*. The internal modifications to the predicate may, +due to the reduction being entirely deterministic, be "reconstructed" +on a context-switch. This may make some implementations slower. + +*Implementor's Note: many SIMD-based Parallel Reduction Algorithms are +implemented in hardware with MVs that ensure lane-crossing is minimised. +The mistake which would be catastrophic to SVP64 to make is to then +limit the Reduction Sequence for all implementors +based solely and exclusively on what one +specific internal microarchitecture does. +In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient, +compact and efficient encodings of abstract concepts. +It is the Implementor's responsibility to produce a design +that complies with the above algorithm, +utilising internal Micro-coding and other techniques to transparently +insert MV operations +if necessary or desired, to give the level of efficiency or performance +required.* + +# Element-width overrides + +Element-width overrides are best illustrated with a packed structure +union in the c programming language. The following should be taken +literally, and assume always a little-endian layout: + + typedef union { + uint8_t b[]; + uint16_t s[]; + uint32_t i[]; + uint64_t l[]; + uint8_t actual_bytes[8]; + } el_reg_t; + + elreg_t int_regfile[128]; + + get_polymorphed_reg(reg, bitwidth, offset): + el_reg_t res; + res.l = 0; // TODO: going to need sign-extending / zero-extending + if bitwidth == 8: + reg.b = int_regfile[reg].b[offset] + elif bitwidth == 16: + reg.s = int_regfile[reg].s[offset] + elif bitwidth == 32: + reg.i = int_regfile[reg].i[offset] + elif bitwidth == 64: + reg.l = int_regfile[reg].l[offset] + return res + + set_polymorphed_reg(reg, bitwidth, offset, val): + if (!reg.isvec): + # not a vector: first element only, overwrites high bits + int_regfile[reg].l[0] = val + elif bitwidth == 8: + int_regfile[reg].b[offset] = val + elif bitwidth == 16: + int_regfile[reg].s[offset] = val + elif bitwidth == 32: + int_regfile[reg].i[offset] = val + elif bitwidth == 64: + int_regfile[reg].l[offset] = val + +In effect the GPR registers r0 to r127 (and corresponding FPRs fp0 +to fp127) are reinterpreted to be "starting points" in a byte-addressable +memory. Vectors - which become just a virtual naming construct - effectively +overlap. + +It is extremely important for implementors to note that the only circumstance +where upper portions of an underlying 64-bit register are zero'd out is +when the destination is a scalar. The ideal register file has byte-level +write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE. + +An example ADD operation with predication and element width overrides: + +  for (i = 0; i < VL; i++) + if (predval & 1<>destwid) + if (!RT.isvec) break + if (RT.isvec)  { id += 1; } + if (RA.isvec)  { irs1 += 1; } + if (RB.isvec)  { irs2 += 1; } + if (RC.isvec)  { irs3 += 1; } + +The significant part here is that the second half is stored +starting not from RT+MAXVL at all: it is the *element* index +that is offset by MAXVL, both halves actually starting from RT. +If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements +RT0 to RT2 are stored: + + 0..31 32..63 + r0 unchanged unchanged + r1 RT0.lo RT1.lo + r2 RT2.lo unchanged + r3 unchanged RT0.hi + r4 RT1.hi RT2.hi + r5 unchanged unchanged + +Note that all of the LO halves start from r1, but that the HI halves +start from half-way into r3. The reason is that with MAXVL bring +5 and elwidth being 32, this is the 5th element +offset (in 32 bit quantities) counting from r1. + +Additional DRAFT Scalar instructions in 3-in 2-out form +with an implicit 2nd destination: + +* [[isa/svfixedarith]] +* [[isa/svfparith]] +