[[!toc]]
+# Partial Implementations
+
+It is perfectly legal to implement subsets of SVP64 as long as illegal
+instruction traps are always raised on unimplemented features,
+so that soft-emulation is possible,
+even for future revisions of SVP64. With SVP64 being partly controlled
+through contextual SPRs, a little care has to be taken.
+
+**All** SPRs
+not implemented including reserved ones for future use must raise an illegal
+instruction trap if read or written. This allows software the
+opportunity to emulate the context created by the given SPR.
+
+**Embedded Scalar Scenario**
+
+In this scenario an implementation does not wish to implement the Vectorisation
+but simply wishes to take advantage of predication or other feature
+of SVP64, such as instructions that might only be available if prefixed.
+Such an implementation would be entirely free to do so with the proviso
+that:
+
+* any attempts to call `setvl` shall either raise an illegal instruction
+ or be partially implemented to set SVSTATE correctly.
+* if SVSTATE contains any value in any bit that is not supported
+ in hardware, an illegal instruction shall be raised when an SVP64
+ prefixed instruction is executed.
+* if SVSTATE contains values requesting supported features at the time
+ that the prefixed instruction is executed then it is executed in
+ hardware as per specification, with no illegal exception trap raised.
+
+Example, assuming that hardware implements predication but not
+elwidth overrides:
+
+ setvli r0, 4 # sets VL equal to 4
+ sv.addi r5, r0, 1 # raises an 0x700 trap
+ setvli r0, 1 # sets VL equal to 1
+ sv.addi r5, r0, 1 # gets executed by hardware
+ sv.addi/ew=8 r5, r0, 1 # raises an 0x700 trap
+ sv.ori/sm=EQ r5, r0, 1 # executed by hardware
+
+The first
+
# XER, SO and other global flags
Vector systems are expected to be high performance. This is achieved
SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
CPU ISA.
-Vectorisation of the VSX Packed SIMD system
-likewise makes no sense whatsoever. SV *replaces* VSX and provides,
+Vectorisation of the VSX Packed SIMD system makes no sense whatsoever,
+the sole exceptions potentially being any operations with 128-bit
+operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar
+Quad-precision Add).
+SV effectively *replaces* VSX requiring far less instructions, and provides,
at the very minimum, predication (which VSX was designed without).
Thus all VSX Major Opcodes - all of them - are "unused" and must raise
illegal instruction exceptions in SV Prefix Mode.
* ew=8/16/32 - element width
* sew=8/16/32 - source element width
* vec=2/3/4 - SUBVL
-* mode=reduce/satu/sats/crpred
+* mode=mr/satu/sats/crpred
* pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
-* spred={reg spec}
similar to x86 "rex" prefix.
For modes:
* pred-result:
- - pm=lt/gt/le/ge/eq/ne/so/ns OR
- - pm=RC1 OR pm=~RC1
+ - pm=lt/gt/le/ge/eq/ne/so/ns
+ - RC1 mode
* fail-first
- - ff=lt/gt/le/ge/eq/ne/so/ns OR
- - ff=RC1 OR ff=~RC1
+ - ff=lt/gt/le/ge/eq/ne/so/ns
+ - RC1 mode
* saturation:
- sats
- satu
It is extremely important for implementors to note that the only circumstance
where upper portions of an underlying 64-bit register are zero'd out is
when the destination is a scalar. The ideal register file has byte-level
-write-enable lines, just like most SRAMs.
+write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE.
An example ADD operation with predication and element width overrides:
if (RA.isvec) { irs1 += 1; }
if (RB.isvec) { irs2 += 1; }
+Thus it can be clearly seen that elements are packed by their
+element width, and the packing starts from the source (or destination)
+specified by the instruction.
+
# Twin (implicit) result operations
Some operations in the Power ISA already target two 64-bit scalar
-registers: `lq` for example. Some mathematical algorithms are more
+registers: `lq` for example, and LD with update.
+Some mathematical algorithms are more
efficient when there are two outputs rather than one, providing
-feedback loops between elements. 64-bit multiply
+feedback loops between elements (the most well-known being add with
+carry). 64-bit multiply
for example actually internally produces a 128 bit result, which clearly
cannot be stored in a single 64 bit register. Some ISAs recommend
"macro op fusion": the practice of setting a convention whereby if
The practice and convention of macro-op fusion however is not compatible
with SVP64 Horizontal-First, because Horizontal Mode may only
-be applied to a single instruction at a time. Thus it becomes
+be applied to a single instruction at a time, and SVP64 is based on
+the principle of strict Program Order even at the element
+level. Thus it becomes
necessary to add explicit more complex single instructions with
-more operands than would normally be seen in another ISA. If it
+more operands than would normally be seen in the average RISC ISA
+(3-in, 2-out, in some cases). If it
was not for Power ISA already having LD/ST with update as well as
Condition Codes and `lq` this would be hard to justify.
into consideration, the starting point for the implicit destination
is best illustrated in pseudocode:
- # demo of madded
- for (i = 0; i < VL; i++)
+ # demo of madded
+ for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication
src1 = get_polymorphed_reg(RA, srcwid, irs1)
src2 = get_polymorphed_reg(RB, srcwid, irs2)
src2 = get_polymorphed_reg(RC, srcwid, irs3)
result = src1*src2 + src2
destmask = (2<<destwid)-1
- # store two halves of result
+ # store two halves of result, both start from RT.
set_polymorphed_reg(RT, destwid, ird , result&destmask)
set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
if (!RT.isvec) break
The significant part here is that the second half is stored
starting not from RT+MAXVL at all: it is the *element* index
-that is offset by MAXVL, both starting from RT.
+that is offset by MAXVL, both halves actually starting from RT.
+If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements
+RT0 to RT2 are stored:
+
+ 0..31 32..63
+ r0 unchanged unchanged
+ r1 RT0.lo RT1.lo
+ r2 RT2.lo unchanged
+ r3 unchanged RT0.hi
+ r4 RT1.hi RT2.hi
+ r5 unchanged unchanged
+
+Note that all of the LO halves start from r1, but that the HI halves
+start from half-way into r3. The reason is that with MAXVL bring
+5 and elwidth being 32, this is the 5th element
+offset (in 32 bit quantities) counting from r1.
+
+Additional DRAFT Scalar instructions in 3-in 2-out form
+with an implicit 2nd destination:
* [[isa/svfixedarith]]
* [[isa/svfparith]]