implementations may exploit this observation to deploy efficient
Parallel Carry Lookahead.
+```
# assume VL=4, this results in 4 sequential ops (below)
sv.adde r0.v, r4.v, r8.v
adde r1, r5, r9 # takes carry from previous
...
adde r3, r7, r11 # likewise
+```
It can clearly be seen that the carry chains from one
64 bit add to the next, the end result being that a
[[openpower/opcode_regs_deduped]]
* Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
-from reading the markdown formatted version of the Scalar pseudocode
-which is machine-readable and found in [[openpower/isatables]]. The
-analysis gives, by instruction, a "Register Profile". `add RT, RA, RB`
-for example is given a designation `RM-2R-1W` because it requires
-two GPR reads and one GPR write.
-* Secondly, the total number of registers was added up (2R-1W is 3 registers)
-and if less than or equal to three then that instruction could be given an
-EXTRA3 designation. Four or more is given an EXTRA2 designation because
-there are only 9 bits available.
+ from reading the markdown formatted version of the Scalar pseudocode which
+ is machine-readable and found in [[openpower/isatables]]. The analysis
+ gives, by instruction, a "Register Profile". `add RT, RA, RB` for
+ example is given a designation `RM-2R-1W` because it requires two GPR
+ reads and one GPR write.
+* Secondly, the total number of registers was added up (2R-1W is 3
+ registers) and if less than or equal to three then that instruction
+ could be given an EXTRA3 designation. Four or more is given an EXTRA2
+ designation because there are only 9 bits available.
* Thirdly, the instruction was analysed to see if Twin or Single
-Predication was suitable. As a general rule this was if there
-was only a single operand and a single result (`extw` and LD/ST)
-however it was found that some 2 or 3 operand instructions also
-qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
-in Twin Predication, some compromises were made, here. LDST is
-Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
+ Predication was suitable. As a general rule this was if there
+ was only a single operand and a single result (`extw` and LD/ST)
+ however it was found that some 2 or 3 operand instructions also
+ qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
+ in Twin Predication, some compromises were made, here. LDST is
+ Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
* Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
-could have been decided
-that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
-and RT indexed 2 (EXTRA bits 6-8). In some cases (LD/ST with update)
-RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
-(because it is possible to do, and perceived to be useful). Rc=1
-co-results (CR0, CR1) are always given the same EXTRA index as their
-main result (RT, FRT).
-* Fifthly, in an automated process the results of the analysis
-were outputted in CSV Format for use in machine-readable form
-by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
-
-This process was laborious but logical, and, crucially, once a
-decision is made (and ratified) cannot be reversed.
-Qualifying future Power ISA Scalar instructions for SVP64
-is **strongly** advised to utilise this same process and the same
-sv_analysis.py program as a canonical method of maintaining the
-relationships. Alterations to that same program which
-change the Designation is **prohibited** once finalised (ratified
-through the Power ISA WG Process). It would
-be similar to deciding that `add` should be changed from X-Form
+ could have been decided that RA would be indexed 0 (EXTRA bits 0-2), RB
+ indexed 1 (EXTRA bits 3-5) and RT indexed 2 (EXTRA bits 6-8). In some
+ cases (LD/ST with update) RA-as-a-source is given a **different** EXTRA
+ index from RA-as-a-result (because it is possible to do, and perceived
+ to be useful). Rc=1 co-results (CR0, CR1) are always given the same
+ EXTRA index as their main result (RT, FRT).
+* Fifthly, in an automated process the results of the analysis were
+ outputted in CSV Format for use in machine-readable form by sv_analysis.py
+ <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
+
+This process was laborious but logical, and, crucially, once a decision
+is made (and ratified) cannot be reversed. Qualifying future Power ISA
+Scalar instructions for SVP64 is **strongly** advised to utilise this
+same process and the same sv_analysis.py program as a canonical method
+of maintaining the relationships. Alterations to that same program
+which change the Designation is **prohibited** once finalised (ratified
+through the Power ISA WG Process). It would be similar to deciding that
+`add` should be changed from X-Form
to D-Form.
## Single Predication <a name="1p"> </a>
-This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
+This is a standard mode normally found in Vector ISAs. every element
+in every source Vector and in the destination uses the same bit of one
+single predicate mask.
-In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep, but depending on whether sz and/or dz are set, srcstep and
-dststep can still potentially become different indices. Only when sz=dz
-is srcstep guaranteed to equal dststep at all times.
+In SVSTATE, for Single-predication, implementors MUST increment both
+srcstep and dststep, but depending on whether sz and/or dz are set,
+srcstep and dststep can still potentially become different indices.
+Only when sz=dz is srcstep guaranteed to equal dststep at all times.
Note that in some Mode Formats there is only one flag (zz). This indicates
that *both* sz *and* dz are set to the same.
This extreme power and flexibility comes down to the fact that SVP64
is not actually a Vector ISA: it is a loop-abstraction-concept that
-is applied *in general* to Scalar operations, just like the x86
-`REP` instruction (if put on steroids).
+is applied *in general* to Scalar operations, just like the x86 `REP`
+instruction (if put on steroids).
## Pack/Unpack
The pack/unpack concept of VSX `vpack` is abstracted out as Sub-Vector
-reordering.
-Two bits in the `SVSHAPE` [[sv/spr]]
-enable either "packing" or "unpacking"
-on the subvectors vec2/3/4.
+reordering. Two bits in the `SVSHAPE` [[sv/spr]] enable either "packing"
+or "unpacking" on the subvectors vec2/3/4.
-First, illustrating a
-"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides),
-note that the VL loop is outer and the SUBVL loop inner:
+First, illustrating a "normal" SVP64 operation with `SUBVL!=1:` (assuming
+no elwidth overrides), note that the VL loop is outer and the SUBVL
+loop inner:
+```
def index():
for i in range(VL):
for j in range(SUBVL):
for idx in index():
operation_on(RA+idx)
+```
For pack/unpack (again, no elwidth overrides), note that now there is the
option to swap the SUBVL and VL loop orders.
In effect the Pack/Unpack performs a Transpose of the subvector elements.
Illustrated this time with a GPR mv operation:
+```
# yield an outer-SUBVL or inner VL loop with SUBVL
def index_p(outer):
if outer:
# walk through both source and dest indices simultaneously
for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)):
move_operation(RT+dst_idx, RA+src_idx)
+```
"yield" from python is used here for simplicity and clarity.
The two Finite State Machines for the generation of the source
packed together, Sub-elements 1 are packed together, as
are Sub-elements 2.
+```
srcstep=0 srcstep=1
0 1 2 3 4 5
dststep=0 dststep=1 dststep=2
0 3 1 4 2 5
+```
-Setting of both `PACK` and `UNPACK` is neither prohibited nor
-`UNDEFINED` because the reordering is fully deterministic, and
-additional REMAP reordering may be applied. Combined with
-Matrix REMAP this would
-give potentially up to 4 Dimensions of reordering.
+Setting of both `PACK` and `UNPACK` is neither prohibited nor `UNDEFINED`
+because the reordering is fully deterministic, and additional REMAP
+reordering may be applied. Combined with Matrix REMAP this would give
+potentially up to 4 Dimensions of reordering.
-Pack/Unpack has quirky interactions on
-[[sv/mv.swizzle]] because it can set a different subvector length for
-destination, and has a slightly different pseudocode algorithm
-for Vertical-First Mode.
+Pack/Unpack has quirky interactions on [[sv/mv.swizzle]] because it can
+set a different subvector length for destination, and has a slightly
+different pseudocode algorithm for Vertical-First Mode.
Pack/Unpack is enabled (set up) through [[sv/svstep]].
## Reduce modes
-Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
-Vector ISA would have explicit Reduce opcodes with defined characteristics
-per operation: in SX Aurora there is even an additional scalar argument
-containing the initial reduction value, and the default is either 0
-or 1 depending on the specifics of the explicit opcode.
-SVP64 fundamentally has to
-utilise *existing* Scalar Power ISA v3.0B operations, which presents some
-unique challenges.
+Reduction in SVP64 is deterministic and somewhat of a misnomer.
+A normal Vector ISA would have explicit Reduce opcodes with defined
+characteristics per operation: in SX Aurora there is even an additional
+scalar argument containing the initial reduction value, and the default
+is either 0 or 1 depending on the specifics of the explicit opcode.
+SVP64 fundamentally has to utilise *existing* Scalar Power ISA v3.0B
+operations, which presents some unique challenges.
The solution turns out to be to simply define reduction as permitting
deterministic element-based schedules to be issued using the base Scalar
operations, and to rely on the underlying microarchitecture to resolve
-Register Hazards at the element level. This goes back to
-the fundamental principle that SV is nothing more than a Sub-Program-Counter
-sitting between Decode and Issue phases.
-
-For Scalar Reduction,
-Microarchitectures *may* take opportunities to parallelise the reduction
-but only if in doing so they preserve strict Program Order at the Element Level.
-Opportunities where this is possible include an `OR` operation
-or a MIN/MAX operation: it may be possible to parallelise the reduction,
-but for Floating Point it is not permitted due to different results
-being obtained if the reduction is not executed in strict Program-Sequential
-Order.
+Register Hazards at the element level. This goes back to the fundamental
+principle that SV is nothing more than a Sub-Program-Counter sitting
+between Decode and Issue phases.
+
+For Scalar Reduction, Microarchitectures *may* take opportunities to
+parallelise the reduction but only if in doing so they preserve strict
+Program Order at the Element Level. Opportunities where this is possible
+include an `OR` operation or a MIN/MAX operation: it may be possible to
+parallelise the reduction, but for Floating Point it is not permitted
+due to different results being obtained if the reduction is not executed
+in strict Program-Sequential Order.
In essence it becomes the programmer's responsibility to leverage the
pre-determined schedules to desired effect.
as a simple and natural relaxation of the usual restriction on the Vector
Looping which would terminate if the destination was marked as a Scalar.
Scalar Reduction by contrast *keeps issuing Vector Element Operations*
-even though the destination register is marked as scalar.
-Thus it is up to the programmer to be aware of this, observe some
-conventions, and thus end up achieving the desired outcome of scalar
-reduction.
-
-It is also important to appreciate that there is no
-actual imposition or restriction on how this mode is utilised: there
-will therefore be several valuable uses (including Vector Iteration
-and "Reverse-Gear")
-and it is up to the programmer to make best use of the
-(strictly deterministic) capability
-provided.
+even though the destination register is marked as scalar. Thus it is
+up to the programmer to be aware of this, observe some conventions,
+and thus end up achieving the desired outcome of scalar reduction.
+
+It is also important to appreciate that there is no actual imposition or
+restriction on how this mode is utilised: there will therefore be several
+valuable uses (including Vector Iteration and "Reverse-Gear") and it is
+up to the programmer to make best use of the (strictly deterministic)
+capability provided.
In this mode, which is suited to operations involving carry or overflow,
one register must be assigned, by convention by the programmer to be the
*Note that issuing instructions in Scalar reduce mode such as `setb`
are neither `UNDEFINED` nor prohibited, despite them not making much
-sense at first glance.
-Scalar reduce is strictly defined behaviour, and the cost in
-hardware terms of prohibition of seemingly non-sensical operations is too great.
-Therefore it is permitted and required to be executed successfully.
-Implementors **MAY** choose to optimise such instructions in instances
-where their use results in "extraneous execution", i.e. where it is clear
-that the sequence of operations, comprising multiple overwrites to
-a scalar destination **without** cumulative, iterative, or reductive
-behaviour (no "accumulator"), may discard all but the last element
-operation. Identification
-of such is trivial to do for `setb` and `cmp`: the source register type is
-a completely different register file from the destination.
-Likewise Scalar reduction when the destination is a Vector
-is as if the Reduction Mode was not requested. However it would clearly
-be unacceptable to perform such optimisations on cache-inhibited LD/ST,
-so some considerable care needs to be taken.*
+sense at first glance. Scalar reduce is strictly defined behaviour,
+and the cost in hardware terms of prohibition of seemingly non-sensical
+operations is too great. Therefore it is permitted and required to
+be executed successfully. Implementors **MAY** choose to optimise
+such instructions in instances where their use results in "extraneous
+execution", i.e. where it is clear that the sequence of operations,
+comprising multiple overwrites to a scalar destination **without**
+cumulative, iterative, or reductive behaviour (no "accumulator"), may
+discard all but the last element operation. Identification of such
+is trivial to do for `setb` and `cmp`: the source register type is a
+completely different register file from the destination. Likewise Scalar
+reduction when the destination is a Vector is as if the Reduction Mode
+was not requested. However it would clearly be unacceptable to perform
+such optimisations on cache-inhibited LD/ST, so some considerable care
+needs to be taken.*
Typical applications include simple operations such as `ADD r3, r10.v,
r3` where, clearly, r3 is being used to accumulate the addition of all
elements of the vector starting at r10.
+```
# add RT, RA,RB but when RT==RA
for i in range(VL):
iregs[RA] += iregs[RB+i] # RT==RA
+```
However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
-SV ordinarily
-**terminates** at the first scalar operation. Only by marking the
-operation as "mapreduce" will it continue to issue multiple sub-looped
-(element) instructions in `Program Order`.
-
-To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This may be useful in situations where the results may be different
-(floating-point) if executed in a different order. Given that there is
-no actual prohibition on Reduce Mode being applied when the destination
-is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
-or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
-for example will start at the opposite end of the Vector and push
-a cumulative series of overlapping add operations into the Execution units of
-the underlying hardware.
+SV ordinarily **terminates** at the first scalar operation. Only by
+marking the operation as "mapreduce" will it continue to issue multiple
+sub-looped (element) instructions in `Program Order`.
+
+To perform the loop in reverse order, the ```RG``` (reverse gear) bit
+must be set. This may be useful in situations where the results may be
+different (floating-point) if executed in a different order. Given that
+there is no actual prohibition on Reduce Mode being applied when the
+destination is a Vector, the "Reverse Gear" bit turns out to be a way to
+apply Iterative or Cumulative Vector operations in reverse. `sv.add/rg
+r3.v, r4.v, r4.v` for example will start at the opposite end of the
+Vector and push a cumulative series of overlapping add operations into
+the Execution units of the underlying hardware.
Other examples include shift-mask operations where a Vector of inserts
-into a single destination register is required (see [[sv/bitmanip]], bmset),
-as a way to construct
-a value quickly from multiple arbitrary bit-ranges and bit-offsets.
-Using the same register as both the source and destination, with Vectors
-of different offsets masks and values to be inserted has multiple
-applications including Video, cryptography and JIT compilation.
+into a single destination register is required (see [[sv/bitmanip]],
+bmset), as a way to construct a value quickly from multiple arbitrary
+bit-ranges and bit-offsets. Using the same register as both the source
+and destination, with Vectors of different offsets masks and values to
+be inserted has multiple applications including Video, cryptography and
+JIT compilation.
+```
# assume VL=4:
# * Vector of shift-offsets contained in RC (r12.v)
# * Vector of masks contained in RB (r8.v)
# * Vector of values to be masked-in in RA (r4.v)
# * Scalar destination RT (r0) to receive all mask-offset values
sv.bmset/mr r0, r4.v, r8.v, r12.v
+```
-Due to the Deterministic Scheduling,
-Subtract and Divide are still permitted to be executed in this mode,
-although from an algorithmic perspective it is strongly discouraged.
-It would be better to use addition followed by one final subtract,
-or in the case of divide, to get better accuracy, to perform a multiply
-cascade followed by a final divide.
+Due to the Deterministic Scheduling, Subtract and Divide are still
+permitted to be executed in this mode, although from an algorithmic
+perspective it is strongly discouraged. It would be better to use
+addition followed by one final subtract, or in the case of divide, to get
+better accuracy, to perform a multiply cascade followed by a final divide.
Note that single-operand or three-operand scalar-dest reduce is perfectly
-well permitted: the programmer may still declare one register, used as
-both a Vector source and Scalar destination, to be utilised as
-the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc
-this naturally fits well with the normal expected usage of these
-operations.
+well permitted: the programmer may still declare one register, used
+as both a Vector source and Scalar destination, to be utilised as the
+"accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc this
+naturally fits well with the normal expected usage of these operations.
If an interrupt or exception occurs in the middle of the scalar mapreduce,
the scalar destination register **MUST** be updated with the current
(intermediate) result, because this is how ```Program Order``` is
-preserved (Vector Loops are to be considered to be just another way of issuing instructions
-in Program Order). In this way, after return from interrupt,
-the scalar mapreduce may continue where it left off. This provides
-"precise" exception behaviour.
-
-Note that hardware is perfectly permitted to perform multi-issue
-parallel optimisation of the scalar reduce operation: it's just that
-as far as the user is concerned, all exceptions and interrupts **MUST**
-be precise.
+preserved (Vector Loops are to be considered to be just another way
+of issuing instructions in Program Order). In this way, after return
+from interrupt, the scalar mapreduce may continue where it left off.
+This provides "precise" exception behaviour.
+Note that hardware is perfectly permitted to perform multi-issue parallel
+optimisation of the scalar reduce operation: it's just that as far as
+the user is concerned, all exceptions and interrupts **MUST** be precise.
## Fail-on-first <a name="fail-first"> </a>
-Data-dependent fail-on-first has two distinct variants: one for LD/ST
-(see [[sv/ldst]],
-the other for arithmetic operations (actually, CR-driven)
-[[sv/normal]] and CR operations [[sv/cr_ops]].
-Note in each
-case the assumption is that vector elements are required appear to be
-executed in sequential Program Order, element 0 being the first.
-
-* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
- ordinary one. Exceptions occur "as normal". However for elements 1
- and above, if an exception would occur, then VL is **truncated** to the
- previous element.
+Data-dependent fail-on-first has two distinct variants: one for LD/ST (see
+[[sv/ldst]], the other for arithmetic operations (actually, CR-driven)
+[[sv/normal]] and CR operations [[sv/cr_ops]]. Note in each case the
+assumption is that vector elements are required appear to be executed
+in sequential Program Order, element 0 being the first.
+
+* LD/ST ffirst (not to be confused with *Data-Dependent* LD/ST ffirst)
+ treats the first LD/ST in a vector (element 0) as an ordinary one.
+ Exceptions occur "as normal". However for elements 1 and above, if an
+ exception would occur, then VL is **truncated** to the previous element.
* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
CR-creating operation produces a result (including cmp). Similar to
- branch, an analysis of the CR is performed and if the test fails, the
- vector operation terminates and discards all element operations
- above the current one (and the current one if VLi is not set),
- and VL is truncated to either
- the *previous* element or the current one, depending on whether
- VLi (VL "inclusive") is set.
-
-Thus the new VL comprises a contiguous vector of results,
-all of which pass the testing criteria (equal to zero, less than zero).
-
-The CR-based data-driven fail-on-first is new and not found in ARM
-SVE or RVV. At the same time it is also "old" because it is a generalisation
-of the Z80
-[Block compare](https://rvbelzen.tripod.com/z80prgtemp/z80prg04.htm)
+ branch, an analysis of the CR is performed and if the test fails,
+ the vector operation terminates and discards all element operations
+ above the current one (and the current one if VLi is not set), and
+ VL is truncated to either the *previous* element or the current one,
+ depending on whether VLi (VL "inclusive") is set.
+
+Thus the new VL comprises a contiguous vector of results, all of which
+pass the testing criteria (equal to zero, less than zero).
+
+The CR-based data-driven fail-on-first is new and not
+found in ARM SVE or RVV. At the same time it is also
+"old" because it is a generalisation of the Z80 [Block
+compare](https://rvbelzen.tripod.com/z80prgtemp/z80prg04.htm)
instructions, especially
-[CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir)
-which is based on CP (compare) as the ultimate "element" (suffix)
-operation to which the repeat (prefix) is applied.
-It is extremely useful for reducing instruction count,
-however requires speculative execution involving modifications of VL
-to get high performance implementations. An additional mode (RC1=1)
-effectively turns what would otherwise be an arithmetic operation
-into a type of `cmp`. The CR is stored (and the CR.eq bit tested
-against the `inv` field).
-If the CR.eq bit is equal to `inv` then the Vector is truncated and
-the loop ends.
-Note that when RC1=1 the result elements are never stored, only the CRs.
+[CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir) which is
+based on CP (compare) as the ultimate "element" (suffix) operation
+to which the repeat (prefix) is applied. It is extremely useful for
+reducing instruction count, however requires speculative execution
+involving modifications of VL to get high performance implementations.
+An additional mode (RC1=1) effectively turns what would otherwise be an
+arithmetic operation into a type of `cmp`. The CR is stored (and the
+CR.eq bit tested against the `inv` field). If the CR.eq bit is equal to
+`inv` then the Vector is truncated and the loop ends. Note that when
+RC1=1 the result elements are never stored, only the CRs.
VLi is only available as an option when `Rc=0` (or for instructions
-which do not have Rc). When set, the current element is always
-also included in the count (the new length that VL will be set to).
-This may be useful in combination with "inv" to truncate the Vector
-to *exclude* elements that fail a test, or, in the case of implementations
-of strncpy, to include the terminating zero.
+which do not have Rc). When set, the current element is always also
+included in the count (the new length that VL will be set to). This may
+be useful in combination with "inv" to truncate the Vector to *exclude*
+elements that fail a test, or, in the case of implementations of strncpy,
+to include the terminating zero.
In CR-based data-driven fail-on-first there is only the option to select
and test one bit of each CR (just as with branch BO). For more complex
workloads or balance resources.
CR-based data-dependent first on the other hand MUST not truncate VL
-arbitrarily to a length decided by the hardware: VL MUST only be
-truncated based explicitly on whether a test fails.
-This because it is a precise test on which algorithms
-will rely.
+arbitrarily to a length decided by the hardware: VL MUST only be truncated
+based explicitly on whether a test fails. This because it is a precise
+test on which algorithms will rely.
-*Note: there is no reverse-direction for Data-dependent Fail-First.
-REMAP will need to be activated to invert the ordering of element
-traversal.*
+*Note: there is no reverse-direction for Data-dependent Fail-First. REMAP
+will need to be activated to invert the ordering of element traversal.*
### Data-dependent fail-first on CR operations (crand etc)
-Operations that actually produce or alter CR Field as a result
-do not also in turn have an Rc=1 mode. However it makes no
-sense to try to test the 4 bits of a CR Field for being equal
-or not equal to zero. Moreover, the result is already in the
-form that is desired: it is a CR field. Therefore,
-CR-based operations have their own SVP64 Mode, described
-in [[sv/cr_ops]]
+Operations that actually produce or alter CR Field as a result do not
+also in turn have an Rc=1 mode. However it makes no sense to try to test
+the 4 bits of a CR Field for being equal or not equal to zero. Moreover,
+the result is already in the form that is desired: it is a CR field.
+Therefore, CR-based operations have their own SVP64 Mode, described in
+[[sv/cr_ops]]
There are two primary different types of CR operations:
Pred-result mode may not be applied on CR-based operations.
-Although CR operations (mtcr, crand, cror) may be Vectorised,
-predicated, pred-result mode applies to operations that have
-an Rc=1 mode, or make sense to add an RC1 option.
+Although CR operations (mtcr, crand, cror) may be Vectorised, predicated,
+pred-result mode applies to operations that have an Rc=1 mode, or make
+sense to add an RC1 option.
-Predicate-result merges common CR testing with predication, saving on
-instruction count. In essence, a Condition Register Field test
-is performed, and if it fails it is considered to have been
-*as if* the destination predicate bit was zero. Given that
-there are no CR-based operations that produce Rc=1 co-results,
-there can be no pred-result mode for mtcr and other CR-based instructions
+Predicate-result merges common CR testing with predication, saving
+on instruction count. In essence, a Condition Register Field test is
+performed, and if it fails it is considered to have been *as if* the
+destination predicate bit was zero. Given that there are no CR-based
+operations that produce Rc=1 co-results, there can be no pred-result
+mode for mtcr and other CR-based instructions
Arithmetic and Logical Pred-result, which does have Rc=1 or for which
RC1 Mode makes sense, is covered in [[sv/normal]]
Numbering relationships for CR fields are already complex due to being
in BE format (*the relationship is not clearly explained in the v3.0B
-or v3.1 specification*). However with some care and consideration
-the exact same mapping used for INT and FP regfiles may be applied,
-just to the upper bits, as explained below. Firstly and most
-importantly a new notation
-`CR{field number}` is used to indicate access to a particular
-Condition Register Field (as opposed to the notation `CR[bit]`
-which accesses one bit of the 32 bit Power ISA v3.0B
-Condition Register).
+or v3.1 specification*). However with some care and consideration the
+exact same mapping used for INT and FP regfiles may be applied, just to
+the upper bits, as explained below. Firstly and most importantly a new
+notation `CR{field number}` is used to indicate access to a particular
+Condition Register Field (as opposed to the notation `CR[bit]` which
+accesses one bit of the 32 bit Power ISA v3.0B Condition Register).
`CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
+```
CR{n} = CR[32+n*4:35+n*4]
+```
-For SVP64 the relationship for the sequential
-numbering of elements is to the CR **fields** within
-the CR Register, not to individual bits within the CR register.
+For SVP64 the relationship for the sequential numbering of elements is to
+the CR **fields** within the CR Register, not to individual bits within
+the CR register.
The `CR{n}` notation is designed to give *linear sequential
numbering* in the Vector domain on a straight sequential Vector Loop.
In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
-select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
-*in* that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of
+select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits *in*
+that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of
analysis and research) to be as follows:
+```
CR_index = (BA>>2) # top 3 bits
bit_index = (BA & 0b11) # low 2 bits
CR_reg = CR{CR_index} # get the CR
# finally get the bit from the CR.
CR_bit = (CR_reg & (1<<bit_index)) != 0
+```
When it comes to applying SV, it is the *CR Field* number `CR_reg`
to which SV EXTRA2/3
applies, **not** the `CR_bit` portion (bits 3-4):
+```
if extra3_mode:
spec = EXTRA3
else:
else:
# scalar constructs "00 spec[1:2] BA[0:4]"
return (spec[1:2] << 5) | BA
+```
Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
algorithm to determine CR\_reg is modified to as follows:
+```
CR_index = (BA>>2) # top 3 bits
if spec[0]:
# vector mode, 0-124 increments of 4
CR_reg = CR{CR_index} # get the CR
# finally get the bit from the CR.
CR_bit = (CR_reg & (1<<bit_index)) != 0
+```
Note here that the decoding pattern to determine CR\_bit does not change.
result of the operation as one part of that element *and a corresponding
CR element*. Greatly simplified pseudocode:
+```
for i in range(VL):
# calculate the vector result of an add
iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
CRs{8+i}.eq = iregs[RT+i] == 0
CRs{8+i}.gt = iregs[RT+i] > 0
... etc
+```
If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
then a followup instruction must be performed, setting "reduce" mode on
illustration of normal mode add operation: zeroing not included, elwidth
overrides not included. if there is no predicate, it is set to all 1s
+```
function op_add(rd, rs1, rs2) # add not VADD!
int i, id=0, irs1=0, irs2=0;
predval = get_pred_val(FALSE, rd);
STATE.srcoffs = 0; # reset
return;
}
+```
This has several modes:
A reasonable (prototype) starting point:
+```
svp64 [field=value]*
+```
Fields:
For actual assembler:
+```
sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
+```
Qualifiers:
For more complex applications a REMAP Schedule must be used
-*Programmers's note:
-if passed a predicate mask with only one bit set, this algorithm
-takes no action, similar to when a predicate mask is all zero.*
+*Programmers's note: if passed a predicate mask with only one bit set,
+this algorithm takes no action, similar to when a predicate mask is
+all zero.*
*Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
implemented in hardware with MVs that ensure lane-crossing is minimised.
-The mistake which would be catastrophic to SVP64 to make is to then
-limit the Reduction Sequence for all implementors
-based solely and exclusively on what one
-specific internal microarchitecture does.
-In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
-compact and efficient encodings of abstract concepts.*
-**It is the Implementor's responsibility to produce a design
-that complies with the above algorithm,
-utilising internal Micro-coding and other techniques to transparently
-insert micro-architectural lane-crossing Move operations
+The mistake which would be catastrophic to SVP64 to make is to then limit
+the Reduction Sequence for all implementors based solely and exclusively
+on what one specific internal microarchitecture does. In SIMD ISAs
+the internal SIMD Architectural design is exposed and imposed on the
+programmer. Cray-style Vector ISAs on the other hand provide convenient,
+compact and efficient encodings of abstract concepts.* **It is the
+Implementor's responsibility to produce a design that complies with the
+above algorithm, utilising internal Micro-coding and other techniques to
+transparently insert micro-architectural lane-crossing Move operations
if necessary or desired, to give the level of efficiency or performance
required.**
union in the c programming language. The following should be taken
literally, and assume always a little-endian layout:
+```
#pragma pack
typedef union {
uint8_t b[];
} el_reg_t;
elreg_t int_regfile[128];
+```
Accessing (get and set) of registers given a value, register (in `elreg_t`
form), and that all arithmetic, numbering and pseudo-Memory format is
LE-endian and LSB0-numbered below:
+```
elreg_t& get_polymorphed_reg(elreg_t const& reg, bitwidth, offset):
el_reg_t res; // result
res.l = 0; // TODO: going to need sign-extending / zero-extending
int_regfile[reg].i[offset] = val
elif bitwidth == 64:
int_regfile[reg].l[offset] = val
+```
In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
to fp127) are reinterpreted to be "starting points" in a byte-addressable
An example ADD operation with predication and element width overrides:
+```
for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication
src1 = get_polymorphed_reg(RA, srcwid, irs1)
if (RT.isvec) { id += 1; }
if (RA.isvec) { irs1 += 1; }
if (RB.isvec) { irs2 += 1; }
+```
Thus it can be clearly seen that elements are packed by their
element width, and the packing starts from the source (or destination)
## Twin (implicit) result operations
Some operations in the Power ISA already target two 64-bit scalar
-registers: `lq` for example, and LD with update.
-Some mathematical algorithms are more
-efficient when there are two outputs rather than one, providing
-feedback loops between elements (the most well-known being add with
-carry). 64-bit multiply
-for example actually internally produces a 128 bit result, which clearly
-cannot be stored in a single 64 bit register. Some ISAs recommend
-"macro op fusion": the practice of setting a convention whereby if
-two commonly used instructions (mullo, mulhi) use the same ALU but
-one selects the low part of an identical operation and the other
-selects the high part, then optimised micro-architectures may
+registers: `lq` for example, and LD with update. Some mathematical
+algorithms are more efficient when there are two outputs rather than one,
+providing feedback loops between elements (the most well-known being add
+with carry). 64-bit multiply for example actually internally produces
+a 128 bit result, which clearly cannot be stored in a single 64 bit
+register. Some ISAs recommend "macro op fusion": the practice of setting
+a convention whereby if two commonly used instructions (mullo, mulhi) use
+the same ALU but one selects the low part of an identical operation and
+the other selects the high part, then optimised micro-architectures may
"fuse" those two instructions together, using Micro-coding techniques,
internally.
The practice and convention of macro-op fusion however is not compatible
-with SVP64 Horizontal-First, because Horizontal Mode may only
-be applied to a single instruction at a time, and SVP64 is based on
-the principle of strict Program Order even at the element
-level. Thus it becomes
-necessary to add explicit more complex single instructions with
-more operands than would normally be seen in the average RISC ISA
-(3-in, 2-out, in some cases). If it
-was not for Power ISA already having LD/ST with update as well as
-Condition Codes and `lq` this would be hard to justify.
-
-With limited space in the `EXTRA` Field, and Power ISA opcodes
-being only 32 bit, 5 operands is quite an ask. `lq` however sets
-a precedent: `RTp` stands for "RT pair". In other words the result
-is stored in RT and RT+1. For Scalar operations, following this
-precedent is perfectly reasonable. In Scalar mode,
-`maddedu` therefore stores the two halves of the 128-bit multiply
-into RT and RT+1.
-
-What, then, of `sv.maddedu`? If the destination is hard-coded to
-RT and RT+1 the instruction is not useful when Vectorised because
-the output will be overwritten on the next element. To solve this
-is easy: define the destination registers as RT and RT+MAXVL
-respectively. This makes it easy for compilers to statically allocate
-registers even when VL changes dynamically.
+with SVP64 Horizontal-First, because Horizontal Mode may only be applied
+to a single instruction at a time, and SVP64 is based on the principle of
+strict Program Order even at the element level. Thus it becomes necessary
+to add explicit more complex single instructions with more operands than
+would normally be seen in the average RISC ISA (3-in, 2-out, in some
+cases). If it was not for Power ISA already having LD/ST with update as
+well as Condition Codes and `lq` this would be hard to justify.
+
+With limited space in the `EXTRA` Field, and Power ISA opcodes being only
+32 bit, 5 operands is quite an ask. `lq` however sets a precedent: `RTp`
+stands for "RT pair". In other words the result is stored in RT and RT+1.
+For Scalar operations, following this precedent is perfectly reasonable.
+In Scalar mode, `maddedu` therefore stores the two halves of the 128-bit
+multiply into RT and RT+1.
+
+What, then, of `sv.maddedu`? If the destination is hard-coded to RT and
+RT+1 the instruction is not useful when Vectorised because the output
+will be overwritten on the next element. To solve this is easy: define
+the destination registers as RT and RT+MAXVL respectively. This makes
+it easy for compilers to statically allocate registers even when VL
+changes dynamically.
Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
and bear in mind that element-width overrides still have to be taken
-into consideration, the starting point for the implicit destination
-is best illustrated in pseudocode:
+into consideration, the starting point for the implicit destination is
+best illustrated in pseudocode:
+```
# demo of maddedu
for (i = 0; i < VL; i++)
if (predval & 1<<i) # predication
if (RA.isvec) { irs1 += 1; }
if (RB.isvec) { irs2 += 1; }
if (RC.isvec) { irs3 += 1; }
+```
The significant part here is that the second half is stored
starting not from RT+MAXVL at all: it is the *element* index
If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements
RT0 to RT2 are stored:
+```
LSB0: 63:32 31:0
MSB0: 0:31 32:63
r0 unchanged unchanged
r3 RT0.hi unchanged
r4 RT2.hi RT1.hi
r5 unchanged unchanged
+```
Note that all of the LO halves start from r1, but that the HI halves
-start from half-way into r3. The reason is that with MAXVL bring
-5 and elwidth being 32, this is the 5th element
-offset (in 32 bit quantities) counting from r1.
-
-*Programmer's note: accessing registers that have been placed
-starting on a non-contiguous boundary (half-way along a scalar
-register) can be inconvenient: REMAP can provide an offset but
-it requires extra instructions to set up. A simple solution
-is to ensure that MAXVL is rounded up such that the Vector
-ends cleanly on a contiguous register boundary. MAXVL=6 in
-the above example would achieve that*
-
-Additional DRAFT Scalar instructions in 3-in 2-out form
-with an implicit 2nd destination:
+start from half-way into r3. The reason is that with MAXVL bring 5 and
+elwidth being 32, this is the 5th element offset (in 32 bit quantities)
+counting from r1.
+
+*Programmer's note: accessing registers that have been placed starting
+on a non-contiguous boundary (half-way along a scalar register) can
+be inconvenient: REMAP can provide an offset but it requires extra
+instructions to set up. A simple solution is to ensure that MAXVL is
+rounded up such that the Vector ends cleanly on a contiguous register
+boundary. MAXVL=6 in the above example would achieve that*
+
+Additional DRAFT Scalar instructions in 3-in 2-out form with an implicit
+2nd destination:
* [[isa/svfixedarith]]
* [[isa/svfparith]]