the purposes of storing intermediary calculations. As these intermediary
results are Deterministically computed they may be useful.
Additionally, because the intermediate results are always written out
-it is possible to serve Precise Interrupts without affecting latency
+it is possible to service Precise Interrupts without affecting latency
(a common limitation of Vector ISAs).
# Basic principle
CR Fields may then be used as Predicate Masks to exclude those operations
with an Index exceeding VL-1.*
+## Parallel Reduction
+
+Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
+(Power ISA v3.0B) operation is leveraged, unmodified, to give the
+*appearance* and *effect* of Reduction.
+
+In Horizontal-First Mode, Vector-result reduction **requires**
+the destination to be a Vector, which will be used to store
+intermediary results.
+
+Given that the tree-reduction schedule is deterministic,
+Interrupts and exceptions
+can therefore also be precise. The final result will be in the first
+non-predicate-masked-out destination element, but due again to
+the deterministic schedule programmers may find uses for the intermediate
+results.
+
+When Rc=1 a corresponding Vector of co-resultant CRs is also
+created. No special action is taken: the result and its CR Field
+are stored "as usual" exactly as all other SVP64 Rc=1 operations.
+
+Note that the Schedule only makes sense on top of certain instructions:
+X-Form with a Register Profile of `RT,RA,RB` is fine. Like Scalar
+Reduction, nothing is prohibited:
+the results of execution on an unsuitable instruction may simply
+not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
+Reduction in particular do not make sense, but `ternlogi`, if used
+with care, would.
+
+**Parallel-Reduction with Predication**
+
+To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
+completely separate from the actual element-level (scalar) operations,
+Move operations are **not** included in the Schedule. This means that
+the Schedule leaves the final (scalar) result in the first-non-masked
+element of the Vector used. With the predicate mask being dynamic
+(but deterministic) this result could be anywhere.
+
+If that result is needed to be moved to a (single) scalar register
+then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
+needed to get it, where the predicate is the exact same predicate used
+in the prior Parallel-Reduction instruction.
+
+* If there was only a single
+ bit in the predicate then the result will not have moved or been altered
+ from the source vector prior to the Reduction
+* If there was more than one bit the result will be in the
+ first element with a predicate bit set.
+
+In either case the result is in the element with the first bit set in
+the predicate mask.
+
+For *some* implementations
+the vector-to-scalar copy may be a slow operation, as may the Predicated
+Parallel Reduction itself.
+It may be better to perform a pre-copy
+of the values, compressing them (VREDUCE-style) into a contiguous block,
+which will guarantee that the result goes into the very first element
+of the destination vector, in which case clearly no follow-up
+vector-to-scalar MV operation is needed.
+
+**Usage conditions**
+
+The simplest usage is to perform an overwrite, specifying all three
+register operands the same.
+
+ setvl VL=6
+ sv.add *8, *8, *8
+
+The Reduction Schedule will issue the Parallel Tree Reduction spanning
+registers 8 through 13, by adjusting the offsets to RT, RA and RB as
+necessary (see "Parallel Reduction algorithm" in a later section).
+
+A non-overwrite is possible as well but just as with the overwrite
+version, only those destination elements necessary for storing
+intermediary computations will be written to: the remaining elements
+will **not** be overwritten and will **not** be zero'd.
+
+ setvl VL=4
+ sv.add *0, *8, *8
+
+However it is critical to note that if the source and destination are
+not the same then the trick of using a follow-up vector-scalar MV will
+not work.
+
+## Sub-Vector Horizontal Reduction
+
+Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
+on all first Subvector elements, followed by another separate independent
+Parallel Reduction on all the second Subvector elements and so on.
+
+By contrast, when SVM is set and SUBVL!=1, a Horizontal
+Subvector mode is enabled, applying the Parallel Reduction
+Algorithm to the Subvector Elements. The Parallel Reduction
+is independently applied VL times, to each group of Subvector
+elements. Bear in mind that predication is never applied down
+into individual Subvector elements, but will be applied
+to select whether the *entire* Parallel Reduction on each
+group is performed or not.
+
+ for (i = 0; i < VL; i++)
+ if (predval & 1<<i) # predication
+ subvecparallelreduction(...)
+
+Note that as this is a Parallel Reduction, for best results
+it should be an overwrite operation, where the result for
+the Horizontal Reduction of each Subvector will be in the
+first Subvector element.
+Also note that use of Rc=1 is `UNDEFINED` behaviour.
+
+In essence what is happening here is that Structure Packing is being
+combined with Parallel Reduction. If the Subvector elements may be
+laid out as a 2D matrix, with the Subvector elements on rows,
+and Parallel Reduction is applied per row, then if `SVM` is **clear**
+the Matrix is transposed (like Pack/Unpack)
+before still applying the Parallel Reduction to the **row**.
+
# REMAP area of SVSTATE
The following bits of the SVSTATE SPR are used for REMAP: