are stored "as usual" exactly as all other SVP64 Rc=1 operations.
Note that the Schedule only makes sense on top of certain instructions:
-X-Form with a Register Profile of `RT,RA,RB` is fine. Like Scalar
+X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
+and the destination are all the same type. Like Scalar
Reduction, nothing is prohibited:
the results of execution on an unsuitable instruction may simply
-not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
-Reduction in particular do not make sense, but `ternlogi`, if used
-with care, would.
+not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi)
+may be used.
Critical to note regarding use of Parallel-Reduction REMAP is that,
exactly as with Matrix Mode, the `svshape` instruction *requests*
a certain Vector Length (number of elements to reduce) and then
sets VL and MAXVL at the number of **operations** needed to be
-carried out. Thus, equally as importantly, the total number of operations
+carried out. Thus, equally as importantly, like Matrix REMAP
+the total number of operations
is restricted to 127. Any Parallel-Reduction requiring more operations
will need to be done manually in batches.
Also important to note is that the Deterministic Schedule is arranged
-so that some implementations *may* parallelise it, as long as doing so
-respects Program Order and Register Hazards. Performance (speed)
+so that some implementations *may* parallelise it (as long as doing so
+respects Program Order and Register Hazards). Performance (speed)
of any given
implementation is neither strictly defined or guaranteed. As with
the Vulkan(tm) Specification, strict compliance is paramount whilst
-performance is left to Implementors.
+performance is at the discretion of Implementors.
**Parallel-Reduction with Predication**
The simplest usage is to perform an overwrite, specifying all three
register operands the same.
- setvl VL=6
+ svshape parallelreduce, 6
sv.add *8, *8, *8
The Reduction Schedule will issue the Parallel Tree Reduction spanning
intermediary computations will be written to: the remaining elements
will **not** be overwritten and will **not** be zero'd.
- setvl VL=4
+ svshape parallelreduce, 6
sv.add *0, *8, *8
However it is critical to note that if the source and destination are
the Matrix is transposed (like Pack/Unpack)
before still applying the Parallel Reduction to the **row**.
+# Determining Register Hazards
+
+For high-performance (Multi-Issue, Out-of-Order) systems it is critical
+to be able to statically determine the extent of Vectors in order to
+allocate pre-emptive Hazard protection. The next task is to eliminate
+masked-out elements using predicate bits, freeing up the associated
+Hazards.
+
+For non-REMAP situations `VL` is sufficient to ascertain early
+Hazard coverage, and with SVSTATE being a high priority cached
+quantity at the same level of MSR and PC this is not a problem.
+
+The problems come when REMAP is enabled. Indexed REMAP must instead
+use `MAXVL` as the earliest (simplest)
+batch-level Hazard Reservation indicator,
+but Matrix, FFT and Parallel Reduction must all use completely different
+schemes. The reason is that VL is used to step through the total
+number of *operations*, not the number of registers. The "Saving Grace"
+is that all of the REMAP Schedules are Deterministic.
+
+Advance-notice Parallel computation and subsequent cacheing
+of all of these complex Deterministic REMAP Schedules is
+*strongly recommended*, thus allowing clear and precise multi-issue
+batched Hazard coverage to be deployed, *even for Indexed Mode*.
+This is only possible for Indexed due to the strict guidelines
+given to Programmers.
+
+In short, there exists solutions to the problem of Hazard Management,
+with varying degrees of refinement possible at correspondingly
+increasing levels of complexity in hardware.
+
# REMAP area of SVSTATE
The following bits of the SVSTATE SPR are used for REMAP: