present require direct writing to SVSHAPE0-3 SPRs. Additional
REMAP Modes may also be introduced at that time.*
-There are four types of REMAP:
+There are five types of REMAP:
* **Matrix**, also known as 2D and 3D reshaping, can perform in-place
Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
are cacheable from the point at which the `svindex` instruction
is executed.
-Parallel Reduction is unusual in that it requires a full vector array
-of results (not a scalar) and uses the rest of the result Vector for
-the purposes of storing intermediary calculations. As these intermediary
-results are Deterministically computed they may be useful.
-Additionally, because the intermediate results are always written out
-it is possible to service Precise Interrupts without affecting latency
-(a common limitation of Vector ISAs implementing explicit
-Parallel Reduction instructions, because their Architectural State cannot
-hold the partial results).
-
-*Hardware Architectural note: with the Scheduling applying as a Phase between
-Decode and Issue in a Deterministic fashion the Register Hazards may be
-easily computed and a standard Out-of-Order Micro-Architecture exploited to good
-effect. Even an In-Order system may observe that for large Outer Product
-Schedules there will be no stalls, but if the Matrices are particularly
-small size an In-Order system would have to stall, just as it would if
-the operations were loop-unrolled without Simple-V. Thus: regardless
-of the Micro-Architecture the Hardware Engineer should first consider
-how best to process the exact same equivalent loop-unrolled instruction
-stream.*
-
## Horizontal-Parallelism Hint
`SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
variants (`psvshape`) which provide more comprehensive capacity and
mitigate the need to write direct to the SVSHAPE SPRs.
+*Hardware Architectural note: with the Scheduling applying as a Phase between
+Decode and Issue in a Deterministic fashion the Register Hazards may be
+easily computed and a standard Out-of-Order Micro-Architecture exploited to good
+effect. Even an In-Order system may observe that for large Outer Product
+Schedules there will be no stalls, but if the Matrices are particularly
+small size an In-Order system would have to stall, just as it would if
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream.*
+
### FFT/DCT Triple Loop
DCT and FFT are some of the most astonishingly used algorithms in
the deterministic schedule programmers may find uses for the intermediate
results, even for non-commutative Defined Word operations.
+Parallel Reduction is unusual in that it requires a full vector array
+of results (not a scalar) and uses the rest of the result Vector for
+the purposes of storing intermediary calculations. As these intermediary
+results are Deterministically computed they may be useful.
+Additionally, because the intermediate results are always written out
+it is possible to service Precise Interrupts without affecting latency
+(a common limitation of Vector ISAs implementing explicit
+Parallel Reduction instructions, because their Architectural State cannot
+hold the partial results).
+
When Rc=1 a corresponding Vector of co-resultant CRs is also
created. No special action is taken: the result *and its CR Field*
are stored "as usual" exactly as all other SVP64 Rc=1 operations.