From 76faba51f84c5a891bc74f860e2090108b31b1e7 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Fri, 19 May 2023 18:19:16 +0100
Subject: [PATCH]

---
 openpower/sv/remap.mdwn | 44 ++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn
index ef9d4be37..0d5c6eacc 100644
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -60,7 +60,7 @@ present in the 32-bit variants of the REMAP Management instructions that at
 present require direct writing to SVSHAPE0-3 SPRs.  Additional
 REMAP Modes may also be introduced at that time.*
 
-There are four types of REMAP:
+There are five types of REMAP:
 
 * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
   Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
@@ -90,27 +90,6 @@ Architectural State is permitted to assume that the Indices
 are cacheable from the point at which the `svindex` instruction
 is executed.
 
-Parallel Reduction is unusual in that it requires a full vector array
-of results (not a scalar) and uses the rest of the result Vector for
-the purposes of storing intermediary calculations.  As these intermediary
-results are Deterministically computed they may be useful.
-Additionally, because the intermediate results are always written out
-it is possible to service Precise Interrupts without affecting latency
-(a common limitation of Vector ISAs implementing explicit
-Parallel Reduction instructions, because their Architectural State cannot
-hold the partial results).
-
-*Hardware Architectural note: with the Scheduling applying as a Phase between
-Decode and Issue in a Deterministic fashion the Register Hazards may be
-easily computed and a standard Out-of-Order Micro-Architecture exploited to good
-effect.  Even an In-Order system may observe that for large Outer Product
-Schedules there will be no stalls, but if the Matrices are particularly
-small size an In-Order system would have to stall, just as it would if
-the operations were loop-unrolled without Simple-V. Thus: regardless
-of the Micro-Architecture the Hardware Engineer should first consider
-how best to process the exact same equivalent loop-unrolled instruction
-stream.*
-
 ## Horizontal-Parallelism Hint
 
 `SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
@@ -178,6 +157,17 @@ instructions. Future versions of SVP64 will include EXT1xx prefixed
 variants (`psvshape`) which provide more comprehensive capacity and
 mitigate the need to write direct to the SVSHAPE SPRs.
 
+*Hardware Architectural note: with the Scheduling applying as a Phase between
+Decode and Issue in a Deterministic fashion the Register Hazards may be
+easily computed and a standard Out-of-Order Micro-Architecture exploited to good
+effect.  Even an In-Order system may observe that for large Outer Product
+Schedules there will be no stalls, but if the Matrices are particularly
+small size an In-Order system would have to stall, just as it would if
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream.*
+
 ### FFT/DCT Triple Loop
 
 DCT and FFT are some of the most astonishingly used algorithms in
@@ -287,6 +277,16 @@ non-predicate-masked-out destination element, but due again to
 the deterministic schedule programmers may find uses for the intermediate
 results, even for non-commutative Defined Word operations.
 
+Parallel Reduction is unusual in that it requires a full vector array
+of results (not a scalar) and uses the rest of the result Vector for
+the purposes of storing intermediary calculations.  As these intermediary
+results are Deterministically computed they may be useful.
+Additionally, because the intermediate results are always written out
+it is possible to service Precise Interrupts without affecting latency
+(a common limitation of Vector ISAs implementing explicit
+Parallel Reduction instructions, because their Architectural State cannot
+hold the partial results).
+
 When Rc=1 a corresponding Vector of co-resultant CRs is also
 created.  No special action is taken: the result *and its CR Field*
 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
-- 
2.30.2