(no commit message)

author lkcl <lkcl@web>

Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)

committer IkiWiki <ikiwiki.info>

Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)
author lkcl <lkcl@web>
Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)
committer IkiWiki <ikiwiki.info>
Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)
diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn

index ef9d4be37602c3d7297b8213032e92ec695988d1..0d5c6eacc74a9618f4c08589487052dc247a0551 100644 (file)
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -60,7 +60,7 @@ present in the 32-bit variants of the REMAP Management instructions that at
  present require direct writing to SVSHAPE0-3 SPRs.  Additional
  REMAP Modes may also be introduced at that time.*
  
-There are four types of REMAP:
+There are five types of REMAP:
  
  * **Matrix**, also known as 2D and 3D reshaping, can perform in-place
    Matrix transpose and rotate. The Shapes are set up for an "Outer Product"
@@ -90,27 +90,6 @@ Architectural State is permitted to assume that the Indices
  are cacheable from the point at which the `svindex` instruction
  is executed.
  
-Parallel Reduction is unusual in that it requires a full vector array
-of results (not a scalar) and uses the rest of the result Vector for
-the purposes of storing intermediary calculations.  As these intermediary
-results are Deterministically computed they may be useful.
-Additionally, because the intermediate results are always written out
-it is possible to service Precise Interrupts without affecting latency
-(a common limitation of Vector ISAs implementing explicit
-Parallel Reduction instructions, because their Architectural State cannot
-hold the partial results).
-
-*Hardware Architectural note: with the Scheduling applying as a Phase between
-Decode and Issue in a Deterministic fashion the Register Hazards may be
-easily computed and a standard Out-of-Order Micro-Architecture exploited to good
-effect.  Even an In-Order system may observe that for large Outer Product
-Schedules there will be no stalls, but if the Matrices are particularly
-small size an In-Order system would have to stall, just as it would if
-the operations were loop-unrolled without Simple-V. Thus: regardless
-of the Micro-Architecture the Hardware Engineer should first consider
-how best to process the exact same equivalent loop-unrolled instruction
-stream.*
-
  ## Horizontal-Parallelism Hint
  
  `SVSTATE.hphint` is an indicator to hardware of how many elements are 100%
@@ -178,6 +157,17 @@ instructions. Future versions of SVP64 will include EXT1xx prefixed
  variants (`psvshape`) which provide more comprehensive capacity and
  mitigate the need to write direct to the SVSHAPE SPRs.
  
+*Hardware Architectural note: with the Scheduling applying as a Phase between
+Decode and Issue in a Deterministic fashion the Register Hazards may be
+easily computed and a standard Out-of-Order Micro-Architecture exploited to good
+effect.  Even an In-Order system may observe that for large Outer Product
+Schedules there will be no stalls, but if the Matrices are particularly
+small size an In-Order system would have to stall, just as it would if
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream.*
+
  ### FFT/DCT Triple Loop
  
  DCT and FFT are some of the most astonishingly used algorithms in
@@ -287,6 +277,16 @@ non-predicate-masked-out destination element, but due again to
  the deterministic schedule programmers may find uses for the intermediate
  results, even for non-commutative Defined Word operations.
  
+Parallel Reduction is unusual in that it requires a full vector array
+of results (not a scalar) and uses the rest of the result Vector for
+the purposes of storing intermediary calculations.  As these intermediary
+results are Deterministically computed they may be useful.
+Additionally, because the intermediate results are always written out
+it is possible to service Precise Interrupts without affecting latency
+(a common limitation of Vector ISAs implementing explicit
+Parallel Reduction instructions, because their Architectural State cannot
+hold the partial results).
+
  When Rc=1 a corresponding Vector of co-resultant CRs is also
  created.  No special action is taken: the result *and its CR Field*
  are stored "as usual" exactly as all other SVP64 Rc=1 operations.
author	lkcl <lkcl@web>
	Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)
committer	IkiWiki <ikiwiki.info>
	Fri, 19 May 2023 17:19:16 +0000 (18:19 +0100)