From 76faba51f84c5a891bc74f860e2090108b31b1e7 Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 19 May 2023 18:19:16 +0100 Subject: [PATCH] --- openpower/sv/remap.mdwn | 44 ++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index ef9d4be37..0d5c6eacc 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -60,7 +60,7 @@ present in the 32-bit variants of the REMAP Management instructions that at present require direct writing to SVSHAPE0-3 SPRs. Additional REMAP Modes may also be introduced at that time.* -There are four types of REMAP: +There are five types of REMAP: * **Matrix**, also known as 2D and 3D reshaping, can perform in-place Matrix transpose and rotate. The Shapes are set up for an "Outer Product" @@ -90,27 +90,6 @@ Architectural State is permitted to assume that the Indices are cacheable from the point at which the `svindex` instruction is executed. -Parallel Reduction is unusual in that it requires a full vector array -of results (not a scalar) and uses the rest of the result Vector for -the purposes of storing intermediary calculations. As these intermediary -results are Deterministically computed they may be useful. -Additionally, because the intermediate results are always written out -it is possible to service Precise Interrupts without affecting latency -(a common limitation of Vector ISAs implementing explicit -Parallel Reduction instructions, because their Architectural State cannot -hold the partial results). - -*Hardware Architectural note: with the Scheduling applying as a Phase between -Decode and Issue in a Deterministic fashion the Register Hazards may be -easily computed and a standard Out-of-Order Micro-Architecture exploited to good -effect. Even an In-Order system may observe that for large Outer Product -Schedules there will be no stalls, but if the Matrices are particularly -small size an In-Order system would have to stall, just as it would if -the operations were loop-unrolled without Simple-V. Thus: regardless -of the Micro-Architecture the Hardware Engineer should first consider -how best to process the exact same equivalent loop-unrolled instruction -stream.* - ## Horizontal-Parallelism Hint `SVSTATE.hphint` is an indicator to hardware of how many elements are 100% @@ -178,6 +157,17 @@ instructions. Future versions of SVP64 will include EXT1xx prefixed variants (`psvshape`) which provide more comprehensive capacity and mitigate the need to write direct to the SVSHAPE SPRs. +*Hardware Architectural note: with the Scheduling applying as a Phase between +Decode and Issue in a Deterministic fashion the Register Hazards may be +easily computed and a standard Out-of-Order Micro-Architecture exploited to good +effect. Even an In-Order system may observe that for large Outer Product +Schedules there will be no stalls, but if the Matrices are particularly +small size an In-Order system would have to stall, just as it would if +the operations were loop-unrolled without Simple-V. Thus: regardless +of the Micro-Architecture the Hardware Engineer should first consider +how best to process the exact same equivalent loop-unrolled instruction +stream.* + ### FFT/DCT Triple Loop DCT and FFT are some of the most astonishingly used algorithms in @@ -287,6 +277,16 @@ non-predicate-masked-out destination element, but due again to the deterministic schedule programmers may find uses for the intermediate results, even for non-commutative Defined Word operations. +Parallel Reduction is unusual in that it requires a full vector array +of results (not a scalar) and uses the rest of the result Vector for +the purposes of storing intermediary calculations. As these intermediary +results are Deterministically computed they may be useful. +Additionally, because the intermediate results are always written out +it is possible to service Precise Interrupts without affecting latency +(a common limitation of Vector ISAs implementing explicit +Parallel Reduction instructions, because their Architectural State cannot +hold the partial results). + When Rc=1 a corresponding Vector of co-resultant CRs is also created. No special action is taken: the result *and its CR Field* are stored "as usual" exactly as all other SVP64 Rc=1 operations. -- 2.30.2