From: lkcl <lkcl@web>
Date: Wed, 9 Jun 2021 16:47:58 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: DRAFT_SVP64_0_1~782
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=c6d1b2609282545b297ecaa1686cf1fc443034ba;p=libreriscv.git

---

diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index ed9cdf7d5..c40d3ef96 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -217,7 +217,8 @@ or a MIN/MAX operation) it may be possible to parallelise the reduction.
 
 ## Scalar result reduce mode
 
-In this mode, one register is identified as being the "accumulator".
+In this mode, which is suited to operations involving carry or overflow,
+one register is identified as being the "accumulator".
 Scalar reduction is thus categorised by:
 
 * One of the sources is a Vector
@@ -264,12 +265,36 @@ the scalar destination register **MUST** be updated with the current
 (intermediate) result, because this is how ```Program Order``` is
 preserved (Vector Loops are to be considered to be just another instruction
 being executed in Program Order).  In this way, after return from interrupt,
-the scalar mapreduce may continue where it left off.
+the scalar mapreduce may continue where it left off.  This provides
+"precise" exception behaviour.
+
+Note that hardware is perfectly permitted to perform multi-issue
+parallel optimisation of the scalar reduce operation: it's just that
+as far as the user is concerned, all exceptions and interrupts **MUST**
+be precise.
 
 ## Vector result reduce mode
 
+Vector result reduce mode may utilise the destination vector for
+the purposes of storing intermediary results.  Interrupts and exceptions
+can therefore also be precise.  The result will be in the first
+non-predicate-masked-out destination element.  Note that unlike
+Scalar reduce mode, Vector reduce
+mode is *not* suited to operations which involve carry or overflow.
+
+Programs **MUST NOT** rely on the contents of the intermediate results:
+they may change from hardware implementation to hardware implementation.
+Some implementations may perform an incremental update, whilst others
+may choose to use the available Vector space for a binary tree reduction.
+If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
+a *straight* SVP64 Vector instruction can be issued, where the source and
+destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
+respecting ```Program Order``` being mandatory in SVP64, hardware should
+and must detect this case and issue an incremental sequence of scalar
+element instructions.
+
 1. limited to single predicated dual src operations (add RT, RA, RB).
-   triple source operations are prohibited (fma).
+   triple source operations are prohibited (such as fma).
 2. limited to operations that make sense.  divide is excluded, as is
    subtract (X - Y - Z produces different answers depending on the order)
    and asymmetric CRops (crandc, crorc). sane  operations:
@@ -298,17 +323,6 @@ the scalar mapreduce may continue where it left off.
    unaltered (not used for the purposes of intermediary storage); the
    scalar result is placed in the first available unmasked element.
 
-Note: Programs **MUST NOT** rely on the contents of the intermediate results:
-they may change from hardware implementation to hardware implementation.
-Some implementations may perform an incremental update, whilst others
-may choose to use the available Vector space for a binary tree reduction.
-If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
-a *straight* SVP64 Vector instruction can be issued, where the source and
-destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
-respecting ```Program Order``` being mandatory in SVP64, hardware should
-and must detect this case and issue an incremental sequence of scalar
-element instructions.
-
 Pseudocode for the case where RA==RB:
 
     result = op(iregs[RA], iregs[RA+1])