From 992155539a8434dbe62c91f42cf608d3263fd9cf Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Mon, 26 Oct 2020 01:58:56 +0000
Subject: [PATCH]

---
 openpower/openpower/sv/predication.mdwn | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/openpower/openpower/sv/predication.mdwn b/openpower/openpower/sv/predication.mdwn
index 819735379..4ecc61ca6 100644
--- a/openpower/openpower/sv/predication.mdwn
+++ b/openpower/openpower/sv/predication.mdwn
@@ -49,7 +49,7 @@ a big advantage of this is that unpredicated operations just set the predicate t
 This idea has several disadvantages.
 
 * the single DM entry for the entire 64 bits creates a read hazard that has to be resolved through the addition of a special Shadowing Function Unit.  Only when the entire predicate is available can the die-cancel/ok be pulled on the FU elements each bit covers
-* this situation is exacerbated if one vector creates a predicate mask that is then used to mask immediately following instructions.  Ordinarily, Cray-styke "chaining" would be possible.  The single DM entry for the entire predicate mask prohibits this.
+* this situation is exacerbated if one vector creates a predicate mask that is then used to mask immediately following instructions.  Ordinarily (i.e. without the predicate involved), Cray-style "chaining" would be possible.  The single DM entry for the entire predicate mask prohibits this because the subsequent operations can only proceed when the *entire* mask has been computed.
 * Allocation of bits to FUs gets particularly complex for SIMD (elwidth overrides) requiring shift and mask logic that is simply not needed compared to "one-for-one" schemes (above)
 
 Overall there is very little in favour of this concept.
@@ -78,4 +78,6 @@ Not only that but it is even more complex when trying to bring in virtual regist
 
 Out-of-order systems, to be effective, require several operations to be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if every predicated vector operation needed one 8-chunked scalar register each it becomes exceedingly complex very quickly.
 
+Even more than that, when computing the mask from a vector "compare", the groupings are troublesome to think through how to implement, which is itself a bad sign.  It is suspected that chaining will be complex or adversely affected by certain combinations of element width.
+
 Overall this idea which initially seems to save resources brings together all the least favourable aspects of other proposals and combines all of them!
-- 
2.30.2