This idea has several disadvantages.
* the single DM entry for the entire 64 bits creates a read hazard that has to be resolved through the addition of a special Shadowing Function Unit. Only when the entire predicate is available can the die-cancel/ok be pulled on the FU elements each bit covers
-* this situation is exacerbated if one vector creates a predicate mask that is then used to mask immediately following instructions. Ordinarily, Cray-styke "chaining" would be possible. The single DM entry for the entire predicate mask prohibits this.
+* this situation is exacerbated if one vector creates a predicate mask that is then used to mask immediately following instructions. Ordinarily (i.e. without the predicate involved), Cray-style "chaining" would be possible. The single DM entry for the entire predicate mask prohibits this because the subsequent operations can only proceed when the *entire* mask has been computed.
* Allocation of bits to FUs gets particularly complex for SIMD (elwidth overrides) requiring shift and mask logic that is simply not needed compared to "one-for-one" schemes (above)
Overall there is very little in favour of this concept.
Out-of-order systems, to be effective, require several operations to be "in-flight" (POWER10 has up to 1,000 in-flight instructions) and if every predicated vector operation needed one 8-chunked scalar register each it becomes exceedingly complex very quickly.
+Even more than that, when computing the mask from a vector "compare", the groupings are troublesome to think through how to implement, which is itself a bad sign. It is suspected that chaining will be complex or adversely affected by certain combinations of element width.
+
Overall this idea which initially seems to save resources brings together all the least favourable aspects of other proposals and combines all of them!