-Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file. This in combination with 8b-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to be directly ANDed with the regfile write-enable lines to achieve the required functionality. The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance. Avoided very simply with byte-level write-enable.
+Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file. This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to be directly ANDed with the regfile write-enable lines to achieve the required functionality. The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance. Avoided very simply with byte-level write-enable.
+
+## General implications and considerations
+
+### OE=1 and SO
+
+XER.SO (sticky overflow) is known to cause massive slowdown in pretty much every microarchitecture and it definitely compromises the performance of out-of-order systems. The reason is that it introduces a READ-MODIFY-WRITE cycle between XER.SO and CR0 (which contains a copy of the SO field after inclusion of the overflow). The result and source registers branch off as RaW and WaR hazards from this RMW chain.
+
+This is even before predication or vectorisation were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA.
+
+As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests. Consequently it makes very little sense to continue to propagate OE=1 in the Vectorisation context of SV.
+
+### Vector Chaining
+
+(see [[masked_vector_chaining]])
+
+One of the design principles of SV is that the use of VL should be as closrly equivalent to a direct substitution of the scalar operations of the hardware for-loop as possible, as if those looped operations were actually in the instruction stream rather than being issued from the Vector loop.
+
+The implications here are that *register dependency hazards still have to be respected inter-element*.
+
+Using a multi-issue out-of-order engine as the underlying microarchitectural basis this is not as difficult to achieve as it first seems (the hard work habing been done by the Dependency Matrices). In addition, Vector Chaining should also be possible for a multi-issue out-of-order rngine to cope with, as long as false (unnecessary) Dependency Hazards are not introduced in between Vectors, where the dependencies actually only exist between elements *in* the Vector.
+
+The concept of recognising that it is the elements within the Vector that have Dependency Hazards rather than the Vectors themselves is what permits Cray-style "chaining".
+
+This "false/unnecessary hazard" condition eliminates and/or compromises the performance or drives up resource utilisation in at least two of the proposals below.