(no commit message)

[libreriscv.git] / openpower / sv / predication.mdwn
diff --git a/openpower/sv/predication.mdwn b/openpower/sv/predication.mdwn

index c4bc0897fa551622fd819b1d34dcb18688e8a459..31eecfafc595f26dce6107910eb047e333257ca9 100644 (file)
--- a/openpower/sv/predication.mdwn
+++ b/openpower/sv/predication.mdwn
@@ -1,6 +1,7 @@
  # TODO ideas
  
-<https://bugs.libre-soc.org/show_bug.cgi?id=213>
+<https://bugs.libre-soc.org/show_bug.cgi?id=527>
+
  
  * idea 1: modify cmp (and other CR generators?) with qualifiers that
    create single bit prefix vector into int reg
@@ -16,6 +17,7 @@
    - small and large out-of-order 
    - in-order
    - FSM (0.3 IPC or below)
+  - single or multi-issue
  * must not compromise or penalise any microarchitectural performance
  * must cover up to 64 elements
  * must still work for elwidth over-rides
@@ -25,7 +27,7 @@
  * two modes, "zeroing" and "non-zeroing". zeroing mode places a zero in the masked-out element results, where non-zeroing leaves the destination (result) element unmodified.
  * predicate must be invertable via an opcode bit (to avoid the need for an instruction which inverts all bits of the predicate mask)
  
-Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file.  This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask can very simply be directly ANDed with the regfile write-enable lines to achieve the required functionality of leaving masked-out elements unmodified.  The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance.  Avoided very simply with byte-level write-enable.
+Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file.  This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to very simply be directly ANDed with the regfile write-enable lines to achieve the required functionality of leaving masked-out elements unmodified, right down to the 8 bit element level.  The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance.  Avoided very simply with byte-level write-enable.
  
  ## General implications and considerations
  
@@ -41,7 +43,7 @@ As well-known weaknesses that compromise performance, very little use of OE=1 is
  
  (see [[masked_vector_chaining]])
  
-One of the design principles of SV is that the use of VL should be as closrly equivalent to a direct substitution of the scalar operations of the hardware for-loop as possible, as if those looped operations were actually in the instruction stream (as scalar operations) rather than being issued from the Vector loop.
+One of the design principles of SV is that the use of VL should be as closely equivalent to a direct substitution of the scalar operations of the hardware for-loop as possible, as if those looped operations were actually in the instruction stream (as scalar operations) rather than being issued from the Vector loop.
  
  The implications here are that *register dependency hazards still have to be respected inter-element* even when (conceptually) pushed into the instruction stream from a hardware for-loop.
  
@@ -64,9 +66,9 @@ They also involve adding extra scalar bitmanip opcodes, such that by utilising
  
  In addition those scalar 64-bit bitmanip operations, although some of them are obscure and unusual in the scalar world, do actually have practical applications outside of a vector context.
  
-(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorised however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks!).
+(Hilariously and confusingly those very same scalar bitmanip opcodes may themselves be SV-vectorised however with VL only being up to 64 elements it is not anticipated that SV-bitmanip would be used to generate up to 64 bit predicate masks, when a single 64 bit scalar operation will suffice).
  
-Adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is however anomalous, costly, and unnecessary.
+The summary is that adding a full set special vector opcodes just for manipulating predicate masks and being able to transfer them to other regfiles (a la mfcr) is anomalous, costly, and unnecessary.
  
  ## CR-based predication proposal