(no commit message)

[libreriscv.git] / openpower / sv / predication.mdwn
diff --git a/openpower/sv/predication.mdwn b/openpower/sv/predication.mdwn

index c5bd20a12bf0ab6322d224319f9af301bcf74b65..6af4c9cadc40cec84beb9b2ac1f645d0d0f836c3 100644 (file)
--- a/openpower/sv/predication.mdwn
+++ b/openpower/sv/predication.mdwn
@@ -1,4 +1,4 @@
-# TODO
+# TODO ideas
  
  <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  
  
  <https://bugs.libre-soc.org/show_bug.cgi?id=213>
  
@@ -12,7 +12,10 @@
  
  # Requirements
  
  
  # Requirements
  
-* must be easily implementable in any  microarchitecture including out-of-order 
+* must be easily implementable in any  microarchitecture including:
+  - small and large out-of-order 
+  - in-order
+  - FSM (0.3 IPC or below)
  * must not compromise or penalise any microarchitectural performance
  * must cover up to 64 elements
  * must still work for elwidth over-rides
  * must not compromise or penalise any microarchitectural performance
  * must cover up to 64 elements
  * must still work for elwidth over-rides
@@ -22,7 +25,31 @@
  * two modes, "zeroing" and "non-zeroing". zeroing mode places a zero in the masked-out element results, where non-zeroing leaves the destination (result) element unmodified.
  * predicate must be invertable via an opcode bit (to avoid the need for an instruction which inverts all bits of the predicate mask)
  
  * two modes, "zeroing" and "non-zeroing". zeroing mode places a zero in the masked-out element results, where non-zeroing leaves the destination (result) element unmodified.
  * predicate must be invertable via an opcode bit (to avoid the need for an instruction which inverts all bits of the predicate mask)
  
-Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file.  This in combination with 8b-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to be directly ANDed with the regfile write-enable lines to achieve the required functionality.  The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance.  Avoided very simply with byte-level write-enable.
+Implementation note: even in in-order microarchitectures it is strongly adviseable to use byte-level write-enable lines on the register file.  This in combination with 8-bit SIMD element overrides allows, in "non-zeroing" mode, the predicate mask to be directly ANDed with the regfile write-enable lines to achieve the required functionality.  The alternative is to perform a READ-MODIFY-MASK-WRITE cycle which is costly and compromises performance.  Avoided very simply with byte-level write-enable.
+
+## General implications and considerations
+
+### OE=1 and SO
+
+XER.SO (sticky overflow) is known to cause massive slowdown in pretty much every microarchitecture and it definitely compromises the performance of out-of-order systems.  The reason is that it introduces a READ-MODIFY-WRITE cycle between XER.SO and CR0 (which contains a copy of the SO field after inclusion of the overflow). The result and source registers branch off as RaW and WaR hazards from this RMW chain.
+
+This is even before predication or vectorisation were to be added on top, i.e. these are existing weaknesses in OpenPOWER as a scalar ISA.
+
+As well-known weaknesses that compromise performance, very little use of OE=1 is actually made, outside of unit tests and Conformance Tests.  Consequently it makes very little sense to continue to propagate OE=1 in the Vectorisation context of SV.
+
+### Vector Chaining
+
+(see [[masked_vector_chaining]])
+
+One of the design principles of SV is that the use of VL should be as closrly equivalent to a direct substitution of the scalar operations of the hardware for-loop as possible, as if those looped operations were actually in the instruction stream rather than being issued from the Vector loop.
+
+The implications here are that *register dependency hazards still have to be respected inter-element*.
+
+Using a multi-issue out-of-order engine as the underlying microarchitectural basis this is not as difficult to achieve as it first seems (the hard work habing been done by the Dependency Matrices).  In addition, Vector Chaining should also be possible for a multi-issue out-of-order rngine to cope with, as long as false (unnecessary) Dependency Hazards are not introduced in between Vectors, where the dependencies actually only exist between elements *in* the Vector.
+
+The concept of recognising that it is the elements within the Vector that have Dependency Hazards rather than the Vectors themselves is what permits Cray-style "chaining". 
+
+This "false/unnecessary hazard" condition eliminates and/or compromises the performance or drives up resource utilisation in at least two of the proposals below.
  
  # Proposals
  
  
  # Proposals
  
@@ -43,7 +70,7 @@ Adding a full set special vector opcodes just for manipulating predicate masks a
  
  this involves treating each CR as providing one bit of predicate. If
  there is limited space in SVPrefix it will be a fixed bit (bit 0)
  
  this involves treating each CR as providing one bit of predicate. If
  there is limited space in SVPrefix it will be a fixed bit (bit 0)
-otherwise it may be selected (bit 0 to 3 of the CR)
+otherwise it may be selected (bit 0 to 3 of the CR) through a firld in the opcode.
  
  the crucial advantage of this proposal is that the Function Units can
  have one more register (a CR) added as their Read Dependency Hazards
  
  the crucial advantage of this proposal is that the Function Units can
  have one more register (a CR) added as their Read Dependency Hazards
@@ -60,18 +87,7 @@ datapath to the relevant FUs.  This could be reduced by adding yet another
  type of special virtual register port or datapath that masks out the
  required predicate bits closer to the regfile.
  
  type of special virtual register port or datapath that masks out the
  required predicate bits closer to the regfile.
  
-## One scalar int per predicate element.
-
-Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates.  This idea has quite a lot of merit to it.
-
-Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
-
-The disadvantages appear on closer analysis:
-
-* Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain 8 predicate bits, would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read.  Resource-wise, then, this idea is expensive.
-* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly.  Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
-
-On balance this is a less favourable option than vectorising CRs
+another disadvantage is that the CR regfile needs to be expanded from 8x 4bit CRs to a minimum of 64x or preferably 128x 4-bit CRs.  Beyond that they can be transferred using vectorised mfcr and mtcrf into INT regs.  this is a huge number of CR regs, each of which will need a DM column in the FU-REGs Matrix.  however this cost can be mitigated through regfile cacheing, bringing FU-REGs column numbers back down to "sane".
  
  ### Predicated SIMD HI32-LO32 FUs
  
  
  ### Predicated SIMD HI32-LO32 FUs
  
@@ -134,8 +150,27 @@ bandwidth can again be reduced by performing the selection of the masks
  (bit 0 thru bit 3 of each CR) closer to the regfile i.e. before hitting
  the broadcast bus.
  
  (bit 0 thru bit 3 of each CR) closer to the regfile i.e. before hitting
  the broadcast bus.
  
+## One scalar int per predicate element.
+
+Similar to RVV and similar to the one-CR-per-element concept above, the idea here is to use the LSB of any given element in a vector of predicates.  This idea has quite a lot of merit to it.
+
+Implementation-wise just like in the CR-based case a special regfile port could be added that gets the LSB of each scalar integer register and routes them through to the broadcast bus.
+
+The disadvantages appear on closer analysis:
+
+* Unlike the "full" CR port (which reads 8x CRs CR0-7 in one hit) trying the same trick on the scalar integer regfile, to obtain just 8 predicate bits (each being an LSB of a given 64 bit scalar int), would require a whopping 8x64bit set of reads to the INT regfile instead of a scant 1x32bit read.  Resource-wise, then, this idea is expensive.
+* With predicate bits being distributed out amongst 64 bit scalar registers, scalar bitmanipulation operations that can be performed after transferring Vectors of CMP operations from CRs to INTs (vectorised-mfcr) are more challenging and costly.  Rather than use vectorised mfcr, complex transfers of the LSBs into a single scalar int are required.
+
+In a "normal" Vector ISA this would be solved by adding opcodes that perform the kinds of bitmanipulation operations normally needed for predicate masks, as specialist operations *on* those masks.  However for SV the rule has been set: "no unnecessary additional Vector Instructions" because it is possible to use existing PowerISA scalar bitmanip opcodes to cover the same job.
+
+The problem is that vectors of LSBs need to be transferred *to* scalar int regs, bitmanip operations carried out, *and then transferred back*, which is exceptionally costly.
+
+On balance this is a less favourable option than vectorising CRs
+
  ## Scalar (single) integer as predicate, with one DM row
  
  ## Scalar (single) integer as predicate, with one DM row
  
+This idea has merit in that to perform predicate bitmanip operations the preficate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away.  Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
+
  This idea has several disadvantages.
  
  * the single DM entry for the entire 64 bits creates a read hazard
  This idea has several disadvantages.
  
  * the single DM entry for the entire 64 bits creates a read hazard
@@ -178,7 +213,8 @@ and ongoing maintenance difficulties.
  ## Schemes which split (a scalar) integer reg into mask "chunks"
  
  These ideas are based on the principle that each chunk of 8 (or 16)
  ## Schemes which split (a scalar) integer reg into mask "chunks"
  
  These ideas are based on the principle that each chunk of 8 (or 16)
-bits of a scalar integer register may be covered by its own DM row.
+bits of a scalar integer register may be covered by its own DM column
+  in FU-REGs.
  8 chunks of a scalar 64-bit integer register for use as a bit-level
  predicate mask onto 64 vector elements would for example require 8
  DM entries.
  8 chunks of a scalar 64-bit integer register for use as a bit-level
  predicate mask onto 64 vector elements would for example require 8
  DM entries.
@@ -220,8 +256,6 @@ think through how to implement, which is itself a bad sign.  It is
  suspected that chaining will be complex or adversely affected by certain
  combinations of element width.
  
  suspected that chaining will be complex or adversely affected by certain
  combinations of element width.
  
-(see [[masked_vector_chaining]])
-
  Overall this idea which initially seems to save resources brings together
  all the least favourable implementation aspects of other proposals and
  requires and combines all of them.
  Overall this idea which initially seems to save resources brings together
  all the least favourable implementation aspects of other proposals and
  requires and combines all of them.