(no commit message)

[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn

index 3e626caaaeadc375b0ec1f66c2d3ee18e3a46e62..bad606195af4b8d3cfba31a94e8e7d9aa398c053 100644 (file)
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -118,6 +118,69 @@ It is important to note that having a different v3.0B Scalar opcode
  that is different from an SVP64 one is highly undesirable: the complexity
  in the decoder is greatly increased.
  
+# EXTRA Field Mapping
+
+The purpose of the 9-bit EXTRA field mapping is to mark individual
+registers (RT, RA, BFA) as either scalat or vector, and to extend
+their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
+Three of the 9 bits may also be used up for a 2nd Predicate (Twin
+Predication) leaving a mere 6 bits for qualifying registers. As can
+be seen there is significant pressure on these (and all) SVP64 bits.
+
+In Power ISA v3.1 prefixing there are bits which describe and classify
+the prefix in a fashion that is independent of the suffix. MLSS for
+example.  For SVP64 there is insufficient space to make the SVP64 Prefix
+"self-describing", and consequently every single Scalar instruction 
+had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
+This process was semi-automated and is described in this section.
+The final results, which are part of the SVP64 Specification, are here:
+
+* [[openpower/opcode_regs_deduped]]
+
+Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
+from reading the markdown formatted version of the Scalar pseudocode
+which is machine-readable and found in [[openpower/isatables]].  The
+analysis gives, by instruction, a "Register Profile".  `add RT, RA, RB`
+for example is given a designation `RM-2R-1W` because it requires
+two GPR reads and one GPR write.
+
+Secondly, the total number of registers was added up (2R-1W is 3 registers)
+and if less than or equal to three then that instruction could be given an
+EXTRA3 designation.  Four or more is given an EXTRA2 designation because
+there are only 9 bits available.
+
+Thirdly, the instruction was analysed to see if Twin or Single
+Predication was suitable.  As a general rule this was if there
+was only a single operand and a single result (`extw` and LD/ST)
+however it was found that some 2 or 3 operand instructions also
+qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
+in Twin Predication, some compromises were made, here.  LDST is
+Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
+
+Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
+could have been decided
+that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
+and RT indexed 2 (EXTRA bits 6-8).  In some cases (LD/ST with update)
+RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
+(because it is possible to do, and perceived to be useful). Rc=1
+co-results (CR0, CR1) are always given the same EXTRA index as their
+main result (RT, FRT).
+
+Fifthly, in an automated process the results of the analysis
+were outputted in CSV Format for use in machine-readable form
+by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
+
+This process was laborious but logical, and, crucially, once a
+decision is made (and ratified) cannot be reversed.
+Qualifying future Power ISA Scalar instructions for SVP64
+is **strongly** advised to utilise this same process and the same
+sv_analysis.py program as a canonical method of maintaining the
+relationships.  Alterations to that same program which
+change the Designation is **prohibited** once finalised (ratified
+through the Power ISA WG Process). It would
+be similar to deciding that `add` should be changed from X-Form
+to D-Form.
+
  # Single Predication
  
  This is a standard mode normally found in Vector ISAs.  every element in every source Vector and in the destination uses the same bit of one single predicate mask.
@@ -160,6 +223,12 @@ This is equivalent to
  `llvm.masked.compressstore.*`
  followed by
  `llvm.masked.expandload.*`
+with a single instruction.
+
+This extreme power and flexibility comes down to the fact that SVP64
+is not actually a Vector ISA: it is a loop-abstraction-concept that
+is applied *in general* to Scalar operations, just like the x86
+`REP` instruction (if put on steroids).
  
  # Reduce modes
  
@@ -184,8 +253,8 @@ but only if in doing so they preserve Program Order at the Element Level.
  Opportunities where this is possible include an `OR` operation
  or a MIN/MAX operation: it may be possible to parallelise the reduction,
  but for Floating Point it is not permitted due to different results
-being obtained if the reduction is not executed in strict sequential
-order.
+being obtained if the reduction is not executed in strict Program-Sequential
+Order.
  
  In essence it becomes the programmer's responsibility to leverage the
  pre-determined schedules to desired effect.
@@ -197,8 +266,9 @@ as a simple and natural relaxation of the usual restriction on the Vector
  Looping which would terminate if the destination was marked as a Scalar.
  Scalar Reduction by contrast *keeps issuing Vector Element Operations*
  even though the destination register is marked as scalar.
-Thus it is up to the programmer to be aware of this and observe some
-conventions.
+Thus it is up to the programmer to be aware of this, observe some
+conventions, and thus end up achieving the desired outcome of scalar
+reduction.
  
  It is also important to appreciate that there is no
  actual imposition or restriction on how this mode is utilised: there
@@ -267,6 +337,7 @@ Using the same register as both the source and destination, with Vectors
  of different offsets masks and values to be inserted has multiple
  applications including Video, cryptography and JIT compilation.
  
+Due to the Deterministic Scheduling,
  Subtract and Divide are still permitted to be executed in this mode,
  although from an algorithmic perspective it is strongly discouraged.
  It would be better to use addition followed by one final subtract,
@@ -356,8 +427,9 @@ executed in sequential Program Order, element 0 being the first.
  * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
    CR-creating operation produces a result (including cmp).  Similar to
    branch, an analysis of the CR is performed and if the test fails, the
-  vector operation terminates and discards all element operations at and
-  above the current one, and VL is truncated to either
+  vector operation terminates and discards all element operations
+  above the current one (and the current one if VLi is not set),
+  and VL is truncated to either
    the *previous* element or the current one, depending on whether
    VLi (VL "inclusive") is set.
  
@@ -459,7 +531,7 @@ SV is applied.
  
  Numbering relationships for CR fields are already complex due to being
  in BE format (*the relationship is not clearly explained in the v3.0B
-or v3.1B specification*).  However with some care and consideration
+or v3.1 specification*).  However with some care and consideration
  the exact same mapping used for INT and FP regfiles may be applied,
  just to the upper bits, as explained below.  The notation
  `CR{field number}` is used to indicate access to a particular
@@ -467,6 +539,14 @@ Condition Register Field (as opposed to the notation `CR[bit]`
  which accesses one bit of the 32 bit Power ISA v3.0B
  Condition Register)
  
+`CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
+
+     CR{7-n} = CR[32+n*4:35+n*4]
+
+For SVP64 the relationship for the sequential
+numbering of elements is to the CR **fields** within
+the CR Register, not to individual bits within the CR register.
+
  In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits.  The top 3 bits (0:2)
  select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
  *in* that CR.  The numbering was determined (after 4 months of
@@ -711,6 +791,10 @@ For modes:
  
  # Proposed Parallel-reduction algorithm
  
+**This algorithm contains a MV operation and may NOT be used.  Removal
+of the MV operation may be achieved by using index-redirection as was
+achieved in DCT and FFT REMAP**
+
  ```
  /// reference implementation of proposed SimpleV reduction semantics.
  ///
@@ -720,7 +804,7 @@ For modes:
  /// `temp_pred` is a user-visible Vector Condition register 
  ///
  /// all input arrays have length `vl`
-def reduce(  vl,  vec, pred, pred,):
+def reduce(vl, vec, pred):
      step = 1;
      while step < vl
          step *= 2;
@@ -730,9 +814,17 @@ def reduce(  vl,  vec, pred, pred,):
              if pred[i] && other_pred
                  vec[i] += vec[other];
              else if other_pred
-                vec[i] = vec[other];
+                XXX VIOLATION OF SVP64 DESIGN XXX
+                XXX vec[i] = vec[other];      XXX
+                XXX VIOLATION OF SVP64 DESIGN XXX
              pred[i] |= other_pred;
+```
  
+we'd want to use something based on the above pseudo-code
+rather than the below pseudo-code -- reasoning here:
+<https://bugs.libre-soc.org/show_bug.cgi?id=697#c11>
+
+```
  def reduce(  vl,  vec, pred, pred,):
      j = 0
      vi = [] # array of lookup indices to skip nonpredicated