openpower/sv/normal.mdwn

   1 # Appendix
   2
   3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
   5
   6 Table of contents:
   7
   8 [[!toc]]
   9
  10
  11 # Rounding, clamp and saturate
  12
  13 see  [[av_opcodes]].
  14
  15 To help ensure that audio quality is not compromised by overflow,
  16 "saturation" is provided, as well as a way to detect when saturation
  17 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
  18 one CR per element in the result (Note: this is different from VSX which
  19 has a single CR per block).
  20
  21 When N=0 the result is saturated to within the maximum range of an
  22 unsigned value.  For integer ops this will be 0 to 2^elwidth-1. Similar
  23 logic applies to FP operations, with the result being saturated to
  24 maximum rather than returning INF, and the minimum to +0.0
  25
  26 When N=1 the same occurs except that the result is saturated to the min
  27 or max of a signed result, and for FP to the min and max value rather
  28 than returning +/- INF.
  29
  30 When Rc=1, the CR "overflow" bit is set on the CR associated with the
  31 element, to indicate whether saturation occurred.  Note that due to
  32 the hugely detrimental effect it has on parallel processing, XER.SO is
  33 **ignored** completely and is **not** brought into play here.  The CR
  34 overflow bit is therefore simply set to zero if saturation did not occur,
  35 and to one if it did.
  36
  37 Note also that saturate on operations that produce a carry output are
  38 prohibited due to the conflicting use of the CR.so bit for storing if
  39 saturation occurred.
  40
  41 Post-analysis of the Vector of CRs to find out if any given element hit
  42 saturation may be done using a mapreduced CR op (cror), or by using the
  43 new crweird instruction, transferring the relevant CR bits to a scalar
  44 integer and testing it for nonzero.  see [[sv/cr_int_predication]]
  45
  46 Note that the operation takes place at the maximum bitwidth (max of
  47 src and dest elwidth) and that truncation occurs to the range of the
  48 dest elwidth.
  49
  50 # Reduce mode
  51
  52 Reduction in SVP64 is deterministic and somewhat of a misnomer.  A normal
  53 Vector ISA would have explicit Reduce opcodes with defibed characteristics
  54 per operation: in SX Aurora there is even an additional scalar argument
  55 containing the initial reduction value. SVP64 fundamentally has to
  56 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
  57 unique challenges.
  58
  59 The solution turns out to be to simply define reduction as permitting
  60 deterministic element-based schedules to be issued using the base Scalar
  61 operations, and to rely on the underlying microarchitecture to resolve
  62 Register Hazards at the element level.  This goes back to
  63 the fundamental principle that SV is nothing more than a Sub-Program-Counter
  64 sitting between Decode and Issue phases.
  65
  66 Microarchitectures *may* take opportunities to parallelise the reduction
  67 but only if in doing so they preserve Program Order at the Element Level.
  68 Opportunities where this is possible include an `OR` operation
  69 or a MIN/MAX operation: it may be possible to parallelise the reduction,
  70 but for Floating Point it is not permitted due to different results
  71 being obtained if the reduction is not executed in strict sequential
  72 order.
  73
  74 ## Scalar result reduce mode
  75
  76 In this mode, which is suited to operations involving carry or overflow,
  77 one register must be identified by the programmer as being the "accumulator".
  78 Scalar reduction is thus categorised by:
  79
  80 * One of the sources is a Vector
  81 * the destination is a scalar
  82 * optionally but most usefully when one source register is also the destination
  83 * That the source register type is the same as the destination register
  84   type identified as the "accumulator".  scalar reduction on `cmp`,
  85   `setb` or `isel` makes no sense for example because of the mixture
  86   between CRs and GPRs.
  87
  88 Typical applications include simple operations such as `ADD r3, r10.v,
  89 r3` where, clearly, r3 is being used to accumulate the addition of all
  90 elements is the vector starting at r10.
  91
  92      # add RT, RA,RB but when RT==RA
  93      for i in range(VL):
  94           iregs[RA] += iregs[RB+i] # RT==RA
  95
  96 However, *unless* the operation is marked as "mapreduce", SV ordinarily
  97 **terminates** at the first scalar operation.  Only by marking the
  98 operation as "mapreduce" will it continue to issue multiple sub-looped
  99 (element) instructions in `Program Order`.
 100
 101 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set.  This is useful for leaving a cumulative suffix sum in reverse order:
 102
 103     for i in (VL-1 downto 0):
 104         # RT-1 = RA gives a suffix sum
 105         iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
 106
 107 Other examples include shift-mask operations where a Vector of inserts
 108 into a single destination register is required, as a way to construct
 109 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
 110 Using the same register as both the source and destination, with Vectors
 111 of different offsets masks and values to be inserted has multiple
 112 applications including Video, cryptography and JIT compilation.
 113
 114 Subtract and Divide are still permitted to be executed in this mode,
 115 although from an algorithmic perspective it is strongly discouraged.
 116 It would be better to use addition followed by one final subtract,
 117 or in the case of divide, to get better accuracy, to perform a multiply
 118 cascade followed by a final divide.
 119
 120 Note that single-operand or three-operand scalar-dest reduce is perfectly
 121 well permitted: both still meet the qualifying characteristics that one
 122 source operand can also be the destination, which allows the "accumulator"
 123 to be identified.
 124
 125 If the "accumulator" cannot be identified (one of the sources is also
 126 a destination) the results are **UNDEFINED**.  This permits implementations
 127 to not have to have complex decoding analysis of register fields: it
 128 is thus up to the programmer to ensure that one of the source registers
 129 is also a destination register in order to take advantage of Scalar
 130 Reduce Mode.
 131
 132 If an interrupt or exception occurs in the middle of the scalar mapreduce,
 133 the scalar destination register **MUST** be updated with the current
 134 (intermediate) result, because this is how ```Program Order``` is
 135 preserved (Vector Loops are to be considered to be just another way of issuing instructions
 136 in Program Order).  In this way, after return from interrupt,
 137 the scalar mapreduce may continue where it left off.  This provides
 138 "precise" exception behaviour.
 139
 140 Note that hardware is perfectly permitted to perform multi-issue
 141 parallel optimisation of the scalar reduce operation: it's just that
 142 as far as the user is concerned, all exceptions and interrupts **MUST**
 143 be precise.
 144
 145 ## Vector result reduce mode
 146
 147 Vector result reduce mode may utilise the destination vector for
 148 the purposes of storing intermediary results.  Interrupts and exceptions
 149 can therefore also be precise.  The result will be in the first
 150 non-predicate-masked-out destination element.  Note that unlike
 151 Scalar reduce mode, Vector reduce
 152 mode is *not* suited to operations which involve carry or overflow.
 153
 154 Programs **MUST NOT** rely on the contents of the intermediate results:
 155 they may change from hardware implementation to hardware implementation.
 156 Some implementations may perform an incremental update, whilst others
 157 may choose to use the available Vector space for a binary tree reduction.
 158 If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
 159 a *straight* SVP64 Vector instruction can be issued, where the source and
 160 destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
 161 respecting ```Program Order``` being mandatory in SVP64, hardware should
 162 and must detect this case and issue an incremental sequence of scalar
 163 element instructions.
 164
 165 1. limited to single predicated dual src operations (add RT, RA, RB).
 166    triple source operations are prohibited (such as fma).
 167 2. limited to operations that make sense.  divide is excluded, as is
 168    subtract (X - Y - Z produces different answers depending on the order)
 169    and asymmetric CRops (crandc, crorc). sane  operations:
 170    multiply, min/max, add, logical bitwise OR, most other CR ops.
 171    operations that do have the same source and dest register type are
 172    also excluded (isel, cmp). operations involving carry or overflow
 173    (XER.CA / OV) are also prohibited.
 174 3. the destination is a vector but the result is stored, ultimately,
 175    in the first nonzero predicated element.  all other nonzero predicated
 176    elements are undefined. *this includes the CR vector* when Rc=1
 177 4. implementations may use any ordering and any algorithm to reduce
 178    down to a single result.  However it must be equivalent to a straight
 179    application of mapreduce.  The destination vector (except masked out
 180    elements) may be used for storing any intermediate results. these may
 181    be left in the vector (undefined).
 182 5. CRM applies when Rc=1.  When CRM is zero, the CR associated with
 183    the result is regarded as a "some results met standard CR result
 184    criteria". When CRM is one, this changes to "all results met standard
 185    CR criteria".
 186 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
 187    in order to store sufficient state to resume operation should an
 188    interrupt occur. this is also why implementations are permitted to use
 189    the destination vector to store intermediary computations
 190 7. *Predication may be applied*.  zeroing mode is not an option.  masked-out
 191    inputs are ignored; masked-out elements in the destination vector are
 192    unaltered (not used for the purposes of intermediary storage); the
 193    scalar result is placed in the first available unmasked element.
 194
 195 Pseudocode for the case where RA==RB:
 196
 197     result = op(iregs[RA], iregs[RA+1])
 198     CR = analyse(result)
 199     for i in range(2, VL):
 200         result = op(result, iregs[RA+i])
 201         CRnew = analyse(result)
 202         if Rc=1
 203             if CRM:
 204                  CR = CR bitwise or CRnew
 205             else:
 206                  CR = CR bitwise AND CRnew
 207
 208 TODO: case where RA!=RB which involves first a vector of 2-operand
 209 results followed by a mapreduce on the intermediates.
 210
 211 Note that when SVM is clear and SUBVL!=1 the sub-elements are
 212 *independent*, i.e. they are mapreduced per *sub-element* as a result.
 213 illustration with a vec2:
 214
 215     result.x = op(iregs[RA].x, iregs[RA+1].x)
 216     result.y = op(iregs[RA].y, iregs[RA+1].y)
 217     for i in range(2, VL):
 218         result.x = op(result.x, iregs[RA+i].x)
 219         result.y = op(result.y, iregs[RA+i].y)
 220
 221 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
 222
 223 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
 224 subvector mode.  Example for a vec3:
 225
 226     for i in range(VL):
 227         result = op(iregs[RA+i].x, iregs[RA+i].x)
 228         result = op(result, iregs[RA+i].y)
 229         result = op(result, iregs[RA+i].z)
 230         iregs[RT+i] = result
 231
 232 In this mode, when Rc=1 the Vector of CRs is as normal: each result
 233 element creates a corresponding CR element.
 234
 235 # Fail-on-first
 236
 237 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
 238 the other for arithmetic operations (actually, CR-driven).  Note in each
 239 case the assumption is that vector elements are required appear to be
 240 executed in sequential Program Order, element 0 being the first.
 241
 242 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 243   ordinary one.  Exceptions occur "as normal".  However for elements 1
 244   and above, if an exception would occur, then VL is **truncated** to the
 245   previous element.
 246 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
 247   CR-creating operation produces a result (including cmp).  Similar to
 248   branch, an analysis of the CR is performed and if the test fails, the
 249   vector operation terminates and discards all element operations at and
 250   above the current one, and VL is truncated to either
 251   the *previous* element or the current one, depending on whether
 252   VLi (VL "inclusive") is set.
 253
 254 Thus the new VL comprises a contiguous vector of results,
 255 all of which pass the testing criteria (equal to zero, less than zero).
 256
 257 The CR-based data-driven fail-on-first is new and not found in ARM
 258 SVE or RVV. It is extremely useful for reducing instruction count,
 259 however requires speculative execution involving modifications of VL
 260 to get high performance implementations.  An additional mode (RC1=1)
 261 effectively turns what would otherwise be an arithmetic operation
 262 into a type of `cmp`.  The CR is stored (and the CR.eq bit tested
 263 against the `inv` field).
 264 If the CR.eq bit is equal to `inv` then the Vector is truncated and
 265 the loop ends.
 266 Note that when RC1=1 the result elements are never stored, only the CRs.
 267
 268 VLi is only available as an option when `Rc=0` (or for instructions
 269 which do not have Rc). When set, the current element is always
 270 also included in the count (the new length that VL will be set to).
 271 This may be useful in combination with "inv" to truncate the Vector
 272 to `exclude` elements that fail a test, or, in the case of implementations
 273 of strncpy, to include the terminating zero.
 274
 275 In CR-based data-driven fail-on-first there is only the option to select
 276 and test one bit of each CR (just as with branch BO).  For more complex
 277 tests this may be insufficient.  If that is the case, a vectorised crops
 278 (crand, cror) may be used, and ffirst applied to the crop instead of to
 279 the arithmetic vector.
 280
 281 One extremely important aspect of ffirst is:
 282
 283 * LDST ffirst may never set VL equal to zero.  This because on the first
 284   element an exception must be raised "as normal".
 285 * CR-based data-dependent ffirst on the other hand **can** set VL equal
 286   to zero. This is the only means in the entirety of SV that VL may be set
 287   to zero (with the exception of via the SV.STATE SPR).  When VL is set
 288   zero due to the first element failing the CR bit-test, all subsequent
 289   vectorised operations are effectively `nops` which is
 290   *precisely the desired and intended behaviour*.
 291
 292 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
 293 to a nonzero value for any implementation-specific reason.  For example:
 294 it is perfectly reasonable for implementations to alter VL when ffirst
 295 LD or ST operations are initiated on a nonaligned boundary, such that
 296 within a loop the subsequent iteration of that loop begins subsequent
 297 ffirst LD/ST operations on an aligned boundary.  Likewise, to reduce
 298 workloads or balance resources.
 299
 300 CR-based data-dependent first on the other hand MUST not truncate VL
 301 arbitrarily to a length decided by the hardware: VL MUST only be
 302 truncated based explicitly on whether a test fails.
 303 This because it is a precise test on which algorithms
 304 will rely.
 305
 306 ## Data-dependent fail-first on CR operations (crand etc)
 307
 308 Operations that actually produce or alter CR Field as a result
 309 do not also in turn have an Rc=1 mode.  However it makes no
 310 sense to try to test the 4 bits of a CR Field for being equal
 311 or not equal to zero. Moreover, the result is already in the
 312 form that is desired: it is a CR field.  Therefore,
 313 CR-based operations have their own SVP64 Mode, described
 314 in [[sv/cr_ops]]
 315
 316 There are two primary different types of CR operations:
 317
 318 * Those which have a 3-bit operand field (referring to a CR Field)
 319 * Those which have a 5-bit operand (referring to a bit within the
 320    whole 32-bit CR)
 321
 322 More details can be found in [[sv/cr_ops]].
 323
 324 # pred-result mode
 325
 326 This mode merges common CR testing with predication, saving on instruction
 327 count. Below is the pseudocode excluding predicate zeroing and elwidth
 328 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
 329
 330     for i in range(VL):
 331         # predication test, skip all masked out elements.
 332         if predicate_masked_out(i):
 333              continue
 334         result = op(iregs[RA+i], iregs[RB+i])
 335         CRnew = analyse(result) # calculates eq/lt/gt
 336         # Rc=1 always stores the CR
 337         if Rc=1 or RC1:
 338             crregs[offs+i] = CRnew
 339         # now test CR, similar to branch
 340         if RC1 or CRnew[BO[0:1]] != BO[2]:
 341             continue # test failed: cancel store
 342         # result optionally stored but CR always is
 343         iregs[RT+i] = result
 344
 345 The reason for allowing the CR element to be stored is so that
 346 post-analysis of the CR Vector may be carried out.  For example:
 347 Saturation may have occurred (and been prevented from updating, by the
 348 test) but it is desirable to know *which* elements fail saturation.
 349
 350 Note that RC1 Mode basically turns all operations into `cmp`.  The
 351 calculation is performed but it is only the CR that is written. The
 352 element result is *always* discarded, never written (just like `cmp`).
 353
 354 Note that predication is still respected: predicate zeroing is slightly
 355 different: elements that fail the CR test *or* are masked out are zero'd.
 356
 357 ## pred-result mode on CR ops
 358
 359 CR operations (mtcr, crand, cror) may be Vectorised,
 360 predicated, and also pred-result mode applied to it.
 361 Vectorisation applies to 4-bit CR Fields which are treated as
 362 elements, not the individual bits of the 32-bit CR.
 363 CR ops and how to identify them is described in [[sv/cr_ops]]
 364