* <https://bugs.libre-soc.org/show_bug.cgi?id=574>
* <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=697>
This is the appendix to [[sv/svp64]], providing explanations of modes
etc. leaving the main svp64 page's primary purpose as outlining the
may be performed to analyse the carry bits (including carry lookahead
propagation) before continuing with further parallel additions.
-# v3.0B/v3.1B relevant instructions
+# v3.0B/v3.1 relevant instructions
SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
CPU ISA.
Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
Vector ISA would have explicit Reduce opcodes with defined characteristics
per operation: in SX Aurora there is even an additional scalar argument
-containing the initial reduction value. SVP64 fundamentally has to
+containing the initial reduction value, and the default is either 0
+or 1 depending on the specifics of the explicit opcode.
+SVP64 fundamentally has to
utilise *existing* Scalar Power ISA v3.0B operations, which presents some
unique challenges.
being obtained if the reduction is not executed in strict sequential
order.
-## Scalar result reduce mode
+In essence it becomes the programmer's responsibility to leverage the
+pre-determined schedules to desired effect.
+
+## Scalar result reduction and iteration
Scalar Reduction per se does not exist, instead is implemented in SVP64
as a simple and natural relaxation of the usual restriction on the Vector
Scalar Reduction by contrast *keeps issuing Vector Element Operations*
even though the destination register is marked as scalar.
Thus it is up to the programmer to be aware of this and observe some
-conventions. It is also important to appreciate that there is no
+conventions.
+
+It is also important to appreciate that there is no
actual imposition or restriction on how this mode is utilised: there
-will therefore be several valuable uses (including Vector Iteration)
-and it is up to the programmer to make best use of the capability
+will therefore be several valuable uses (including Vector Iteration
+and "Reverse-Gear")
+and it is up to the programmer to make best use of the
+(strictly deterministic) capability
provided.
In this mode, which is suited to operations involving carry or overflow,
-one register must be identified by the programmer as being the "accumulator".
-Scalar reduction is thus categorised by:
+one register must be assigned, by convention by the programmer to be the
+"accumulator". Scalar reduction is thus categorised by:
* One of the sources is a Vector
* the destination is a scalar
-* optionally but most usefully when one source register is also the destination
+* optionally but most usefully when one source scalar register is
+ also the scalar destination (which may be informally termed
+ the "accumulator")
* That the source register type is the same as the destination register
- type identified as the "accumulator". scalar reduction on `cmp`,
+ type identified as the "accumulator". Scalar reduction on `cmp`,
`setb` or `isel` makes no sense for example because of the mixture
between CRs and GPRs.
where their use results in "extraneous execution", i.e. where it is clear
that the sequence of operations, comprising multiple overwrites to
a scalar destination **without** cumulative, iterative, or reductive
-behaviour, may discard all but the last element operation. Identification
+behaviour (no "accumulator"), may discard all but the last element
+operation. Identification
of such is trivial to do for `setb` and `cmp`: the source register type is
a completely different register file from the destination*
for i in range(VL):
iregs[RA] += iregs[RB+i] # RT==RA
-However, *unless* the operation is marked as "mapreduce", SV ordinarily
+However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
+SV ordinarily
**terminates** at the first scalar operation. Only by marking the
operation as "mapreduce" will it continue to issue multiple sub-looped
(element) instructions in `Program Order`.
-To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This is useful for leaving a cumulative suffix sum in reverse order:
-
- for i in (VL-1 downto 0):
- # RT-1 = RA gives a suffix sum
- iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
+To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This may be useful in situations where the results may be different
+(floating-point) if executed in a different order. Given that there is
+no actual prohibition on Reduce Mode being applied when the destination
+is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
+or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
+for example will start at the opposite end of the Vector and push
+a cumulative series of overlapping add operations into the Execution units of
+the underlying hardware.
Other examples include shift-mask operations where a Vector of inserts
into a single destination register is required, as a way to construct
Note that when SVM is clear and SUBVL!=1 the sub-elements are
*independent*, i.e. they are mapreduced per *sub-element* as a result.
-illustration with a vec2:
+illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
- result.x = op(iregs[RA].x, iregs[RA+1].x)
- result.y = op(iregs[RA].y, iregs[RA+1].y)
- for i in range(2, VL):
- result.x = op(result.x, iregs[RA+i].x)
- result.y = op(result.y, iregs[RA+i].y)
+ for i in range(0, VL):
+ # RA==RT in the instruction. does not have to be
+ iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
+ iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
-Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
+Thus logically there is nothing special or unanticipated about
+`SVM=0`: it is expected behaviour according to standard SVP64
+Sub-Vector rules.
-When SVM is set and SUBVL!=1, another variant is enabled: horizontal
-subvector mode. Example for a vec3:
+By contrast, when SVM is set and SUBVL!=1, a Horizontal
+Subvector mode is enabled, which behaves very much more
+like a traditional Vector Processor Reduction instruction.
+Example for a vec3:
for i in range(VL):
result = iregs[RA+i].x
# Fail-on-first
-Data-dependent fail-on-first has two distinct variants: one for LD/ST,
-the other for arithmetic operations (actually, CR-driven). Note in each
+Data-dependent fail-on-first has two distinct variants: one for LD/ST
+(see [[sv/ldst]],
+the other for arithmetic operations (actually, CR-driven)
+([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
+Note in each
case the assumption is that vector elements are required appear to be
executed in sequential Program Order, element 0 being the first.
# pred-result mode
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
-
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i):
- continue
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR
- if Rc=1 or RC1:
- crregs[offs+i] = CRnew
- # now test CR, similar to branch
- if RC1 or CRnew[BO[0:1]] != BO[2]:
- continue # test failed: cancel store
- # result optionally stored but CR always is
- iregs[RT+i] = result
-
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out. For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`. The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
+Predicate-result merges common CR testing with predication, saving on
+instruction count. In essence, a Condition Register Field test
+is performed, and if it fails it is considered to have been
+*as if* the destination predicate bit was zero.
+Arithmetic and Logical Pred-result is covered in [[sv/normal]]
## pred-result mode on CR ops
CR element*. Greatly simplified pseudocode:
for i in range(VL):
- # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
- + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
- == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
+ # calculate the vector result of an add
+ iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
+ # now calculate CR bits
+ CRs{8+i}.eq = iregs[RT+i] == 0
+ CRs{8+i}.gt = iregs[RT+i] > 0
+ ... etc
If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
then a followup instruction must be performed, setting "reduce" mode on