In essence it becomes the programmer's responsibility to leverage the
pre-determined schedules to desired effect.
-## Scalar result reduce mode
+## Scalar result reduction and iteration
Scalar Reduction per se does not exist, instead is implemented in SVP64
as a simple and natural relaxation of the usual restriction on the Vector
provided.
In this mode, which is suited to operations involving carry or overflow,
-one register must be identified by the programmer as being the "accumulator".
-Scalar reduction is thus categorised by:
+one register must be assigned, by convention by the programmer to be the
+"accumulator". Scalar reduction is thus categorised by:
* One of the sources is a Vector
* the destination is a scalar
also the scalar destination (which may be informally termed
the "accumulator")
* That the source register type is the same as the destination register
- type identified as the "accumulator". scalar reduction on `cmp`,
+ type identified as the "accumulator". Scalar reduction on `cmp`,
`setb` or `isel` makes no sense for example because of the mixture
between CRs and GPRs.
for i in range(VL):
iregs[RA] += iregs[RB+i] # RT==RA
-However, *unless* the operation is marked as "mapreduce", SV ordinarily
+However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
+SV ordinarily
**terminates** at the first scalar operation. Only by marking the
operation as "mapreduce" will it continue to issue multiple sub-looped
(element) instructions in `Program Order`.
Note that when SVM is clear and SUBVL!=1 the sub-elements are
*independent*, i.e. they are mapreduced per *sub-element* as a result.
-illustration with a vec2:
+illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
- result.x = op(iregs[RA].x, iregs[RA+1].x)
- result.y = op(iregs[RA].y, iregs[RA+1].y)
- for i in range(2, VL):
- result.x = op(result.x, iregs[RA+i].x)
- result.y = op(result.y, iregs[RA+i].y)
+ for i in range(0, VL):
+ # RA==RT in the instruction. does not have to be
+ iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
+ iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
-Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
+Thus logically there is nothing special or unanticipated about
+`SVM=0`: it is expected behaviour according to standard SVP64
+Sub-Vector rules.
-When SVM is set and SUBVL!=1, another variant is enabled: horizontal
-subvector mode. Example for a vec3:
+By contrast, when SVM is set and SUBVL!=1, a Horizontal
+Subvector mode is enabled, which behaves very much more
+like a traditional Vector Processor Reduction instruction.
+Example for a vec3:
for i in range(VL):
result = iregs[RA+i].x
# Fail-on-first
-Data-dependent fail-on-first has two distinct variants: one for LD/ST,
-the other for arithmetic operations (actually, CR-driven). Note in each
+Data-dependent fail-on-first has two distinct variants: one for LD/ST
+(see [[sv/ldst]],
+the other for arithmetic operations (actually, CR-driven)
+([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
+Note in each
case the assumption is that vector elements are required appear to be
executed in sequential Program Order, element 0 being the first.
# pred-result mode
-This mode merges common CR testing with predication, saving on instruction
-count. Below is the pseudocode excluding predicate zeroing and elwidth
-overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
-
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i):
- continue
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR
- if Rc=1 or RC1:
- crregs[offs+i] = CRnew
- # now test CR, similar to branch
- if RC1 or CRnew[BO[0:1]] != BO[2]:
- continue # test failed: cancel store
- # result optionally stored but CR always is
- iregs[RT+i] = result
-
-The reason for allowing the CR element to be stored is so that
-post-analysis of the CR Vector may be carried out. For example:
-Saturation may have occurred (and been prevented from updating, by the
-test) but it is desirable to know *which* elements fail saturation.
-
-Note that RC1 Mode basically turns all operations into `cmp`. The
-calculation is performed but it is only the CR that is written. The
-element result is *always* discarded, never written (just like `cmp`).
-
-Note that predication is still respected: predicate zeroing is slightly
-different: elements that fail the CR test *or* are masked out are zero'd.
+Predicate-result merges common CR testing with predication, saving on
+instruction count. In essence, a Condition Register Field test
+is performed, and if it fails it is considered to have been
+*as if* the destination predicate bit was zero.
+Arithmetic and Logical Pred-result is covered in [[sv/normal]]
## pred-result mode on CR ops
CR element*. Greatly simplified pseudocode:
for i in range(VL):
- # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
- + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
- == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
+ # calculate the vector result of an add
+ iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
+ # now calculate CR bits
+ CRs{8+i}.eq = iregs[RT+i] == 0
+ CRs{8+i}.gt = iregs[RT+i] > 0
+ ... etc
If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
then a followup instruction must be performed, setting "reduce" mode on