through an ISA revision. The opcode proliferation, at O(N^6), inexorably
spirals out of control in the ISA, detrimentally impacting the hardware,
the software, the compilers and the testing and compliance. Here are
-the typical dimensions that result in such massive proliferation:
+the typical dimensions that result in such massive proliferation,
+based on mass-volume DSPs and Micro-Processors:
* Operation (add, mul)
* bitwidth (8, 16, 32, 64, 128)
increase using "tagging" (similar to how x86 originally extended
registers from 32 to 64 bit).
+![Single-Issue concept](/openpower/svp64-primer/img/power_pipelines.svg){ width=40% height=20% }
+
## SV
The fundamentals are (just like x86 "REP"):
* Once the loop is completed *only then* is the Program Counter
allowed to move to the next instruction.
-![image](/svp64-primer/img/power_pipelines.svg)
+![Multi-Issue with Predicated SIMD back-end ALUs](/openpower/svp64-primer/img/sv_multi_issue.svg){ width=40% height=40% }
Hardware (and simulator) implementors are free and clear to implement this
as literally a for-loop, sitting in between instruction decode and issue.
out-of-order execution, although it is strongly recommended to add
predication capability directly into SIMD backend units.
-In Power ISA v3.0B pseudo-code form, an ADD operation, assuming both
-source and destination have been "tagged" as Vectors, is simply:
+A typical Cray-style Scalable Vector ISA (where a SIMD one has a fixed
+non-negotiable static parameter instead of a runtime-dynamic VL)
+performs its arithmetic as:
+
+ for i = 0 to VL-1:
+ VPR(RT)[i] = VPR[RA][i] + VPR(RB)[i]
+
+In Power ISA v3.0B pseudo-code form, an ADD operation in Simple-V,
+assuming both source and destination have been "tagged" as Vectors,
+is simply:
for i = 0 to VL-1:
GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
* Compacted operations into registers (normally only provided by SIMD)
* Fail-on-first (introduced in ARM SVE2)
* A new concept: Data-dependent fail-first
-* Condition-Register based *post-result* predication (also new)
* A completely new concept: "Twin Predication"
* vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
function op_add(RT, RA, RB) # add not VADD!
int id=0, irs1=0, irs2=0;
- predval = get_pred_val(FALSE, rd);
+ predval = get_pred_val(FALSE, RT); # dest mask
for i = 0 to VL-1:
if (predval & 1<<i) # predication bit test
ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
function op_add(RT, RA, RB) # add not VADD!
int id=0, irs1=0, irs2=0;
- predval = get_pred_val(FALSE, rd);
+ predval = get_pred_val(FALSE, RT); # dest pred
for i = 0 to VL-1:
if (predval & 1<<i) # predication bit test
ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
"register" but that from that location onwards the elements *overlap
subsequent registers*.
-![image](/svp64-primer/img/svp64_regs.svg){ width=40% }
+![image](/openpower/svp64-primer/img/svp64_regs.svg){ width=40% height=40% }
Here is another way to view the same concept, bearing in mind that it
is assumed a LE memory order:
If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
VGATHER/VSCATTER.
-# CR predicate result analysis
-
-Power ISA has Condition Registers. These store an analysis of the result
-of an operation to test it for being greater, less than or equal to zero.
-What if a test could be done, similar to branch BO testing, which hooked
-into the predication system?
-
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i): continue # skip
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR
- if RC1 or Rc=1: crregs[offs+i] = CRnew
- if RC1: continue # RC1 mode skips result store
- # now test CR, similar to branch
- if CRnew[BO[0:1]] == BO[2]:
- # result optionally stored but CR always is
- iregs[RT+i] = result
-
-Note that whilst the Vector of CRs is always written to the CR regfile,
-only those result elements that pass the BO test get written to the
-integer regfile (when RC1 mode is not set). In RC1 mode the CR is always
-stored, but the result never is. This effectively turns every arithmetic
-operation into a type of `cmp` instruction.
-
-Here for example if FP overflow occurred, and the CR testing was carried
-out for that, all valid results would be stored but invalid ones would
-not, but in addition the Vector of CRs would contain the indicators of
-which ones failed. With the invalid results being simply not written
-this could save resources (save on register file writes).
-
-Also expected is, due to the fact that the predicate mask is effectively
-ANDed with the post-result analysis as a secondary type of predication,
-that there would be savings to be had in some types of operations where
-the post-result analysis, if not included in SV, would need a second
-predicate calculation followed by a predicate mask AND operation.
-
-Note, hilariously, that Vectorised Condition Register Operations (crand,
-cror) may also have post-result analysis applied to them. With Vectors
-of CRs being utilised *for* predication, possibilities for compact and
-elegant code begin to emerge from this innocuous-looking addition to SV.
-
# Exception-based Fail-on-first
One of the major issues with Vectorised LD/ST operations is when a
# Data-dependent fail-first
-This is a minor variant on the CR-based predicate-result mode. Where
-pred-result continues with independent element testing (any of which may
-be parallelised), data-dependent fail-first *stops* at the first failure:
+Data-dependent fail-first *stops* at the first failure:
if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
for i in range(VL):
CRnew = analyse(result) # calculates eq/lt/gt
# now test CR, similar to branch
if CRnew[BO[0:1]] != BO[2]:
- VL = i # truncate: only successes allowed
+ VL = i+VLi # truncate: only successes allowed
break
# test passed: store result (and CR?)
if not RC1: iregs[RT+i] = result
the actual calculation.
The only minor downside here though is the change to VL, which in some
-implementations may cause pipeline stalls. This was one of the reasons
-why CR-based pred-result analysis was added, because that at least is
-entirely paralleliseable.
+implementations may cause pipeline stalls.
# Vertical-First Mode
+![image](/openpower/sv/sv_horizontal_vs_vertical.svg){ width=40% height=40% }
+
This is a relatively new addition to SVP64 under development as of
July 2021. Where Horizontal-First is the standard Cray-style for-loop,
Vertical-First typically executes just the **one** scalar element
beq loop
```
-![image](/openpower/sv/sv_horizontal_vs_vertical.svg)
-
Three examples are illustrated of different types of Scalar-Vector
operations. Note that in its simplest form **only one** element is
executed per instruction **not** multiple elements per instruction.