code\footnote[1]{with the proviso that the Programmer must
be mindful of both the starting point and what they set MAXVL to.
Hardware will helpfully remind them of any Register File overruns
-by happily throwing an Illegal Instructionp}.
-
-On top of these very basic but
-already-profound\footnote[2]{with hardware and ISA Architectural
-requirements that deal with the increased Dependency
-Hazard Management, too detailed to list in full in
-this document, the most important being that the total number of
-registers be a fixed \textbf{and mandatory} Standards-defined quantity}
-beginnings, Predication and Conditional-Exit
-can be added. Predication is found in every GPU ISA, and Conditional-Exit
-is a 50-year invention dating back to Zilog Z80 CPIR and LDIR.
+by happily throwing an Illegal Instruction}.
\begin{verbatim}
for i in range(VL):
break
\end{verbatim}
+On top of these very basic but
+already-profound\footnote[2]{caveats: with hardware and ISA Architectural
+ requirements that deal with the increased Dependency
+ Hazard Management, too detailed to list in full in
+ this document, the most important being that the total number of
+ registers be a fixed \textbf{and mandatory} Standards-defined quantity}
+beginnings, Predication and Conditional-Exit
+can be added. Predication is found in every GPU ISA, and Conditional-Exit
+is a 50-year invention dating back to Zilog Z80 CPIR and LDIR.
+
Additionally the concept may be introduced from ARM SVE and RISC-V
RVV "Fault-First" on Load and Store, where if an Exception would occur
then the Hardware informs the programmer that the Vector operation
is truncated:
\begin{verbatim}
- for i in range(VL):
- if predicate.bit[i] clear:
- continue
- EffectiveAddress = GPR(RA+i) + Immediate
+ for i in range(VL):
+ if predicate.bit[i] clear:
+ continue
+ EffectiveAddress = GPR(RA+i) + Immediate
if Exception@(EffectiveAddress):
- if i == 0: RAISE Exception
- else:
- VL = i
- break
- GPR(RT+i) = Mem@(EffectiveAddress)
+ if i == 0: RAISE Exception
+ else: VL = i; break # truncate
+ GPR(RT+i) = Mem@(EffectiveAddress)
\end{verbatim}
The important facet of both these "Conditional Truncation" constructs
the primary being that in a SIMD (parallel) context, strncpy
operates in bytes where SIMD operates in power-of-two multiples
only. PackedSIMD is the worst offender: PredicatedSIMD is marginally
-better\footnote[3]{caveat: if extended properly, as was
-done successfully, with huge beneficial effect, in ARM SVE}.
+better\footnote[3]{caveat: if designed properly, as was
+done successfully in ARM SVE}.
If SIMD Load and Store has to start on an Aligned Memory location,
which is a common limitation, things get even worse.
The operations that were supposed to speed