From: Luke Kenneth Casson Leighton Date: Wed, 17 May 2023 19:01:55 +0000 (+0000) Subject: update to OpenSearch2023 paper X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=a6da490bb899a68ecce686e36d53a185d4d170ed;p=libreriscv.git update to OpenSearch2023 paper --- diff --git a/conferences/opensearch2023/opensearch2023.tex b/conferences/opensearch2023/opensearch2023.tex index 2486b6b0e..da25dfc7e 100644 --- a/conferences/opensearch2023/opensearch2023.tex +++ b/conferences/opensearch2023/opensearch2023.tex @@ -102,7 +102,7 @@ of Scalar Loop Construct. This is what SIMD and normal Vector ISAs look like: \begin{verbatim} - for i in range(SIMDlength): + for i in range(SIMDlength): VR(RT)[i] = VR(RA)[i] + VR(RB)[i] \end{verbatim} @@ -119,11 +119,11 @@ is a 50-year invention dating back to Zilog Z80 CPIR and LDIR. \begin{verbatim} for i in range(VL): - if predicate.bit[i] clear: + if predicate.bit[i] clear: # skip? continue GPR(RT+i) = GPR(RA+i) + GPR(RB+i) - if CCTest(GPR(RT+i)) is failure: - VL = i + if CCTest(GPR(RT+i)) fails: # end? + VL = i # truncate the Vector break \end{verbatim} @@ -159,11 +159,57 @@ usual hassle with SIMD - often compensated for with hard-coded dedicated "Memory copy" or "String copy" instructions that cannot be leveraged for any other purpose, goes away. +\section{strncpy} + +strncpy presents some unique challenges for an ISA and hardware, +the primary being that in a SIMD (parallel) context, strncpy +operates in bytes where SIMD operates in power-of-two multiples +only. PackedSIMD is the worst offender: PredicatedSIMD is better. +If SIMD Load and Store has to start on an Aligned Memory location +things get even worse. The operations that were supposed to speed +up algorithms have to have "preamble" and "postamble" to take care +of the corner-cases. + +Worse, a naive SIMD ISA cannot have Conditional inter-relationships. +64-byte or 128-byte-wide LOADs either succeed in full or they fail +in full. If the strncpy subroutine happens to copy from the last +few bytes in memory, SIMD LOADs are the worst thing to use. +We need a way to Conditionally terminate the LOAD and inform the +Programmer, and this is where Load-Fault-First comes into play. + +However even this is not enough: once LOADed it is necessary to +first spot the NUL character, and once identified to then begin +copying NUL characters from that point onwards. + +\begin{verbatim} + for (i = 0; i < n && src[i] != '\0'; i++) + dest[i] = src[i]; + for ( ; i < n; i++) + dest[i] = '\0'; +\end{verbatim} + +Performing such a conditional NUL-character search in a SIMD ISA +is typically extremely convoluted. A usual approach would be +to perform a Parallel compare against NUL (easy enough) followed +by an instruction that then searches sequentially for the first +fail, followed by another instruction that explicitly truncates +the Vector Length, followed finally by the actual STORE. + +\textit{All of the sequential-search-and-truncate} is part of +the Data-Dependent Fail-First Mode that is a first-order construct +in SVP64. When applied to the \textbf{sv.cmpi} instruction, +which produces a Vector of Condition Codes ()as opposed to just +one for the Scalar \textbf{cmpi} instruction), +the search for the NUL character truncates the Vector Length +at the required point, such that the next instruction (STORE) +is already set up to copy up to and including the NUL +(if one was indeed found). + \begin{verbatim} mtspr 9, 3 # move r3 to CTR addi 0,0,0 # initialise r0 to zero # chr-copy loop starts here: - # for (i = 0; i < n && src[i] != '\0'; i++) + # for (i=0; i