1 \documentclass[slidestop
]{beamer
}
2 \usepackage{beamerthemesplit
}
10 \title{Data-Dependent-Fail-First
}
11 \author{Luke Kenneth Casson Leighton and Shriya Sharma
}
18 \huge{Libre-SOC Simple-V Specification \\
21 \Large{Data-Dependent Fail-First
}\\
26 \large{Funded by NLnet NGI-ASSURE \\
27 EU grant agreement No
957073}\\
33 \begin{frame
}[fragile
]
34 \frametitle{Simple-V CMPI in a nutshell
}
37 function op
\_cmpi(BA, RA, SI) # cmpi not vector-cmpi!
38 (assuming you know power-isa)
40 for (i =
0; i < VL; i++)
41 CR
[BA+id
] <= compare(ireg
[RA+ira
], SI);
42 if (reg
\_is\_vectorised[BA
] ) \
{ id +=
1; \
}
43 if (reg
\_is\_vectorised[RA
]) \
{ ira +=
1; \
}
47 \item Above is oversimplified: predication etc. left out
48 \item Scalar-scalar and scalar-vector and vector-vector now all in one
49 \item OoO may choose to push CMPIs into instr. queue (v. busy!)
54 \frame{\frametitle{Load/Store Fault-First
}
57 \item Problem: vector load and store can cause a page fault
58 \item Solution: a protocol that allows optional load/store
59 \item instruction
\textit{requests
} a number of elements
60 \item instruction
\textit{informs
} the number actually loaded
61 \item first element load/store is not optional (cannot fail)
62 \item ARM SVE: https://arxiv.org/pdf/
1803.06185.pdf
63 \item more: wikipedia Vector processor page: Fault/Fail First
65 \item Load/Store is Memory to/from Register, what about
67 \item Register-to-register: "Data-Dependent Fail-First."
68 \item Z80 LDIR: Mem-Register, CPIR: Register-Register
72 \begin{frame
}[fragile
]
73 \frametitle{Data-Dependent-Fail-First in a nutshell
}
76 function op
\_cmpi(BA, RA, SI) # cmpi not vector-cmpi!
78 for (i =
0; i < VL; i++)
79 CR
[BA+id
] <= compare(ireg
[RA+ira
], SI);
80 if (reg
\_is\_vectorised[BA
] ) \
{ id +=
1; \
}
81 if (reg
\_is\_vectorised[RA
]) \
{ ira +=
1; \
}
82 if test (CR
[BA+id
]) == FAIL: \
{ VL = i +
1; break \
}
86 \item Parallelism still perfectly possible
87 ("hold" writing results until sequential post-analysis
88 carried out. Best done with OoO)
89 \item VL truncation can be inclusive or exclusive
90 (include or exclude a NULL pointer or a
91 string-end character, or overflow result)
92 \item \textit{Truncation can be to zero Vector Length
}
96 \frame{\frametitle{Power ISA v3.1 vstribr
}
98 \lstinputlisting[language=
{}]{vstribr.txt
}
101 \item ironically this hard-coded instruction is
102 identical to general-purpose Simple-V DD-FFirst...
107 \frame{\frametitle{Pospopcount
}
110 \item Positional popcount adds up the totals of each bit set to
1 in each bit-position, of an array of input values.
111 \item Notoriously difficult to do in SIMD assembler: typically
550 lines
112 \item https://github.com/clausecker/pospop - Full writeup: \\
113 https://libre-soc.org/openpower/sv/cookbook/pospopcnt
117 \lstinputlisting[language=
{}]{pospopcount.c
}
122 \frame{\frametitle{Pospopcount
}
125 \includegraphics[width=
0.5\textwidth]{pospopcount.png
}
128 \item The challenge is to perform an appropriate transpose of the data (the CPU can only work on registers, horizontally),
129 in blocks that suit the processor and the ISA capacity.
135 \frame{\frametitle{Pospopcount
}
138 \includegraphics[width=
0.6\textwidth]{array_popcnt.png
}
143 \item The draft gbbd instruction implements the transpose (shown above),
144 preparing the data to use standard popcount.
145 (gbbd is based on Power ISA vgbbd, v3.1 p445)
151 \frame{\frametitle{pospopcount assembler
}
154 \lstinputlisting[language=
{}]{pospopcount.s
}
158 \frame{\frametitle{strncpy
}
160 \lstinputlisting[language=
{}]{strncpy.c
}
162 \item two simple-looking for-loops,
163 data-dependent in the first.
164 \item sv.cmpi stops at the first zero, /vli includes the zero
166 \item note the post-increment Load/Store: saves
168 \item a Vector of CRs is produced which then get tested
169 by the sv.bc/all instruction, counting down CTR
171 \item Power ISA added hard-coded data-dependent capacity
172 into vstribr, where SVP64 it is generic (applies
174 \item even the null-ing part is not straightforward as
175 it could be mis-aligned compared to the VSX width.
176 \item end-result: assembler-optimised strncpy on Power
177 ISA v3.0 is a whopping
240 instructions. SVP64 is
10
184 \frame{\frametitle{strncpy assembler
}
186 \lstinputlisting[language=
{}]{strncpy.s
}
190 \frame{\frametitle{sv.lbz/ff=RC1/vli *
16,
1(
10)
}
192 \includegraphics[width=
0.6\textwidth]{lbz_ff_vli.png
}
196 \item r10 points to memory address
0x001007
197 \item sv.lbz (Power ISA load byte immediate) multiplies immediate
198 offset by element step index, to get Effective Address (EA)
199 \item LD/ST has no Rc=
1 so Data-Dependent Fail-First specified
200 as "ff=RC1". Not LD/ST Fault First! vli: VL inclusive
201 \item Test done after each load. Fails at Memory contents
202 0x001009. Inclusive Mode: VL is truncated to
5 (FIVE) not
4
206 \frame{\frametitle{linked-list walking
}
213 \frame{\frametitle{sv.ld/ff=RC1/vli *
17,
8(*16)}
216 \includegraphics[width=1.0\textwidth]{linked_list_dd.png}
220 \frame{\frametitle{maxloc}
221 \lstinputlisting[language={}]{maxloc.py}
224 \item FORTRAN MAXLOC - find the index of largest number
225 notoriously difficult to optimally implement for SIMD
226 \item algorithms include \textit{depth-first} recursive
227 descent (!) mapreduce-style, offsetting the
228 locally-computed largest index (plus value) which
229 are then tested in upper level(s)
230 \item SVP64: note below the sv.cmp (first while-loop),
231 sv.minmax. (second while-loop) and the sv.crnand which
232 by Predicate masking is 3-in 1-out CR ops
233 not the usual 2-in 1-out
234 \item There is however quite a bit of "housekeeping".
236 https://libre-soc.org/openpower/sv/cookbook/fortran\_maxloc
240 \frame{\frametitle{maxloc assembler}
242 \lstinputlisting[language={}]{maxloc.s}
245 \frame{\frametitle{Summary}
248 \item SIMD fundamentally assumes element independence.
249 \item No provision in SIMD ISAs or Architectures for
250 inter-element inter-dependence, let alone sequential
252 \item Simple-V adds features such as Data-Dependent
253 Fail-First as \textit{general concepts},
254 exploiting Condition Registers (Vectorised)
255 \item Hardware Parallelism is \textit{still possible}
256 by exploiting the standard capabilities of
257 Speculative Execution: produce results, hold
258 off writing, post-analyse and cancel the results
259 that should not be written. Uses \textit{existing}
260 standard OoO Micro-architecture
261 \item Huge simplification of algorithms, huge "compactification"
262 just like Zilog Z80 and Intel 8086, yet still parallel
263 \item compact deep-expressive assembler brings CISC
264 capability but RISC-RISC (Prefix-Suffix). SIMD remains
265 at the \textit{back-end in hardware} where it belongs.
266 Not exposed at the programmer.
272 {\Huge The end\vspace{12pt}\\
273 Thank you\vspace{12pt}
278 \item Discussion: http://lists.libre-soc.org
279 \item OFTC.net IRC \#libre-soc
280 \item http://libre-soc.org/
281 \item https://nlnet.nl/project/Libre-SOC-OpenPOWER-ISA
282 \item https://bugs.libre-soc.org/show\_bug.cgi?id=676
283 \item https://bugs.libre-soc.org/show\_bug.cgi?id=1244
284 \item https://libre-soc.org/openpower/sv/cookbook/fortran\_maxloc
285 \item https://libre-soc.org/nlnet/\#faq