profile_mode.xml: Minor updates and fixes.
[gcc.git] / libstdc++-v3 / doc / xml / manual / profile_mode.xml
1 <?xml version='1.0'?>
2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
4 [ ]>
5
6 <chapter id="manual.ext.profile_mode" xreflabel="Profile Mode">
7 <?dbhtml filename="profile_mode.html"?>
8
9 <chapterinfo>
10 <keywordset>
11 <keyword>
12 C++
13 </keyword>
14 <keyword>
15 library
16 </keyword>
17 <keyword>
18 profile
19 </keyword>
20 </keywordset>
21 </chapterinfo>
22
23 <title>Profile Mode</title>
24
25
26 <sect1 id="manual.ext.profile_mode.intro" xreflabel="Intro">
27 <title>Intro</title>
28 <para>
29 <emphasis>Goal: </emphasis>Give performance improvement advice based on
30 recognition of suboptimal usage patterns of the standard library.
31 </para>
32
33 <para>
34 <emphasis>Method: </emphasis>Wrap the standard library code. Insert
35 calls to an instrumentation library to record the internal state of
36 various components at interesting entry/exit points to/from the standard
37 library. Process trace, recognize suboptimal patterns, give advice.
38 For details, see
39 <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
40 CGO 2009</ulink>.
41 </para>
42 <para>
43 <emphasis>Strengths: </emphasis>
44 <itemizedlist>
45 <listitem><para>
46 Unintrusive solution. The application code does not require any
47 modification.
48 </para></listitem>
49 <listitem><para> The advice is call context sensitive, thus capable of
50 identifying precisely interesting dynamic performance behavior.
51 </para></listitem>
52 <listitem><para>
53 The overhead model is pay-per-view. When you turn off a diagnostic class
54 at compile time, its overhead disappears.
55 </para></listitem>
56 </itemizedlist>
57 </para>
58 <para>
59 <emphasis>Drawbacks: </emphasis>
60 <itemizedlist>
61 <listitem><para>
62 You must recompile the application code with custom options.
63 </para></listitem>
64 <listitem><para>You must run the application on representative input.
65 The advice is input dependent.
66 </para></listitem>
67 <listitem><para>
68 The execution time will increase, in some cases by factors.
69 </para></listitem>
70 </itemizedlist>
71 </para>
72
73
74 <sect2 id="manual.ext.profile_mode.using" xreflabel="Using">
75 <title>Using the Profile Mode</title>
76
77 <para>
78 This is the anticipated common workflow for program <code>foo.cc</code>:
79 <programlisting>
80 $ cat foo.cc
81 #include &lt;vector&gt;
82 int main() {
83 vector&lt;int&gt; v;
84 for (int k = 0; k &lt; 1024; ++k) v.insert(v.begin(), k);
85 }
86
87 $ g++ -D_GLIBCXX_PROFILE foo.cc
88 $ ./a.out
89 $ cat libstdcxx-profile.txt
90 vector-to-list: improvement = 5: call stack = 0x804842c ...
91 : advice = change std::vector to std::list
92 vector-size: improvement = 3: call stack = 0x804842c ...
93 : advice = change initial container size from 0 to 1024
94 </programlisting>
95 </para>
96
97 <para>
98 Anatomy of a warning:
99 <itemizedlist>
100 <listitem>
101 <para>
102 Warning id. This is a short descriptive string for the class
103 that this warning belongs to. E.g., "vector-to-list".
104 </para>
105 </listitem>
106 <listitem>
107 <para>
108 Estimated improvement. This is an approximation of the benefit expected
109 from implementing the change suggested by the warning. It is given on
110 a log10 scale. Negative values mean that the alternative would actually
111 do worse than the current choice.
112 In the example above, 5 comes from the fact that the overhead of
113 inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,
114 which is around 10e5. The improvement from setting the initial size to
115 1024 is in the range of 10e3, since the overhead of dynamic resizing is
116 linear in this case.
117 </para>
118 </listitem>
119 <listitem>
120 <para>
121 Call stack. Currently, the addresses are printed without
122 symbol name or code location attribution.
123 Users are expected to postprocess the output using, for instance, addr2line.
124 </para>
125 </listitem>
126 <listitem>
127 <para>
128 The warning message. For some warnings, this is static text, e.g.,
129 "change vector to list". For other warnings, such as the one above,
130 the message contains numeric advice, e.g., the suggested initial size
131 of the vector.
132 </para>
133 </listitem>
134 </itemizedlist>
135 </para>
136
137 <para>Three files are generated. <code>libstdcxx-profile.txt</code>
138 contains human readable advice. <code>libstdcxx-profile.raw</code>
139 contains implementation specific data about each diagnostic.
140 Their format is not documented. They are sufficient to generate
141 all the advice given in <code>libstdcxx-profile.txt</code>. The advantage
142 of keeping this raw format is that traces from multiple executions can
143 be aggregated simply by concatenating the raw traces. We intend to
144 offer an external utility program that can issue advice from a trace.
145 <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic
146 parameters used. To alter parameters, edit this file and rename it to
147 <code>libstdcxx-profile.conf</code>.
148 </para>
149
150 <para>Advice is given regardless whether the transformation is valid.
151 For instance, we advise changing a map to an unordered_map even if the
152 application semantics require that data be ordered.
153 We believe such warnings can help users understand the performance
154 behavior of their application better, which can lead to changes
155 at a higher abstraction level.
156 </para>
157
158 </sect2>
159
160 <sect2 id="manual.ext.profile_mode.tuning" xreflabel="Tuning">
161 <title>Tuning the Profile Mode</title>
162
163 <para>Compile time switches and environment variables (see also file
164 profiler.h). Unless specified otherwise, they can be set at compile time
165 using -D_&lt;name&gt; or by setting variable &lt;name&gt;
166 in the environment where the program is run, before starting execution.
167 <itemizedlist>
168 <listitem><para>
169 <code>_GLIBCXX_PROFILE_NO_&lt;diagnostic&gt;</code>:
170 disable specific diagnostics.
171 See section Diagnostics for possible values.
172 (Environment variables not supported.)
173 </para></listitem>
174 <listitem><para>
175 <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root
176 path for the output files.
177 </para></listitem>
178 <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum
179 number of warnings desired. The default value is 10.</para></listitem>
180 <listitem><para>
181 <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0,
182 the advice will
183 be collected and reported for the program as a whole, and not for each
184 call context.
185 This could also be used in continuous regression tests, where you
186 just need to know whether there is a regression or not.
187 The default value is 32.
188 </para></listitem>
189 <listitem><para>
190 <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>:
191 set a limit on how much memory to use for the accounting tables for each
192 diagnostic type. When this limit is reached, new events are ignored
193 until the memory usage decreases under the limit. Generally, this means
194 that newly created containers will not be instrumented until some
195 live containers are deleted. The default is 128 MB.
196 </para></listitem>
197 <listitem><para>
198 <code>_GLIBCXX_PROFILE_NO_THREADS</code>:
199 Make the library not use threads. If thread local storage (TLS) is not
200 available, you will get a preprocessor error asking you to set
201 -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.
202 Multithreaded execution without TLS is not supported.
203 (Environment variable not supported.)
204 </para></listitem>
205 <listitem><para>
206 <code>_GLIBCXX_HAVE_EXECINFO_H</code>:
207 This name should be defined automatically at library configuration time.
208 If your library was configured without <code>execinfo.h</code>, but
209 you have it in your include path, you can define it explicitly. Without
210 it, advice is collected for the program as a whole, and not for each
211 call context.
212 (Environment variable not supported.)
213 </para></listitem>
214 </itemizedlist>
215 </para>
216
217 </sect2>
218
219 </sect1>
220
221
222 <sect1 id="manual.ext.profile_mode.design" xreflabel="Design">
223 <title>Design</title>
224
225 <para>
226 </para>
227 <table frame='all'>
228 <title>Code Location</title>
229 <tgroup cols='2' align='left' colsep='1' rowsep='1'>
230 <colspec colname='c1'></colspec>
231 <colspec colname='c2'></colspec>
232
233 <thead>
234 <row>
235 <entry>Code Location</entry>
236 <entry>Use</entry>
237 </row>
238 </thead>
239 <tbody>
240 <row>
241 <entry><code>libstdc++-v3/include/std/*</code></entry>
242 <entry>Preprocessor code to redirect to profile extension headers.</entry>
243 </row>
244 <row>
245 <entry><code>libstdc++-v3/include/profile/*</code></entry>
246 <entry>Profile extension public headers (map, vector, ...).</entry>
247 </row>
248 <row>
249 <entry><code>libstdc++-v3/include/profile/impl/*</code></entry>
250 <entry>Profile extension internals. Implementation files are
251 only included from <code>impl/profiler.h</code>, which is the only
252 file included from the public headers.</entry>
253 </row>
254 </tbody>
255 </tgroup>
256 </table>
257
258 <para>
259 </para>
260
261 <sect2 id="manual.ext.profile_mode.design.wrapper"
262 xreflabel="Wrapper">
263 <title>Wrapper Model</title>
264 <para>
265 In order to get our instrumented library version included instead of the
266 release one,
267 we use the same wrapper model as the debug mode.
268 We subclass entities from the release version. Wherever
269 <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is
270 <code>std::__norm</code>, whereas the profile namespace is
271 <code>std::__profile</code>. Using plain <code>std</code> translates
272 into <code>std::__profile</code>.
273 </para>
274 <para>
275 Whenever possible, we try to wrap at the public interface level, e.g.,
276 in <code>unordered_set</code> rather than in <code>hashtable</code>,
277 in order not to depend on implementation.
278 </para>
279 <para>
280 Mixing object files built with and without the profile mode must
281 not affect the program execution. However, there are no guarantees to
282 the accuracy of diagnostics when using even a single object not built with
283 <code>-D_GLIBCXX_PROFILE</code>.
284 Currently, mixing the profile mode with debug and parallel extensions is
285 not allowed. Mixing them at compile time will result in preprocessor errors.
286 Mixing them at link time is undefined.
287 </para>
288 </sect2>
289
290
291 <sect2 id="manual.ext.profile_mode.design.instrumentation"
292 xreflabel="Instrumentation">
293 <title>Instrumentation</title>
294 <para>
295 Instead of instrumenting every public entry and exit point,
296 we chose to add instrumentation on demand, as needed
297 by individual diagnostics.
298 The main reason is that some diagnostics require us to extract bits of
299 internal state that are particular only to that diagnostic.
300 We plan to formalize this later, after we learn more about the requirements
301 of several diagnostics.
302 </para>
303 <para>
304 All the instrumentation points can be switched on and off using
305 <code>-D[_NO]_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code> options.
306 With all the instrumentation calls off, there should be negligible
307 overhead over the release version. This property is needed to support
308 diagnostics based on timing of internal operations. For such diagnostics,
309 we anticipate turning most of the instrumentation off in order to prevent
310 profiling overhead from polluting time measurements, and thus diagnostics.
311 </para>
312 <para>
313 All the instrumentation on/off compile time switches live in
314 <code>include/profile/profiler.h</code>.
315 </para>
316 </sect2>
317
318
319 <sect2 id="manual.ext.profile_mode.design.rtlib"
320 xreflabel="Run Time Behavior">
321 <title>Run Time Behavior</title>
322 <para>
323 For practical reasons, the instrumentation library processes the trace
324 partially
325 rather than dumping it to disk in raw form. Each event is processed when
326 it occurs. It is usually attached a cost and it is aggregated into
327 the database of a specific diagnostic class. The cost model
328 is based largely on the standard performance guarantees, but in some
329 cases we use knowledge about GCC's standard library implementation.
330 </para>
331 <para>
332 Information is indexed by (1) call stack and (2) instance id or address
333 to be able to understand and summarize precise creation-use-destruction
334 dynamic chains. Although the analysis is sensitive to dynamic instances,
335 the reports are only sensitive to call context. Whenever a dynamic instance
336 is destroyed, we accumulate its effect to the corresponding entry for the
337 call stack of its constructor location.
338 </para>
339
340 <para>
341 For details, see
342 <ulink url="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
343 CGO 2009</ulink>.
344 </para>
345 </sect2>
346
347
348 <sect2 id="manual.ext.profile_mode.design.analysis"
349 xreflabel="Analysis and Diagnostics">
350 <title>Analysis and Diagnostics</title>
351 <para>
352 Final analysis takes place offline, and it is based entirely on the
353 generated trace and debugging info in the application binary.
354 See section Diagnostics for a list of analysis types that we plan to support.
355 </para>
356 <para>
357 The input to the analysis is a table indexed by profile type and call stack.
358 The data type for each entry depends on the profile type.
359 </para>
360 </sect2>
361
362
363 <sect2 id="manual.ext.profile_mode.design.cost-model"
364 xreflabel="Cost Model">
365 <title>Cost Model</title>
366 <para>
367 While it is likely that cost models become complex as we get into
368 more sophisticated analysis, we will try to follow a simple set of rules
369 at the beginning.
370 </para>
371 <itemizedlist>
372 <listitem><para><emphasis>Relative benefit estimation:</emphasis>
373 The idea is to estimate or measure the cost of all operations
374 in the original scenario versus the scenario we advise to switch to.
375 For instance, when advising to change a vector to a list, an occurrence
376 of the <code>insert</code> method will generally count as a benefit.
377 Its magnitude depends on (1) the number of elements that get shifted
378 and (2) whether it triggers a reallocation.
379 </para></listitem>
380 <listitem><para><emphasis>Synthetic measurements:</emphasis>
381 We will measure the relative difference between similar operations on
382 different containers. We plan to write a battery of small tests that
383 compare the times of the executions of similar methods on different
384 containers. The idea is to run these tests on the target machine.
385 If this training phase is very quick, we may decide to perform it at
386 library initialization time. The results can be cached on disk and reused
387 across runs.
388 </para></listitem>
389 <listitem><para><emphasis>Timers:</emphasis>
390 We plan to use timers for operations of larger granularity, such as sort.
391 For instance, we can switch between different sort methods on the fly
392 and report the one that performs best for each call context.
393 </para></listitem>
394 <listitem><para><emphasis>Show stoppers:</emphasis>
395 We may decide that the presence of an operation nullifies the advice.
396 For instance, when considering switching from <code>set</code> to
397 <code>unordered_set</code>, if we detect use of operator <code>++</code>,
398 we will simply not issue the advice, since this could signal that the use
399 care require a sorted container.</para></listitem>
400 </itemizedlist>
401
402 </sect2>
403
404
405 <sect2 id="manual.ext.profile_mode.design.reports"
406 xreflabel="Reports">
407 <title>Reports</title>
408 <para>
409 There are two types of reports. First, if we recognize a pattern for which
410 we have a substitute that is likely to give better performance, we print
411 the advice and estimated performance gain. The advice is usually associated
412 to a code position and possibly a call stack.
413 </para>
414 <para>
415 Second, we report performance characteristics for which we do not have
416 a clear solution for improvement. For instance, we can point to the user
417 the top 10 <code>multimap</code> locations
418 which have the worst data locality in actual traversals.
419 Although this does not offer a solution,
420 it helps the user focus on the key problems and ignore the uninteresting ones.
421 </para>
422 </sect2>
423
424
425 <sect2 id="manual.ext.profile_mode.design.testing"
426 xreflabel="Testing">
427 <title>Testing</title>
428 <para>
429 First, we want to make sure we preserve the behavior of the release mode.
430 You can just type <code>"make check-profile"</code>, which
431 builds and runs the whole test suite in profile mode.
432 </para>
433 <para>
434 Second, we want to test the correctness of each diagnostic.
435 We created a <code>profile</code> directory in the test suite.
436 Each diagnostic must come with at least two tests, one for false positives
437 and one for false negatives.
438 </para>
439 </sect2>
440
441 </sect1>
442
443 <sect1 id="manual.ext.profile_mode.api"
444 xreflabel="API">
445 <title>Extensions for Custom Containers</title>
446
447 <para>
448 Many large projects use their own data structures instead of the ones in the
449 standard library. If these data structures are similar in functionality
450 to the standard library, they can be instrumented with the same hooks
451 that are used to instrument the standard library.
452 The instrumentation API is exposed in file
453 <code>profiler.h</code> (look for "Instrumentation hooks").
454 </para>
455
456 </sect1>
457
458
459 <sect1 id="manual.ext.profile_mode.cost_model"
460 xreflabel="Cost Model">
461 <title>Empirical Cost Model</title>
462
463 <para>
464 Currently, the cost model uses formulas with predefined relative weights
465 for alternative containers or container implementations. For instance,
466 iterating through a vector is X times faster than iterating through a list.
467 </para>
468 <para>
469 (Under development.)
470 We are working on customizing this to a particular machine by providing
471 an automated way to compute the actual relative weights for operations
472 on the given machine.
473 </para>
474 <para>
475 (Under development.)
476 We plan to provide a performance parameter database format that can be
477 filled in either by hand or by an automated training mechanism.
478 The analysis module will then use this database instead of the built in.
479 generic parameters.
480 </para>
481
482 </sect1>
483
484
485 <sect1 id="manual.ext.profile_mode.implementation"
486 xreflabel="Implementation">
487 <title>Implementation Issues</title>
488
489
490 <sect2 id="manual.ext.profile_mode.implementation.stack"
491 xreflabel="Stack Traces">
492 <title>Stack Traces</title>
493 <para>
494 Accurate stack traces are needed during profiling since we group events by
495 call context and dynamic instance. Without accurate traces, diagnostics
496 may be hard to interpret. For instance, when giving advice to the user
497 it is imperative to reference application code, not library code.
498 </para>
499 <para>
500 Currently we are using the libc <code>backtrace</code> routine to get
501 stack traces.
502 <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set
503 to 0 if you are willing to give up call context information, or to a small
504 positive value to reduce run time overhead.
505 </para>
506 </sect2>
507
508
509 <sect2 id="manual.ext.profile_mode.implementation.symbols"
510 xreflabel="Symbolization">
511 <title>Symbolization of Instruction Addresses</title>
512 <para>
513 The profiling and analysis phases use only instruction addresses.
514 An external utility such as addr2line is needed to postprocess the result.
515 We do not plan to add symbolization support in the profile extension.
516 This would require access to symbol tables, debug information tables,
517 external programs or libraries and other system dependent information.
518 </para>
519 </sect2>
520
521
522 <sect2 id="manual.ext.profile_mode.implementation.concurrency"
523 xreflabel="Concurrency">
524 <title>Concurrency</title>
525 <para>
526 Our current model is simplistic, but precise.
527 We cannot afford to approximate because some of our diagnostics require
528 precise matching of operations to container instance and call context.
529 During profiling, we keep a single information table per diagnostic.
530 There is a single lock per information table.
531 </para>
532 </sect2>
533
534
535 <sect2 id="manual.ext.profile_mode.implementation.stdlib-in-proflib"
536 xreflabel="Using the Standard Library in the Runtime Library">
537 <title>Using the Standard Library in the Instrumentation Implementation</title>
538 <para>
539 As much as we would like to avoid uses of libstdc++ within our
540 instrumentation library, containers such as unordered_map are very
541 appealing. We plan to use them as long as they are named properly
542 to avoid ambiguity.
543 </para>
544 </sect2>
545
546
547 <sect2 id="manual.ext.profile_mode.implementation.malloc-hooks"
548 xreflabel="Malloc Hooks">
549 <title>Malloc Hooks</title>
550 <para>
551 User applications/libraries can provide malloc hooks.
552 When the implementation of the malloc hooks uses stdlibc++, there can
553 be an infinite cycle between the profile mode instrumentation and the
554 malloc hook code.
555 </para>
556 <para>
557 We protect against reentrance to the profile mode instrumentation code,
558 which should avoid this problem in most cases.
559 The protection mechanism is thread safe and exception safe.
560 This mechanism does not prevent reentrance to the malloc hook itself,
561 which could still result in deadlock, if, for instance, the malloc hook
562 uses non-recursive locks.
563 XXX: A definitive solution to this problem would be for the profile extension
564 to use a custom allocator internally, and perhaps not to use libstdc++.
565 </para>
566 </sect2>
567
568
569 <sect2 id="manual.ext.profile_mode.implementation.construction-destruction"
570 xreflabel="Construction and Destruction of Global Objects">
571 <title>Construction and Destruction of Global Objects</title>
572 <para>
573 The profiling library state is initialized at the first call to a profiling
574 method. This allows us to record the construction of all global objects.
575 However, we cannot do the same at destruction time. The trace is written
576 by a function registered by <code>atexit</code>, thus invoked by
577 <code>exit</code>.
578 </para>
579 </sect2>
580
581 </sect1>
582
583
584 <sect1 id="manual.ext.profile_mode.developer"
585 xreflabel="Developer Information">
586 <title>Developer Information</title>
587
588 <sect2 id="manual.ext.profile_mode.developer.bigpic"
589 xreflabel="Big Picture">
590 <title>Big Picture</title>
591
592 <para>The profile mode headers are included with
593 <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in
594 <code>include/std/*</code>.
595 </para>
596
597 <para>Instrumented implementations are provided in
598 <code>include/profile/*</code>. All instrumentation hooks are macros
599 defined in <code>include/profile/profiler.h</code>.
600 </para>
601
602 <para>All the implementation of the instrumentation hooks is in
603 <code>include/profile/impl/*</code>. Although all the code gets included,
604 thus is publicly visible, only a small number of functions are called from
605 outside this directory. All calls to hook implementations must be
606 done through macros defined in <code>profiler.h</code>. The macro
607 must ensure (1) that the call is guarded against reentrance and
608 (2) that the call can be turned off at compile time using a
609 <code>-D_GLIBCXX_PROFILE_...</code> compiler option.
610 </para>
611
612 </sect2>
613
614 <sect2 id="manual.ext.profile_mode.developer.howto"
615 xreflabel="How To Add A Diagnostic">
616 <title>How To Add A Diagnostic</title>
617
618 <para>Let's say the diagnostic name is "magic".
619 </para>
620
621 <para>If you need to instrument a header not already under
622 <code>include/profile/*</code>, first edit the corresponding header
623 under <code>include/std/</code> and add a preprocessor directive such
624 as the one in <code>include/std/vector</code>:
625 <programlisting>
626 #ifdef _GLIBCXX_PROFILE
627 # include &lt;profile/vector&gt;
628 #endif
629 </programlisting>
630 </para>
631
632 <para>If the file you need to instrument is not yet under
633 <code>include/profile/</code>, make a copy of the one in
634 <code>include/debug</code>, or the main implementation.
635 You'll need to include the main implementation and inherit the classes
636 you want to instrument. Then define the methods you want to instrument,
637 define the instrumentation hooks and add calls to them.
638 Look at <code>include/profile/vector</code> for an example.
639 </para>
640
641 <para>Add macros for the instrumentation hooks in
642 <code>include/profile/impl/profiler.h</code>.
643 Hook names must start with <code>__profcxx_</code>.
644 Make sure they transform
645 in no code with <code>-D_NO_GLBICXX_PROFILE_MAGIC</code>.
646 Make sure all calls to any method in namespace <code>__gnu_profile</code>
647 is protected against reentrance using macro
648 <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>.
649 All names of methods in namespace <code>__gnu_profile</code> called from
650 <code>profiler.h</code> must start with <code>__trace_magic_</code>.
651 </para>
652
653 <para>Add the implementation of the diagnostic.
654 <itemizedlist>
655 <listitem><para>
656 Create new file <code>include/profile/impl/profiler_magic.h</code>.
657 </para></listitem>
658 <listitem><para>
659 Define class <code>__magic_info: public __object_info_base</code>.
660 This is the representation of a line in the object table.
661 The <code>__merge</code> method is used to aggregate information
662 across all dynamic instances created at the same call context.
663 The <code>__magnitude</code> must return the estimation of the benefit
664 as a number of small operations, e.g., number of words copied.
665 The <code>__write</code> method is used to produce the raw trace.
666 The <code>__advice</code> method is used to produce the advice string.
667 </para></listitem>
668 <listitem><para>
669 Define class <code>__magic_stack_info: public __magic_info</code>.
670 This defines the content of a line in the stack table.
671 </para></listitem>
672 <listitem><para>
673 Define class <code>__trace_magic: public __trace_base&lt;__magic_info,
674 __magic_stack_info&gt;</code>.
675 It defines the content of the trace associated with this diagnostic.
676 </para></listitem>
677 </itemizedlist>
678 </para>
679
680 <para>Add initialization and reporting calls in
681 <code>include/profile/impl/profiler_trace.h</code>. Use
682 <code>__trace_vector_to_list</code> as an example.
683 </para>
684
685 <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>.
686 </para>
687 </sect2>
688 </sect1>
689
690 <sect1 id="manual.ext.profile_mode.diagnostics">
691 <title>Diagnostics</title>
692
693 <para>
694 The table below presents all the diagnostics we intend to implement.
695 Each diagnostic has a corresponding compile time switch
696 <code>-D_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
697 Groups of related diagnostics can be turned on with a single switch.
698 For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to
699 <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH
700 -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
701 </para>
702
703 <para>
704 The benefit, cost, expected frequency and accuracy of each diagnostic
705 was given a grade from 1 to 10, where 10 is highest.
706 A high benefit means that, if the diagnostic is accurate, the expected
707 performance improvement is high.
708 A high cost means that turning this diagnostic on leads to high slowdown.
709 A high frequency means that we expect this to occur relatively often.
710 A high accuracy means that the diagnostic is unlikely to be wrong.
711 These grades are not perfect. They are just meant to guide users with
712 specific needs or time budgets.
713 </para>
714
715 <table frame='all'>
716 <title>Diagnostics</title>
717 <tgroup cols='6' align='left' colsep='1' rowsep='1'>
718 <colspec colname='c1'></colspec>
719 <colspec colname='c2'></colspec>
720 <colspec colname='c3'></colspec>
721 <colspec colname='c4'></colspec>
722 <colspec colname='c5'></colspec>
723 <colspec colname='c6'></colspec>
724 <colspec colname='c7'></colspec>
725
726 <thead>
727 <row>
728 <entry>Group</entry>
729 <entry>Flag</entry>
730 <entry>Benefit</entry>
731 <entry>Cost</entry>
732 <entry>Freq.</entry>
733 <entry>Implemented</entry>
734 </row>
735 </thead>
736 <tbody>
737 <row>
738 <entry><ulink url="#manual.ext.profile_mode.analysis.containers">
739 CONTAINERS</ulink></entry>
740 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_small">
741 HASHTABLE_TOO_SMALL</ulink></entry>
742 <entry>10</entry>
743 <entry>1</entry>
744 <entry></entry>
745 <entry>10</entry>
746 <entry>yes</entry>
747 </row>
748 <row>
749 <entry></entry>
750 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_too_large">
751 HASHTABLE_TOO_LARGE</ulink></entry>
752 <entry>5</entry>
753 <entry>1</entry>
754 <entry></entry>
755 <entry>10</entry>
756 <entry>yes</entry>
757 </row>
758 <row>
759 <entry></entry>
760 <entry><ulink url="#manual.ext.profile_mode.analysis.inefficient_hash">
761 INEFFICIENT_HASH</ulink></entry>
762 <entry>7</entry>
763 <entry>3</entry>
764 <entry></entry>
765 <entry>10</entry>
766 <entry>yes</entry>
767 </row>
768 <row>
769 <entry></entry>
770 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_small">
771 VECTOR_TOO_SMALL</ulink></entry>
772 <entry>8</entry>
773 <entry>1</entry>
774 <entry></entry>
775 <entry>10</entry>
776 <entry>yes</entry>
777 </row>
778 <row>
779 <entry></entry>
780 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_too_large">
781 VECTOR_TOO_LARGE</ulink></entry>
782 <entry>5</entry>
783 <entry>1</entry>
784 <entry></entry>
785 <entry>10</entry>
786 <entry>yes</entry>
787 </row>
788 <row>
789 <entry></entry>
790 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_hashtable">
791 VECTOR_TO_HASHTABLE</ulink></entry>
792 <entry>7</entry>
793 <entry>7</entry>
794 <entry></entry>
795 <entry>10</entry>
796 <entry>no</entry>
797 </row>
798 <row>
799 <entry></entry>
800 <entry><ulink url="#manual.ext.profile_mode.analysis.hashtable_to_vector">
801 HASHTABLE_TO_VECTOR</ulink></entry>
802 <entry>7</entry>
803 <entry>7</entry>
804 <entry></entry>
805 <entry>10</entry>
806 <entry>no</entry>
807 </row>
808 <row>
809 <entry></entry>
810 <entry><ulink url="#manual.ext.profile_mode.analysis.vector_to_list">
811 VECTOR_TO_LIST</ulink></entry>
812 <entry>8</entry>
813 <entry>5</entry>
814 <entry></entry>
815 <entry>10</entry>
816 <entry>yes</entry>
817 </row>
818 <row>
819 <entry></entry>
820 <entry><ulink url="#manual.ext.profile_mode.analysis.list_to_vector">
821 LIST_TO_VECTOR</ulink></entry>
822 <entry>10</entry>
823 <entry>5</entry>
824 <entry></entry>
825 <entry>10</entry>
826 <entry>no</entry>
827 </row>
828 <row>
829 <entry></entry>
830 <entry><ulink url="#manual.ext.profile_mode.analysis.assoc_ord_to_unord">
831 ORDERED_TO_UNORDERED</ulink></entry>
832 <entry>10</entry>
833 <entry>5</entry>
834 <entry></entry>
835 <entry>10</entry>
836 <entry>only map/unordered_map</entry>
837 </row>
838 <row>
839 <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms">
840 ALGORITHMS</ulink></entry>
841 <entry><ulink url="#manual.ext.profile_mode.analysis.algorithms.sort">
842 SORT</ulink></entry>
843 <entry>7</entry>
844 <entry>8</entry>
845 <entry></entry>
846 <entry>7</entry>
847 <entry>no</entry>
848 </row>
849 <row>
850 <entry><ulink url="#manual.ext.profile_mode.analysis.locality">
851 LOCALITY</ulink></entry>
852 <entry><ulink url="#manual.ext.profile_mode.analysis.locality.sw_prefetch">
853 SOFTWARE_PREFETCH</ulink></entry>
854 <entry>8</entry>
855 <entry>8</entry>
856 <entry></entry>
857 <entry>5</entry>
858 <entry>no</entry>
859 </row>
860 <row>
861 <entry></entry>
862 <entry><ulink url="#manual.ext.profile_mode.analysis.locality.linked">
863 RBTREE_LOCALITY</ulink></entry>
864 <entry>4</entry>
865 <entry>8</entry>
866 <entry></entry>
867 <entry>5</entry>
868 <entry>no</entry>
869 </row>
870 <row>
871 <entry></entry>
872 <entry><ulink url="#manual.ext.profile_mode.analysis.mthread.false_share">
873 FALSE_SHARING</ulink></entry>
874 <entry>8</entry>
875 <entry>10</entry>
876 <entry></entry>
877 <entry>10</entry>
878 <entry>no</entry>
879 </row>
880 </tbody>
881 </tgroup>
882 </table>
883
884 <sect2 id="manual.ext.profile_mode.analysis.template"
885 xreflabel="Template">
886 <title>Diagnostic Template</title>
887 <itemizedlist>
888 <listitem><para><emphasis>Switch:</emphasis>
889 <code>_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
890 </para></listitem>
891 <listitem><para><emphasis>Goal:</emphasis> What problem will it diagnose?
892 </para></listitem>
893 <listitem><para><emphasis>Fundamentals:</emphasis>.
894 What is the fundamental reason why this is a problem</para></listitem>
895 <listitem><para><emphasis>Sample runtime reduction:</emphasis>
896 Percentage reduction in execution time. When reduction is more than
897 a constant factor, describe the reduction rate formula.
898 </para></listitem>
899 <listitem><para><emphasis>Recommendation:</emphasis>
900 What would the advise look like?</para></listitem>
901 <listitem><para><emphasis>To instrument:</emphasis>
902 What stdlibc++ components need to be instrumented?</para></listitem>
903 <listitem><para><emphasis>Analysis:</emphasis>
904 How do we decide when to issue the advice?</para></listitem>
905 <listitem><para><emphasis>Cost model:</emphasis>
906 How do we measure benefits? Math goes here.</para></listitem>
907 <listitem><para><emphasis>Example:</emphasis>
908 <programlisting>
909 program code
910 ...
911 advice sample
912 </programlisting>
913 </para></listitem>
914 </itemizedlist>
915 </sect2>
916
917
918 <sect2 id="manual.ext.profile_mode.analysis.containers"
919 xreflabel="Containers">
920 <title>Containers</title>
921
922 <para>
923 <emphasis>Switch:</emphasis>
924 <code>_GLIBCXX_PROFILE_CONTAINERS</code>.
925 </para>
926
927 <sect3 id="manual.ext.profile_mode.analysis.hashtable_too_small"
928 xreflabel="Hashtable Too Small">
929 <title>Hashtable Too Small</title>
930 <itemizedlist>
931 <listitem><para><emphasis>Switch:</emphasis>
932 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>.
933 </para></listitem>
934 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many
935 rehash operations, small construction size and large destruction size.
936 </para></listitem>
937 <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive.
938 Read content, follow chains within bucket, evaluate hash function, place at
939 new location in different order.</para></listitem>
940 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%.
941 Code similar to example below.
942 </para></listitem>
943 <listitem><para><emphasis>Recommendation:</emphasis>
944 Set initial size to N at construction site S.
945 </para></listitem>
946 <listitem><para><emphasis>To instrument:</emphasis>
947 <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
948 </para></listitem>
949 <listitem><para><emphasis>Analysis:</emphasis>
950 For each dynamic instance of <code>unordered_[multi]set|map</code>,
951 record initial size and call context of the constructor.
952 Record size increase, if any, after each relevant operation such as insert.
953 Record the estimated rehash cost.</para></listitem>
954 <listitem><para><emphasis>Cost model:</emphasis>
955 Number of individual rehash operations * cost per rehash.</para></listitem>
956 <listitem><para><emphasis>Example:</emphasis>
957 <programlisting>
958 1 unordered_set&lt;int&gt; us;
959 2 for (int k = 0; k &lt; 1000000; ++k) {
960 3 us.insert(k);
961 4 }
962
963 foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
964 </programlisting>
965 </para></listitem>
966 </itemizedlist>
967 </sect3>
968
969
970 <sect3 id="manual.ext.profile_mode.analysis.hashtable_too_large"
971 xreflabel="Hashtable Too Large">
972 <title>Hashtable Too Large</title>
973 <itemizedlist>
974 <listitem><para><emphasis>Switch:</emphasis>
975 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>.
976 </para></listitem>
977 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are
978 never filled up because fewer elements than reserved are ever
979 inserted.
980 </para></listitem>
981 <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which
982 is good in itself and may also improve memory reference performance through
983 fewer cache and TLB misses.</para></listitem>
984 <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown.
985 </para></listitem>
986 <listitem><para><emphasis>Recommendation:</emphasis>
987 Set initial size to N at construction site S.
988 </para></listitem>
989 <listitem><para><emphasis>To instrument:</emphasis>
990 <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
991 </para></listitem>
992 <listitem><para><emphasis>Analysis:</emphasis>
993 For each dynamic instance of <code>unordered_[multi]set|map</code>,
994 record initial size and call context of the constructor, and correlate it
995 with its size at destruction time.
996 </para></listitem>
997 <listitem><para><emphasis>Cost model:</emphasis>
998 Number of iteration operations + memory saved.</para></listitem>
999 <listitem><para><emphasis>Example:</emphasis>
1000 <programlisting>
1001 1 vector&lt;unordered_set&lt;int&gt;&gt; v(100000, unordered_set&lt;int&gt;(100)) ;
1002 2 for (int k = 0; k &lt; 100000; ++k) {
1003 3 for (int j = 0; j &lt; 10; ++j) {
1004 4 v[k].insert(k + j);
1005 5 }
1006 6 }
1007
1008 foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
1009 bytes of memory and M iteration steps.
1010 </programlisting>
1011 </para></listitem>
1012 </itemizedlist>
1013 </sect3>
1014
1015 <sect3 id="manual.ext.profile_mode.analysis.inefficient_hash"
1016 xreflabel="Inefficient Hash">
1017 <title>Inefficient Hash</title>
1018 <itemizedlist>
1019 <listitem><para><emphasis>Switch:</emphasis>
1020 <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>.
1021 </para></listitem>
1022 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized
1023 distribution.
1024 </para></listitem>
1025 <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform
1026 distribution may lead to long chains, thus possibly increasing complexity
1027 by a factor up to the number of elements.
1028 </para></listitem>
1029 <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up
1030 to container size.
1031 </para></listitem>
1032 <listitem><para><emphasis>Recommendation:</emphasis> Change hash function
1033 for container built at site S. Distribution score = N. Access score = S.
1034 Longest chain = C, in bucket B.
1035 </para></listitem>
1036 <listitem><para><emphasis>To instrument:</emphasis>
1037 <code>unordered_set, unordered_map</code> constructor, destructor, [],
1038 insert, iterator.
1039 </para></listitem>
1040 <listitem><para><emphasis>Analysis:</emphasis>
1041 Count the exact number of link traversals.
1042 </para></listitem>
1043 <listitem><para><emphasis>Cost model:</emphasis>
1044 Total number of links traversed.</para></listitem>
1045 <listitem><para><emphasis>Example:</emphasis>
1046 <programlisting>
1047 class dumb_hash {
1048 public:
1049 size_t operator() (int i) const { return 0; }
1050 };
1051 ...
1052 unordered_set&lt;int, dumb_hash&gt; hs;
1053 ...
1054 for (int i = 0; i &lt; COUNT; ++i) {
1055 hs.find(i);
1056 }
1057 </programlisting>
1058 </para></listitem>
1059 </itemizedlist>
1060 </sect3>
1061
1062 <sect3 id="manual.ext.profile_mode.analysis.vector_too_small"
1063 xreflabel="Vector Too Small">
1064 <title>Vector Too Small</title>
1065 <itemizedlist>
1066 <listitem><para><emphasis>Switch:</emphasis>
1067 <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>.
1068 </para></listitem>
1069 <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many
1070 resize operations, small construction size and large destruction size..
1071 </para></listitem>
1072 <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive.
1073 Copying large amounts of data takes time. Resizing many small vectors may
1074 have allocation overhead and affect locality.</para></listitem>
1075 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1076 </para></listitem>
1077 <listitem><para><emphasis>Recommendation:</emphasis>
1078 Set initial size to N at construction site S.</para></listitem>
1079 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1080 </para></listitem>
1081 <listitem><para><emphasis>Analysis:</emphasis>
1082 For each dynamic instance of <code>vector</code>,
1083 record initial size and call context of the constructor.
1084 Record size increase, if any, after each relevant operation such as
1085 <code>push_back</code>. Record the estimated resize cost.
1086 </para></listitem>
1087 <listitem><para><emphasis>Cost model:</emphasis>
1088 Total number of words copied * time to copy a word.</para></listitem>
1089 <listitem><para><emphasis>Example:</emphasis>
1090 <programlisting>
1091 1 vector&lt;int&gt; v;
1092 2 for (int k = 0; k &lt; 1000000; ++k) {
1093 3 v.push_back(k);
1094 4 }
1095
1096 foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
1097 copying 4000000 bytes and 20 memory allocations and deallocations.
1098 </programlisting>
1099 </para></listitem>
1100 </itemizedlist>
1101 </sect3>
1102
1103 <sect3 id="manual.ext.profile_mode.analysis.vector_too_large"
1104 xreflabel="Vector Too Large">
1105 <title>Vector Too Large</title>
1106 <itemizedlist>
1107 <listitem><para><emphasis>Switch:</emphasis>
1108 <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code>
1109 </para></listitem>
1110 <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are
1111 never filled up because fewer elements than reserved are ever
1112 inserted.
1113 </para></listitem>
1114 <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which
1115 is good in itself and may also improve memory reference performance through
1116 fewer cache and TLB misses.</para></listitem>
1117 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1118 </para></listitem>
1119 <listitem><para><emphasis>Recommendation:</emphasis>
1120 Set initial size to N at construction site S.</para></listitem>
1121 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1122 </para></listitem>
1123 <listitem><para><emphasis>Analysis:</emphasis>
1124 For each dynamic instance of <code>vector</code>,
1125 record initial size and call context of the constructor, and correlate it
1126 with its size at destruction time.</para></listitem>
1127 <listitem><para><emphasis>Cost model:</emphasis>
1128 Total amount of memory saved.</para></listitem>
1129 <listitem><para><emphasis>Example:</emphasis>
1130 <programlisting>
1131 1 vector&lt;vector&lt;int&gt;&gt; v(100000, vector&lt;int&gt;(100)) ;
1132 2 for (int k = 0; k &lt; 100000; ++k) {
1133 3 for (int j = 0; j &lt; 10; ++j) {
1134 4 v[k].insert(k + j);
1135 5 }
1136 6 }
1137
1138 foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
1139 bytes of memory and may reduce the number of cache and TLB misses.
1140 </programlisting>
1141 </para></listitem>
1142 </itemizedlist>
1143 </sect3>
1144
1145 <sect3 id="manual.ext.profile_mode.analysis.vector_to_hashtable"
1146 xreflabel="Vector to Hashtable">
1147 <title>Vector to Hashtable</title>
1148 <itemizedlist>
1149 <listitem><para><emphasis>Switch:</emphasis>
1150 <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>.
1151 </para></listitem>
1152 <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1153 <code>vector</code> that can be substituted with <code>unordered_set</code>
1154 to reduce execution time.
1155 </para></listitem>
1156 <listitem><para><emphasis>Fundamentals:</emphasis>
1157 Linear search in a vector is very expensive, whereas searching in a hashtable
1158 is very quick.</para></listitem>
1159 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up
1160 to container size.
1161 </para></listitem>
1162 <listitem><para><emphasis>Recommendation:</emphasis>Replace
1163 <code>vector</code> with <code>unordered_set</code> at site S.
1164 </para></listitem>
1165 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1166 operations and access methods.</para></listitem>
1167 <listitem><para><emphasis>Analysis:</emphasis>
1168 For each dynamic instance of <code>vector</code>,
1169 record call context of the constructor. Issue the advice only if the
1170 only methods called on this <code>vector</code> are <code>push_back</code>,
1171 <code>insert</code> and <code>find</code>.
1172 </para></listitem>
1173 <listitem><para><emphasis>Cost model:</emphasis>
1174 Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -
1175 cost(unordered_set::insert) + cost(unordered_set::find).
1176 </para></listitem>
1177 <listitem><para><emphasis>Example:</emphasis>
1178 <programlisting>
1179 1 vector&lt;int&gt; v;
1180 ...
1181 2 for (int i = 0; i &lt; 1000; ++i) {
1182 3 find(v.begin(), v.end(), i);
1183 4 }
1184
1185 foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
1186 comparisons.
1187 </programlisting>
1188 </para></listitem>
1189 </itemizedlist>
1190 </sect3>
1191
1192 <sect3 id="manual.ext.profile_mode.analysis.hashtable_to_vector"
1193 xreflabel="Hashtable to Vector">
1194 <title>Hashtable to Vector</title>
1195 <itemizedlist>
1196 <listitem><para><emphasis>Switch:</emphasis>
1197 <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>.
1198 </para></listitem>
1199 <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1200 <code>unordered_set</code> that can be substituted with <code>vector</code>
1201 to reduce execution time.
1202 </para></listitem>
1203 <listitem><para><emphasis>Fundamentals:</emphasis>
1204 Hashtable iterator is slower than vector iterator.</para></listitem>
1205 <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%.
1206 </para></listitem>
1207 <listitem><para><emphasis>Recommendation:</emphasis>Replace
1208 <code>unordered_set</code> with <code>vector</code> at site S.
1209 </para></listitem>
1210 <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code>
1211 operations and access methods.</para></listitem>
1212 <listitem><para><emphasis>Analysis:</emphasis>
1213 For each dynamic instance of <code>unordered_set</code>,
1214 record call context of the constructor. Issue the advice only if the
1215 number of <code>find</code>, <code>insert</code> and <code>[]</code>
1216 operations on this <code>unordered_set</code> are small relative to the
1217 number of elements, and methods <code>begin</code> or <code>end</code>
1218 are invoked (suggesting iteration).</para></listitem>
1219 <listitem><para><emphasis>Cost model:</emphasis>
1220 Number of .</para></listitem>
1221 <listitem><para><emphasis>Example:</emphasis>
1222 <programlisting>
1223 1 unordered_set&lt;int&gt; us;
1224 ...
1225 2 int s = 0;
1226 3 for (unordered_set&lt;int&gt;::iterator it = us.begin(); it != us.end(); ++it) {
1227 4 s += *it;
1228 5 }
1229
1230 foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
1231 indirections and may achieve better data locality.
1232 </programlisting>
1233 </para></listitem>
1234 </itemizedlist>
1235 </sect3>
1236
1237 <sect3 id="manual.ext.profile_mode.analysis.vector_to_list"
1238 xreflabel="Vector to List">
1239 <title>Vector to List</title>
1240 <itemizedlist>
1241 <listitem><para><emphasis>Switch:</emphasis>
1242 <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>.
1243 </para></listitem>
1244 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1245 <code>vector</code> could be substituted with <code>list</code> for
1246 better performance.
1247 </para></listitem>
1248 <listitem><para><emphasis>Fundamentals:</emphasis>
1249 Inserting in the middle of a vector is expensive compared to inserting in a
1250 list.
1251 </para></listitem>
1252 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to
1253 container size.
1254 </para></listitem>
1255 <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list
1256 at site S.</para></listitem>
1257 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1258 operations and access methods.</para></listitem>
1259 <listitem><para><emphasis>Analysis:</emphasis>
1260 For each dynamic instance of <code>vector</code>,
1261 record the call context of the constructor. Record the overhead of each
1262 <code>insert</code> operation based on current size and insert position.
1263 Report instance with high insertion overhead.
1264 </para></listitem>
1265 <listitem><para><emphasis>Cost model:</emphasis>
1266 (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1267 method in [push_back, insert, erase])
1268 + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1269 <listitem><para><emphasis>Example:</emphasis>
1270 <programlisting>
1271 1 vector&lt;int&gt; v;
1272 2 for (int i = 0; i &lt; 10000; ++i) {
1273 3 v.insert(v.begin(), i);
1274 4 }
1275
1276 foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
1277 operations.
1278 </programlisting>
1279 </para></listitem>
1280 </itemizedlist>
1281 </sect3>
1282
1283 <sect3 id="manual.ext.profile_mode.analysis.list_to_vector"
1284 xreflabel="List to Vector">
1285 <title>List to Vector</title>
1286 <itemizedlist>
1287 <listitem><para><emphasis>Switch:</emphasis>
1288 <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>.
1289 </para></listitem>
1290 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1291 <code>list</code> could be substituted with <code>vector</code> for
1292 better performance.
1293 </para></listitem>
1294 <listitem><para><emphasis>Fundamentals:</emphasis>
1295 Iterating through a vector is faster than through a list.
1296 </para></listitem>
1297 <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%.
1298 </para></listitem>
1299 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector
1300 at site S.</para></listitem>
1301 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1302 operations and access methods.</para></listitem>
1303 <listitem><para><emphasis>Analysis:</emphasis>
1304 Issue the advice if there are no <code>insert</code> operations.
1305 </para></listitem>
1306 <listitem><para><emphasis>Cost model:</emphasis>
1307 (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1308 method in [push_back, insert, erase])
1309 + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1310 <listitem><para><emphasis>Example:</emphasis>
1311 <programlisting>
1312 1 list&lt;int&gt; l;
1313 ...
1314 2 int sum = 0;
1315 3 for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
1316 4 sum += *it;
1317 5 }
1318
1319 foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
1320 memory references.
1321 </programlisting>
1322 </para></listitem>
1323 </itemizedlist>
1324 </sect3>
1325
1326 <sect3 id="manual.ext.profile_mode.analysis.list_to_slist"
1327 xreflabel="List to Forward List">
1328 <title>List to Forward List (Slist)</title>
1329 <itemizedlist>
1330 <listitem><para><emphasis>Switch:</emphasis>
1331 <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>.
1332 </para></listitem>
1333 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1334 <code>list</code> could be substituted with <code>forward_list</code> for
1335 better performance.
1336 </para></listitem>
1337 <listitem><para><emphasis>Fundamentals:</emphasis>
1338 The memory footprint of a forward_list is smaller than that of a list.
1339 This has beneficial effects on memory subsystem, e.g., fewer cache misses.
1340 </para></listitem>
1341 <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%.
1342 Note that the reduction is only noticeable if the size of the forward_list
1343 node is in fact larger than that of the list node. For memory allocators
1344 with size classes, you will only notice an effect when the two node sizes
1345 belong to different allocator size classes.
1346 </para></listitem>
1347 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with
1348 forward_list at site S.</para></listitem>
1349 <listitem><para><emphasis>To instrument:</emphasis><code>list</code>
1350 operations and iteration methods.</para></listitem>
1351 <listitem><para><emphasis>Analysis:</emphasis>
1352 Issue the advice if there are no <code>backwards</code> traversals
1353 or insertion before a given node.
1354 </para></listitem>
1355 <listitem><para><emphasis>Cost model:</emphasis>
1356 Always true.</para></listitem>
1357 <listitem><para><emphasis>Example:</emphasis>
1358 <programlisting>
1359 1 list&lt;int&gt; l;
1360 ...
1361 2 int sum = 0;
1362 3 for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
1363 4 sum += *it;
1364 5 }
1365
1366 foo.cc:1: advice: Change "list" to "forward_list".
1367 </programlisting>
1368 </para></listitem>
1369 </itemizedlist>
1370 </sect3>
1371
1372 <sect3 id="manual.ext.profile_mode.analysis.assoc_ord_to_unord"
1373 xreflabel="Ordered to Unordered Associative Container">
1374 <title>Ordered to Unordered Associative Container</title>
1375 <itemizedlist>
1376 <listitem><para><emphasis>Switch:</emphasis>
1377 <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>.
1378 </para></listitem>
1379 <listitem><para><emphasis>Goal:</emphasis> Detect cases where ordered
1380 associative containers can be replaced with unordered ones.
1381 </para></listitem>
1382 <listitem><para><emphasis>Fundamentals:</emphasis>
1383 Insert and search are quicker in a hashtable than in
1384 a red-black tree.</para></listitem>
1385 <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%.
1386 </para></listitem>
1387 <listitem><para><emphasis>Recommendation:</emphasis>
1388 Replace set with unordered_set at site S.</para></listitem>
1389 <listitem><para><emphasis>To instrument:</emphasis>
1390 <code>set</code>, <code>multiset</code>, <code>map</code>,
1391 <code>multimap</code> methods.</para></listitem>
1392 <listitem><para><emphasis>Analysis:</emphasis>
1393 Issue the advice only if we are not using operator <code>++</code> on any
1394 iterator on a particular <code>[multi]set|map</code>.
1395 </para></listitem>
1396 <listitem><para><emphasis>Cost model:</emphasis>
1397 (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for
1398 method in [insert, erase, find])
1399 + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem>
1400 <listitem><para><emphasis>Example:</emphasis>
1401 <programlisting>
1402 1 set&lt;int&gt; s;
1403 2 for (int i = 0; i &lt; 100000; ++i) {
1404 3 s.insert(i);
1405 4 }
1406 5 int sum = 0;
1407 6 for (int i = 0; i &lt; 100000; ++i) {
1408 7 sum += *s.find(i);
1409 8 }
1410 </programlisting>
1411 </para></listitem>
1412 </itemizedlist>
1413 </sect3>
1414
1415 </sect2>
1416
1417
1418
1419 <sect2 id="manual.ext.profile_mode.analysis.algorithms"
1420 xreflabel="Algorithms">
1421 <title>Algorithms</title>
1422
1423 <para><emphasis>Switch:</emphasis>
1424 <code>_GLIBCXX_PROFILE_ALGORITHMS</code>.
1425 </para>
1426
1427 <sect3 id="manual.ext.profile_mode.analysis.algorithms.sort"
1428 xreflabel="Sorting">
1429 <title>Sort Algorithm Performance</title>
1430 <itemizedlist>
1431 <listitem><para><emphasis>Switch:</emphasis>
1432 <code>_GLIBCXX_PROFILE_SORT</code>.
1433 </para></listitem>
1434 <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm
1435 performance based on actual input. For instance, advise Radix Sort over
1436 Quick Sort for a particular call context.
1437 </para></listitem>
1438 <listitem><para><emphasis>Fundamentals:</emphasis>
1439 See papers:
1440 <ulink url="http://portal.acm.org/citation.cfm?doid=1065944.1065981">
1441 A framework for adaptive algorithm selection in STAPL</ulink> and
1442 <ulink url="http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4228227">
1443 Optimizing Sorting with Machine Learning Algorithms</ulink>.
1444 </para></listitem>
1445 <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%.
1446 </para></listitem>
1447 <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm
1448 at site S from X Sort to Y Sort.</para></listitem>
1449 <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code>
1450 algorithm.</para></listitem>
1451 <listitem><para><emphasis>Analysis:</emphasis>
1452 Issue the advice if the cost model tells us that another sort algorithm
1453 would do better on this input. Requires us to know what algorithm we
1454 are using in our sort implementation in release mode.</para></listitem>
1455 <listitem><para><emphasis>Cost model:</emphasis>
1456 Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem>
1457 <listitem><para><emphasis>Example:</emphasis>
1458 <programlisting>
1459 </programlisting>
1460 </para></listitem>
1461 </itemizedlist>
1462 </sect3>
1463
1464 </sect2>
1465
1466
1467 <sect2 id="manual.ext.profile_mode.analysis.locality"
1468 xreflabel="Data Locality">
1469 <title>Data Locality</title>
1470
1471 <para><emphasis>Switch:</emphasis>
1472 <code>_GLIBCXX_PROFILE_LOCALITY</code>.
1473 </para>
1474
1475 <sect3 id="manual.ext.profile_mode.analysis.locality.sw_prefetch"
1476 xreflabel="Need Software Prefetch">
1477 <title>Need Software Prefetch</title>
1478 <itemizedlist>
1479 <listitem><para><emphasis>Switch:</emphasis>
1480 <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>.
1481 </para></listitem>
1482 <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect
1483 memory accesses that are not regular, thus cannot be predicted by
1484 hardware prefetchers.
1485 </para></listitem>
1486 <listitem><para><emphasis>Fundamentals:</emphasis>
1487 Indirect references are hard to predict and are very expensive when they
1488 miss in caches.</para></listitem>
1489 <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%.
1490 </para></listitem>
1491 <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch
1492 instruction.</para></listitem>
1493 <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and
1494 access operator [].
1495 </para></listitem>
1496 <listitem><para><emphasis>Analysis:</emphasis>
1497 First, get cache line size and page size from system.
1498 Then record iterator dereference sequences for which the value is a pointer.
1499 For each sequence within a container, issue a warning if successive pointer
1500 addresses are not within cache lines and do not form a linear pattern
1501 (otherwise they may be prefetched by hardware).
1502 If they also step across page boundaries, make the warning stronger.
1503 </para>
1504 <para>The same analysis applies to containers other than vector.
1505 However, we cannot give the same advice for linked structures, such as list,
1506 as there is no random access to the n-th element. The user may still be
1507 able to benefit from this information, for instance by employing frays (user
1508 level light weight threads) to hide the latency of chasing pointers.
1509 </para>
1510 <para>
1511 This analysis is a little oversimplified. A better cost model could be
1512 created by understanding the capability of the hardware prefetcher.
1513 This model could be trained automatically by running a set of synthetic
1514 cases.
1515 </para>
1516 </listitem>
1517 <listitem><para><emphasis>Cost model:</emphasis>
1518 Total distance between pointer values of successive elements in vectors
1519 of pointers.</para></listitem>
1520 <listitem><para><emphasis>Example:</emphasis>
1521 <programlisting>
1522 1 int zero = 0;
1523 2 vector&lt;int*&gt; v(10000000, &amp;zero);
1524 3 for (int k = 0; k &lt; 10000000; ++k) {
1525 4 v[random() % 10000000] = new int(k);
1526 5 }
1527 6 for (int j = 0; j &lt; 10000000; ++j) {
1528 7 count += (*v[j] == 0 ? 0 : 1);
1529 8 }
1530
1531 foo.cc:7: advice: Insert prefetch instruction.
1532 </programlisting>
1533 </para></listitem>
1534 </itemizedlist>
1535 </sect3>
1536
1537 <sect3 id="manual.ext.profile_mode.analysis.locality.linked"
1538 xreflabel="Linked Structure Locality">
1539 <title>Linked Structure Locality</title>
1540 <itemizedlist>
1541 <listitem><para><emphasis>Switch:</emphasis>
1542 <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
1543 </para></listitem>
1544 <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of
1545 objects stored in linked structures (lists, red-black trees and hashtables)
1546 with respect to their actual traversal patterns.
1547 </para></listitem>
1548 <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned
1549 to a specific traversal pattern, to result in better data locality.
1550 See paper:
1551 <ulink url="http://www.springerlink.com/content/8085744l00x72662/">
1552 Custom Memory Allocation for Free</ulink>.
1553 </para></listitem>
1554 <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%.
1555 </para></listitem>
1556 <listitem><para><emphasis>Recommendation:</emphasis>
1557 High scatter score N for container built at site S.
1558 Consider changing allocation sequence or choosing a structure conscious
1559 allocator.</para></listitem>
1560 <listitem><para><emphasis>To instrument:</emphasis> Methods of all
1561 containers using linked structures.</para></listitem>
1562 <listitem><para><emphasis>Analysis:</emphasis>
1563 First, get cache line size and page size from system.
1564 Then record the number of successive elements that are on different line
1565 or page, for each traversal method such as <code>find</code>. Give advice
1566 only if the ratio between this number and the number of total node hops
1567 is above a threshold.</para></listitem>
1568 <listitem><para><emphasis>Cost model:</emphasis>
1569 Sum(same_cache_line(this,previous))</para></listitem>
1570 <listitem><para><emphasis>Example:</emphasis>
1571 <programlisting>
1572 1 set&lt;int&gt; s;
1573 2 for (int i = 0; i &lt; 10000000; ++i) {
1574 3 s.insert(i);
1575 4 }
1576 5 set&lt;int&gt; s1, s2;
1577 6 for (int i = 0; i &lt; 10000000; ++i) {
1578 7 s1.insert(i);
1579 8 s2.insert(i);
1580 9 }
1581 ...
1582 // Fast, better locality.
1583 10 for (set&lt;int&gt;::iterator it = s.begin(); it != s.end(); ++it) {
1584 11 sum += *it;
1585 12 }
1586 // Slow, elements are further apart.
1587 13 for (set&lt;int&gt;::iterator it = s1.begin(); it != s1.end(); ++it) {
1588 14 sum += *it;
1589 15 }
1590
1591 foo.cc:5: advice: High scatter score NNN for set built here. Consider changing
1592 the allocation sequence or switching to a structure conscious allocator.
1593 </programlisting>
1594 </para></listitem>
1595 </itemizedlist>
1596 </sect3>
1597
1598 </sect2>
1599
1600
1601 <sect2 id="manual.ext.profile_mode.analysis.mthread"
1602 xreflabel="Multithreaded Data Access">
1603 <title>Multithreaded Data Access</title>
1604
1605 <para>
1606 The diagnostics in this group are not meant to be implemented short term.
1607 They require compiler support to know when container elements are written
1608 to. Instrumentation can only tell us when elements are referenced.
1609 </para>
1610
1611 <para><emphasis>Switch:</emphasis>
1612 <code>_GLIBCXX_PROFILE_MULTITHREADED</code>.
1613 </para>
1614
1615 <sect3 id="manual.ext.profile_mode.analysis.mthread.ddtest"
1616 xreflabel="Dependence Violations at Container Level">
1617 <title>Data Dependence Violations at Container Level</title>
1618 <itemizedlist>
1619 <listitem><para><emphasis>Switch:</emphasis>
1620 <code>_GLIBCXX_PROFILE_DDTEST</code>.
1621 </para></listitem>
1622 <listitem><para><emphasis>Goal:</emphasis> Detect container elements
1623 that are referenced from multiple threads in the parallel region or
1624 across parallel regions.
1625 </para></listitem>
1626 <listitem><para><emphasis>Fundamentals:</emphasis>
1627 Sharing data between threads requires communication and perhaps locking,
1628 which may be expensive.
1629 </para></listitem>
1630 <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%.
1631 </para></listitem>
1632 <listitem><para><emphasis>Recommendation:</emphasis> Change data
1633 distribution or parallel algorithm.</para></listitem>
1634 <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1635 and iterators.
1636 </para></listitem>
1637 <listitem><para><emphasis>Analysis:</emphasis>
1638 Keep a shadow for each container. Record iterator dereferences and
1639 container member accesses. Issue advice for elements referenced by
1640 multiple threads.
1641 See paper: <ulink url="http://portal.acm.org/citation.cfm?id=207110.207148">
1642 The LRPD test: speculative run-time parallelization of loops with
1643 privatization and reduction parallelization</ulink>.
1644 </para></listitem>
1645 <listitem><para><emphasis>Cost model:</emphasis>
1646 Number of accesses to elements referenced from multiple threads
1647 </para></listitem>
1648 <listitem><para><emphasis>Example:</emphasis>
1649 <programlisting>
1650 </programlisting>
1651 </para></listitem>
1652 </itemizedlist>
1653 </sect3>
1654
1655 <sect3 id="manual.ext.profile_mode.analysis.mthread.false_share"
1656 xreflabel="False Sharing">
1657 <title>False Sharing</title>
1658 <itemizedlist>
1659 <listitem><para><emphasis>Switch:</emphasis>
1660 <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>.
1661 </para></listitem>
1662 <listitem><para><emphasis>Goal:</emphasis> Detect elements in the
1663 same container which share a cache line, are written by at least one
1664 thread, and accessed by different threads.
1665 </para></listitem>
1666 <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions,
1667 cache protocols require
1668 communication to invalidate lines, which may be expensive.
1669 </para></listitem>
1670 <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%.
1671 </para></listitem>
1672 <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container
1673 or use padding to avoid false sharing.</para></listitem>
1674 <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1675 and iterators.
1676 </para></listitem>
1677 <listitem><para><emphasis>Analysis:</emphasis>
1678 First, get the cache line size.
1679 For each shared container, record all the associated iterator dereferences
1680 and member access methods with the thread id. Compare the address lists
1681 across threads to detect references in two different threads to the same
1682 cache line. Issue a warning only if the ratio to total references is
1683 significant. Do the same for iterator dereference values if they are
1684 pointers.</para></listitem>
1685 <listitem><para><emphasis>Cost model:</emphasis>
1686 Number of accesses to same cache line from different threads.
1687 </para></listitem>
1688 <listitem><para><emphasis>Example:</emphasis>
1689 <programlisting>
1690 1 vector&lt;int&gt; v(2, 0);
1691 2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
1692 3 for (i = 0; i &lt; SIZE; ++i) {
1693 4 v[i % 2] += i;
1694 5 }
1695
1696 OMP_NUM_THREADS=2 ./a.out
1697 foo.cc:1: advice: Change container structure or padding to avoid false
1698 sharing in multithreaded access at foo.cc:4. Detected N shared cache lines.
1699 </programlisting>
1700 </para></listitem>
1701 </itemizedlist>
1702 </sect3>
1703
1704 </sect2>
1705
1706
1707 <sect2 id="manual.ext.profile_mode.analysis.statistics"
1708 xreflabel="Statistics">
1709 <title>Statistics</title>
1710
1711 <para>
1712 <emphasis>Switch:</emphasis>
1713 <code>_GLIBCXX_PROFILE_STATISTICS</code>.
1714 </para>
1715
1716 <para>
1717 In some cases the cost model may not tell us anything because the costs
1718 appear to offset the benefits. Consider the choice between a vector and
1719 a list. When there are both inserts and iteration, an automatic advice
1720 may not be issued. However, the programmer may still be able to make use
1721 of this information in a different way.
1722 </para>
1723 <para>
1724 This diagnostic will not issue any advice, but it will print statistics for
1725 each container construction site. The statistics will contain the cost
1726 of each operation actually performed on the container.
1727 </para>
1728
1729 </sect2>
1730
1731
1732 </sect1>
1733
1734
1735 <bibliography id="profile_mode.biblio">
1736 <title>Bibliography</title>
1737
1738 <biblioentry>
1739 <title>
1740 Perflint: A Context Sensitive Performance Advisor for C++ Programs
1741 </title>
1742
1743 <author>
1744 <firstname>Lixia</firstname>
1745 <surname>Liu</surname>
1746 </author>
1747 <author>
1748 <firstname>Silvius</firstname>
1749 <surname>Rus</surname>
1750 </author>
1751
1752 <copyright>
1753 <year>2009</year>
1754 <holder></holder>
1755 </copyright>
1756
1757 <publisher>
1758 <publishername>
1759 Proceedings of the 2009 International Symposium on Code Generation
1760 and Optimization
1761 </publishername>
1762 </publisher>
1763 </biblioentry>
1764 </bibliography>
1765
1766
1767 </chapter>