blob: 136138357fa2a39bdee93cd0905c6e254f6959d0 [file] [log] [blame]
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<title>OProfile manual</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.69.1" />
</head>
<body>
<div class="book" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h1 class="title"><a id="oprofile-guide"></a>OProfile manual</h1>
</div>
<div>
<div class="authorgroup">
<div class="author">
<h3 class="author"><span class="firstname">John</span> <span class="surname">Levon</span></h3>
<div class="affiliation">
<div class="address">
<p>
<code class="email">&lt;<a href="mailto:levon@movementarian.org">levon@movementarian.org</a>&gt;</code>
</p>
</div>
</div>
</div>
</div>
</div>
<div>
<p class="copyright">Copyright © 2000-2004 Victoria University of Manchester, John Levon and others</p>
</div>
</div>
<hr />
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="chapter">
<a href="#introduction">1. Introduction</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect1">
<a href="#applications">1. Applications of OProfile</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#jitsupport">1.1. Support for dynamically compiled (JIT) code</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#requirements">2. System requirements</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#resources">3. Internet resources</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#install">4. Installation</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#uninstall">5. Uninstalling OProfile</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="chapter">
<a href="#overview">2. Overview</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect1">
<a href="#getting-started">1. Getting started</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#tools-overview">2. Tools summary</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="chapter">
<a href="#controlling">3. Controlling the profiler</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect1">
<a href="#controlling-daemon">1. Using <span><strong class="command">opcontrol</strong></span></a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opcontrolexamples">1.1. Examples</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#eventspec">1.2. Specifying performance counter events</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#setup-jit">2. Setting up the JIT profiling feature</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#setup-jit-jvm">2.1. JVM instrumentation</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#oprofile-gui">3. Using <span><strong class="command">oprof_start</strong></span></a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#detailed-parameters">4. Configuration details</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#hardware-counters">4.1. Hardware performance counters</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#rtc">4.2. OProfile in RTC mode</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#timer">4.3. OProfile in timer interrupt mode</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#p4">4.4. Pentium 4 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#ia64">4.5. Intel Itanium 2 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#ppc64">4.6. PowerPC64 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#cell-be">4.7. Cell Broadband Engine support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#amd-ibs-support">4.8. AMD64 (x86_64) Instruction-Based Sampling (IBS) support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#misuse">4.9. Dangerous counter settings</a>
</span>
</dt>
</dl>
</dd>
</dl>
</dd>
<dt>
<span class="chapter">
<a href="#results">4. Obtaining results</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect1">
<a href="#profile-spec">1. Profile specifications</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#profile-spec-examples">1.1. Examples</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#profile-spec-details">1.2. Profile specification parameters</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#locating-and-managing-binary-images">1.3. Locating and managing binary images</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#no-results">1.4. What to do when you don't get any results</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opreport">2. Image summaries and symbol summaries (<span><strong class="command">opreport</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opreport-merging">2.1. Merging separate profiles</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-comparison">2.2. Side-by-side multiple results</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-callgraph">2.3. Callgraph output</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-diff">2.4. Differential profiles with <span><strong class="command">opreport</strong></span></a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-anon">2.5. Anonymous executable mappings</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-xml">2.6. XML formatted output</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-options">2.7. Options for <span><strong class="command">opreport</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opannotate">3. Outputting annotated source (<span><strong class="command">opannotate</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opannotate-finding-source">3.1. Locating source files</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opannotate-details">3.2. Usage of <span><strong class="command">opannotate</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#getting-jit-reports">4. OProfile results with JIT samples</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#opgprof">5. <span><strong class="command">gprof</strong></span>-compatible output (<span><strong class="command">opgprof</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opgprof-details">5.1. Usage of <span><strong class="command">opgprof</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#oparchive">6. Archiving measurements (<span><strong class="command">oparchive</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#oparchive-details">6.1. Usage of <span><strong class="command">oparchive</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opimport">7. Converting sample database files (<span><strong class="command">opimport</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opimport-details">7.1. Usage of <span><strong class="command">opimport</strong></span></a>
</span>
</dt>
</dl>
</dd>
</dl>
</dd>
<dt>
<span class="chapter">
<a href="#interpreting">5. Interpreting profiling results</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect1">
<a href="#irq-latency">1. Profiling interrupt latency</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#kernel-profiling">2. Kernel profiling</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#irq-masking">2.1. Interrupt masking</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#idle">2.2. Idle time</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#kernel-modules">2.3. Profiling kernel modules</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#interpreting-callgraph">3. Interpreting call-graph profiles</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#debug-info">4. Inaccuracies in annotated source</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#effect-of-optimizations">4.1. Side effects of optimizations</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#prologues">4.2. Prologues and epilogues</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#inlined-function">4.3. Inlined functions</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#wrong-linenr-info">4.4. Inaccuracy in line number information</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#symbol-without-debug-info">5. Assembly functions</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#overlapping-symbols">6. Overlapping symbols in JITed code</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#hidden-cost">7. Other discrepancies</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="chapter">
<a href="#ack">6. Acknowledgments</a>
</span>
</dt>
</dl>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="introduction"></a>Chapter 1. Introduction</h2>
</div>
</div>
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="sect1">
<a href="#applications">1. Applications of OProfile</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#jitsupport">1.1. Support for dynamically compiled (JIT) code</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#requirements">2. System requirements</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#resources">3. Internet resources</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#install">4. Installation</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#uninstall">5. Uninstalling OProfile</a>
</span>
</dt>
</dl>
</div>
<p>
This manual applies to OProfile version 0.9.6.
OProfile is a profiling system for Linux 2.2/2.4/2.6 systems on a number of architectures. It is capable of profiling
all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries
to binaries. It runs transparently in the background collecting information at a low overhead. These
features make it ideal for profiling entire systems to determine bottle necks in real-world systems.
</p>
<p>
Many CPUs provide "performance counters", hardware registers that can count "events"; for example,
cache misses, or CPU cycles. OProfile provides profiles of code based on the number of these occurring events:
repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded.
This information is aggregated into profiles for each binary image.</p>
<p>
Some hardware setups do not allow OProfile to use performance counters: in these cases, no
events are available, and OProfile operates in timer/RTC mode, as described in later chapters.
</p>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="applications"></a>1. Applications of OProfile</h2>
</div>
</div>
</div>
<p>
OProfile is useful in a number of situations. You might want to use OProfile when you :
</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>need low overhead</p>
</li>
<li>
<p>cannot use highly intrusive profiling methods</p>
</li>
<li>
<p>need to profile interrupt handlers</p>
</li>
<li>
<p>need to profile an application and its shared libraries</p>
</li>
<li>
<p>need to profile dynamically compiled code of supported virtual machines (see <a href="#jitsupport" title="1.1. Support for dynamically compiled (JIT) code">Section 1.1, &#8220;Support for dynamically compiled (JIT) code&#8221;</a>)</p>
</li>
<li>
<p>need to capture the performance behaviour of entire system</p>
</li>
<li>
<p>want to examine hardware effects such as cache misses</p>
</li>
<li>
<p>want detailed source annotation</p>
</li>
<li>
<p>want instruction-level profiles</p>
</li>
<li>
<p>want call-graph profiles</p>
</li>
</ul>
</div>
<p>
OProfile is not a panacea. OProfile might not be a complete solution when you :
</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>require call graph profiles on platforms other than 2.6/x86</p>
</li>
<li>
<p>don't have root permissions</p>
</li>
<li>
<p>require 100% instruction-accurate profiles</p>
</li>
<li>
<p>need function call counts or an interstitial profiling API</p>
</li>
<li>
<p>cannot tolerate any disturbance to the system whatsoever</p>
</li>
<li>
<p>need to profile interpreted or dynamically compiled code of non-supported virtual machines</p>
</li>
</ul>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="jitsupport"></a>1.1. Support for dynamically compiled (JIT) code</h3>
</div>
</div>
</div>
<p>
Older versions of OProfile were not capable of attributing samples to symbols from dynamically
compiled code, i.e. "just-in-time (JIT) code". Typical JIT compilers load the JIT code into
anonymous memory regions. OProfile reported the samples from such code, but the attribution
provided was simply:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">"anon: &lt;tgid&gt;&lt;address range&gt;" </pre>
</td>
</tr>
</table>
<p>
Due to this limitation, it wasn't possible to profile applications executed by virtual machines (VMs)
like the Java Virtual Machine. OProfile now contains an infrastructure to support JITed code.
A development library is provided to allow developers
to add support for any VM that produces dynamically compiled code (see the <span class="emphasis"><em>OProfile JIT agent
developer guide</em></span>).
In addition, built-in support is included for the following:</p>
<div class="itemizedlist">
<ul type="disc">
<li>JVMTI agent library for Java (1.5 and higher)</li>
<li>JVMPI agent library for Java (1.5 and lower)</li>
</ul>
</div>
<p>
For information on how to use OProfile's JIT support, see <a href="#setup-jit" title="2. Setting up the JIT profiling feature">Section 2, &#8220;Setting up the JIT profiling feature&#8221;</a>.
</p>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="requirements"></a>2. System requirements</h2>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">Linux kernel 2.2/2.4/2.6</span>
</dt>
<dd>
<p>
OProfile uses a kernel module that can be compiled for
2.2.11 or later and 2.4. 2.4.10 or above is required if you use the
boot-time kernel option <code class="option">nosmp</code>. 2.6 kernels are supported with the in-kernel
OProfile driver. Note that only 32-bit x86 and IA64 are supported on 2.2/2.4 kernels.
</p>
<p>
2.6 kernels are strongly recommended. Under 2.4, OProfile may cause system crashes if power
management is used, or the BIOS does not correctly deal with local APICs.
</p>
<p>
PPC64 processors (Power4/Power5/PPC970, etc.) require a recent (&gt; 2.6.5) kernel with the line
<code class="constant">#define PV_970</code> present in <code class="filename">include/asm-ppc64/processor.h</code>.
</p>
<p>
Profiling the Cell Broadband Engine PowerPC Processing Element (PPE) requires a kernel version
of 2.6.18 or more recent.
Profiling the Cell Broadband Engine Synergistic Processing Element (SPE) requires a kernel version
of 2.6.22 or more recent. Additionally, full support of SPE profiling requires a BFD library
from binutils code dated January 2007 or later. To ensure the proper BFD support exists, run
the <code class="code">configure</code> utility with <code class="code">--with-target=cell-be</code>.
Profiling the Cell Broadband Engine using SPU events requires a kernel version of 2.6.29-rc1
or more recent.
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3>Attempting to profile SPEs with kernel versions older than 2.6.22 may cause the
system to crash.</div>
<p>
</p>
<p>
Instruction-Based Sampling (IBS) profile on AMD family10h processors requires
kernel version 2.6.28-rc2 or later.
</p>
</dd>
<dt>
<span class="term">modutils 2.4.6 or above</span>
</dt>
<dd>
<p>
You should have installed modutils 2.4.6 or higher (in fact earlier versions work well in almost all
cases).
</p>
</dd>
<dt>
<span class="term">Supported architecture</span>
</dt>
<dd>
<p>
For Intel IA32, a CPU with either a P6 generation or Pentium 4 core is
required. In marketing terms this translates to anything
between an Intel Pentium Pro (not Pentium Classics) and
a Pentium 4 / Xeon, including all Celerons. The AMD
Athlon, Opteron, Phenom, and Turion CPUs are also supported. Other IA32
CPU types only support the RTC mode of OProfile; please
see later in this manual for details. Hyper-threaded Pentium IVs
are not supported in 2.4. For 2.4 kernels, the Intel
IA-64 CPUs are also supported. For 2.6 kernels, there is additionally
support for Alpha processors, MIPS, ARM, x86-64, sparc64, ppc64, AVR32, and,
in timer mode, PA-RISC and s390.
</p>
</dd>
<dt>
<span class="term">Uniprocessor or SMP</span>
</dt>
<dd>
<p>
SMP machines are fully supported.
</p>
</dd>
<dt>
<span class="term">Required libraries</span>
</dt>
<dd>
<p>
These libraries are required : <code class="filename">popt</code>, <code class="filename">bfd</code>,
<code class="filename">liberty</code> (debian users: libiberty is provided in binutils-dev package), <code class="filename">dl</code>,
plus the standard C++ libraries.
</p>
</dd>
<dt>
<span class="term">Required user account</span>
</dt>
<dd>
<p>
For secure processing of sample data from JIT virtual machines (e.g., Java),
the special user account "oprofile" must exist on the system. The 'configure'
and 'make install' operations will print warning messages if this
account is not found. If you intend to profile JITed code, you must create
a group account named 'oprofile' and then create the 'oprofile' user account,
setting the default group to 'oprofile'. A runtime error message is printed to
the oprofile daemon log when processing JIT samples if this special user
account cannot be found.
</p>
</dd>
<dt>
<span class="term">OProfile GUI</span>
</dt>
<dd>
<p>
The use of the GUI to start the profiler requires the <code class="filename">Qt 2</code> library. <code class="filename">Qt 3</code> should
also work.
</p>
</dd>
<dt>
<span class="term">
<span class="acronym">ELF</span>
</span>
</dt>
<dd>
<p>
Probably not too strenuous a requirement, but older <span class="acronym">A.OUT</span> binaries/libraries are not supported.
</p>
</dd>
<dt>
<span class="term">K&amp;R coding style</span>
</dt>
<dd>
<p>
OK, so it's not really a requirement, but I wish it was...
</p>
</dd>
</dl>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="resources"></a>3. Internet resources</h2>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">Web page</span>
</dt>
<dd>
<p>
There is a web page (which you may be reading now) at
<a href="http://oprofile.sf.net/">http://oprofile.sf.net/</a>.
</p>
</dd>
<dt>
<span class="term">Download</span>
</dt>
<dd>
<p>
You can download a source tarball or get anonymous CVS at the sourceforge page,
<a href="http://sf.net/projects/oprofile/">http://sf.net/projects/oprofile/</a>.
</p>
</dd>
<dt>
<span class="term">Mailing list</span>
</dt>
<dd>
<p>
There is a low-traffic OProfile-specific mailing list, details at
<a href="http://sf.net/mail/?group_id=16191">http://sf.net/mail/?group_id=16191</a>.
</p>
</dd>
<dt>
<span class="term">Bug tracker</span>
</dt>
<dd>
<p>
There is a bug tracker for OProfile at SourceForge,
<a href="http://sf.net/tracker/?group_id=16191&amp;atid=116191">http://sf.net/tracker/?group_id=16191&amp;atid=116191</a>.
</p>
</dd>
<dt>
<span class="term">IRC channel</span>
</dt>
<dd>
<p>
Several OProfile developers and users sometimes hang out on channel <span><strong class="command">#oprofile</strong></span>
on the <a href="http://oftc.net">OFTC</a> network.
</p>
</dd>
</dl>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="install"></a>4. Installation</h2>
</div>
</div>
</div>
<p>
First you need to build OProfile and install it. <span><strong class="command">./configure</strong></span>, <span><strong class="command">make</strong></span>, <span><strong class="command">make install</strong></span>
is often all you need, but note these arguments to <span><strong class="command">./configure</strong></span> :
</p>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--with-linux</code>
</span>
</dt>
<dd>
<p>
Use this option to specify the location of the kernel source tree you wish
to compile against. The kernel module is built against this source and
will only work with a running kernel built from the same source with
exact same options, so it is important you specify this option if you need
to.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--with-java</code>
</span>
</dt>
<dd>
<p>
Use this option if you need to profile Java applications. Also, see
<a href="#requirements" title="2. System requirements">Section 2, &#8220;System requirements&#8221;</a>, "Required user account". This option
is used to specify the location of the Java Development Kit (JDK)
source tree you wish to use. This is necessary to get the interface description
of the JVMPI (or JVMTI) interface to compile the JIT support code successfully.
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
The Java Runtime Environment (JRE) does not include the development
files that are required to compile the JIT support code, so the full
JDK must be installed in order to use this option.
</p>
</div>
<p>
By default, the Oprofile JIT support libraries will be installed in
<code class="filename">&lt;oprof_install_dir&gt;/lib/oprofile</code>. To build
and install OProfile and the JIT support libraries as 64-bit, you can
do something like the following:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# CFLAGS="-m64" CXXFLAGS="-m64" ./configure \
--with-kernel-support --with-java={my_jdk_installdir} \
--libdir=/usr/local/lib64
</pre>
</td>
</tr>
</table>
<p>
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
If you encounter errors building 64-bit, you should
install libtool 1.5.26 or later since that release of
libtool fixes known problems for certain platforms.
If you install libtool into a non-standard location,
you'll need to edit the invocation of 'aclocal' in
OProfile's autogen.sh as follows (assume an install
location of /usr/local):
</p>
<p>
<code class="code">aclocal -I m4 -I /usr/local/share/aclocal</code>
</p>
</div>
</dd>
<dt>
<span class="term">
<code class="option">--with-kernel-support</code>
</span>
</dt>
<dd>
<p>
Use this option with 2.6 and above kernels to indicate the
kernel provides the OProfile device driver.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--with-qt-dir/includes/libraries</code>
</span>
</dt>
<dd>
<p>
Specify the location of Qt headers and libraries. It defaults to searching in
<code class="constant">$QTDIR</code> if these are not specified.
</p>
</dd>
<dt>
<a id="disable-werror"></a>
<span class="term">
<code class="option">--disable-werror</code>
</span>
</dt>
<dd>
<p>
Development versions of OProfile build by
default with <code class="option">-Werror</code>. This option turns
<code class="option">-Werror</code> off.
</p>
</dd>
<dt>
<a id="disable-optimization"></a>
<span class="term">
<code class="option">--disable-optimization</code>
</span>
</dt>
<dd>
<p>
Disable the <code class="option">-O2</code> compiler flag
(useful if you discover an OProfile bug and want to give a useful
back-trace etc.)
</p>
</dd>
</dl>
</div>
<p>
You'll need to have a configured kernel source for the current kernel
to build the module for 2.4 kernels. Since all distributions provide different kernels it's unlikely the running kernel match the configured source
you installed. The safest way is to recompile your own kernel, run it and compile oprofile. It is also recommended that if you have a
uniprocessor machine, you enable the local APIC / IO_APIC support for
your kernel (this is automatically enabled for SMP kernels). With many BIOS, kernel &gt;= 2.6.9 and UP kernel it's not sufficient to enable the local APIC you must also turn it on explicitly at boot time by providing "lapic" option to the kernel. On
machines with power management, such as laptops, the power management
must be turned off when using OProfile with 2.4 kernels. The power management software
in the BIOS cannot handle the non-maskable interrupts (NMIs) used by
OProfile for data collection. If you use the NMI watchdog, be aware that
the watchdog is disabled when profiling starts, and not re-enabled until the
OProfile module is removed (or, in 2.6, when OProfile is not running). If you compile OProfile for
a 2.2 kernel you must be root to compile the module. If you are using
2.6 kernels or higher, you do not need kernel source, as long as the
OProfile driver is enabled; additionally, you should not need to disable
power management.
</p>
<p>
Please note that you must save or have available the <code class="filename">vmlinux</code> file
generated during a kernel compile, as OProfile needs it (you can use
<code class="option">--no-vmlinux</code>, but this will prevent kernel profiling).
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="uninstall"></a>5. Uninstalling OProfile</h2>
</div>
</div>
</div>
<p>
You must have the source tree available to uninstall OProfile; a <span><strong class="command">make uninstall</strong></span> will
remove all installed files except your configuration file in the directory <code class="filename">~/.oprofile</code>.
</p>
</div>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="overview"></a>Chapter 2. Overview</h2>
</div>
</div>
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="sect1">
<a href="#getting-started">1. Getting started</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#tools-overview">2. Tools summary</a>
</span>
</dt>
</dl>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="getting-started"></a>1. Getting started</h2>
</div>
</div>
</div>
<p>
Before you can use OProfile, you must set it up. The minimum setup required for this
is to tell OProfile where the <code class="filename">vmlinux</code> file corresponding to the
running kernel is, for example :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --vmlinux=/boot/vmlinux-`uname -r`</pre>
</td>
</tr>
</table>
<p>
If you don't want to profile the kernel itself,
you can tell OProfile you don't have a <code class="filename">vmlinux</code> file :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --no-vmlinux</pre>
</td>
</tr>
</table>
<p>
Now we are ready to start the daemon (<span><strong class="command">oprofiled</strong></span>) which collects
the profile data :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --start</pre>
</td>
</tr>
</table>
<p>
When I want to stop profiling, I can do so with :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --shutdown</pre>
</td>
</tr>
</table>
<p>
Note that unlike <span><strong class="command">gprof</strong></span>, no instrumentation (<code class="option">-pg</code>
and <code class="option">-a</code> options to <span><strong class="command">gcc</strong></span>)
is necessary.
</p>
<p>
Periodically (or on <span><strong class="command">opcontrol --shutdown</strong></span> or <span><strong class="command">opcontrol --dump</strong></span>)
the profile data is written out into the $SESSION_DIR/samples directory (by default at <code class="filename">/var/lib/oprofile/samples</code>).
These profile files cover shared libraries, applications, the kernel (vmlinux), and kernel modules.
You can clear the profile data (at any time) with <span><strong class="command">opcontrol --reset</strong></span>.
</p>
<p>
To place these sample database files in a specific directory instead of the default location (<code class="filename">/var/lib/oprofile</code>) use the <code class="option">--session-dir=dir</code> option. You must also specify the <code class="option">--session-dir</code> to tell the tools to continue using this directory. (In the future, we should allow this to be specified in an environment variable.) :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --no-vmlinux --session-dir=/home/me/tmpsession</pre>
</td>
</tr>
</table>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opcontrol --start --session-dir=/home/me/tmpsession</pre>
</td>
</tr>
</table>
<p>
You can get summaries of this data in a number of ways at any time. To get a summary of
data across the entire system for all of these profiles, you can do :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opreport [--session-dir=dir]</pre>
</td>
</tr>
</table>
<p>
Or to get a more detailed summary, for a particular image, you can do something like :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opreport -l /boot/vmlinux-`uname -r`</pre>
</td>
</tr>
</table>
<p>
There are also a number of other ways of presenting the data, as described later in this manual.
Note that OProfile will choose a default profiling setup for you. However, there are a number
of options you can pass to <span><strong class="command">opcontrol</strong></span> if you need to change something,
also detailed later.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="tools-overview"></a>2. Tools summary</h2>
</div>
</div>
</div>
<p>
This section gives a brief description of the available OProfile utilities and their purpose.
</p>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="filename">ophelp</code>
</span>
</dt>
<dd>
<p>
This utility lists the available events and short descriptions.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">opcontrol</code>
</span>
</dt>
<dd>
<p>
Used for controlling the OProfile data collection, discussed in <a href="#controlling" title="Chapter 3. Controlling the profiler">Chapter 3, <i>Controlling the profiler</i></a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">agent libraries</code>
</span>
</dt>
<dd>
<p>
Used by virtual machines (like the Java VM) to record information about JITed code being profiled. See <a href="#setup-jit" title="2. Setting up the JIT profiling feature">Section 2, &#8220;Setting up the JIT profiling feature&#8221;</a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">opreport</code>
</span>
</dt>
<dd>
<p>
This is the main tool for retrieving useful profile data, described in
<a href="#opreport" title="2. Image summaries and symbol summaries (opreport)">Section 2, &#8220;Image summaries and symbol summaries (<span><strong class="command">opreport</strong></span>)&#8221;</a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">opannotate</code>
</span>
</dt>
<dd>
<p>
This utility can be used to produce annotated source, assembly or mixed source/assembly.
Source level annotation is available only if the application was compiled with
debugging symbols. See <a href="#opannotate" title="3. Outputting annotated source (opannotate)">Section 3, &#8220;Outputting annotated source (<span><strong class="command">opannotate</strong></span>)&#8221;</a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">opgprof</code>
</span>
</dt>
<dd>
<p>
This utility can output gprof-style data files for a binary, for use with
<span><strong class="command">gprof -p</strong></span>. See <a href="#opgprof" title="5. gprof-compatible output (opgprof)">Section 5, &#8220;<span><strong class="command">gprof</strong></span>-compatible output (<span><strong class="command">opgprof</strong></span>)&#8221;</a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">oparchive</code>
</span>
</dt>
<dd>
<p>
This utility can be used to collect executables, debuginfo,
and sample files and copy the files into an archive.
The archive is self-contained and can be moved to another
machine for further analysis.
See <a href="#oparchive" title="6. Archiving measurements (oparchive)">Section 6, &#8220;Archiving measurements (<span><strong class="command">oparchive</strong></span>)&#8221;</a>.
</p>
</dd>
<dt>
<span class="term">
<code class="filename">opimport</code>
</span>
</dt>
<dd>
<p>
This utility converts sample database files from a foreign binary format (abi) to
the native format. This is useful only when moving sample files between hosts,
for analysis on platforms other than the one used for collection.
See <a href="#opimport" title="7. Converting sample database files (opimport)">Section 7, &#8220;Converting sample database files (<span><strong class="command">opimport</strong></span>)&#8221;</a>.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="controlling"></a>Chapter 3. Controlling the profiler</h2>
</div>
</div>
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="sect1">
<a href="#controlling-daemon">1. Using <span><strong class="command">opcontrol</strong></span></a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opcontrolexamples">1.1. Examples</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#eventspec">1.2. Specifying performance counter events</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#setup-jit">2. Setting up the JIT profiling feature</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#setup-jit-jvm">2.1. JVM instrumentation</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#oprofile-gui">3. Using <span><strong class="command">oprof_start</strong></span></a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#detailed-parameters">4. Configuration details</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#hardware-counters">4.1. Hardware performance counters</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#rtc">4.2. OProfile in RTC mode</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#timer">4.3. OProfile in timer interrupt mode</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#p4">4.4. Pentium 4 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#ia64">4.5. Intel Itanium 2 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#ppc64">4.6. PowerPC64 support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#cell-be">4.7. Cell Broadband Engine support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#amd-ibs-support">4.8. AMD64 (x86_64) Instruction-Based Sampling (IBS) support</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#misuse">4.9. Dangerous counter settings</a>
</span>
</dt>
</dl>
</dd>
</dl>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="controlling-daemon"></a>1. Using <span><strong class="command">opcontrol</strong></span></h2>
</div>
</div>
</div>
<p>
In this section we describe the configuration and control of the profiling system
with opcontrol in more depth.
The <span><strong class="command">opcontrol</strong></span> script has a default setup, but you
can alter this with the options given below. In particular,
if your hardware supports performance counters, you can configure them.
There are a number of counters (for example, counter 0 and counter 1
on the Pentium III). Each of these counters can be programmed with
an event to count, such as cache misses or MMX operations. The event
chosen for each counter is reflected in the profile data collected
by OProfile: functions and binaries at the top of the profiles reflect
that most of the chosen events happened within that code.
</p>
<p>
Additionally, each counter has a "count" value: this corresponds to how
detailed the profile is. The lower the value, the more frequently profile
samples are taken. A counter can choose to sample only kernel code, user-space code,
or both (both is the default). Finally, some events have a "unit mask"
- this is a value that further restricts the types of event that are counted.
The event types and unit masks for your CPU are listed by <span><strong class="command">opcontrol
--list-events</strong></span>.
</p>
<p>
The <span><strong class="command">opcontrol</strong></span> script provides the following actions :
</p>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--init</code>
</span>
</dt>
<dd>
<p>
Loads the OProfile module if required and makes the OProfile driver
interface available.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--setup</code>
</span>
</dt>
<dd>
<p>
Followed by list arguments for profiling set up. List of arguments
saved in <code class="filename">/root/.oprofile/daemonrc</code>.
Giving this option is not necessary; you can just directly pass one
of the setup options, e.g. <span><strong class="command">opcontrol --no-vmlinux</strong></span>.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--status</code>
</span>
</dt>
<dd>
<p>
Show configuration information.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--start-daemon</code>
</span>
</dt>
<dd>
<p>
Start the oprofile daemon without starting actual profiling. The profiling
can then be started using <code class="option">--start</code>. This is useful for avoiding
measuring the cost of daemon startup, as <code class="option">--start</code> is a simple
write to a file in oprofilefs. Not available in 2.2/2.4 kernels.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--start</code>
</span>
</dt>
<dd>
<p>
Start data collection with either arguments provided by <code class="option">--setup</code>
or information saved in <code class="filename">/root/.oprofile/daemonrc</code>. Specifying
the addition <code class="option">--verbose</code> makes the daemon generate lots of debug data
whilst it is running.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--dump</code>
</span>
</dt>
<dd>
<p>
Force a flush of the collected profiling data to the daemon.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--stop</code>
</span>
</dt>
<dd>
<p>
Stop data collection (this separate step is not possible with 2.2 or 2.4 kernels).
</p>
</dd>
<dt>
<span class="term">
<code class="option">--shutdown</code>
</span>
</dt>
<dd>
<p>
Stop data collection and kill the daemon.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--reset</code>
</span>
</dt>
<dd>
<p>
Clears out data from current session, but leaves saved sessions.
</p>
</dd>
<dt>
<span class="term"><code class="option">--save=</code>session_name</span>
</dt>
<dd>
<p>
Save data from current session to session_name.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--deinit</code>
</span>
</dt>
<dd>
<p>
Shuts down daemon. Unload the OProfile module and oprofilefs.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--list-events</code>
</span>
</dt>
<dd>
<p>
List event types and unit masks.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--help</code>
</span>
</dt>
<dd>
<p>
Generate usage messages.
</p>
</dd>
</dl>
</div>
<p>
There are a number of possible settings, of which, only
<code class="option">--vmlinux</code> (or <code class="option">--no-vmlinux</code>)
is required. These settings are stored in <code class="filename">~/.oprofile/daemonrc</code>.
</p>
<div class="variablelist">
<dl>
<dt>
<span class="term"><code class="option">--buffer-size=</code>num</span>
</dt>
<dd>
<p>
Number of samples in kernel buffer. When using a 2.6 kernel
buffer watershed need to be tweaked when changing this value.
</p>
</dd>
<dt>
<span class="term"><code class="option">--buffer-watershed=</code>num</span>
</dt>
<dd>
<p>
Set kernel buffer watershed to num samples (2.6 only). When it'll remain only
buffer-size - buffer-watershed free entry in the kernel buffer data will be
flushed to daemon, most usefull value are in the range [0.25 - 0.5] * buffer-size.
</p>
</dd>
<dt>
<span class="term"><code class="option">--cpu-buffer-size=</code>num</span>
</dt>
<dd>
<p>
Number of samples in kernel per-cpu buffer (2.6 only). If you
profile at high rate it can help to increase this if the log
file show excessive count of sample lost cpu buffer overflow.
</p>
</dd>
<dt>
<span class="term"><code class="option">--event=</code>[eventspec]</span>
</dt>
<dd>
<p>
Use the given performance counter event to profile.
See <a href="#eventspec" title="1.2. Specifying performance counter events">Section 1.2, &#8220;Specifying performance counter events&#8221;</a> below.
</p>
</dd>
<dt>
<span class="term"><code class="option">--session-dir=</code>dir_path</span>
</dt>
<dd>
<p>
Create/use sample database out of directory <code class="filename">dir_path</code> instead of
the default location (/var/lib/oprofile).
</p>
</dd>
<dt>
<span class="term"><code class="option">--separate=</code>[none,lib,kernel,thread,cpu,all]</span>
</dt>
<dd>
<p>
By default, every profile is stored in a single file. Thus, for example,
samples in the C library are all accredited to the <code class="filename">/lib/libc.o</code>
profile. However, you choose to create separate sample files by specifying
one of the below options.
</p>
<div class="informaltable">
<table border="1">
<colgroup>
<col />
<col />
</colgroup>
<tbody>
<tr>
<td>
<code class="option">none</code>
</td>
<td>No profile separation (default)</td>
</tr>
<tr>
<td>
<code class="option">lib</code>
</td>
<td>Create per-application profiles for libraries</td>
</tr>
<tr>
<td>
<code class="option">kernel</code>
</td>
<td>Create per-application profiles for the kernel and kernel modules</td>
</tr>
<tr>
<td>
<code class="option">thread</code>
</td>
<td>Create profiles for each thread and each task</td>
</tr>
<tr>
<td>
<code class="option">cpu</code>
</td>
<td>Create profiles for each CPU</td>
</tr>
<tr>
<td>
<code class="option">all</code>
</td>
<td>All of the above options</td>
</tr>
</tbody>
</table>
</div>
<p>
Note that <code class="option">--separate=kernel</code> also turns on <code class="option">--separate=lib</code>.
When using <code class="option">--separate=kernel</code>, samples in hardware interrupts, soft-irqs, or other
asynchronous kernel contexts are credited to the task currently running. This means you will see
seemingly nonsense profiles such as <code class="filename">/bin/bash</code> showing samples for the PPP modules,
etc.
</p>
<p>
On 2.2/2.4 only kernel threads already started when profiling begins are correctly profiled;
newly started kernel thread samples are credited to the vmlinux (kernel) profile.
</p>
<p>
Using <code class="option">--separate=thread</code> creates a lot
of sample files if you leave OProfile running for a while; it's most
useful when used for short sessions, or when using image filtering.
</p>
</dd>
<dt>
<span class="term"><code class="option">--callgraph=</code>#depth</span>
</dt>
<dd>
<p>
Enable call-graph sample collection with a maximum depth. Use 0 to disable
callgraph profiling. NOTE: Callgraph support is available on a limited
number of platforms at this time; for example:
</p>
<p>
</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>x86 with recent 2.6 kernel</p>
</li>
<li>
<p>ARM with recent 2.6 kernel</p>
</li>
<li>
<p>PowerPC with 2.6.17 kernel</p>
</li>
</ul>
</div>
<p>
</p>
<p>
</p>
</dd>
<dt>
<span class="term"><code class="option">--image=</code>image,[images]|"all"</span>
</dt>
<dd>
<p>
Image filtering. If you specify one or more absolute
paths to binaries, OProfile will only produce profile results for those
binary images. This is useful for restricting the sometimes voluminous
output you may get otherwise, especially with
<code class="option">--separate=thread</code>. Note that if you are using
<code class="option">--separate=lib</code> or
<code class="option">--separate=kernel</code>, then if you specification an
application binary, the shared libraries and kernel code
<span class="emphasis"><em>are</em></span> included. Specify the value
"all" to profile everything (the default).
</p>
</dd>
<dt>
<span class="term"><code class="option">--vmlinux=</code>file</span>
</dt>
<dd>
<p>
vmlinux kernel image.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--no-vmlinux</code>
</span>
</dt>
<dd>
<p>
Use this when you don't have a kernel vmlinux file, and you don't want
to profile the kernel. This still counts the total number of kernel samples,
but can't give symbol-based results for the kernel or any modules.
</p>
</dd>
</dl>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opcontrolexamples"></a>1.1. Examples</h3>
</div>
</div>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="examplesperfctr"></a>1.1.1. Intel performance counter setup</h4>
</div>
</div>
</div>
<p>
Here, we have a Pentium III running at 800MHz, and we want to look at where data memory
references are happening most, and also get results for CPU time.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opcontrol --event=CPU_CLK_UNHALTED:400000 --event=DATA_MEM_REFS:10000
# opcontrol --vmlinux=/boot/2.6.0/vmlinux
# opcontrol --start
</pre>
</td>
</tr>
</table>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="examplesrtc"></a>1.1.2. RTC mode</h4>
</div>
</div>
</div>
<p>
Here, we have an Intel laptop without support for performance counters, running on 2.4 kernels.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# ophelp -r
CPU with RTC device
# opcontrol --vmlinux=/boot/2.4.13/vmlinux --event=RTC_INTERRUPTS:1024
# opcontrol --start
</pre>
</td>
</tr>
</table>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="examplesstartdaemon"></a>1.1.3. Starting the daemon separately</h4>
</div>
</div>
</div>
<p>
If we're running 2.6 kernels, we can use <code class="option">--start-daemon</code> to avoid
the profiler startup affecting results.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opcontrol --vmlinux=/boot/2.6.0/vmlinux
# opcontrol --start-daemon
# my_favourite_benchmark --init
# opcontrol --start ; my_favourite_benchmark --run ; opcontrol --stop
</pre>
</td>
</tr>
</table>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="exampleseparate"></a>1.1.4. Separate profiles for libraries and the kernel</h4>
</div>
</div>
</div>
<p>
Here, we want to see a profile of the OProfile daemon itself, including when
it was running inside the kernel driver, and its use of shared libraries.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opcontrol --separate=kernel --vmlinux=/boot/2.6.0/vmlinux
# opcontrol --start
# my_favourite_stress_test --run
# opreport -l -p /lib/modules/2.6.0/kernel /usr/local/bin/oprofiled
</pre>
</td>
</tr>
</table>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="examplessessions"></a>1.1.5. Profiling sessions</h4>
</div>
</div>
</div>
<p>
It can often be useful to split up profiling data into several different
time periods. For example, you may want to collect data on an application's
startup separately from the normal runtime data. You can use the simple
command <span><strong class="command">opcontrol --save</strong></span> to do this. For example :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opcontrol --save=blah
</pre>
</td>
</tr>
</table>
<p>
will create a sub-directory in <code class="filename">$SESSION_DIR/samples</code> containing the samples
up to that point (the current session's sample files are moved into this
directory). You can then pass this session name as a parameter to the post-profiling
analysis tools, to only get data up to the point you named the
session. If you do not want to save a session, you can do
<span><strong class="command">rm -rf $SESSION_DIR/samples/sessionname</strong></span> or, for the
current session, <span><strong class="command">opcontrol --reset</strong></span>.
</p>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="eventspec"></a>1.2. Specifying performance counter events</h3>
</div>
</div>
</div>
<p>
The <code class="option">--event</code> option to <span><strong class="command">opcontrol</strong></span>
takes a specification that indicates how the details of each
hardware performance counter should be setup. If you want to
revert to OProfile's default setting (<code class="option">--event</code>
is strictly optional), use <code class="option">--event=default</code>. Use of this
option over-rides all previous event selections.
</p>
<p>
You can pass multiple event specifications. OProfile will allocate
hardware counters as necessary. Note that some combinations are not
allowed by the CPU; running <span><strong class="command">opcontrol --list-events</strong></span> gives the details
of each event. The event specification is a colon-separated string
of the form <code class="option"><span class="emphasis"><em>name</em></span>:<span class="emphasis"><em>count</em></span>:<span class="emphasis"><em>unitmask</em></span>:<span class="emphasis"><em>kernel</em></span>:<span class="emphasis"><em>user</em></span></code> as described in this table:
</p>
<div class="informaltable">
<table border="1">
<colgroup>
<col />
<col />
</colgroup>
<tbody>
<tr>
<td>
<code class="option">name</code>
</td>
<td>The symbolic event name, e.g. <code class="constant">CPU_CLK_UNHALTED</code></td>
</tr>
<tr>
<td>
<code class="option">count</code>
</td>
<td>The counter reset value, e.g. 100000</td>
</tr>
<tr>
<td>
<code class="option">unitmask</code>
</td>
<td>The unit mask, as given in the events list, e.g. 0x0f</td>
</tr>
<tr>
<td>
<code class="option">kernel</code>
</td>
<td>Whether to profile kernel code</td>
</tr>
<tr>
<td>
<code class="option">user</code>
</td>
<td>Whether to profile userspace code</td>
</tr>
</tbody>
</table>
</div>
<p>
The last three values are optional, if you omit them (e.g. <code class="option">--event=DATA_MEM_REFS:30000</code>),
they will be set to the default values (a unit mask of 0, and profiling both kernel and
userspace code). Note that some events require a unit mask.
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
For the PowerPC platforms, all events specified must be in the same group; i.e., the group number
appended to the event name (e.g. <code class="constant">&lt;<span class="emphasis"><em>some-event-name</em></span>&gt;_GRP9</code>) must be the same.
</p>
</div>
<p>
If OProfile is using RTC mode, and you want to alter the default counter value,
you can use something like <code class="option">--event=RTC_INTERRUPTS:2048</code>. Note the last
three values here are ignored.
If OProfile is using timer-interrupt mode, there is no configuration possible.
</p>
<p>
The table below lists the events selected by default
(<code class="option">--event=default</code>) for the various computer architectures:
</p>
<div class="informaltable">
<table border="1">
<colgroup>
<col />
<col />
<col />
</colgroup>
<tbody>
<tr>
<td>Processor</td>
<td>cpu_type</td>
<td>Default event</td>
</tr>
<tr>
<td>Alpha EV4</td>
<td>alpha/ev4</td>
<td>CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Alpha EV5</td>
<td>alpha/ev5</td>
<td>CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Alpha PCA56</td>
<td>alpha/pca56</td>
<td>CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Alpha EV6</td>
<td>alpha/ev6</td>
<td>CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Alpha EV67</td>
<td>alpha/ev67</td>
<td>CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>ARM/XScale PMU1</td>
<td>arm/xscale1</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>ARM/XScale PMU2</td>
<td>arm/xscale2</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>ARM/MPCore</td>
<td>arm/mpcore</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>AVR32</td>
<td>avr32</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Athlon</td>
<td>i386/athlon</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Pentium Pro</td>
<td>i386/ppro</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Pentium II</td>
<td>i386/pii</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Pentium III</td>
<td>i386/piii</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Pentium M (P6 core)</td>
<td>i386/p6_mobile</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Pentium 4 (non-HT)</td>
<td>i386/p4</td>
<td>GLOBAL_POWER_EVENTS:100000:1:1:1</td>
</tr>
<tr>
<td>Pentium 4 (HT)</td>
<td>i386/p4-ht</td>
<td>GLOBAL_POWER_EVENTS:100000:1:1:1</td>
</tr>
<tr>
<td>Hammer</td>
<td>x86-64/hammer</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Family10h</td>
<td>x86-64/family10</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Family11h</td>
<td>x86-64/family11h</td>
<td>CPU_CLK_UNHALTED:100000:0:1:1</td>
</tr>
<tr>
<td>Itanium</td>
<td>ia64/itanium</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>Itanium 2</td>
<td>ia64/itanium2</td>
<td>CPU_CYCLES:100000:0:1:1</td>
</tr>
<tr>
<td>TIMER_INT</td>
<td>timer</td>
<td>None selectable</td>
</tr>
<tr>
<td>IBM iseries</td>
<td>PowerPC 4/5/970</td>
<td>CYCLES:10000:0:1:1</td>
</tr>
<tr>
<td>IBM pseries</td>
<td>PowerPC 4/5/970/Cell</td>
<td>CYCLES:10000:0:1:1</td>
</tr>
<tr>
<td>IBM s390</td>
<td>timer</td>
<td>None selectable</td>
</tr>
<tr>
<td>IBM s390x</td>
<td>timer</td>
<td>None selectable</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="setup-jit"></a>2. Setting up the JIT profiling feature</h2>
</div>
</div>
</div>
<p>
To gather information about JITed code from a virtual machine,
it needs to be instrumented with an agent library. We use the
agent libraries for Java in the following example. To use the
Java profiling feature, you must build OProfile with the "--with-java" option
(<a href="#install" title="4. Installation">Section 4, &#8220;Installation&#8221;</a>).
</p>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="setup-jit-jvm"></a>2.1. JVM instrumentation</h3>
</div>
</div>
</div>
<p>
Add this to the startup parameters of the JVM (for JVMTI):
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen"><code xmlns="http://www.w3.org/1999/xhtml" class="option">-agentpath:&lt;libdir&gt;/libjvmti_oprofile.so[=&lt;options&gt;]</code> </pre>
</td>
</tr>
</table>
<p>
or
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen"><code xmlns="http://www.w3.org/1999/xhtml" class="option">-agentlib:jvmti_oprofile[=&lt;options&gt;]</code> </pre>
</td>
</tr>
</table>
<p>
</p>
<p>
The JVMPI agent implementation is enabled with the command line option
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen"><code xmlns="http://www.w3.org/1999/xhtml" class="option">-Xrunjvmpi_oprofile[:&lt;options&gt;]</code> </pre>
</td>
</tr>
</table>
<p>
</p>
<p>
Currently, there is just one option available -- <code class="option">debug</code>. For JVMPI,
the convention for specifying an option is <code class="option">option_name=[yes|no]</code>.
For JVMTI, the option specification is simply the option name, implying
"yes"; no option specified implies "no".
</p>
<p>
The agent library (installed in <code class="filename">&lt;oprof_install_dir&gt;/lib/oprofile</code>)
needs to be in the library search path (e.g. add the library directory
to <code class="constant">LD_LIBRARY_PATH</code>). If the command line of
the JVM is not accessible, it may be buried within shell scripts or a
launcher program. It may also be possible to set an environment variable to add
the instrumentation.
For Sun JVMs this is <code class="constant">JAVA_TOOL_OPTIONS</code>. Please check
your JVM documentation for
further information on the agent startup options.
</p>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="oprofile-gui"></a>3. Using <span><strong class="command">oprof_start</strong></span></h2>
</div>
</div>
</div>
<p>
The <span><strong class="command">oprof_start</strong></span> application provides a convenient way to start the profiler.
Note that <span><strong class="command">oprof_start</strong></span> is just a wrapper around the <span><strong class="command">opcontrol</strong></span> script,
so it does not provide more services than the script itself.
</p>
<p>
After <span><strong class="command">oprof_start</strong></span> is started you can select the event type for each counter;
the sampling rate and other related parameters are explained in <a href="#controlling-daemon" title="1. Using opcontrol">Section 1, &#8220;Using <span><strong class="command">opcontrol</strong></span>&#8221;</a>.
The "Configuration" section allows you to set general parameters such as the buffer size, kernel filename
etc. The counter setup interface should be self-explanatory; <a href="#hardware-counters" title="4.1. Hardware performance counters">Section 4.1, &#8220;Hardware performance counters&#8221;</a> and related
links contain information on using unit masks.
</p>
<p>
A status line shows the current status of the profiler: how long it has been running, and the average
number of interrupts received per second and the total, over all processors.
Note that quitting <span><strong class="command">oprof_start</strong></span> does not stop the profiler.
</p>
<p>
Your configuration is saved in the same file as <span><strong class="command">opcontrol</strong></span> uses; that is,
<code class="filename">~/.oprofile/daemonrc</code>.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="detailed-parameters"></a>4. Configuration details</h2>
</div>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="hardware-counters"></a>4.1. Hardware performance counters</h3>
</div>
</div>
</div>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
Your CPU type may not include the requisite support for hardware performance counters, in which case
you must use OProfile in RTC mode in 2.4 (see <a href="#rtc" title="4.2. OProfile in RTC mode">Section 4.2, &#8220;OProfile in RTC mode&#8221;</a>), or timer mode in 2.6 (see <a href="#timer" title="4.3. OProfile in timer interrupt mode">Section 4.3, &#8220;OProfile in timer interrupt mode&#8221;</a>).
You do not really need to read this section unless you are interested in using
events other than the default event chosen by OProfile.
</p>
</div>
<p>
The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available
from <a href="http://developer.intel.com/">http://developer.intel.com/</a>.
The AMD Athlon/Opteron/Phenom/Turion implementation is detailed in <a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf">
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf</a>.
For PowerPC64 processors in IBM iSeries, pSeries, and blade server systems, processor documentation
is available at <a href="http://www-01.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC/">
http://www-01.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC</a>. (For example, the
specific publication containing information on the performance monitor unit for the PowerPC970 is
"IBM PowerPC 970FX RISC Microprocessor User's Manual.")
These processors are capable of delivering an interrupt when a counter overflows.
This is the basic mechanism on which OProfile is based. The delivery mode is <span class="acronym">NMI</span>,
so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called,
the current <span class="acronym">PC</span> value and the current task are recorded into the profiling structure.
This allows the overflow event to be attached to a specific assembly instruction in a binary image.
The daemon receives this data from the kernel, and writes it to the sample files.
</p>
<p>
If we use an event such as <code class="constant">CPU_CLK_UNHALTED</code> or <code class="constant">INST_RETIRED</code>
(<code class="constant">GLOBAL_POWER_EVENTS</code> or <code class="constant">INSTR_RETIRED</code>, respectively, on the Pentium 4), we can
use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting
data such as the cache behaviour of routines with the other available counters.
</p>
<p>
However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay
between the counter overflow and the interrupt delivery that can skew results on a small scale - this means
you cannot rely on the profiles at the instruction level as being perfectly accurate.
If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean
that it is responsible for that event. However, it implies that the counter overflowed in the dynamic
vicinity of that instruction, to within a few instructions. Further details on this problem can be found in
<a href="#interpreting" title="Chapter 5. Interpreting profiling results">Chapter 5, <i>Interpreting profiling results</i></a> and also in the Digital paper "ProfileMe: A Hardware Performance Counter".
</p>
<p>
Each counter has several configuration parameters.
First, there is the unit mask: this simply further specifies what to count.
Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts
whilst in kernel or user space. You can configure these separately for each counter.
</p>
<p>
After each overflow event, the counter will be re-initialized
such that another overflow will occur after this many events have been counted. Thus, higher
values mean less-detailed profiling, and lower values mean more detail, but higher overhead.
Picking a good value for this
parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event
you have chosen.
Specifying too large a value will mean not enough interrupts are generated
to give a realistic profile (though this problem can be ameliorated by profiling for <span class="emphasis"><em>longer</em></span>).
Specifying too small a value can lead to higher performance overhead.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="rtc"></a>4.2. OProfile in RTC mode</h3>
</div>
</div>
</div>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
This section applies to 2.2/2.4 kernels only.
</p>
</div>
<p>
Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes
some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix).
On these machines, OProfile falls
back to using the real-time clock interrupt to collect samples. This interrupt is also used by the <span><strong class="command">rtc</strong></span>
module: you cannot have both the OProfile and rtc modules loaded nor the rtc support compiled in the kernel.
</p>
<p>
RTC mode is less capable than the hardware counters mode; in particular, it is unable to profile sections of
the kernel where interrupts are disabled. There is just one available event, "RTC interrupts", and its value
corresponds to the number of interrupts generated per second (that is, a higher number means a better profiling
resolution, and higher overhead). The current implementation of the real-time clock supports only power-of-two
sampling rates from 2 to 4096 per second. Other values within this range are rounded to the nearest power of
two.
</p>
<p>
You can force use of the RTC interrupt with the <code class="option">force_rtc=1</code> module parameter.
</p>
<p>
Setting the value from the GUI should be straightforward. On the command line, you need to specify the
event to <span><strong class="command">opcontrol</strong></span>, e.g. :
</p>
<p>
<span>
<strong class="command">opcontrol --event=RTC_INTERRUPTS:256</strong>
</span>
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="timer"></a>4.3. OProfile in timer interrupt mode</h3>
</div>
</div>
</div>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
This section applies to 2.6 kernels and above only.
</p>
</div>
<p>
In 2.6 kernels on CPUs without OProfile support for the hardware performance counters, the driver
falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to
profile code that has interrupts disabled. Note that there are no configuration parameters for
setting this, unlike the RTC and hardware performance counter setup.
</p>
<p>
You can force use of the timer interrupt by using the <code class="option">timer=1</code> module
parameter (or <code class="option">oprofile.timer=1</code> on the boot command line if OProfile is
built-in).
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="p4"></a>4.4. Pentium 4 support</h3>
</div>
</div>
</div>
<p>
The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event
selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a
particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their
operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one
another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of
one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar
to those in the P6 or Athlon/Opteron/Phenom/Turion families of CPU.
</p>
<p>
There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store
(DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described
above. Performance monitoring hardware on Pentium 4 / Xeon processors with Hyperthreading enabled (multiple logical
processors on a single die) is not supported in 2.4 kernels (you can use OProfile if you disable hyper-threading,
though).
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="ia64"></a>4.5. Intel Itanium 2 support</h3>
</div>
</div>
</div>
<p>
The Itanium 2 performance monitoring unit (PMU) organizes the counters as four
pairs of performance event monitoring registers. Each pair is composed of a
Performance Monitoring Configuration (PMC) register and Performance Monitoring
Data (PMD) register. The PMC selects the performance event being monitored and
the PMD determines the sampling interval. The IA64 Performance Monitoring Unit
(PMU) triggers sampling with maskable interrupts. Thus, samples will not occur
in sections of the IA64 kernel where interrupts are disabled.
</p>
<p>
None of the advance features of the Itanium 2 performance monitoring unit
such as opcode matching, address range matching, or precise event sampling are
supported by this version of OProfile. The Itanium 2 support only maps OProfile's
existing interrupt-based model to the PMU hardware.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="ppc64"></a>4.6. PowerPC64 support</h3>
</div>
</div>
</div>
<p>
The performance monitoring unit (PMU) for the IBM PowerPC 64-bit processors
consists of between 4 and 8 counters (depending on the model), plus three
special purpose registers used for programming the counters -- MMCR0, MMCR1,
and MMCRA. Advanced features such as instruction matching and thresholding are
not supported by this version of OProfile.
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3>Later versions of the IBM POWER5+ processor (beginning with revision 3.0)
run the performance monitor unit in POWER6 mode, effectively removing OProfile's
access to counters 5 and 6. These two counters are dedicated to counting
instructions completed and cycles, respectively. In POWER6 mode, however, the
counters do not generate an interrupt on overflow and so are unusable by
OProfile. Kernel versions 2.6.23 and higher will recognize this mode
and export "ppc64/power5++" as the cpu_type to the oprofilefs pseudo filesystem.
OProfile userspace responds to this cpu_type by removing these counters from
the list of potential events to count. Without this kernel support, attempts
to profile using an event from one of these counters will yield incorrect
results -- typically, zero (or near zero) samples in the generated report.
</div>
<p>
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="cell-be"></a>4.7. Cell Broadband Engine support</h3>
</div>
</div>
</div>
<p>
The Cell Broadband Engine (CBE) processor core consists of a PowerPC Processing
Element (PPE) and 8 Synergistic Processing Elements (SPE). PPEs and SPEs each
consist of a processing unit (PPU and SPU, respectively) and other hardware
components, such as memory controllers.
</p>
<p>
A PPU has two hardware threads (aka "virtual CPUs"). The performance monitor
unit of the CBE collects event information on one hardware thread at a time.
Therefore, when profiling PPE events,
OProfile collects the profile based on the selected events by time slicing the
performance counter hardware between the two threads. The user must ensure the
collection interval is long enough so that the time spent collecting data for
each PPU is sufficient to obtain a good profile.
</p>
<p>
To profile an SPU application, the user should specify the SPU_CYCLES event.
When starting OProfile with SPU_CYCLES, the opcontrol script enforces certain
separation parameters (separate=cpu,lib) to ensure that sufficient information
is collected in the sample data in order to generate a complete report. The
--merge=cpu option can be used to obtain a more readable report if analyzing
the performance of each separate SPU is not necessary.
</p>
<p>
Profiling with an SPU event (events 4100 through 4163) is not compatible with any other
event. Further more, only one SPU event can be specified at a time. The hardware only
supports profiling on one SPU per node at a time. The OProfile kernel code time slices
between the eight SPUs to collect data on all SPUs.
</p>
<p>
SPU profile reports have some unique characteristics compared to reports for
standard architectures:
</p>
<div class="itemizedlist">
<ul type="disc">
<li>Typically no "app name" column. This is really standard OProfile behavior
when the report contains samples for just a single application, which is
commonly the case when profiling SPUs.</li>
<li>"CPU" equates to "SPU"</li>
<li>Specifying '--long-filenames' on the opreport command does not always result
in long filenames. This happens when the SPU application code is embedded in
the PPE executable or shared library. The embedded SPU ELF data contains only the
short filename (i.e., no path information) for the SPU binary file that was used as
the source for embedding. The reason that just the short filename is used is because
the original SPU binary file may not exist or be accessible at runtime. The performance
analyst must have sufficient knowledge of the application to be able to correlate the
SPU binary image names found in the report to the application's source files.
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3>
Compile the application with -g and generate the OProfile report
with -g to facilitate finding the right source file(s) on which to focus.
</div></li>
</ul>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="amd-ibs-support"></a>4.8. AMD64 (x86_64) Instruction-Based Sampling (IBS) support</h3>
</div>
</div>
</div>
<p>
Instruction-Based Sampling (IBS) is a new performance measurement technique
available on AMD Family 10h processors. Traditional performance counter
sampling is not precise enough to isolate performance issues to individual
instructions. IBS, however, precisely identifies instructions which are not
making the best use of the processor pipeline and memory hierarchy.
For more information, please refer to the "Instruction-Based Sampling:
A New Performance Analysis Technique for AMD Family 10h Processors" (
<a href="http://developer.amd.com/assets/AMD_IBS_paper_EN.pdf">
http://developer.amd.com/assets/AMD_IBS_paper_EN.pdf</a>).
There are two types of IBS profile types, described in the following sections.
</p>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="ibs-fetch"></a>4.8.1. IBS Fetch</h4>
</div>
</div>
</div>
<p>
IBS fetch sampling is a statistical sampling method which counts completed
fetch operations. When the number of completed fetch operations reaches the
maximum fetch count (the sampling period), IBS tags the fetch operation and
monitors that operation until it either completes or aborts. When a tagged
fetch completes or aborts, a sampling interrupt is generated and an IBS fetch
sample is taken. An IBS fetch sample contains a timestamp, the identifier of
the interrupted process, the virtual fetch address, and several event flags
and values that describe what happened during the fetch operation.
</p>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="ibs-op"></a>4.8.2. IBS Op</h4>
</div>
</div>
</div>
<p>
IBS op sampling selects, tags, and monitors macro-ops as issued from AMD64
instructions. Two options are available for selecting ops for sampling:
</p>
<div class="itemizedlist">
<ul type="disc">
<li>
Cycles-based selection counts CPU clock cycles. The op is tagged and monitored
when the count reaches a threshold (the sampling period) and a valid op is
available.
</li>
<li>
Dispatched op-based selection counts dispatched macro-ops.
When the count reaches a threshold, the next valid op is tagged and monitored.
</li>
</ul>
</div>
<p>
In both cases, an IBS sample is generated only if the tagged op retires.
Thus, IBS op event information does not measure speculative execution activity.
The execution stages of the pipeline monitor the tagged macro-op. When the
tagged macro-op retires, a sampling interrupt is generated and an IBS op
sample is taken. An IBS op sample contains a timestamp, the identifier of
the interrupted process, the virtual address of the AMD64 instruction from
which the op was issued, and several event flags and values that describe
what happened when the macro-op executed.
</p>
</div>
<p>
Enabling IBS profiling is done simply by specifying IBS performance events
through the "--event=" options. These events are listed in the
<code class="function">opcontrol --list-events</code>.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
opcontrol --event=IBS_FETCH_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;
opcontrol --event=IBS_OP_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;
Note: * All IBS fetch event must have the same event count and unitmask,
as do those for IBS op.
</pre>
</td>
</tr>
</table>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="misuse"></a>4.9. Dangerous counter settings</h3>
</div>
</div>
</div>
<p>
OProfile is a low-level profiler which allow continuous profiling with a low-overhead cost.
If too low a count reset value is set for a counter, the system can become overloaded with counter
interrupts, and seem as if the system has frozen. Whilst some validation is done, it
is not foolproof.
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
This can happen as follows: When the profiler count
reaches zero an NMI handler is called which stores the sample values in an internal buffer, then resets the counter
to its original value. If the count is very low, a pending NMI can be sent before the NMI handler has
completed. Due to the priority of the NMI, the local APIC delivers the pending interrupt immediately after
completion of the previous interrupt handler, and control never returns to other parts of the system.
In this way the system seems to be frozen.
</p>
</div>
<p>If this happens, it will be impossible to bring the system back to a workable state.
There is no way to provide real security against this happening, other than making sure to use a reasonable value
for the counter reset. For example, setting <code class="constant">CPU_CLK_UNHALTED</code> event type with a ridiculously low reset count (e.g. 500)
is likely to freeze the system.
</p>
<p>
In short : <span><strong class="command">Don't try a foolish sample count value</strong></span>. Unfortunately the definition of a foolish value
is really dependent on the event type - if ever in doubt, e-mail </p>
<div class="address">
<p><code class="email">&lt;<a href="mailto:oprofile-list@lists.sf.net">oprofile-list@lists.sf.net</a>&gt;</code>.</p>
</div>
</div>
</div>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="results"></a>Chapter 4. Obtaining results</h2>
</div>
</div>
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="sect1">
<a href="#profile-spec">1. Profile specifications</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#profile-spec-examples">1.1. Examples</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#profile-spec-details">1.2. Profile specification parameters</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#locating-and-managing-binary-images">1.3. Locating and managing binary images</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#no-results">1.4. What to do when you don't get any results</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opreport">2. Image summaries and symbol summaries (<span><strong class="command">opreport</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opreport-merging">2.1. Merging separate profiles</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-comparison">2.2. Side-by-side multiple results</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-callgraph">2.3. Callgraph output</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-diff">2.4. Differential profiles with <span><strong class="command">opreport</strong></span></a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-anon">2.5. Anonymous executable mappings</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-xml">2.6. XML formatted output</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opreport-options">2.7. Options for <span><strong class="command">opreport</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opannotate">3. Outputting annotated source (<span><strong class="command">opannotate</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opannotate-finding-source">3.1. Locating source files</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#opannotate-details">3.2. Usage of <span><strong class="command">opannotate</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#getting-jit-reports">4. OProfile results with JIT samples</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#opgprof">5. <span><strong class="command">gprof</strong></span>-compatible output (<span><strong class="command">opgprof</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opgprof-details">5.1. Usage of <span><strong class="command">opgprof</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#oparchive">6. Archiving measurements (<span><strong class="command">oparchive</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#oparchive-details">6.1. Usage of <span><strong class="command">oparchive</strong></span></a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#opimport">7. Converting sample database files (<span><strong class="command">opimport</strong></span>)</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#opimport-details">7.1. Usage of <span><strong class="command">opimport</strong></span></a>
</span>
</dt>
</dl>
</dd>
</dl>
</div>
<p>
OK, so the profiler has been running, but it's not much use unless we can get some data out. Fairly often,
OProfile does a little <span class="emphasis"><em>too</em></span> good a job of keeping overhead low, and no data reaches
the profiler. This can happen on lightly-loaded machines. Remember you can force a dump at any time with :
</p>
<p>
<span>
<strong class="command">opcontrol --dump</strong>
</span>
</p>
<p>Remember to do this before complaining there is no profiling data !
Now that we've got some data, it has to be processed. That's the job of <span><strong class="command">opreport</strong></span>,
<span><strong class="command">opannotate</strong></span>, or <span><strong class="command">opgprof</strong></span>.
</p>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="profile-spec"></a>1. Profile specifications</h2>
</div>
</div>
</div>
<p>
All of the analysis tools take a <span class="emphasis"><em>profile specification</em></span>.
This is a set of definitions that describe which actual profiles should be
examined. The simplest profile specification is empty: this will match all
the available profile files for the current session (this is what happens
when you do <span><strong class="command">opreport</strong></span>).
</p>
<p>
Specification parameters are of the form <code class="option">name:value[,value]</code>.
For example, if I wanted to get a combined symbol summary for
<code class="filename">/bin/myprog</code> and <code class="filename">/bin/myprog2</code>,
I could do <span><strong class="command">opreport -l image:/bin/myprog,/bin/myprog2</strong></span>.
As a special case, you don't actually need to specify the <code class="option">image:</code>
part here: anything left on the command line is assumed to be an
<code class="option">image:</code> name. Similarly, if no <code class="option">session:</code>
is specified, then <code class="option">session:current</code> is assumed ("current"
is a special name of the current / last profiling session).
</p>
<p>
In addition to the comma-separated list shown above, some of the
specification parameters can take <span><strong class="command">glob</strong></span>-style
values. For example, if I want to see image summaries for all
binaries profiled in <code class="filename">/usr/bin/</code>, I could do
<span><strong class="command">opreport image:/usr/bin/\*</strong></span>. Note the necessity
to escape the special character from the shell.
</p>
<p>
For <span><strong class="command">opreport</strong></span>, profile specifications can be used to
define two profiles, giving differential output. This is done by
enclosing each of the two specifications within curly braces, as shown
in the examples below. Any specifications outside of curly braces are
shared across both.
</p>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="profile-spec-examples"></a>1.1. Examples</h3>
</div>
</div>
</div>
<p>
Image summaries for all profiles with <code class="constant">DATA_MEM_REFS</code>
samples in the saved session called "stresstest" :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opreport session:stresstest event:DATA_MEM_REFS
</pre>
</td>
</tr>
</table>
<p>
Symbol summary for the application called "test_sym53c8xx,9xx". Note the
escaping is necessary as <code class="option">image:</code> takes a comma-separated list.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opreport -l ./test/test_sym53c8xx\,9xx
</pre>
</td>
</tr>
</table>
<p>
Image summaries for all binaries in the <code class="filename">test</code> directory,
excepting <code class="filename">boring-test</code> :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opreport image:./test/\* image-exclude:./test/boring-test
</pre>
</td>
</tr>
</table>
<p>
Differential profile of a binary stored in two archives :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opreport -l /bin/bash { archive:./orig } { archive:./new }
</pre>
</td>
</tr>
</table>
<p>
Differential profile of an archived binary with the current session :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opreport -l /bin/bash { archive:./orig } { }
</pre>
</td>
</tr>
</table>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="profile-spec-details"></a>1.2. Profile specification parameters</h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">archive:</code>
<span class="emphasis">
<em>archivepath</em>
</span>
</span>
</dt>
<dd>
<p>
A path to an archive made with <span><strong class="command">oparchive</strong></span>.
Absence of this tag, unlike others, means "the current system",
equivalent to specifying "archive:".
</p>
</dd>
<dt>
<span class="term">
<code class="option">session:</code>
<span class="emphasis">
<em>sessionlist</em>
</span>
</span>
</dt>
<dd>
<p>
A comma-separated list of session names to resolve in. Absence of this
tag, unlike others, means "the current session", equivalent to
specifying "session:current".
</p>
</dd>
<dt>
<span class="term">
<code class="option">session-exclude:</code>
<span class="emphasis">
<em>sessionlist</em>
</span>
</span>
</dt>
<dd>
<p>
A comma-separated list of sessions to exclude.
</p>
</dd>
<dt>
<span class="term">
<code class="option">image:</code>
<span class="emphasis">
<em>imagelist</em>
</span>
</span>
</dt>
<dd>
<p>
A comma-separated list of image names to resolve. Each entry may be relative
path, <span><strong class="command">glob</strong></span>-style name, or full path, e.g.</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">opreport 'image:/usr/bin/oprofiled,*op*,./opreport'</pre>
</td>
</tr>
</table>
</dd>
<dt>
<span class="term">
<code class="option">image-exclude:</code>
<span class="emphasis">
<em>imagelist</em>
</span>
</span>
</dt>
<dd>
<p>
Same as <code class="option">image:</code>, but the matching images are excluded.
</p>
</dd>
<dt>
<span class="term">
<code class="option">lib-image:</code>
<span class="emphasis">
<em>imagelist</em>
</span>
</span>
</dt>
<dd>
<p>
Same as <code class="option">image:</code>, but only for images that are for
a particular primary binary image (namely, an application). This only
makes sense to use if you're using <code class="option">--separate</code>.
This includes kernel modules and the kernel when using
<code class="option">--separate=kernel</code>.
</p>
</dd>
<dt>
<span class="term">
<code class="option">lib-image-exclude:</code>
<span class="emphasis">
<em>imagelist</em>
</span>
</span>
</dt>
<dd>
<p>
Same as <code class="option">lib-image:</code>, but the matching images
are excluded.
</p>
</dd>
<dt>
<span class="term">
<code class="option">event:</code>
<span class="emphasis">
<em>eventlist</em>
</span>
</span>
</dt>
<dd>
<p>
The symbolic event name to match on, e.g. <code class="option">event:DATA_MEM_REFS</code>.
You can pass a list of events for side-by-side comparison with <span><strong class="command">opreport</strong></span>.
When using the timer interrupt, the event is always "TIMER".
</p>
</dd>
<dt>
<span class="term">
<code class="option">count:</code>
<span class="emphasis">
<em>eventcountlist</em>
</span>
</span>
</dt>
<dd>
<p>
The event count to match on, e.g. <code class="option">event:DATA_MEM_REFS count:30000</code>.
Note that this value refers to the setting used for <span><strong class="command">opcontrol</strong></span>
only, and has nothing to do with the sample counts in the profile data
itself.
You can pass a list of events for side-by-side comparison with <span><strong class="command">opreport</strong></span>.
When using the timer interrupt, the count is always 0 (indicating it cannot be set).
</p>
</dd>
<dt>
<span class="term">
<code class="option">unit-mask:</code>
<span class="emphasis">
<em>masklist</em>
</span>
</span>
</dt>
<dd>
<p>
The unit mask value of the event to match on, e.g. <code class="option">unit-mask:1</code>.
You can pass a list of events for side-by-side comparison with <span><strong class="command">opreport</strong></span>.
</p>
</dd>
<dt>
<span class="term">
<code class="option">cpu:</code>
<span class="emphasis">
<em>cpulist</em>
</span>
</span>
</dt>
<dd>
<p>
Only consider profiles for the given numbered CPU (starting from zero).
This is only useful when using CPU profile separation.
</p>
</dd>
<dt>
<span class="term">
<code class="option">tgid:</code>
<span class="emphasis">
<em>pidlist</em>
</span>
</span>
</dt>
<dd>
<p>
Only consider profiles for the given task groups. Unless some program
is using threads, the task group ID of a process is the same
as its process ID. This option corresponds to the POSIX
notion of a thread group.
This is only useful when using per-process profile separation.
</p>
</dd>
<dt>
<span class="term">
<code class="option">tid:</code>
<span class="emphasis">
<em>tidlist</em>
</span>
</span>
</dt>
<dd>
<p>
Only consider profiles for the given threads. When using
recent thread libraries, all threads in a process share the
same task group ID, but have different thread IDs. You can
use this option in combination with <code class="option">tgid:</code> to
restrict the results to particular threads within a process.
This is only useful when using per-process profile separation.
</p>
</dd>
</dl>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="locating-and-managing-binary-images"></a>1.3. Locating and managing binary images</h3>
</div>
</div>
</div>
<p>
Each session's sample files can be found in the $SESSION_DIR/samples/ directory (default: <code class="filename">/var/lib/oprofile/samples/</code>).
These are used, along with the binary image files, to produce human-readable data.
In some circumstances (kernel modules in an initrd, or modules on 2.6 kernels), OProfile
will not be able to find the binary images. All the tools have an <code class="option">--image-path</code>
option to which you can pass a comma-separated list of alternate paths to search. For example,
I can let OProfile find my 2.6 modules by using <span><strong class="command">--image-path /lib/modules/2.6.0/kernel/</strong></span>.
It is your responsibility to ensure that the correct images are found when using this
option.
</p>
<p>
Note that if a binary image changes after the sample file was created, you won't be able to get useful
symbol-based data out. This situation is detected for you. If you replace a binary, you should
make sure to save the old binary if you need to do comparative profiles.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="no-results"></a>1.4. What to do when you don't get any results</h3>
</div>
</div>
</div>
<p>
When attempting to get output, you may see the error :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
error: no sample files found: profile specification too strict ?
</pre>
</td>
</tr>
</table>
<p>
What this is saying is that the profile specification you passed in,
when matched against the available sample files, resulted in no matches.
There are a number of reasons this might happen:
</p>
<div class="variablelist">
<dl>
<dt>
<span class="term">spelling</span>
</dt>
<dd>
<p>
You specified a binary name, but spelt it wrongly. Check your spelling !
</p>
</dd>
<dt>
<span class="term">profiler wasn't running</span>
</dt>
<dd>
<p>
Make very sure that OProfile was actually up and running when you ran
the binary.
</p>
</dd>
<dt>
<span class="term">binary didn't run long enough</span>
</dt>
<dd>
<p>
Remember OProfile is a statistical profiler - you're not guaranteed to
get samples for short-running programs. You can help this by using a
lower count for the performance counter, so there are a lot more samples
taken per second.
</p>
</dd>
<dt>
<span class="term">binary spent most of its time in libraries</span>
</dt>
<dd>
<p>
Similarly, if the binary spends little time in the main binary image
itself, with most of it spent in shared libraries it uses, you might
not see any samples for the binary image itself. You can check this
by using <span><strong class="command">opcontrol --separate=lib</strong></span> before the
profiling session, so <span><strong class="command">opreport</strong></span> and friends show
the library profiles on a per-application basis.
</p>
</dd>
<dt>
<span class="term">specification was really too strict</span>
</dt>
<dd>
<p>
For example, you specified something like <code class="option">tgid:3433</code>,
but no task with that group ID ever ran the code.
</p>
</dd>
<dt>
<span class="term">binary didn't generate any events</span>
</dt>
<dd>
<p>
If you're using a particular event counter, for example counting MMX
operations, the code might simply have not generated any events in the
first place. Verify the code you're profiling does what you expect it
to.
</p>
</dd>
<dt>
<span class="term">you didn't specify kernel module name correctly</span>
</dt>
<dd>
<p>
If you're using 2.6 kernels, and trying to get reports for a kernel
module, make sure to use the <code class="option">-p</code> option, and specify the
module name <span class="emphasis"><em>with</em></span> the <code class="filename">.ko</code>
extension. Check if the module is one loaded from initrd.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="opreport"></a>2. Image summaries and symbol summaries (<span><strong class="command">opreport</strong></span>)</h2>
</div>
</div>
</div>
<p>
The <span><strong class="command">opreport</strong></span> utility is the primary utility you will use for
getting formatted data out of OProfile. It produces two types of data: image summaries
and symbol summaries. An image summary lists the number of samples for individual
binary images such as libraries or applications. Symbol summaries provide per-symbol
profile data. In the following example, we're getting an image summary for the whole
system:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opreport --long-filenames
CPU: PIII, speed 863.195 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
905898 59.7415 /usr/lib/gcc-lib/i386-redhat-linux/3.2/cc1plus
214320 14.1338 /boot/2.6.0/vmlinux
103450 6.8222 /lib/i686/libc-2.3.2.so
60160 3.9674 /usr/local/bin/madplay
31769 2.0951 /usr/local/oprofile-pp/bin/oprofiled
26550 1.7509 /usr/lib/libartsflow.so.1.0.0
23906 1.5765 /usr/bin/as
18770 1.2378 /oprofile
15528 1.0240 /usr/lib/qt-3.0.5/lib/libqt-mt.so.3.0.5
11979 0.7900 /usr/X11R6/bin/XFree86
11328 0.7471 /bin/bash
...
</pre>
</td>
</tr>
</table>
<p>
If we had specified <code class="option">--symbols</code> in the previous command, we would have
gotten a symbol summary of all the images across the entire system. We can restrict this to only
part of the system profile; for example,
below is a symbol summary of the OProfile daemon. Note that as we used
<span><strong class="command">opcontrol --separate=kernel</strong></span>, symbols from images that <span><strong class="command">oprofiled</strong></span>
has used are also shown.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opreport -l `which oprofiled` 2&gt;/dev/null | more
CPU: PIII, speed 863.195 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
vma samples % image name symbol name
0804be10 14971 28.1993 oprofiled odb_insert
0804afdc 7144 13.4564 oprofiled pop_buffer_value
c01daea0 6113 11.5144 vmlinux __copy_to_user_ll
0804b060 2816 5.3042 oprofiled opd_put_sample
0804b4a0 2147 4.0441 oprofiled opd_process_samples
0804acf4 1855 3.4941 oprofiled opd_put_image_sample
0804ad84 1766 3.3264 oprofiled opd_find_image
0804a5ec 1084 2.0418 oprofiled opd_find_module
0804ba5c 741 1.3957 oprofiled odb_hash_add_node
...
</pre>
</td>
</tr>
</table>
<p>
These are the two basic ways you are most likely to use regularly, but <span><strong class="command">opreport</strong></span>
can do a lot more than that, as described below.
</p>
<div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="opreport-merging"></a>2.1. Merging separate profiles</h3></div></div></div>
If you have used one of the <code class="option">--separate=</code> options
whilst profiling, there can be several separate profiles for
a single binary image within a session. Normally the output
will keep these images separated (so, for example, the image summary
output shows library image summaries on a per-application basis,
when using <code class="option">--separate=lib</code>).
Sometimes it can be useful to merge these results back together
before getting results. The <code class="option">--merge</code> option allows
you to do that.
</div>
<div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="opreport-comparison"></a>2.2. Side-by-side multiple results</h3></div></div></div>
If you have used multiple events when profiling, by default you get
side-by-side results of each event's sample values from <span><strong class="command">opreport</strong></span>.
You can restrict which events to list by appropriate use of the
<code class="option">event:</code> profile specifications, etc.
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opreport-callgraph"></a>2.3. Callgraph output</h3>
</div>
</div>
</div>
<p>
This section provides details on how to use the OProfile callgraph feature.
</p>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="op-cg1"></a>2.3.1. Callgraph details</h4>
</div>
</div>
</div>
<p>
When using the <code class="option">opcontrol --callgraph</code> option, you can see what
functions are calling other functions in the output. Consider the
following program:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
#include &lt;string.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stdio.h&gt;
#define SIZE 500000
static int compare(const void *s1, const void *s2)
{
return strcmp(s1, s2);
}
static void repeat(void)
{
int i;
char *strings[SIZE];
char str[] = "abcdefghijklmnopqrstuvwxyz";
for (i = 0; i &lt; SIZE; ++i) {
strings[i] = strdup(str);
strfry(strings[i]);
}
qsort(strings, SIZE, sizeof(char *), compare);
}
int main()
{
while (1)
repeat();
}
</pre>
</td>
</tr>
</table>
<p>
When running with the call-graph option, OProfile will
record the function stack every time it takes a sample.
<span><strong class="command">opreport --callgraph</strong></span> outputs an entry for each
function, where each entry looks similar to:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
samples % image name symbol name
197 0.1548 cg main
127036 99.8452 cg repeat
84590 42.5084 libc-2.3.2.so strfry
84590 66.4838 libc-2.3.2.so strfry [self]
39169 30.7850 libc-2.3.2.so random_r
3475 2.7312 libc-2.3.2.so __i686.get_pc_thunk.bx
-------------------------------------------------------------------------------
</pre>
</td>
</tr>
</table>
<p>
Here the non-indented line is the function we're focussing upon
(<code class="function">strfry()</code>). This
line is the same as you'd get from a normal <span><strong class="command">opreport</strong></span>
output.
</p>
<p>
Above the non-indented line we find the functions that called this
function (for example, <code class="function">repeat()</code> calls
<code class="function">strfry()</code>). The samples and percentage values here
refer to the number of times we took a sample where this call was found
in the stack; the percentage is relative to all other callers of the
function we're focussing on. Note that these values are
<span class="emphasis"><em>not</em></span> call counts; they only reflect the call stack
every time a sample is taken; that is, if a call is found in the stack
at the time of a sample, it is recorded in this count.
</p>
<p>
Below the line are functions that are called by
<code class="function">strfry()</code> (called <span class="emphasis"><em>callees</em></span>).
It's clear here that <code class="function">strfry()</code> calls
<code class="function">random_r()</code>. We also see a special entry with a
"[self]" marker. This records the normal samples for the function, but
the percentage becomes relative to all callees. This allows you to
compare time spent in the function itself compared to functions it
calls. Note that if a function calls itself, then it will appear in the
list of callees of itself, but without the "[self]" marker; so recursive
calls are still clearly separable.
</p>
<p>
You may have noticed that the output lists <code class="function">main()</code>
as calling <code class="function">strfry()</code>, but it's clear from the source
that this doesn't actually happen. See <a href="#interpreting-callgraph" title="3. Interpreting call-graph profiles">Section 3, &#8220;Interpreting call-graph profiles&#8221;</a> for an explanation.
</p>
</div>
<div class="sect3" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a id="cg-with-jitsupport"></a>2.3.2. Callgraph and JIT support</h4>
</div>
</div>
</div>
<p>
Callgraph output where anonymously mapped code is in the callstack can sometimes be misleading.
For all such code, the samples for the anonymously mapped code are stored in a samples subdirectory
named <code class="filename">{anon:anon}/&lt;tgid&gt;.&lt;begin_addr&gt;.&lt;end_addr&gt;</code>.
As stated earlier, if this anonymously mapped code is JITed code from a supported VM like Java,
OProfile creates an ELF file to provide a (somewhat) permanent backing file for the code.
However, when viewing callgraph output, any anonymously mapped code in the callstack
will be attributed to <code class="filename">anon (&lt;tgid&gt;: range:&lt;begin_addr&gt;-&lt;end_addr&gt;</code>,
even if a <code class="filename">.jo</code> ELF file had been created for it. See the example below.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
-------------------------------------------------------------------------------
1 2.2727 libj9ute23.so java.bin traceV
2 4.5455 libj9ute23.so java.bin utsTraceV
4 9.0909 libj9trc23.so java.bin fillInUTInterfaces
37 84.0909 libj9trc23.so java.bin twGetSequenceCounter
8 0.0154 libj9prt23.so java.bin j9time_hires_clock
27 61.3636 anon (tgid:10014 range:0x100000-0x103000) java.bin (no symbols)
9 20.4545 libc-2.4.so java.bin gettimeofday
8 18.1818 libj9prt23.so java.bin j9time_hires_clock [self]
-------------------------------------------------------------------------------
</pre>
</td>
</tr>
</table>
<p>
The output shows that "anon (tgid:10014 range:0x100000-0x103000)" was a callee of
<code class="code">j9time_hires_clock</code>, even though the ELF file <code class="filename">10014.jo</code> was
created for this profile run. Unfortunately, there is currently no way to correlate
that anonymous callgraph entry with its corresponding <code class="filename">.jo</code> file.
</p>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opreport-diff"></a>2.4. Differential profiles with <span><strong class="command">opreport</strong></span></h3>
</div>
</div>
</div>
<p>
Often, we'd like to be able to compare two profiles. For example, when
analysing the performance of an application, we'd like to make code
changes and examine the effect of the change. This is supported in
<span><strong class="command">opreport</strong></span> by giving a profile specification that
identifies two different profiles. The general form is of:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opreport &lt;shared-spec&gt; { &lt;first-profile&gt; } { &lt;second-profile&gt; }
</pre>
</td>
</tr>
</table>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
We lost our Dragon book down the back of the sofa, so you have to be
careful to have spaces around those braces, or things will get
hopelessly confused. We can only apologise.
</p>
</div>
<p>
For each of the profiles, the shared section is prefixed, and then the
specification is analysed. The usual parameters work both within the
shared section, and in the sub-specification within the curly braces.
</p>
<p>
A typical way to use this feature is with archives created with
<span><strong class="command">oparchive</strong></span>. Let's look at an example:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ ./a
$ oparchive -o orig ./a
$ opcontrol --reset
# edit and recompile a
$ ./a
# now compare the current profile of a with the archived profile
$ opreport -xl ./a { archive:./orig } { }
CPU: PIII, speed 863.233 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % diff % symbol name
92435 48.5366 +0.4999 a
54226 --- --- c
49222 25.8459 +++ d
48787 25.6175 -2.2e-01 b
</pre>
</td>
</tr>
</table>
<p>
Note that we specified an empty second profile in the curly braces, as
we wanted to use the current session; alternatively, we could
have specified another archive, or a tgid etc. We specified the binary
<span><strong class="command">a</strong></span> in the shared section, so we matched that in both
the profiles we're diffing.
</p>
<p>
As in the normal output, the results are sorted by the number of
samples, and the percentage field represents the relative percentage of
the symbol's samples in the second profile.
</p>
<p>
Notice the new column in the output. This value represents the
percentage change of the relative percent between the first and the
second profile: roughly, "how much more important this symbol is".
Looking at the symbol <code class="function">a()</code>, we can see that it took
roughly the same amount of the total profile in both the first and the
second profile. The function <code class="function">c()</code> was not in the new
profile, so has been marked with <code class="function">---</code>. Note that the
sample value is the number of samples in the first profile; since we're
displaying results for the second profile, we don't list a percentage
value for it, as it would be meaningless. <code class="function">d()</code> is
new in the second profile, and consequently marked with
<code class="function">+++</code>.
</p>
<p>
When comparing profiles between different binaries, it should be clear
that functions can change in terms of VMA and size. To avoid this
problem, <span><strong class="command">opreport</strong></span> considers a symbol to be the same
if the symbol name, image name, and owning application name all match;
any other factors are ignored. Note that the check for application name
means that trying to compare library profiles between two different
applications will not work as you might expect: each symbol will be
considered different.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opreport-anon"></a>2.5. Anonymous executable mappings</h3>
</div>
</div>
</div>
<p>
Many applications, typically ones involving dynamic compilation into
machine code (just-in-time, or "JIT", compilation), have executable mappings that
are not backed by an ELF file. <span><strong class="command">opreport</strong></span> has basic support for showing the
samples taken in these regions; for example:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opreport /usr/bin/mono -l
CPU: ppc64 POWER5, speed 1654.34 MHz (estimated)
Counted CYCLES events (Processor Cycles using continuous sampling) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
47 58.7500 mono (no symbols)
14 17.5000 anon (tgid:3189 range:0xf72aa000-0xf72fa000) (no symbols)
9 11.2500 anon (tgid:3189 range:0xf6cca000-0xf6dd9000) (no symbols)
. . . .
</pre>
</td>
</tr>
</table>
<p>
</p>
<p>
Note that, since such mappings are dependent upon individual invocations of
a binary, these mappings are always listed as a dependent image,
even when using <code class="option">--separate=none</code>.
Equally, the results are not affected by the <code class="option">--merge</code>
option.
</p>
<p>
As shown in the opreport output above, OProfile is unable to attribute the samples to any
symbol(s) because there is no ELF file for this code.
Enhanced support for JITed code is now available for some virtual machines;
e.g., the Java Virtual Machine. For details about OProfile output for
JITed code, see <a href="#getting-jit-reports" title="4. OProfile results with JIT samples">Section 4, &#8220;OProfile results with JIT samples&#8221;</a>.
</p>
<p>For more information about JIT support in OProfile, see <a href="#jitsupport" title="1.1. Support for dynamically compiled (JIT) code">Section 1.1, &#8220;Support for dynamically compiled (JIT) code&#8221;</a>.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opreport-xml"></a>2.6. XML formatted output</h3>
</div>
</div>
</div>
<p>
The -xml option can be used to generate XML instead of the usual
text format. This allows opreport to eliminate some of the constraints
dictated by the two dimensional text format. For example, it is possible
to separate the sample data across multiple events, cpus and threads. The XML
schema implemented by opreport is found in doc/opreport.xsd. It contains
more detailed comments about the structure of the XML generated by opreport.
</p>
<p>
Since XML is consumed by a client program rather than a user, its structure
is fairly static. In particular, the --sort option is incompatible with the
--xml option. Percentages are not dislayed in the XML so the options related
to percentages will have no effect. Full pathnames are always displayed in
the XML so --long-filenames is not necessary. The --details option will cause
all of the individual sample data to be included in the XML as well as the
instruction byte stream for each symbol (for doing disassembly) and can result
in very large XML files.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opreport-options"></a>2.7. Options for <span><strong class="command">opreport</strong></span></h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--accumulated / -a</code>
</span>
</dt>
<dd>
<p>
Accumulate sample and percentage counts in the symbol list.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--callgraph / -c</code>
</span>
</dt>
<dd>
<p>
Show callgraph information.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--debug-info / -g</code>
</span>
</dt>
<dd>
<p>
Show source file and line for each symbol.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--demangle / -D none|normal|smart</code>
</span>
</dt>
<dd>
<p>
none: no demangling. normal: use default demangler (default) smart: use
pattern-matching to make C++ symbol demangling more readable.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--details / -d</code>
</span>
</dt>
<dd>
<p>
Show per-instruction details for all selected symbols. Note that, for
binaries without symbol information, the VMA values shown are raw file
offsets for the image binary.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-dependent / -x</code>
</span>
</dt>
<dd>
<p>
Do not include application-specific images for libraries, kernel modules
and the kernel. This option only makes sense if the profile session
used --separate.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-symbols / -e [symbols]</code>
</span>
</dt>
<dd>
<p>
Exclude all the symbols in the given comma-separated list.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--global-percent / -%</code>
</span>
</dt>
<dd>
<p>
Make all percentages relative to the whole profile.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--help / -? / --usage</code>
</span>
</dt>
<dd>
<p>
Show help message.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--image-path / -p [paths]</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of additional paths to search for binaries.
This is needed to find modules in kernels 2.6 and upwards.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--root / -R [path]</code>
</span>
</dt>
<dd>
<p>
A path to a filesystem to search for additional binaries.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--include-symbols / -i [symbols]</code>
</span>
</dt>
<dd>
<p>
Only include symbols in the given comma-separated list.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--long-filenames / -f</code>
</span>
</dt>
<dd>
<p>
Output full paths instead of basenames.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--merge / -m [lib,cpu,tid,tgid,unitmask,all]</code>
</span>
</dt>
<dd>
<p>
Merge any profiles separated in a --separate session.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--no-header</code>
</span>
</dt>
<dd>
<p>
Don't output a header detailing profiling parameters.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--output-file / -o [file]</code>
</span>
</dt>
<dd>
<p>
Output to the given file instead of stdout.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--reverse-sort / -r</code>
</span>
</dt>
<dd>
<p>
Reverse the sort from the default.
</p>
</dd>
<dt>
<span class="term"><code class="option">--session-dir=</code>dir_path</span>
</dt>
<dd>
<p>
Use sample database out of directory <code class="filename">dir_path</code>
instead of the default location (/var/lib/oprofile).
</p>
</dd>
<dt>
<span class="term">
<code class="option">--show-address / -w</code>
</span>
</dt>
<dd>
<p>
Show the VMA address of each symbol (off by default).
</p>
</dd>
<dt>
<span class="term">
<code class="option">--sort / -s [vma,sample,symbol,debug,image]</code>
</span>
</dt>
<dd>
<p>
Sort the list of symbols by, respectively, symbol address,
number of samples, symbol name, debug filename and line number,
binary image filename.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--symbols / -l</code>
</span>
</dt>
<dd>
<p>
List per-symbol information instead of a binary image summary.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--threshold / -t [percentage]</code>
</span>
</dt>
<dd>
<p>
Only output data for symbols that have more than the given percentage
of total samples.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--verbose / -V [options]</code>
</span>
</dt>
<dd>
<p>
Give verbose debugging output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--version / -v</code>
</span>
</dt>
<dd>
<p>
Show version.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--xml / -X</code>
</span>
</dt>
<dd>
<p>
Generate XML output.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="opannotate"></a>3. Outputting annotated source (<span><strong class="command">opannotate</strong></span>)</h2>
</div>
</div>
</div>
<p>
The <span><strong class="command">opannotate</strong></span> utility generates annotated source files or assembly listings, optionally
mixed with source.
If you want to see the source file, the profiled application needs to have debug information, and the source
must be available through this debug information. For GCC, you must use the <code class="option">-g</code> option
when you are compiling.
If the binary doesn't contain sufficient debug information, you can still
use <span><strong class="command">opannotate <code class="option">--assembly</code></strong></span> to get annotated assembly.
</p>
<p>
Note that for the reason explained in <a href="#hardware-counters" title="4.1. Hardware performance counters">Section 4.1, &#8220;Hardware performance counters&#8221;</a> the results can be
inaccurate. The debug information itself can add other problems; for example, the line number for a symbol can be
incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to
crediting source lines with samples not really "owned" by this line. Also see
<a href="#interpreting" title="Chapter 5. Interpreting profiling results">Chapter 5, <i>Interpreting profiling results</i></a>.
</p>
<p>
You can output the annotation to one single file, containing all the source found using the
<code class="option">--source</code>. You can use this in conjunction with <code class="option">--assembly</code>
to get combined source/assembly output.
</p>
<p>
You can also output a directory of annotated source files that maintains the structure of
the original sources. Each line in the annotated source is prepended with the samples
for that line. Additionally, each symbol is annotated giving details for the symbol
as a whole. An example:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opannotate --source --output-dir=annotated /usr/local/oprofile-pp/bin/oprofiled
$ ls annotated/home/moz/src/oprofile-pp/daemon/
opd_cookie.h opd_image.c opd_kernel.c opd_sample_files.c oprofiled.c
</pre>
</td>
</tr>
</table>
<p>
Line numbers are maintained in the source files, but each file has
a footer appended describing the profiling details. The actual annotation
looks something like this :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
...
:static uint64_t pop_buffer_value(struct transient * trans)
11510 1.9661 :{ /* pop_buffer_value total: 89901 15.3566 */
: uint64_t val;
:
10227 1.7469 : if (!trans-&gt;remaining) {
: fprintf(stderr, "BUG: popping empty buffer !\n");
: exit(EXIT_FAILURE);
: }
:
: val = get_buffer_value(trans-&gt;buffer, 0);
2281 0.3896 : trans-&gt;remaining--;
2296 0.3922 : trans-&gt;buffer += kernel_pointer_size;
: return val;
10454 1.7857 :}
...
</pre>
</td>
</tr>
</table>
<p>
The first number on each line is the number of samples, whilst the second is
the relative percentage of total samples.
</p>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opannotate-finding-source"></a>3.1. Locating source files</h3>
</div>
</div>
</div>
<p>
Of course, <span><strong class="command">opannotate</strong></span> needs to be able to locate the source files
for the binary image(s) in order to produce output. Some binary images have debug
information where the given source file paths are relative, not absolute. You can
specify search paths to look for these files (similar to <span><strong class="command">gdb</strong></span>'s
<code class="option">dir</code> command) with the <code class="option">--search-dirs</code> option.
</p>
<p>
Sometimes you may have a binary image which gives absolute paths for the source files,
but you have the actual sources elsewhere (commonly, you've installed an SRPM for
a binary on your system and you want annotation from an existing profile). You can
use the <code class="option">--base-dirs</code> option to redirect OProfile to look somewhere
else for source files. For example, imagine we have a binary generated from a source
file that is given in the debug information as <code class="filename">/tmp/build/libfoo/foo.c</code>,
and you have the source tree matching that binary installed in <code class="filename">/home/user/libfoo/</code>.
You can redirect OProfile to find <code class="filename">foo.c</code> correctly like this :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opannotate --source --base-dirs=/tmp/build/libfoo/ --search-dirs=/home/user/libfoo/ --output-dir=annotated/ /lib/libfoo.so
</pre>
</td>
</tr>
</table>
<p>
You can specify multiple (comma-separated) paths to both options.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opannotate-details"></a>3.2. Usage of <span><strong class="command">opannotate</strong></span></h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--assembly / -a</code>
</span>
</dt>
<dd>
<p>
Output annotated assembly. If this is combined with --source, then mixed
source / assembly annotations are output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--base-dirs / -b [paths]/</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of path prefixes. This can be used to point OProfile to a
different location for source files when the debug information specifies an
absolute path on your system for the source that does not exist. The prefix
is stripped from the debug source file paths, then searched in the search dirs
specified by <code class="option">--search-dirs</code>.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--demangle / -D none|normal|smart</code>
</span>
</dt>
<dd>
<p>
none: no demangling. normal: use default demangler (default) smart: use
pattern-matching to make C++ symbol demangling more readable.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-dependent / -x</code>
</span>
</dt>
<dd>
<p>
Do not include application-specific images for libraries, kernel modules
and the kernel. This option only makes sense if the profile session
used --separate.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-file [files]</code>
</span>
</dt>
<dd>
<p>
Exclude all files in the given comma-separated list of glob patterns.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-symbols / -e [symbols]</code>
</span>
</dt>
<dd>
<p>
Exclude all the symbols in the given comma-separated list.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--help / -? / --usage</code>
</span>
</dt>
<dd>
<p>
Show help message.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--image-path / -p [paths]</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of additional paths to search for binaries.
This is needed to find modules in kernels 2.6 and upwards.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--root / -R [path]</code>
</span>
</dt>
<dd>
<p>
A path to a filesystem to search for additional binaries.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--include-file [files]</code>
</span>
</dt>
<dd>
<p>
Only include files in the given comma-separated list of glob patterns.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--include-symbols / -i [symbols]</code>
</span>
</dt>
<dd>
<p>
Only include symbols in the given comma-separated list.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--objdump-params [params]</code>
</span>
</dt>
<dd>
<p>
Pass the given parameters as extra values when calling objdump.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--output-dir / -o [dir]</code>
</span>
</dt>
<dd>
<p>
Output directory. This makes opannotate output one annotated file for each
source file. This option can't be used in conjunction with --assembly.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--search-dirs / -d [paths]</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of paths to search for source files. This is useful to find
source files when the debug information only contains relative paths.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--source / -s</code>
</span>
</dt>
<dd>
<p>
Output annotated source. This requires debugging information to be available
for the binaries.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--threshold / -t [percentage]</code>
</span>
</dt>
<dd>
<p>
Only output data for symbols that have more than the given percentage
of total samples.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--verbose / -V [options]</code>
</span>
</dt>
<dd>
<p>
Give verbose debugging output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--version / -v</code>
</span>
</dt>
<dd>
<p>
Show version.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="getting-jit-reports"></a>4. OProfile results with JIT samples</h2>
</div>
</div>
</div>
<p>
After profiling a Java (or other supported VM) application, the command
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen"><span xmlns="http://www.w3.org/1999/xhtml"><strong class="command">"opcontrol --dump"</strong></span> </pre>
</td>
</tr>
</table>
<p>
flushes the sample buffers and creates ELF binaries from the
intermediate files that were written by the agent library.
The ELF binaries are named <code class="filename">&lt;tgid&gt;.jo</code>.
With the symbol information stored in these ELF files, it is
possible to map samples to the appropriate symbols.
</p>
<p>
The usual analysis tools (<span><strong class="command">opreport</strong></span> and/or
<span><strong class="command">opannotate</strong></span>) can now be used
to get symbols and assembly code for the instrumented VM processes.
</p>
<p>
Below is an example of a profile report of a Java application that has been
instrumented with the provided agent library.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opreport -l /usr/lib/jvm/jre-1.5.0-ibm/bin/java
CPU: Core Solo / Duo, speed 2167 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
186020 50.0523 no-vmlinux no-vmlinux (no symbols)
34333 9.2380 7635.jo java void test.f1()
19022 5.1182 libc-2.5.so libc-2.5.so _IO_file_xsputn@@GLIBC_2.1
18762 5.0483 libc-2.5.so libc-2.5.so vfprintf
16408 4.4149 7635.jo java void test$HelloThread.run()
16250 4.3724 7635.jo java void test$test_1.f2(int)
15303 4.1176 7635.jo java void test.f2(int, int)
13252 3.5657 7635.jo java void test.f2(int)
5165 1.3897 7635.jo java void test.f4()
955 0.2570 7635.jo java void test$HelloThread.run()~
</pre>
</td>
</tr>
</table>
<p>
</p>
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Note</h3>
<p>
Depending on the JVM that is used, certain options of opreport and opannotate
do NOT work since they rely on debug information (e.g. source code line number)
that is not always available. The Sun JVM does provide the necessary debug
information via the JVMTI[PI] interface,
but other JVMs do not.
</p>
</div>
<p>
As you can see in the opreport output, the JIT support agent for Java
generates symbols to include the class and method signature.
A symbol with the suffix &#732;&lt;n&gt; (e.g.
<code class="code">void test$HelloThread.run()&#732;1</code>) means that this is
the &lt;n&gt;th occurrence of the identical name. This happens if a method is re-JITed.
A symbol with the suffix %&lt;n&gt;, means that the address space of this symbol
was reused during the sample session (see <a href="#overlapping-symbols" title="6. Overlapping symbols in JITed code">Section 6, &#8220;Overlapping symbols in JITed code&#8221;</a>).
The value &lt;n&gt; is the percentage of time that this symbol/code was present in
relation to the total lifetime of all overlapping other symbols. A symbol of the form
<code class="code">&lt;return_val&gt; &lt;class_name&gt;$&lt;method_sig&gt;</code> denotes an
inner class.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="opgprof"></a>5. <span><strong class="command">gprof</strong></span>-compatible output (<span><strong class="command">opgprof</strong></span>)</h2>
</div>
</div>
</div>
<p>
If you're familiar with the output produced by <span><strong class="command">GNU gprof</strong></span>,
you may find <span><strong class="command">opgprof</strong></span> useful. It takes a single binary
as an argument, and produces a <code class="filename">gmon.out</code> file for use
with <span><strong class="command">gprof -p</strong></span>. If call-graph profiling is enabled,
then this is also included.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opgprof `which oprofiled` # generates gmon.out file
$ gprof -p `which oprofiled` | head
Flat profile:
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
33.13 206237.00 206237.00 odb_insert
22.67 347386.00 141149.00 pop_buffer_value
9.56 406881.00 59495.00 opd_put_sample
7.34 452599.00 45718.00 opd_find_image
7.19 497327.00 44728.00 opd_process_samples
</pre>
</td>
</tr>
</table>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opgprof-details"></a>5.1. Usage of <span><strong class="command">opgprof</strong></span></h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--help / -? / --usage</code>
</span>
</dt>
<dd>
<p>
Show help message.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--image-path / -p [paths]</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of additional paths to search for binaries.
This is needed to find modules in kernels 2.6 and upwards.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--root / -R [path]</code>
</span>
</dt>
<dd>
<p>
A path to a filesystem to search for additional binaries.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--output-filename / -o [file]</code>
</span>
</dt>
<dd>
<p>
Output to the given file instead of the default, gmon.out
</p>
</dd>
<dt>
<span class="term">
<code class="option">--threshold / -t [percentage]</code>
</span>
</dt>
<dd>
<p>
Only output data for symbols that have more than the given percentage
of total samples.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--verbose / -V [options]</code>
</span>
</dt>
<dd>
<p>
Give verbose debugging output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--version / -v</code>
</span>
</dt>
<dd>
<p>
Show version.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="oparchive"></a>6. Archiving measurements (<span><strong class="command">oparchive</strong></span>)</h2>
</div>
</div>
</div>
<p>
The <span><strong class="command">oparchive</strong></span> utility generates a directory populated
with executable, debug, and oprofile sample files. This directory can be
moved to another machine via <span><strong class="command">tar</strong></span> and analyzed without
further use of the data collection machine.
</p>
<p>
The following command would collect the sample files, the executables
associated with the sample files, and the debuginfo files associated
with the executables and copy them into
<code class="filename">/tmp/current_data</code>:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# oparchive -o /tmp/current_data
</pre>
</td>
</tr>
</table>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="oparchive-details"></a>6.1. Usage of <span><strong class="command">oparchive</strong></span></h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--help / -? / --usage</code>
</span>
</dt>
<dd>
<p>
Show help message.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--exclude-dependent / -x</code>
</span>
</dt>
<dd>
<p>
Do not include application-specific images for libraries, kernel modules
and the kernel. This option only makes sense if the profile session
used --separate.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--image-path / -p [paths]</code>
</span>
</dt>
<dd>
<p>
Comma-separated list of additional paths to search for binaries.
This is needed to find modules in kernels 2.6 and upwards.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--root / -R [path]</code>
</span>
</dt>
<dd>
<p>
A path to a filesystem to search for additional binaries.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--output-directory / -o [directory]</code>
</span>
</dt>
<dd>
<p>
Output to the given directory. There is no default. This must be specified.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--list-files / -l</code>
</span>
</dt>
<dd>
<p>
Only list the files that would be archived, don't copy them.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--verbose / -V [options]</code>
</span>
</dt>
<dd>
<p>
Give verbose debugging output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--version / -v</code>
</span>
</dt>
<dd>
<p>
Show version.
</p>
</dd>
</dl>
</div>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="opimport"></a>7. Converting sample database files (<span><strong class="command">opimport</strong></span>)</h2>
</div>
</div>
</div>
<p>
This utility converts sample database files from a foreign binary format (abi) to
the native format. This is useful only when moving sample files between hosts,
for analysis on platforms other than the one used for collection. The abi format
of the file to be imported is described in a text file located in <code class="filename">$SESSION_DIR/abi</code>.
</p>
<p>
The following command would convert the input samples files to the
output samples files using the given abi file as a binary description
of the input file and the curent platform abi as a binary description
of the output file.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
# opimport -a /var/lib/oprofile/abi -o /tmp/current/.../GLOBAL_POWER_EVENTS.200000.1.all.all.all /var/lib/.../mprime/GLOBAL_POWER_EVENTS.200000.1.all.all.all
</pre>
</td>
</tr>
</table>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="opimport-details"></a>7.1. Usage of <span><strong class="command">opimport</strong></span></h3>
</div>
</div>
</div>
<div class="variablelist">
<dl>
<dt>
<span class="term">
<code class="option">--help / -? / --usage</code>
</span>
</dt>
<dd>
<p>
Show help message.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--abi / -a [filename]</code>
</span>
</dt>
<dd>
<p>
Input abi file description location.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--force / -f</code>
</span>
</dt>
<dd>
<p>
Force conversion even if the input and output abi are identical.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--output / -o [filename]</code>
</span>
</dt>
<dd>
<p>
Specify the output filename. If the output file already exists, the file is
not overwritten but data are accumulated in. Sample filename are informative
for post profile tools and must be kept identical, in other word the pathname
from the first path component containing a '{' must be kept as it in the
output filename.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--verbose / -V</code>
</span>
</dt>
<dd>
<p>
Give verbose debugging output.
</p>
</dd>
<dt>
<span class="term">
<code class="option">--version / -v</code>
</span>
</dt>
<dd>
<p>
Show version.
</p>
</dd>
</dl>
</div>
</div>
</div>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="interpreting"></a>Chapter 5. Interpreting profiling results</h2>
</div>
</div>
</div>
<div class="toc">
<p>
<b>Table of Contents</b>
</p>
<dl>
<dt>
<span class="sect1">
<a href="#irq-latency">1. Profiling interrupt latency</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#kernel-profiling">2. Kernel profiling</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#irq-masking">2.1. Interrupt masking</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#idle">2.2. Idle time</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#kernel-modules">2.3. Profiling kernel modules</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#interpreting-callgraph">3. Interpreting call-graph profiles</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#debug-info">4. Inaccuracies in annotated source</a>
</span>
</dt>
<dd>
<dl>
<dt>
<span class="sect2">
<a href="#effect-of-optimizations">4.1. Side effects of optimizations</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#prologues">4.2. Prologues and epilogues</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#inlined-function">4.3. Inlined functions</a>
</span>
</dt>
<dt>
<span class="sect2">
<a href="#wrong-linenr-info">4.4. Inaccuracy in line number information</a>
</span>
</dt>
</dl>
</dd>
<dt>
<span class="sect1">
<a href="#symbol-without-debug-info">5. Assembly functions</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#overlapping-symbols">6. Overlapping symbols in JITed code</a>
</span>
</dt>
<dt>
<span class="sect1">
<a href="#hidden-cost">7. Other discrepancies</a>
</span>
</dt>
</dl>
</div>
<p>
The standard caveats of profiling apply in interpreting the results from OProfile:
profile realistic situations, profile different scenarios, profile
for as long as a time as possible, avoid system-specific artifacts, don't trust
the profile data too much. Also bear in mind the comments on the performance
counters above - you <span class="emphasis"><em>cannot</em></span> rely on totally accurate
instruction-level profiling. However, for almost all circumstances the data
can be useful. Ideally a utility such as Intel's VTUNE would be available to
allow careful instruction-level analysis; go hassle Intel for this, not me ;)
</p>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="irq-latency"></a>1. Profiling interrupt latency</h2>
</div>
</div>
</div>
<p>
This is an example of how the latency of delivery of profiling interrupts
can impact the reliability of the profiling data. This is pretty much a
worst-case-scenario example: these problems are fairly rare.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
double fun(double a, double b, double c)
{
double result = 0;
for (int i = 0 ; i &lt; 10000; ++i) {
result += a;
result *= b;
result /= c;
}
return result;
}
</pre>
</td>
</tr>
</table>
<p>
Here the last instruction of the loop is very costly, and you would expect the result
reflecting that - but (cutting the instructions inside the loop):
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opannotate -a -t 10 ./a.out
88 15.38% : 8048337: fadd %st(3),%st
48 8.391% : 8048339: fmul %st(2),%st
68 11.88% : 804833b: fdiv %st(1),%st
368 64.33% : 804833d: inc %eax
: 804833e: cmp $0x270f,%eax
: 8048343: jle 8048337
</pre>
</td>
</tr>
</table>
<p>
The problem comes from the x86 hardware; when the counter overflows the IRQ
is asserted but the hardware has features that can delay the NMI interrupt:
x86 hardware is synchronous (i.e. cannot interrupt during an instruction);
there is also a latency when the IRQ is asserted, and the multiple
execution units and the out-of-order model of modern x86 CPUs also causes
problems. This is the same function, with annotation :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
$ opannotate -s -t 10 ./a.out
:double fun(double a, double b, double c)
:{ /* _Z3funddd total: 572 100.0% */
: double result = 0;
368 64.33% : for (int i = 0 ; i &lt; 10000; ++i) {
88 15.38% : result += a;
48 8.391% : result *= b;
68 11.88% : result /= c;
: }
: return result;
:}
</pre>
</td>
</tr>
</table>
<p>
The conclusion: don't trust samples coming at the end of a loop,
particularly if the last instruction generated by the compiler is costly. This
case can also occur for branches. Always bear in mind that samples
can be delayed by a few cycles from its real position. That's a hardware
problem and OProfile can do nothing about it.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="kernel-profiling"></a>2. Kernel profiling</h2>
</div>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="irq-masking"></a>2.1. Interrupt masking</h3>
</div>
</div>
</div>
<p>
OProfile uses non-maskable interrupts (NMI) on the P6 generation, Pentium 4,
Athlon, Opteron, Phenom, and Turion processors. These interrupts can occur even in section of the
Linux where interrupts are disabled, allowing collection of samples in virtually
all executable code. The RTC, timer interrupt mode, and Itanium 2 collection mechanisms
use maskable interrupts. Thus, the RTC and Itanium 2 data collection mechanism have "sample
shadows", or blind spots: regions where no samples will be collected. Typically, the samples
will be attributed to the code immediately after the interrupts are re-enabled.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="idle"></a>2.2. Idle time</h3>
</div>
</div>
</div>
<p>
Your kernel is likely to support halting the processor when a CPU is idle. As
the typical hardware events like <code class="constant">CPU_CLK_UNHALTED</code> do not
count when the CPU is halted, the kernel profile will not reflect the actual
amount of time spent idle. You can change this behaviour by booting with
the <code class="option">idle=poll</code> option, which uses a different idle routine. This
will appear as <code class="function">poll_idle()</code> in your kernel profile.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="kernel-modules"></a>2.3. Profiling kernel modules</h3>
</div>
</div>
</div>
<p>
OProfile profiles kernel modules by default. However, there are a couple of problems
you may have when trying to get results. First, you may have booted via an initrd;
this means that the actual path for the module binaries cannot be determined automatically.
To get around this, you can use the <code class="option">-p</code> option to the profiling tools
to specify where to look for the kernel modules.
</p>
<p>
In 2.6, the information on where kernel module binaries are located has been removed.
This means OProfile needs guiding with the <code class="option">-p</code> option to find your
modules. Normally, you can just use your standard module top-level directory for this.
Note that due to this problem, OProfile cannot check that the modification times match;
it is your responsibility to make sure you do not modify a binary after a profile
has been created.
</p>
<p>
If you have run <span><strong class="command">insmod</strong></span> or <span><strong class="command">modprobe</strong></span> to insert a module
in a particular directory, it is important that you specify this directory with the
<code class="option">-p</code> option first, so that it over-rides an older module binary that might
exist in other directories you've specified with <code class="option">-p</code>. It is up to you
to make sure that these values are correct: 2.6 kernels simply do not provide enough
information for OProfile to get this information.
</p>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="interpreting-callgraph"></a>3. Interpreting call-graph profiles</h2>
</div>
</div>
</div>
<p>
Sometimes the results from call-graph profiles may be different to what
you expect to see. The first thing to check is whether the target
binaries where compiled with frame pointers enabled (if the binary was
compiled using <span><strong class="command">gcc</strong></span>'s
<code class="option">-fomit-frame-pointer</code> option, you will not get
meaningful results). Note that as of this writing, the GCC developers
plan to disable frame pointers by default. The Linux kernel is built
without frame pointers by default; there is a configuration option you
can use to turn it on under the "Kernel Hacking" menu.
</p>
<p>
Often you may see a caller of a function that does not actually directly
call the function you're looking at (e.g. if <code class="function">a()</code>
calls <code class="function">b()</code>, which in turn calls
<code class="function">c()</code>, you may see an entry for
<code class="function">a()-&gt;c()</code>). What's actually occurring is that we
are taking samples at the very start (or the very end) of
<code class="function">c()</code>; at these few instructions, we haven't yet
created the new function's frame, so it appears as if
<code class="function">a()</code> is calling directly into
<code class="function">c()</code>. Be careful not to be misled by these
entries.
</p>
<p>
Like the rest of OProfile, call-graph profiling uses a statistical
approach; this means that sometimes a backtrace sample is truncated, or
even partially wrong. Bear this in mind when examining results.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="debug-info"></a>4. Inaccuracies in annotated source</h2>
</div>
</div>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="effect-of-optimizations"></a>4.1. Side effects of optimizations</h3>
</div>
</div>
</div>
<p>
The compiler can introduce some pitfalls in the annotated source output.
The optimizer can move pieces of code in such manner that two line of codes
are interlaced (instruction scheduling). Also debug info generated by the compiler
can show strange behavior. This is especially true for complex expressions e.g. inside
an if statement:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
if (a &amp;&amp; ..
b &amp;&amp; ..
c &amp;&amp;)
</pre>
</td>
</tr>
</table>
<p>
here the problem come from the position of line number. The available debug
info does not give enough details for the if condition, so all samples are
accumulated at the position of the right brace of the expression. Using
<span><strong class="command">opannotate <code class="option">-a</code></strong></span> can help to show the real
samples at an assembly level.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="prologues"></a>4.2. Prologues and epilogues</h3>
</div>
</div>
</div>
<p>
The compiler generally needs to generate "glue" code across function calls, dependent
on the particular function call conventions used. Additionally other things
need to happen, like stack pointer adjustment for the local variables; this
code is known as the function prologue. Similar code is needed at function return,
and is known as the function epilogue. This will show up in annotations as
samples at the very start and end of a function, where there is no apparent
executable code in the source.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="inlined-function"></a>4.3. Inlined functions</h3>
</div>
</div>
</div>
<p>
You may see that a function is credited with a certain number of samples, but
the listing does not add up to the correct total. To pick a real example :
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
:internal_sk_buff_alloc_security(struct sk_buff *skb)
353 2.342% :{ /* internal_sk_buff_alloc_security total: 1882 12.48% */
:
: sk_buff_security_t *sksec;
15 0.0995% : int rc = 0;
:
10 0.06633% : sksec = skb-&gt;lsm_security;
468 3.104% : if (sksec &amp;&amp; sksec-&gt;magic == DSI_MAGIC) {
: goto out;
: }
:
: sksec = (sk_buff_security_t *) get_sk_buff_memory(skb);
3 0.0199% : if (!sksec) {
38 0.2521% : rc = -ENOMEM;
: goto out;
10 0.06633% : }
: memset(sksec, 0, sizeof (sk_buff_security_t));
44 0.2919% : sksec-&gt;magic = DSI_MAGIC;
32 0.2123% : sksec-&gt;skb = skb;
45 0.2985% : sksec-&gt;sid = DSI_SID_NORMAL;
31 0.2056% : skb-&gt;lsm_security = sksec;
:
: out:
:
146 0.9685% : return rc;
:
98 0.6501% :}
</pre>
</td>
</tr>
</table>
<p>
Here, the function is credited with 1,882 samples, but the annotations
below do not account for this. This is usually because of inline functions -
the compiler marks such code with debug entries for the inline function
definition, and this is where <span><strong class="command">opannotate</strong></span> annotates
such samples. In the case above, <code class="function">memset</code> is the most
likely candidate for this problem. Examining the mixed source/assembly
output can help identify such results.
</p>
<p>
This problem is more visible when there is no source file available, in the
following example it's trivially visible the sums of symbols samples is less
than the number of the samples for this file. The difference must be accounted
to inline functions.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
/*
* Total samples for file : "arch/i386/kernel/process.c"
*
* 109 2.4616
*/
/* default_idle total: 84 1.8970 */
/* cpu_idle total: 21 0.4743 */
/* flush_thread total: 1 0.0226 */
/* prepare_to_copy total: 1 0.0226 */
/* __switch_to total: 18 0.4065 */
</pre>
</td>
</tr>
</table>
<p>
The missing samples are not lost, they will be credited to another source
location where the inlined function is defined. The inlined function will be
credited from multiple call site and merged in one place in the annotated
source file so there is no way to see from what call site are coming the
samples for an inlined function.
</p>
<p>
When running <span><strong class="command">opannotate</strong></span>, you may get a warning
"some functions compiled without debug information may have incorrect source line attributions".
In some rare cases, OProfile is not able to verify that the derived source line
is correct (when some parts of the binary image are compiled without debugging
information). Be wary of results if this warning appears.
</p>
<p>
Furthermore, for some languages the compiler can implicitly generate functions,
such as default copy constructors. Such functions are labelled by the compiler
as having a line number of 0, which means the source annotation can be confusing.
</p>
</div>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="wrong-linenr-info"></a>4.4. Inaccuracy in line number information</h3>
</div>
</div>
</div>
<p>
Depending on your compiler you can fall into the following problem:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
struct big_object { int a[500]; };
int main()
{
big_object a, b;
for (int i = 0 ; i != 1000 * 1000; ++i)
b = a;
return 0;
}
</pre>
</td>
</tr>
</table>
<p>
Compiled with <span><strong class="command">gcc</strong></span> 3.0.4 the annotated source is clearly inaccurate:
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
:int main()
:{ /* main total: 7871 100% */
: big_object a, b;
: for (int i = 0 ; i != 1000 * 1000; ++i)
: b = a;
7871 100% : return 0;
:}
</pre>
</td>
</tr>
</table>
<p>
The problem here is distinct from the IRQ latency problem; the debug line number
information is not precise enough; again, looking at output of <span><strong class="command">opannoatate -as</strong></span> can help.
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
:int main()
:{
: big_object a, b;
: for (int i = 0 ; i != 1000 * 1000; ++i)
: 80484c0: push %ebp
: 80484c1: mov %esp,%ebp
: 80484c3: sub $0xfac,%esp
: 80484c9: push %edi
: 80484ca: push %esi
: 80484cb: push %ebx
: b = a;
: 80484cc: lea 0xfffff060(%ebp),%edx
: 80484d2: lea 0xfffff830(%ebp),%eax
: 80484d8: mov $0xf423f,%ebx
: 80484dd: lea 0x0(%esi),%esi
: return 0;
3 0.03811% : 80484e0: mov %edx,%edi
: 80484e2: mov %eax,%esi
1 0.0127% : 80484e4: cld
8 0.1016% : 80484e5: mov $0x1f4,%ecx
7850 99.73% : 80484ea: repz movsl %ds:(%esi),%es:(%edi)
9 0.1143% : 80484ec: dec %ebx
: 80484ed: jns 80484e0
: 80484ef: xor %eax,%eax
: 80484f1: pop %ebx
: 80484f2: pop %esi
: 80484f3: pop %edi
: 80484f4: leave
: 80484f5: ret
</pre>
</td>
</tr>
</table>
<p>
So here it's clear that copying is correctly credited with of all the samples, but the
line number information is misplaced. <span><strong class="command">objdump -dS</strong></span> exposes the
same problem. Note that maintaining accurate debug information for compilers when optimizing is difficult, so this problem is not suprising.
The problem of debug information
accuracy is also dependent on the binutils version used; some BFD library versions
contain a work-around for known problems of <span><strong class="command">gcc</strong></span>, some others do not. This is unfortunate but we must live with that,
since profiling is pointless when you disable optimisation (which would give better debugging entries).
</p>
</div>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="symbol-without-debug-info"></a>5. Assembly functions</h2>
</div>
</div>
</div>
<p>
Often the assembler cannot generate debug information automatically.
This means that you cannot get a source report unless
you manually define the neccessary debug information; read your assembler documentation for how you might
do that. The only
debugging info needed currently by OProfile is the line-number/filename-VMA association. When profiling assembly
without debugging info you can always get report for symbols, and optionally for VMA, through <span><strong class="command">opreport -l</strong></span>
or <span><strong class="command">opreport -d</strong></span>, but this works only for symbols with the right attributes.
For <span><strong class="command">gas</strong></span> you can get this by
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
.globl foo
.type foo,@function
</pre>
</td>
</tr>
</table>
<p>
whilst for <span><strong class="command">nasm</strong></span> you must use
</p>
<table xmlns="" border="0" style="background: #E0E0E0;" width="90%">
<tr>
<td>
<pre class="screen">
GLOBAL foo:function ; [1]
</pre>
</td>
</tr>
</table>
<p>
Note that OProfile does not need the global attribute, only the function attribute.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="overlapping-symbols"></a>6. Overlapping symbols in JITed code</h2>
</div>
</div>
</div>
<p>
Some virtual machines (e.g., Java) may re-JIT a method, resulting in previously
allocated space for a piece of compiled code to be reused. This means that, at one distinct
code address, multiple symbols/methods may be present during the run time of the application.
</p>
<p>
Since OProfile samples are buffered and don&#8242;t have timing information, there is no way
to correlate samples with the (possibly) varying address ranges in which the code for a symbol
may reside.
An alternative would be flushing the OProfile sampling buffer when we get an unload event,
but this could result in high overhead.
</p>
<p>
To moderate the problem of overlapping symbols, OProfile tries to select the symbol that was
present at this address range most of the time. Additionally, other overlapping symbols
are truncated in the overlapping area.
This gives reasonable results, because in reality, address reuse typically takes place
during phase changes of the application -- in particular, during application startup.
Thus, for optimum profiling results, start the sampling session after application startup
and burn in.
</p>
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="hidden-cost"></a>7. Other discrepancies</h2>
</div>
</div>
</div>
<p>
Another cause of apparent problems is the hidden cost of instructions. A very
common example is two memory reads: one from L1 cache and the other from memory:
the second memory read is likely to have more samples.
There are many other causes of hidden cost of instructions. A non-exhaustive
list: mis-predicted branch, TLB cache miss, partial register stall,
partial register dependencies, memory mismatch stall, re-executed µops. If you want to write
programs at the assembly level, be sure to take a look at the Intel and
AMD documentation at <a href="http://developer.intel.com/">http://developer.intel.com/</a>
and <a href="http://developer.amd.com/devguides.jsp/">http://developer.amd.com/devguides.jsp</a>.
</p>
</div>
</div>
<div class="chapter" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a id="ack"></a>Chapter 6. Acknowledgments</h2>
</div>
</div>
</div>
<p>
Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie,
Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu,
Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Jason Yeh,
Randolph Chung, Anton Blanchard, Richard Henderson, Andries Brouwer, Bryan Rittmeyer,
Maynard P. Johnson,
Richard Reich (rreich@rdrtech.com), Zwane Mwaikambo, Dave Jones, Charles Filtness; and finally Pulp, for "Intro".
</p>
</div>
</div>
</body>
</html>