| <?xml version="1.0" encoding='ISO-8859-1'?> |
| <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> |
| |
| <book id="oprofile-internals"> |
| <bookinfo> |
| <title>OProfile Internals</title> |
| |
| <authorgroup> |
| <author> |
| <firstname>John</firstname> |
| <surname>Levon</surname> |
| <affiliation> |
| <address><email>levon@movementarian.org</email></address> |
| </affiliation> |
| </author> |
| </authorgroup> |
| |
| <copyright> |
| <year>2003</year> |
| <holder>John Levon</holder> |
| </copyright> |
| </bookinfo> |
| |
| <toc></toc> |
| |
| <chapter id="introduction"> |
| <title>Introduction</title> |
| |
| <para> |
| This document is current for OProfile version <oprofileversion />. |
| This document provides some details on the internal workings of OProfile for the |
| interested hacker. This document assumes strong C, working C++, plus some knowledge of |
| kernel internals and CPU hardware. |
| </para> |
| <note> |
| <para> |
| Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 |
| uses a very different kernel module implementation and daemon to produce the sample files. |
| </para> |
| </note> |
| |
| <sect1 id="overview"> |
| <title>Overview</title> |
| <para> |
| OProfile is a statistical continuous profiler. In other words, profiles are generated by |
| regularly sampling the current registers on each CPU (from an interrupt handler, the |
| saved PC value at the time of interrupt is stored), and converting that runtime PC |
| value into something meaningful to the programmer. |
| </para> |
| <para> |
| OProfile achieves this by taking the stream of sampled PC values, along with the detail |
| of which task was running at the time of the interrupt, and converting into a file offset |
| against a particular binary file. Because applications <function>mmap()</function> |
| the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename> |
| or whatever), it's possible to find the relevant binary file and offset by walking |
| the task's list of mapped memory areas. Each PC value is thus converted into a tuple |
| of binary-image,offset. This is something that the userspace tools can use directly |
| to reconstruct where the code came from, including the particular assembly instructions, |
| symbol, and source line (via the binary's debug information if present). |
| </para> |
| <para> |
| Regularly sampling the PC value like this approximates what actually was executed and |
| how often - more often than not, this statistical approximation is good enough to |
| reflect reality. In common operation, the time between each sample interrupt is regulated |
| by a fixed number of clock cycles. This implies that the results will reflect where |
| the CPU is spending the most time; this is obviously a very useful information source |
| for performance analysis. |
| </para> |
| <para> |
| Sometimes though, an application programmer needs different kinds of information: for example, |
| "which of the source routines cause the most cache misses ?". The rise in importance of |
| such metrics in recent years has led many CPU manufacturers to provide hardware performance |
| counters capable of measuring these events on the hardware level. Typically, these counters |
| increment once per each event, and generate an interrupt on reaching some pre-defined |
| number of events. OProfile can use these interrupts to generate samples: then, the |
| profile results are a statistical approximation of which code caused how many of the |
| given event. |
| </para> |
| <para> |
| Consider a simplified system that only executes two functions A and B. A |
| takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at |
| 100 cycles a second, and we've set the performance counter to create an |
| interrupt after a set number of "events" (in this case an event is one |
| clock cycle). It should be clear that the chances of the interrupt |
| occurring in function A is 1/100, and 99/100 for function B. Thus, we |
| statistically approximate the actual relative performance features of |
| the two functions over time. This same analysis works for other types of |
| events, providing that the interrupt is tied to the number of events |
| occurring (that is, after N events, an interrupt is generated). |
| </para> |
| <para> |
| There are typically more than one of these counters, so it's possible to set up profiling |
| for several different event types. Using these counters gives us a powerful, low-overhead |
| way of gaining performance metrics. If OProfile, or the CPU, does not support performance |
| counters, then a simpler method is used: the kernel timer interrupt feeds samples |
| into OProfile itself. |
| </para> |
| <para> |
| The rest of this document concerns itself with how we get from receiving samples at |
| interrupt time to producing user-readable profile information. |
| </para> |
| </sect1> |
| |
| <sect1 id="components"> |
| <title>Components of the OProfile system</title> |
| |
| <sect2 id="arch-specific-components"> |
| <title>Architecture-specific components</title> |
| <para> |
| If OProfile supports the hardware performance counters found on |
| a particular architecture, code for managing the details of setting |
| up and managing these counters can be found in the kernel source |
| tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename> |
| directory. The architecture-specific implementation works via |
| filling in the oprofile_operations structure at init time. This |
| provides a set of operations such as <function>setup()</function>, |
| <function>start()</function>, <function>stop()</function>, etc. |
| that manage the hardware-specific details of fiddling with the |
| performance counter registers. |
| </para> |
| <para> |
| The other important facility available to the architecture code is |
| <function>oprofile_add_sample()</function>. This is where a particular sample |
| taken at interrupt time is fed into the generic OProfile driver code. |
| </para> |
| </sect2> |
| |
| <sect2 id="filesystem"> |
| <title>oprofilefs</title> |
| <para> |
| OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from |
| userspace at <filename>/dev/oprofile</filename>. This consists of small |
| files for reporting and receiving configuration from userspace, as well |
| as the actual character device that the OProfile userspace receives samples |
| from. At <function>setup()</function> time, the architecture-specific may |
| add further configuration files related to the details of the performance |
| counters. For example, on x86, one numbered directory for each hardware |
| performance counter is added, with files in each for the event type, |
| reset value, etc. |
| </para> |
| <para> |
| The filesystem also contains a <filename>stats</filename> directory with |
| a number of useful counters for various OProfile events. |
| </para> |
| </sect2> |
| |
| <sect2 id="driver"> |
| <title>Generic kernel driver</title> |
| <para> |
| This lives in <filename>drivers/oprofile/</filename>, and forms the core of |
| how OProfile works in the kernel. Its job is to take samples delivered |
| from the architecture-specific code (via <function>oprofile_add_sample()</function>), |
| and buffer this data, in a transformed form as described later, until releasing |
| the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename> |
| character device. |
| </para> |
| </sect2> |
| |
| <sect2 id="daemon"> |
| <title>The OProfile daemon</title> |
| <para> |
| The OProfile userspace daemon's job is to take the raw data provided by the |
| kernel and write it to the disk. It takes the single data stream from the |
| kernel and logs sample data against a number of sample files (found in |
| <filename>$SESSION_DIR/samples/current/</filename>, by default located at |
| <filename>/var/lib/oprofile/samples/current/</filename>. For the benefit |
| of the "separate" functionality, the names/paths of these sample files |
| are mangled to reflect where the samples were from: this can include |
| thread IDs, the binary file path, the event type used, and more. |
| </para> |
| <para> |
| After this final step from interrupt to disk file, the data is now |
| persistent (that is, changes in the running of the system do not invalidate |
| stored data). So the post-profiling tools can run on this data at any |
| time (assuming the original binary files are still available and unchanged, |
| naturally). |
| </para> |
| </sect2> |
| |
| <sect2 id="post-profiling"> |
| <title>Post-profiling tools</title> |
| So far, we've collected data, but we've yet to present it in a useful form |
| to the user. This is the job of the post-profiling tools. In general form, |
| they collate a subset of the available sample files, load and process each one |
| correlated against the relevant binary file, and finally produce user-readable |
| information. |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |
| |
| <chapter id="performance-counters"> |
| <title>Performance counter management</title> |
| |
| <sect1 id ="performance-counters-ui"> |
| <title>Providing a user interface</title> |
| |
| <para> |
| The performance counter registers need programming in order to set the |
| type of event to count, etc. OProfile uses a standard model across all |
| CPUs for defining these events as follows : |
| </para> |
| <informaltable frame="all"> |
| <tgroup cols='2'> |
| <tbody> |
| <row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row> |
| <row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row> |
| <row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row> |
| <row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row> |
| <row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row> |
| <row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row> |
| </tbody> |
| </tgroup> |
| </informaltable> |
| <para> |
| The term "unit mask" is borrowed from the Intel architectures, and can |
| further specify exactly when a counter is incremented (for example, |
| cache-related events can be restricted to particular state transitions |
| of the cache lines). |
| </para> |
| <para> |
| All of the available hardware events and their details are specified in |
| the textual files in the <filename>events</filename> directory. The |
| syntax of these files should be fairly obvious. The user specifies the |
| names and configuration details of the chosen counters via |
| <command>opcontrol</command>. These are then written to the kernel |
| module (in numerical form) via <filename>/dev/oprofile/N/</filename> |
| where N is the physical hardware counter (some events can only be used |
| on specific counters; OProfile hides these details from the user when |
| possible). On IA64, the perfmon-based interface behaves somewhat |
| differently, as described later. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="performance-counters-programming"> |
| <title>Programming the performance counter registers</title> |
| |
| <para> |
| We have described how the user interface fills in the desired |
| configuration of the counters and transmits the information to the |
| kernel. It is the job of the <function>->setup()</function> method |
| to actually program the performance counter registers. Clearly, the |
| details of how this is done is architecture-specific; it is also |
| model-specific on many architectures. For example, i386 provides methods |
| for each model type that programs the counter registers correctly |
| (see the <filename>op_model_*</filename> files in |
| <filename>arch/i386/oprofile</filename> for the details). The method |
| reads the values stored in the virtual oprofilefs files and programs |
| the registers appropriately, ready for starting the actual profiling |
| session. |
| </para> |
| <para> |
| The architecture-specific drivers make sure to save the old register |
| settings before doing OProfile setup. They are restored when OProfile |
| shuts down. This is useful, for example, on i386, where the NMI watchdog |
| uses the same performance counter registers as OProfile; they cannot |
| run concurrently, but OProfile makes sure to restore the setup it found |
| before it was running. |
| </para> |
| <para> |
| In addition to programming the counter registers themselves, other setup |
| is often necessary. For example, on i386, the local APIC needs |
| programming in order to make the counter's overflow interrupt appear as |
| an NMI (non-maskable interrupt). This allows sampling (and therefore |
| profiling) of regions where "normal" interrupts are masked, enabling |
| more reliable profiles. |
| </para> |
| |
| <sect2 id="performance-counters-start"> |
| <title>Starting and stopping the counters</title> |
| <para> |
| Initiating a profiling session is done via writing an ASCII '1' |
| to the file <filename>/dev/oprofile/enable</filename>. This sets up the |
| core, and calls into the architecture-specific driver to actually |
| enable each configured counter. Again, the details of how this is |
| done is model-specific (for example, the Athlon models can disable |
| or enable on a per-counter basis, unlike the PPro models). |
| </para> |
| </sect2> |
| |
| <sect2> |
| <title>IA64 and perfmon</title> |
| <para> |
| The IA64 architecture provides a different interface from the other |
| architectures, using the existing perfmon driver. Register programming |
| is handled entirely in user-space (see |
| <filename>daemon/opd_perfmon.c</filename> for the details). A process |
| is forked for each CPU, which creates a perfmon context and sets the |
| counter registers appropriately via the |
| <function>sys_perfmonctl</function> interface. In addition, the actual |
| initiation and termination of the profiling session is handled via the |
| same interface using <constant>PFM_START</constant> and |
| <constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs |
| files for the performance counters, as the kernel driver does not |
| program the registers itself. |
| </para> |
| <para> |
| Instead, the perfmon driver for OProfile simply registers with the |
| OProfile core with an OProfile-specific UUID. During a profiling |
| session, the perfmon core calls into the OProfile perfmon driver and |
| samples are registered with the OProfile core itself as usual (with |
| <function>oprofile_add_sample()</function>). |
| </para> |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |
| |
| <chapter id="collecting-samples"> |
| <title>Collecting and processing samples</title> |
| |
| <sect1 id="receiving-interrupts"> |
| <title>Receiving interrupts</title> |
| <para> |
| Naturally, how the overflow interrupts are received is specific |
| to the hardware architecture, unless we are in "timer" mode, where the |
| logging routine is called directly from the standard kernel timer |
| interrupt handler. |
| </para> |
| <para> |
| On the i386 architecture, the local APIC is programmed such that when a |
| counter overflows (that is, it receives an event that causes an integer |
| overflow of the register value to zero), an NMI is generated. This calls |
| into the general handler <function>do_nmi()</function>; because OProfile |
| has registered itself as capable of handling NMI interrupts, this will |
| call into the OProfile driver code in |
| <filename>arch/i386/oprofile</filename>. Here, the saved PC value (the |
| CPU saves the register set at the time of interrupt on the stack |
| available for inspection) is extracted, and the counters are examined to |
| find out which one generated the interrupt. Also determined is whether |
| the system was inside kernel or user space at the time of the interrupt. |
| These three pieces of information are then forwarded onto the OProfile |
| core via <function>oprofile_add_sample()</function>. Finally, the |
| counter values are reset to the chosen count value, to ensure another |
| interrupt happens after another N events have occurred. Other |
| architectures behave in a similar manner. |
| </para> |
| </sect1> |
| |
| <sect1 id="core-structure"> |
| <title>Core data structures</title> |
| <para> |
| Before considering what happens when we log a sample, we shall digress |
| for a moment and look at the general structure of the data collection |
| system. |
| </para> |
| <para> |
| OProfile maintains a small buffer for storing the logged samples for |
| each CPU on the system. Only this buffer is altered when we actually log |
| a sample (remember, we may still be in an NMI context, so no locking is |
| possible). The buffer is managed by a two-handed system; the "head" |
| iterator dictates where the next sample data should be placed in the |
| buffer. Of course, overflow of the buffer is possible, in which case |
| the sample is discarded. |
| </para> |
| <para> |
| It is critical to remember that at this point, the PC value is an |
| absolute value, and is therefore only meaningful in the context of which |
| task it was logged against. Thus, these per-CPU buffers also maintain |
| details of which task each logged sample is for, as described in the |
| next section. In addition, we store whether the sample was in kernel |
| space or user space (on some architectures and configurations, the address |
| space is not sub-divided neatly at a specific PC value, so we must store |
| this information). |
| </para> |
| <para> |
| As well as these small per-CPU buffers, we have a considerably larger |
| single buffer. This holds the data that is eventually copied out into |
| the OProfile daemon. On certain system events, the per-CPU buffers are |
| processed and entered (in mutated form) into the main buffer, known in |
| the source as the "event buffer". The "tail" iterator indicates the |
| point from which the CPU may be read, up to the position of the "head" |
| iterator. This provides an entirely lock-free method for extracting data |
| from the CPU buffers. This process is described in detail later in this chapter. |
| </para> |
| <figure><title>The OProfile buffers</title> |
| <graphic fileref="buffers.png" /> |
| </figure> |
| </sect1> |
| |
| <sect1 id="logging-sample"> |
| <title>Logging a sample</title> |
| <para> |
| As mentioned, the sample is logged into the buffer specific to the |
| current CPU. The CPU buffer is a simple array of pairs of unsigned long |
| values; for a sample, they hold the PC value and the counter for the |
| sample. (The counter value is later used to translate back into the relevant |
| event type the counter was programmed to). |
| </para> |
| <para> |
| In addition to logging the sample itself, we also log task switches. |
| This is simply done by storing the address of the last task to log a |
| sample on that CPU in a data structure, and writing a task switch entry |
| into the buffer if the new value of <function>current()</function> has |
| changed. Note that later we will directly de-reference this pointer; |
| this imposes certain restrictions on when and how the CPU buffers need |
| to be processed. |
| </para> |
| <para> |
| Finally, as mentioned, we log whether we have changed between kernel and |
| userspace using a similar method. Both of these variables |
| (<varname>last_task</varname> and <varname>last_is_kernel</varname>) are |
| reset when the CPU buffer is read. |
| </para> |
| </sect1> |
| |
| <sect1 id="logging-stack"> |
| <title>Logging stack traces</title> |
| <para> |
| OProfile can also provide statistical samples of call chains (on x86). To |
| do this, at sample time, the frame pointer chain is traversed, recording |
| the return address for each stack frame. This will only work if the code |
| was compiled with frame pointers, but we're careful to abort the |
| traversal if the frame pointer appears bad. We store the set of return |
| addresses straight into the CPU buffer. Note that, since this traversal |
| is keyed off the standard sample interrupt, the number of times a |
| function appears in a stack trace is not an indicator of how many times |
| the call site was executed: rather, it's related to the number of |
| samples we took where that call site was involved. Thus, the results for |
| stack traces are not necessarily proportional to the call counts: |
| typical programs will have many <function>main()</function> samples. |
| </para> |
| </sect1> |
| |
| <sect1 id="synchronising-buffers"> |
| <title>Synchronising the CPU buffers to the event buffer</title> |
| <!-- FIXME: update when percpu patch goes in --> |
| <para> |
| At some point, we have to process the data in each CPU buffer and enter |
| it into the main (event) buffer. The file |
| <filename>buffer_sync.c</filename> contains the relevant code. We |
| periodically (currently every <constant>HZ</constant>/4 jiffies) start |
| the synchronisation process. In addition, we process the buffers on |
| certain events, such as an application calling |
| <function>munmap()</function>. This is particularly important for |
| <function>exit()</function> - because the CPU buffers contain pointers |
| to the task structure, if we don't process all the buffers before the |
| task is actually destroyed and the task structure freed, then we could |
| end up trying to dereference a bogus pointer in one of the CPU buffers. |
| </para> |
| <para> |
| We also add a notification when a kernel module is loaded; this is so |
| that user-space can re-read <filename>/proc/modules</filename> to |
| determine the load addresses of kernel module text sections. Without |
| this notification, samples for a newly-loaded module could get lost or |
| be attributed to the wrong module. |
| </para> |
| <para> |
| The synchronisation itself works in the following manner: first, mutual |
| exclusion on the event buffer is taken. Remember, we do not need to do |
| that for each CPU buffer, as we only read from the tail iterator (whilst |
| interrupts might be arriving at the same buffer, but they will write to |
| the position of the head iterator, leaving previously written entries |
| intact). Then, we process each CPU buffer in turn. A CPU switch |
| notification is added to the buffer first (for |
| <option>--separate=cpu</option> support). Then the processing of the |
| actual data starts. |
| </para> |
| <para> |
| As mentioned, the CPU buffer consists of task switch entries and the |
| actual samples. When the routine <function>sync_buffer()</function> sees |
| a task switch, the process ID and process group ID are recorded into the |
| event buffer, along with a dcookie (see below) identifying the |
| application binary (e.g. <filename>/bin/bash</filename>). The |
| <varname>mmap_sem</varname> for the task is then taken, to allow safe |
| iteration across the tasks' list of mapped areas. Each sample is then |
| processed as described in the next section. |
| </para> |
| <para> |
| After a buffer has been read, the tail iterator is updated to reflect |
| how much of the buffer was processed. Note that when we determined how |
| much data there was to read in the CPU buffer, we also called |
| <function>cpu_buffer_reset()</function> to reset |
| <varname>last_task</varname> and <varname>last_is_kernel</varname>, as |
| we've already mentioned. During the processing, more samples may have |
| been arriving in the CPU buffer; this is OK because we are careful to |
| only update the tail iterator to how much we actually read - on the next |
| buffer synchronisation, we will start again from that point. |
| </para> |
| </sect1> |
| |
| <sect1 id="dentry-cookies"> |
| <title>Identifying binary images</title> |
| <para> |
| In order to produce useful profiles, we need to be able to associate a |
| particular PC value sample with an actual ELF binary on the disk. This |
| leaves us with the problem of how to export this information to |
| user-space. We create unique IDs that identify a particular directory |
| entry (dentry), and write those IDs into the event buffer. Later on, |
| the user-space daemon can call the <function>lookup_dcookie</function> |
| system call, which looks up the ID and fills in the full path of |
| the binary image in the buffer user-space passes in. These IDs are |
| maintained by the code in <filename>fs/dcookies.c</filename>; the |
| cache lasts for as long as the daemon has the event buffer open. |
| </para> |
| </sect1> |
| |
| <sect1 id="finding-dentry"> |
| <title>Finding a sample's binary image and offset</title> |
| <para> |
| We haven't yet described how we process the absolute PC value into |
| something usable by the user-space daemon. When we find a sample entered |
| into the CPU buffer, we traverse the list of mappings for the task |
| (remember, we will have seen a task switch earlier, so we know which |
| task's lists to look at). When a mapping is found that contains the PC |
| value, we look up the mapped file's dentry in the dcookie cache. This |
| gives the dcookie ID that will uniquely identify the mapped file. Then |
| we alter the absolute value such that it is an offset from the start of |
| the file being mapped (the mapping need not start at the start of the |
| actual file, so we have to consider the offset value of the mapping). We |
| store this dcookie ID into the event buffer; this identifies which |
| binary the samples following it are against. |
| In this manner, we have converted a PC value, which has transitory |
| meaning only, into a static offset value for later processing by the |
| daemon. |
| </para> |
| <para> |
| We also attempt to avoid the relatively expensive lookup of the dentry |
| cookie value by storing the cookie value directly into the dentry |
| itself; then we can simply derive the cookie value immediately when we |
| find the correct mapping. |
| </para> |
| </sect1> |
| |
| </chapter> |
| |
| <chapter id="sample-files"> |
| <title>Generating sample files</title> |
| |
| <sect1 id="processing-buffer"> |
| <title>Processing the buffer</title> |
| |
| <para> |
| Now we can move onto user-space in our description of how raw interrupt |
| samples are processed into useful information. As we described in |
| previous sections, the kernel OProfile driver creates a large buffer of |
| sample data consisting of offset values, interspersed with |
| notification of changes in context. These context changes indicate how |
| following samples should be attributed, and include task switches, CPU |
| changes, and which dcookie the sample value is against. By processing |
| this buffer entry-by-entry, we can determine where the samples should |
| be accredited to. This is particularly important when using the |
| <option>--separate</option>. |
| </para> |
| <para> |
| The file <filename>daemon/opd_trans.c</filename> contains the basic routine |
| for the buffer processing. The <varname>struct transient</varname> |
| structure is used to hold changes in context. Its members are modified |
| as we process each entry; it is passed into the routines in |
| <filename>daemon/opd_sfile.c</filename> for actually logging the sample |
| to a particular sample file (which will be held in |
| <filename>$SESSION_DIR/samples/current</filename>). |
| </para> |
| <para> |
| The buffer format is designed for conciseness, as high sampling rates |
| can easily generate a lot of data. Thus, context changes are prefixed |
| by an escape code, identified by <function>is_escape_code()</function>. |
| If an escape code is found, the next entry in the buffer identifies |
| what type of context change is being read. These are handed off to |
| various handlers (see the <varname>handlers</varname> array), which |
| modify the transient structure as appropriate. If it's not an escape |
| code, then it must be a PC offset value, and the very next entry will |
| be the numeric hardware counter. These values are read and recorded |
| in the transient structure; we then do a lookup to find the correct |
| sample file, and log the sample, as described in the next section. |
| </para> |
| |
| <sect2 id="handling-kernel-samples"> |
| <title>Handling kernel samples</title> |
| |
| <para> |
| Samples from kernel code require a little special handling. Because |
| the binary text which the sample is against does not correspond to |
| any file that the kernel directly knows about, the OProfile driver |
| stores the absolute PC value in the buffer, instead of the file offset. |
| Of course, we need an offset against some particular binary. To handle |
| this, we keep a list of loaded modules by parsing |
| <filename>/proc/modules</filename> as needed. When a module is loaded, |
| a notification is placed in the OProfile buffer, and this triggers a |
| re-read. We store the module name, and the loading address and size. |
| This is also done for the main kernel image, as specified by the user. |
| The absolute PC value is matched against each address range, and |
| modified into an offset when the matching module is found. See |
| <filename>daemon/opd_kernel.c</filename> for the details. |
| </para> |
| |
| </sect2> |
| |
| |
| </sect1> |
| |
| <sect1 id="sample-file-generation"> |
| <title>Locating and creating sample files</title> |
| |
| <para> |
| We have a sample value and its satellite data stored in a |
| <varname>struct transient</varname>, and we must locate an |
| actual sample file to store the sample in, using the context |
| information in the transient structure as a key. The transient data to |
| sample file lookup is handled in |
| <filename>daemon/opd_sfile.c</filename>. A hash is taken of the |
| transient values that are relevant (depending upon the setting of |
| <option>--separate</option>, some values might be irrelevant), and the |
| hash value is used to lookup the list of currently open sample files. |
| Of course, the sample file might not be found, in which case we need |
| to create and open it. |
| </para> |
| <para> |
| OProfile uses a rather complex scheme for naming sample files, in order |
| to make selecting relevant sample files easier for the post-profiling |
| utilities. The exact details of the scheme are given in |
| <filename>oprofile-tests/pp_interface</filename>, but for now it will |
| suffice to remember that the filename will include only relevant |
| information for the current settings, taken from the transient data. A |
| fully-specified filename looks something like : |
| </para> |
| <computeroutput> |
| /var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0 |
| </computeroutput> |
| <para> |
| It should be clear that this identifies such information as the |
| application binary, the dependent (library) binary, the hardware event, |
| and the process and thread ID. Typically, not all this information is |
| needed, in which cases some values may be replaced with the token |
| <filename>all</filename>. |
| </para> |
| <para> |
| The code that generates this filename and opens the file is found in |
| <filename>daemon/opd_mangling.c</filename>. You may have realised that |
| at this point, we do not have the binary image file names, only the |
| dcookie values. In order to determine a file name, a dcookie value is |
| looked up in the dcookie cache. This is to be found in |
| <filename>daemon/opd_cookie.c</filename>. Since dcookies are both |
| persistent and unique during a sampling session, we can cache the |
| values. If the value is not found in the cache, then we ask the kernel |
| to do the lookup from value to file name for us by calling |
| <function>lookup_dcookie()</function>. This looks up the value in a |
| kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns |
| the fully-qualified file name to userspace. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="sample-file-writing"> |
| <title>Writing data to a sample file</title> |
| |
| <para> |
| Each specific sample file is a hashed collection, where the key is |
| the PC offset from the transient data, and the value is the number of |
| samples recorded against that offset. The files are |
| <function>mmap()</function>ed into the daemon's memory space. The code |
| to actually log the write against the sample file can be found in |
| <filename>libdb/</filename>. |
| </para> |
| <para> |
| For recording stack traces, we have a more complicated sample filename |
| mangling scheme that allows us to identify cross-binary calls. We use |
| the same sample file format, where the key is a 64-bit value composed |
| from the from,to pair of offsets. |
| </para> |
| |
| </sect1> |
| |
| </chapter> |
| |
| <chapter id="output"> |
| <title>Generating useful output</title> |
| |
| <para> |
| All of the tools used to generate human-readable output have to take |
| roughly the same steps to collect the data for processing. First, the |
| profile specification given by the user has to be parsed. Next, a list |
| of sample files matching the specification has to obtained. Using this |
| list, we need to locate the binary file for each sample file, and then |
| use them to extract meaningful data, before a final collation and |
| presentation to the user. |
| </para> |
| |
| <sect1 id="profile-specification"> |
| <title>Handling the profile specification</title> |
| |
| <para> |
| The profile specification presented by the user is parsed in |
| the function <function>profile_spec::create()</function>. This |
| creates an object representing the specification. Then we |
| use <function>profile_spec::generate_file_list()</function> |
| to search for all sample files and match them against the |
| <varname>profile_spec</varname>. |
| </para> |
| |
| <para> |
| To enable this matching process to work, the attributes of |
| each sample file is encoded in its filename. This is a low-tech |
| approach to matching specifications against candidate sample |
| files, but it works reasonably well. A typical sample file |
| might look like these: |
| </para> |
| <screen> |
| /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all |
| /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all |
| /var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0 |
| /var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all |
| </screen> |
| <para> |
| This looks unnecessarily complex, but it's actually fairly simple. First |
| we have the session of the sample, by default located here |
| <filename>/var/lib/oprofile/samples/current</filename>. This location |
| can be changed by specifying the --session-dir option at command-line. |
| This session could equally well be inside an archive from <command>oparchive</command>. |
| Next we have one of the tokens <filename>{root}</filename> or |
| <filename>{kern}</filename>. <filename>{root}</filename> indicates |
| that the binary is found on a file system, and we will encode its path |
| in the next section (e.g. <filename>/bin/ls</filename>). |
| <filename>{kern}</filename> indicates a kernel module - on 2.6 kernels |
| the path information is not available from the kernel, so we have to |
| special-case kernel modules like this; we encode merely the name of the |
| module as loaded. |
| </para> |
| <para> |
| Next there is a <filename>{dep}</filename> token, indicating another |
| token/path which identifies the dependent binary image. This is used even for |
| the "primary" binary (i.e. the one that was |
| <function>execve()</function>d), as it simplifies processing. Finally, |
| if this sample file is a normal flat profile, the actual file is next in |
| the path. If it's a call-graph sample file, we need one further |
| specification, to allow us to identify cross-binary arcs in the call |
| graph. |
| </para> |
| <para> |
| The actual sample file name is dot-separated, where the fields are, in |
| order: event name, event count, unit mask, task group ID, task ID, and |
| CPU number. |
| </para> |
| <para> |
| This sample file can be reliably parsed (with |
| <function>parse_filename()</function>) into a |
| <varname>filename_spec</varname>. Finally, we can check whether to |
| include the sample file in the final results by comparing this |
| <varname>filename_spec</varname> against the |
| <varname>profile_spec</varname> the user specified (for the interested, |
| see <function>valid_candidate()</function> and |
| <function>profile_spec::match</function>). Then comes the really |
| complicated bit... |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="sample-file-collating"> |
| <title>Collating the candidate sample files</title> |
| |
| <para> |
| At this point we have a duplicate-free list of sample files we need |
| to process. But first we need to do some further arrangement: we |
| need to classify each sample file, and we may also need to "invert" |
| the profiles. |
| </para> |
| |
| <sect2 id="sample-file-classifying"> |
| <title>Classifying sample files</title> |
| |
| <para> |
| It's possible for utilities like <command>opreport</command> to show |
| data in columnar format: for example, we might want to show the results |
| of two threads within a process side-by-side. To do this, we need |
| to classify each sample file into classes - the classes correspond |
| with each <command>opreport</command> column. The function that handles |
| this is <function>arrange_profiles()</function>. Each sample file |
| is added to a particular class. If the sample file is the first in |
| its class, a template is generated from the sample file. Each template |
| describes a particular class (thus, in our example above, each template |
| will have a different thread ID, and this uniquely identifies each |
| class). |
| </para> |
| |
| <para> |
| Each class has a list of "profile sets" matching that class's template. |
| A profile set is either a profile of the primary binary image, or any of |
| its dependent images. After all sample files have been listed in one of |
| the profile sets belonging to the classes, we have to name each class and |
| perform error-checking. This is done by |
| <function>identify_classes()</function>; each class is checked to ensure |
| that its "axis" is the same as all the others. This is needed because |
| <command>opreport</command> can't produce results in 3D format: we can |
| only differ in one aspect, such as thread ID or event name. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="sample-file-inverting"> |
| <title>Creating inverted profile lists</title> |
| |
| <para> |
| Remember that if we're using certain profile separation options, such as |
| "--separate=lib", a single binary could be a dependent image to many |
| different binaries. For example, the C library image would be a |
| dependent image for most programs that have been profiled. As it |
| happens, this can cause severe performance problems: without some |
| re-arrangement, these dependent binary images would be opened each |
| time we need to process sample files for each program. |
| </para> |
| |
| <para> |
| The solution is to "invert" the profiles via |
| <function>invert_profiles()</function>. We create a new data structure |
| where the dependent binary is first, and the primary binary images using |
| that dependent binary are listed as sub-images. This helps our |
| performance problem, as now we only need to open each dependent image |
| once, when we process the list of inverted profiles. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="generating-profile-data"> |
| <title>Generating profile data</title> |
| |
| <para> |
| Things don't get any simpler at this point, unfortunately. At this point |
| we've collected and classified the sample files into the set of inverted |
| profiles, as described in the previous section. Now we need to process |
| each inverted profile and make something of the data. The entry point |
| for this is <function>populate_for_image()</function>. |
| </para> |
| |
| <sect2 id="bfd"> |
| <title>Processing the binary image</title> |
| <para> |
| The first thing we do with an inverted profile is attempt to open the |
| binary image (remember each inverted profile set is only for one binary |
| image, but may have many sample files to process). The |
| <varname>op_bfd</varname> class provides an abstracted interface to |
| this; internally it uses <filename>libbfd</filename>. The main purpose |
| of this class is to process the symbols for the binary image; this is |
| also where symbol filtering happens. This is actually quite tricky, but |
| should be clear from the source. |
| </para> |
| </sect2> |
| |
| <sect2 id="processing-sample-files"> |
| <title>Processing the sample files</title> |
| <para> |
| The class <varname>profile_container</varname> is a hold-all that |
| contains all the processed results. It is a container of |
| <varname>profile_t</varname> objects. The |
| <function>add_sample_files()</function> method uses |
| <filename>libdb</filename> to open the given sample file and add the |
| key/value types to the <varname>profile_t</varname>. Once this has been |
| done, <function>profile_container::add()</function> is passed the |
| <varname>profile_t</varname> plus the <varname>op_bfd</varname> for |
| processing. |
| </para> |
| <para> |
| <function>profile_container::add()</function> walks through the symbols |
| collected in the <varname>op_bfd</varname>. |
| <function>op_bfd::get_symbol_range()</function> gives us the start and |
| end of the symbol as an offset from the start of the binary image, |
| then we interrogate the <varname>profile_t</varname> for the relevant samples |
| for that offset range. We create a <varname>symbol_entry</varname> |
| object for this symbol and fill it in. If needed, here we also collect |
| debug information from the <varname>op_bfd</varname>, and possibly |
| record the detailed sample information (as used by <command>opreport |
| -d</command> and <command>opannotate</command>). |
| Finally the <varname>symbol_entry</varname> is added to |
| a private container of <varname>profile_container</varname> - this |
| <varname>symbol_container</varname> holds all such processed symbols. |
| </para> |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="generating-output"> |
| <title>Generating output</title> |
| |
| <para> |
| After the processing described in the previous section, we've now got |
| full details of what we need to output stored in the |
| <varname>profile_container</varname> on a symbol-by-symbol basis. To |
| produce output, we need to replay that data and format it suitably. |
| </para> |
| <para> |
| <command>opreport</command> first asks the |
| <varname>profile_container</varname> for a |
| <varname>symbol_collection</varname> (this is also where thresholding |
| happens). |
| This is sorted, then a |
| <varname>opreport_formatter</varname> is initialised. |
| This object initialises a set of field formatters as requested. Then |
| <function>opreport_formatter::output()</function> is called. This |
| iterates through the (sorted) <varname>symbol_collection</varname>; |
| for each entry, the selected fields (as set by the |
| <varname>format_flags</varname> options) are output by calling the |
| field formatters, with the <varname>symbol_entry</varname> passed in. |
| </para> |
| |
| </sect1> |
| |
| </chapter> |
| |
| <chapter id="ext"> |
| <title>Extended Feature Interface</title> |
| |
| <sect1 id="ext-intro"> |
| <title>Introduction</title> |
| |
| <para> |
| The Extended Feature Interface is a standard callback interface |
| designed to allow extension to the OProfile daemon's sample processing. |
| Each feature defines a set of callback handlers which can be enabled or |
| disabled through the OProfile daemon's command-line option. |
| This interface can be used to implement support for architecture-specific |
| features or features not commonly used by general OProfile users. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="ext-name-and-handlers"> |
| <title>Feature Name and Handlers</title> |
| |
| <para> |
| Each extended feature has an entry in the <varname>ext_feature_table</varname> |
| in <filename>opd_extended.cpp</filename>. Each entry contains a feature name, |
| and a corresponding set of handlers. Feature name is a unique string, which is |
| used to identify a feature in the table. Each feature provides a set |
| of handlers, which will be executed by the OProfile daemon from pre-determined |
| locations to perform certain tasks. At runtime, the OProfile daemon calls a feature |
| handler wrapper from one of the predetermined locations to check whether |
| an extended feature is enabled, and whether a particular handler exists. |
| Only the handlers of the enabled feature will be executed. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="ext-enable"> |
| <title>Enabling Features</title> |
| |
| <para> |
| Each feature is enabled using the OProfile daemon (oprofiled) command-line |
| option "--ext-feature=<extended-feature-name>:[args]". The |
| "extended-feature-name" is used to determine the feature to be enabled. |
| The optional "args" is passed into the feature-specific initialization handler |
| (<function>ext_init</function>). Currently, only one extended feature can be |
| enabled at a time. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="ext-types-of-handlers"> |
| <title>Type of Handlers</title> |
| |
| <para> |
| Each feature is responsible for providing its own set of handlers. |
| Types of handler are: |
| </para> |
| |
| <sect2 id="ext_init"> |
| <title>ext_init Handler</title> |
| |
| <para> |
| "ext_init" handles initialization of an extended feature. It takes |
| "args" parameter which is passed in through the "oprofiled --ext-feature=< |
| extended-feature-name>:[args]". This handler is executed in the function |
| <function>opd_options()</function> in the file <filename>daemon/oprofiled.c |
| </filename>. |
| </para> |
| |
| <note> |
| <para> |
| The ext_init handler is required for all features. |
| </para> |
| </note> |
| |
| </sect2> |
| |
| <sect2 id="ext_print_stats"> |
| <title>ext_print_stats Handler</title> |
| |
| <para> |
| "ext_print_stats" handles the extended feature statistics report. It adds |
| a new section in the OProfile daemon statistics report, which is normally |
| outputed to the file |
| <filename>/var/lib/oprofile/samples/oprofiled.log</filename>. |
| This handler is executed in the function <function>opd_print_stats()</function> |
| in the file <filename>daemon/opd_stats.c</filename>. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="ext_sfile_handlers"> |
| <title>ext_sfile Handler</title> |
| |
| <para> |
| "ext_sfile" contains a set of handlers related to operations on the extended |
| sample files (sample files for events related to extended feature). |
| These operations include <function>create_sfile()</function>, |
| <function>sfile_dup()</function>, <function>close_sfile()</function>, |
| <function>sync_sfile()</function>, and <function>get_file()</function> |
| as defined in <filename>daemon/opd_sfile.c</filename>. |
| An additional field, <varname>odb_t * ext_file</varname>, is added to the |
| <varname>struct sfile</varname> for storing extended sample files |
| information. |
| |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="ext-implementation"> |
| <title>Extended Feature Reference Implementation</title> |
| |
| <sect2 id="ext-ibs"> |
| <title>Instruction-Based Sampling (IBS)</title> |
| |
| <para> |
| An example of extended feature implementation can be seen by |
| examining the AMD Instruction-Based Sampling support. |
| </para> |
| |
| <sect3 id="ibs-init"> |
| <title>IBS Initialization</title> |
| |
| <para> |
| Instruction-Based Sampling (IBS) is a new performance measurement technique |
| available on AMD Family 10h processors. Enabling IBS profiling is done simply |
| by specifying IBS performance events through the "--event=" options. |
| </para> |
| |
| <screen> |
| opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user> |
| opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user> |
| |
| Note: * Count and unitmask for all IBS fetch events must be the same, |
| as do those for IBS op. |
| </screen> |
| |
| <para> |
| IBS performance events are listed in <function>opcontrol --list-events</function>. |
| When users specify these events, opcontrol verifies them using ophelp, which |
| checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname> |
| tag in <filename>events/x86-64/family10/events</filename> file. |
| Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and |
| /dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows. |
| </para> |
| |
| <screen> |
| oprofiled \ |
| --ext-feature=ibs:\ |
| fetch:<IBS_FETCH_EVENT1>,<IBS_FETCH_EVENT2>,...,:<IBS fetch count>:<IBS Fetch um>|\ |
| op:<IBS_OP_EVENT1>,<IBS_OP_EVENT2>,...,:<IBS op count>:<IBS op um> |
| </screen> |
| |
| <para> |
| Here, the OProfile daemon parses the <varname>--ext-feature</varname> |
| option and checks the feature name ("ibs") before calling the |
| the initialization function to handle the string |
| containing IBS events, counts, and unitmasks. |
| Then, it stores each event in the IBS virtual-counter table |
| (<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and |
| stores the event index in the IBS Virtual Counter Index (VCI) map |
| (<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value |
| as the map key. |
| </para> |
| </sect3> |
| |
| <sect3 id="ibs-data-processing"> |
| <title>IBS Data Processing</title> |
| |
| <para> |
| During a profile session, the OProfile daemon identifies IBS samples in the |
| event buffer using the <varname>"IBS_FETCH_CODE"</varname> or |
| <varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers |
| <function>code_ibs_fetch_sample()</function> or |
| <function>code_ibs_op_sample()</function> listed in the |
| <varname>handler_t handlers[]</varname> vector in |
| <filename>daemon/opd_trans.c </filename>. These handlers are responsible for |
| processing IBS samples and translate them into IBS performance events. |
| </para> |
| |
| <para> |
| Unlike traditional performance events, each IBS sample can be derived into |
| multiple IBS performance events. For each event that the user specifies, |
| a combination of bits from Model-Specific Registers (MSR) are checked |
| against the bitmask defining the event. If the condition is met, the event |
| will then be recorded. The derivation logic is in the files |
| <filename>daemon/opd_ibs_macro.h</filename> and |
| <filename>daemon/opd_ibs_trans.[h,c]</filename>. |
| </para> |
| |
| </sect3> |
| |
| <sect3 id="ibs-sample-file"> |
| <title>IBS Sample File</title> |
| |
| <para> |
| Traditionally, sample file information <varname>(odb_t)</varname> is stored |
| in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>. |
| Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on |
| alpha-based system. Event index (the counter number on which the event |
| is configured) is used to access the corresponding entry in the array. |
| Unlike the traditional performance event, IBS does not use the actual |
| counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>). |
| Also, the number of performance events generated by IBS could be larger than |
| <varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op |
| events). Therefore IBS requires a special data structure and sfile |
| handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing |
| IBS sample files. IBS-sample-file information is stored in a memory |
| allocated by handler <function>ibs_sfile_create()</function>, which can |
| be accessed through <varname>struct sfile::odb_t * ext_files</varname>. |
| </para> |
| |
| </sect3> |
| |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |
| |
| <glossary id="glossary"> |
| <title>Glossary of OProfile source concepts and types</title> |
| |
| <glossentry><glossterm>application image</glossterm> |
| <glossdef><para> |
| The primary binary image used by an application. This is derived |
| from the kernel and corresponds to the binary started upon running |
| an application: for example, <filename>/bin/bash</filename>. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>binary image</glossterm> |
| <glossdef><para> |
| An ELF file containing executable code: this includes kernel modules, |
| the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries, |
| and application binaries. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>dcookie</glossterm> |
| <glossdef><para> |
| Short for "dentry cookie". A unique ID that can be looked up to provide |
| the full path name of a binary image. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>dependent image</glossterm> |
| <glossdef><para> |
| A binary image that is dependent upon an application, used with |
| per-application separation. Most commonly, shared libraries. For example, |
| if <filename>/bin/bash</filename> is running and we take |
| some samples inside the C library itself due to <command>bash</command> |
| calling library code, then the image <filename>/lib/libc.so</filename> |
| would be dependent upon <filename>/bin/bash</filename>. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>merging</glossterm> |
| <glossdef><para> |
| This refers to the ability to merge several distinct sample files |
| into one set of data at runtime, in the post-profiling tools. For example, |
| per-thread sample files can be merged into one set of data, because |
| they are compatible (i.e. the aggregation of the data is meaningful), |
| but it's not possible to merge sample files for two different events, |
| because there would be no useful meaning to the results. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>profile class</glossterm> |
| <glossdef><para> |
| A collection of profile data that has been collected under the same |
| class template. For example, if we're using <command>opreport</command> |
| to show results after profiling with two performance counters enabled |
| profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>, |
| there would be two profile classes, one for each event. Or if we're on |
| an SMP system and doing per-cpu profiling, and we request |
| <command>opreport</command> to show results for each CPU side-by-side, |
| there would be a profile class for each CPU. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>profile specification</glossterm> |
| <glossdef><para> |
| The parameters the user passes to the post-profiling tools that limit |
| what sample files are used. This specification is matched against |
| the available sample files to generate a selection of profile data. |
| </para></glossdef></glossentry> |
| |
| <glossentry><glossterm>profile template</glossterm> |
| <glossdef><para> |
| The parameters that define what goes in a particular profile class. |
| This includes a symbolic name (e.g. "cpu:1") and the code-usable |
| equivalent. |
| </para></glossdef></glossentry> |
| |
| </glossary> |
| |
| </book> |