Title: Method and apparatus for accurate profiling of computer programs
Abstract: An object code expansion profiler equips a program for execution profiling by preprocessing the object code files of the program so as to add profiling monitoring code to the beginning of all or substantially all functions. The preprocessing includes, for each function, the steps of grouping the function's instructions into basic blocks, counting the number of cycles required to execute the instructions of the basic block, and inserting special monitoring code with the basic block. The special monitoring code is executed each time the basic block is executed, and updates the profiling information to reflect the number of cycles required to execute the basic block. Special handling is provided for profiling calls to the Operating System (OS). The resultant profiling information is converted into a call graph image most useful for human users. For each arc in the graph connecting a calling-function/parent-node to a called-function/child node, the displayed arc image has a width logarithmically proportional to the self+descendants time for the called function.
Patent Number: 6,934,935 Issued on 08/23/2005 to Bennett,   et al.
| Inventors:
|
Bennett; James (Menlo Park, CA);
Anderson; Mark (Palo Alto, CA);
Na; Choon Piaw (Saratoga, CA);
Hastings; Reed (La Honda, CA)
|
| Assignee:
|
International Business Machines Corporation (Armonk, NY)
|
| Appl. No.:
|
557737 |
| Filed:
|
April 25, 2000 |
| Current U.S. Class: |
717/127; 717/150; 717/151; 717/156; 717/158; 717/130; 718/100; 718/102 |
| Intern'l Class: |
G06F 009/44; G06F 009//46 |
| Field of Search: |
717/127,130,151,156,158,136,150
718/100,102
|
References Cited [Referenced By]
U.S. Patent Documents
| 4845615 | Jul., 1989 | Blasciak.
| |
| 4937740 | Jun., 1990 | Agarwal et al.
| |
| 5047919 | Sep., 1991 | Sterling et al.
| |
| 5142679 | Aug., 1992 | Owaki et al.
| |
| 5164969 | Nov., 1992 | Alley et al.
| |
| 5193180 | Mar., 1993 | Hastings.
| |
| 5212794 | May., 1993 | Pettis et al.
| |
| 5247651 | Sep., 1993 | Clarisse.
| |
| 5265254 | Nov., 1993 | Blasciak et al.
| |
| 5313616 | May., 1994 | Cline et al.
| |
| 5333304 | Jul., 1994 | Christensen et al.
| |
| 5335344 | Aug., 1994 | Hastings.
| |
| 5355484 | Oct., 1994 | Record et al.
| |
| 5355487 | Oct., 1994 | Keller et al.
| |
| 5359533 | Oct., 1994 | Ricka et al.
| |
| 5450586 | Sep., 1995 | Kuzara et al.
| |
| 5465258 | Nov., 1995 | Adams.
| |
| 5539907 | Jul., 1996 | Srivastava et al.
| |
| 5732273 | Mar., 1998 | Srivastava et al.
| |
| 5740443 | Apr., 1998 | Carini.
| |
| 5828883 | Oct., 1998 | Hall.
| |
| 5963740 | Oct., 1999 | Srivastava et al.
| |
| 6049666 | Apr., 2000 | Bennett et al.
| |
| 6126329 | Oct., 2000 | Bennett et al.
| |
Other References
"Effective Simulation of Multiprogramming", W.P. Dawkins et al., ACM SIGMETRICS
Performance Evaluation Review, Proceedings of the conference on Measurement and
modeling of computer systems, Apr. 1990, vol. 18 Issue 1.
"Profile Guided Code Positioning", Karl Pettis and Robert C. Hansen, Hewlett-Packard
Company, California Language Laboratory, Proceedings of the ACM SIGPLAN'90 Conference
on Programming Language Design and implementation, Jun. 1990.
Graham et al., GPROF; "A Call Graph Execution Profiler," p. 120-126 (1-16) 1982.
Hilfingen. "A Memory Allocation Profiler for C and Lisp Programs," p. 223-47, 1988.
Ponder et al., "Inaccuracies in Program Profilers," p. 459-467, 1987.
Graham et al., "An Execution Profiler for Modular Programs," p. 671-685, Mar.
10, 1983.
Smith, Michael D., "Tracing With Pixie," Apr. 4, 1991, Stanford University Technical
Report No. CSL-TR-91-497, p. 1-29.
"Pixie", UNIX man p. 1-2.
Larus, James R. et al., "Re-Writing Executable Files to Measure Program Behavior,"
Mar. 25, 1992, Univ. of Wisconsin, Computer Science Dept.
Ball, Thomas, "Optimally Profiling and Tracing Programs," Sep. 6, 1991, Univ.
of Wisconsin, Computer Science Department.
|
Primary Examiner: Banankhah; Majid
Parent Case Text
CROSS REFERENCE TO RELATED APPLICATIONS
This is a continuation of application Ser. No. 08/871,247, filed Jun. 9, 1997,
now U.S. Pat. No. 6,126,329 which is a continuation of application Ser. No. 08/074,428,
filed, Jun. 8, 1993, now abandoned which are both hereby incorporated by reference.
Claims
1. A computer implemented method for profiling execution of a computer program
having functions, the method comprising the steps of:
a) augmenting the program by performing the following augmenting steps for substantially
all functions of the program:
i) selecting a function;
ii) adding function entry profiling code to the selected function;
iii) adding function exit profiling code to the selected function; and
b) executing the augmented program and collecting profiling data for substantially
all executed functions of the program.
2. The method of claim 1, wherein the program augmenting step comprises performing
said augmenting steps for all functions of said program, and wherein the executing
step comprises collecting profiling data for all executed program functions.
3. The method of claim 1, wherein the step of adding function entry profiling
code comprises adding function entry parallel stack simulation code for maintaining
a parallel stack indicating call chains, and wherein the step of adding function
exit profiling code comprises adding function exit parallel stack simulation code
for maintaining the parallel stack.
4. The method of claim 1, wherein the step of adding function exit profiling
code comprises adding code to determine a self+descendants time for a call to the
selected function and to incorporate the self+descendants time into profiling information
indexed by an identifier for the selected function.
5. The method of claim 4, wherein the step of adding function exit profiling
code comprises adding code to incorporate the self+descendants time into profiling
information indexed by a partial call chain for the call to the selected function.
6. The method of claim 5, wherein the partial call chain identifies the selected function.
7. The method of claim 6, wherein the partial call chain also identifies an immediate
caller of the selected function.
8. The method of claim 1, wherein the step of adding function exit profiling
code comprises adding code to determine timing statistics for a call to the selected
function and to incorporate the timing statistics into profiling information indexed
by an identifier for the selected function and by a partial call chain for the
call to the selected function.
9. The method of claim 1, wherein the program has relocatable object files containing
substantially all of the functions of the program, and wherein the step of augmenting
the program comprises adding profiling code to substantially all relocatable object
files of the program.
10. The method of claim 9, wherein said relocatable object files comprise instructions,
and wherein the step of augmenting the program further comprises for each function
the steps of:
parsing the function instructions into blocks; and
adding block profiling code to each block.
11. The method of claim 9, wherein said relocatable object files comprise instructions,
and wherein the step of augmenting the program further comprises for each function
the steps of:
parsing the function instructions into basic blocks, wherein a basic block consists
of a contiguous sequence of at least one instruction of which the first instruction
is a program control transfer destination, the last instruction is a program control
transfer instruction, and any instructions between the first and last instructions
are neither program control transfer destinations nor program control transfer
instructions; and
for each basic block, performing the steps of selecting a basic block;
analyzing the selected basic block to determine a fixed number of clock cycles
for the selected basic block; and
adding basic block profiling code so as to be executed whenever the basic block
is executed, said basic block profiling code operating to add the determined fixed
number of clock cycles to a clock cycle accumulator for the function.
12. The method of claim 11, wherein the added basic block profiling code operates
to add the determined fixed number of clock cycles to a clock cycle accumulator
for the selected basic block of the function.
13. The method of claim 11, wherein the step of analyzing the selected basic
block comprises determining whether the selected basic block contains a variably
determinate length operating system call, and wherein the step of adding basic
block profiling code comprises, if the selected basic block contains a variably
determinate length operating system call, the steps of
adding simulation maintenance code to simulate constraints that determine execution
time for the variably determinate length operating system call;
adding simulation evaluation code to evaluate the constraints and to determine
a number of clock cycles required to execute the variably determinate length operating
system call; and
adding simulation profiling code to add the determined number of clock cycles
to a clock cycle accumulator for the function.
14. The method of claim 13, wherein the added simulation profiling code operates
to add the determined fixed number of clock cycles to a clock cycle accumulator
for the selected basic block of the function.
15. The method of claim 11, wherein the step of analyzing the selected basic
block comprises determining whether the selected basic block contains an operating
system (OS) call, and wherein the step of adding basic block profiling code comprises,
for at least one selected basic block containing an OS call, the OS call timing
profiling steps of
adding timing start code to determine a start time of day immediately before
executing the OS call;
adding timing end code to determine an end time of day immediately after executing
the OS call;
adding OS call timing calculation code to determine from the start time and end
time a number of clock cycles required to execute the OS call; and
adding OS call timing recording code to add the determined number of clock cycles
to a clock cycle accumulator for the function.
16. The method of claim 15, wherein said OS call timing profiling steps are performed
for substantially only basic blocks containing indeterminate length OS calls.
17. The method of claim 15, wherein the added OS call timing recording code operates
to add the determined fixed number of clock cycles to a clock cycle accumulator
for the selected basic block of the function.
18. A computer system configured as a new machine by a computer program, comprising
means for performing the steps as recited in claim 1.
19. A computer system configured as a new machine by a computer program, comprising
means for performing the steps as recited in claim 11.
20. A computer implemented method for profiling execution of a computer program
having a function, the method comprising the steps of:
a) augmenting the program by adding function profiling code to the function,
said function profiling code, for each particular call to the function when executed,
serving to determine an execution time for the particular call to the function
and to incorporate said execution time into profiling data for the function, said
function profiling code including at least function entry profiling code;
b) executing the augmented program and collecting said profiling data for the
function; and
c) presenting said profiling data to a user.
21. The method of claim 20, wherein the function has a varying execution time
over a plurality of calls when said program is executed, wherein said profiling
data comprises a global minimum execution time for the function and a global maximum
execution time for the function over all calls to the function, and wherein said
step of augmenting the program by adding function profiling code comprises adding
function profiling code to determine the global minimum execution time for the
function and the global maximum execution time for the function.
22. The method of claim 21, wherein substantially all calls to the function can
be identified by one of a set of partial call chains, and wherein said step of
augmenting the program by adding function profiling code comprises adding function
profiling code to determine profiling data indexed by each partial call chain in
said set of partial call chains.
23. The method of claim 22, wherein each said partial call chain identifies a
caller of said function, wherein said profiling data comprises for each said caller
a caller-wise minimum execution time for the function and a caller-wise maximum
execution time for the function over all calls to the function originating said
caller, and wherein said step of augmenting the program by adding function profiling
code comprises adding function profiling code to determine for each said caller
the caller-wise minimum execution time for the function and the caller-wise maximum
execution time for the function.
24. A computer system configured as a new machine by a computer program, comprising
means for performing the steps as recited in claim 20.
25. A computer implemented method for profiling execution of a computer program
having functions, the method comprising the steps of:
a) augmenting the program by performing the following augmenting steps for substantially
all functions of the program:
i) selecting a function;
ii) adding function profiling code to the function, the function profiling code
including;
function entry profiling code added to the beginning of the selected function,
the function entry profiling code including function entry parallel stack simulation
code for maintaining a parallel stack indicating call chains,
function exit profiling code added to the end of the selected function, the function
exit profiling code including function exit parallel stack simulation code for
maintaining the parallel stack, and
function call profiling code for recording dynamic call information obtained
from the parallel stack;
b) executing the augmented program and collecting profiling data for substantially
all executed functions of the program;
c) processing the recorded dynamic call information to create a dynamic call
graph; and
d) presenting the dynamic call graph to a user.
26. A computer system configured as a new machine by a computer program, comprising
means for performing the steps as recited in claim 25.
27. A computer implemented method for profiling execution of a computer program
having functions, the functions having instructions, the method comprising the
steps of:
a) augmenting the program by adding profiling code to the program, including
at least function entry profiling code;
b) executing the augmented program and collecting profiling data for functions
of the program, said profiling data including self+descendants times for the functions;
c) constructing a call graph for the program; and
d) displaying to a user a representation of the call graph including arcs connecting
nodes, wherein each arc links a parent function to a child function and has a width
corresponding to the self+descendants time for the child function.
28. A computer system configured as a new machine by a computer program, comprising
means for performing the steps as recited in claim 27.
29. A computer system configured as a new machine by a computer program, comprising
means for performing the steps of:
a) augmenting the program by adding function profiling code to the program;
b) executing the augmented program and collecting profiling data for functions
of the program, said profiling data including self+descendants times for the functions;
c) constructing a call graph showing function calls for the program;
d) providing a pruning time filter value; and
e) displaying to a user a representation of the call graph including arcs connecting
nodes, wherein each arc links a parent function to a child function, wherein a
set of arcs is automatically pruned from said displayed call graph representation,
wherein each of said set of said automatically pruned arcs connects to a child
function having a self+descendants time less than said pruning time filter value.
Description
BACKGROUND OF THE INVENTION
The present invention relates generally to a method and apparatus for profiling
computer programs. In particular, one aspect of the present invention relates to
a method for inserting additional instructions and data into existing relocatable
object files of a computer program to collect very detailed and accurate profiling
information. Another aspect of the present invention relates to a method for converting
profiling information into image data for communication to a user/developer.
Profiling is a technique commonly used by software developers to gain information
about the operation of their code. This information can then be used to improve
and optimize the code. Profiling is done by processing the program under development
with a profiler. The profiler adds code to the executable file so that it records
various types of statistical data as the program runs. The type of data recorded
varies with the profiler used.
One traditional and widely used profiler under UNIX is known as prof. Object
code files are supplied to prof, which links the object files and adds monitoring
routines to be executed at the beginning and end of the program. The initial routines
set up a program sampler triggered by timing interrupts. The program sampler, which
is invoked with a period of 1/100 of a second, records the value of the program
counter register. The routines added to the end of the program take the saved data
and create an output file showing the time apparently spent in the various routines;
this output can be organized as a histogram.
This information can be useful to a programmer by indicating which routines
consume the most execution time and are therefore the best places to focus on improvement.
prof suffers from a number of limitations: it requires Operating System ("OS")
support for its interrupts; it shows where the execution is at the sampling times,
but does not show "why" (does not identify the call chain that led to current function
being called); and it relies on timed interrupts with a 1/100 second sampling period.
The sampling period causes sampling errors because with today's fast processing
speeds a great many on instructions can be executed in 1/100 of a second; many
routines have total execution times shorter than this. The execution of short routines
can go entirely unnoticed by prof, and sampling errors can cause significant inaccuracies
in the measured execution times even for longer routines. The 1/100 sampling period
could be theoretically shortened, but this has been found to add excessive overhead,
so prof implementations generally do not allow "tuning" of the sampling rate. Furthermore,
because the timing interrupts that trigger the sampling mechanism are not generated
during OS function calls, OS function calls appear "free" under prof, whereas in
actuality they (and therefore the program functions calling them) might be responsible
for a primary portion of the total execution time.
Another commonly used profiler, known as gprof, relies on recompilation of
source code to add more monitoring features than are present in prof. In addition
to the interrupt sampling performed by prof, gprof adds code to the beginning of
each function to record which function called it. A significant problem arises,
however, if the program employs library code for which source files are unavailable.
gprof is then unable to process the functions defined in these files, and their
callers cannot then be recorded. When the execution sequence passes into one of
these unmonitored functions, the call trace is severed. If the unmonitored function
initiates calls that lead back into a monitored function, this calling of the monitored
function will be disconnected from the remainder of the call trace (such situations
are termed "spontaneous function calls").
A simple program, in which start calls main, main calls function A, function A
calls functions B and C, and function B calls function D, might result in gprof
output statistics similar to those shown in Table I. The time column shows in seconds
first the self time for the function, then the self+descendants time, which is
the sum of the self time for the function of concern plus the self times of all
the descendants of the function of concern. Typical gprof output tables would typically
also include indications of how many times each function was called by each of
its caller functions.
| TABLE I |
| |
| routine |
callers |
time |
| |
| start |
|
0.1/101.1 |
| main |
(start) |
1.0/101.0 |
| A |
(main) |
10.0/100.0 |
| B |
(A) |
20.0/60.0 |
| C |
(A) |
30.0/30.0 |
| D |
(B) |
40.0/40.0 |
| |
Typically, a programmer might then manually convert this table into a
graph, such as shown in FIG. 1A, but this process is very complicated and painstaking
for larger programs. To improve the ease of use of gprof output, some postprocessors
are now available that can process the source code for a program to produce a static
call graph and then overlay the gprof statistics on top of the static call graph.
(In prior art systems the gprof statistics have been shown either by histograms
next to each graph node, or by color coding of graph nodes.) The need for source
code, however, is a serious drawback to such postprocessing, and significantly
limits its usefulness.
Further limitations to the usefulness of gprof include those stemming from
the use of timing interrupts as discussed above with reference to prof (sampling
errors and inability to track OS calls), and poor handling of multiply called functions.
For instance, suppose that in the above example, function D was called by both
functions B and C, in which case the call graph would be as illustrated in FIG.
1B. The self time for function D would then have to be split up for allocation
between the self+descendants times for functions B and C.
To make this allocation gprof relies upon an assumption of an average case distribution.
That is, gprof assumes that all calls to a particular function require the same
amount of time for completion. The self time for function D would then be allocated
to the self+descendants times for functions B and C proportionally to the number
of times that each of them called function D. However, because the arguments passed
to a function can drastically effect the amount of time required to complete the
function call, the average case distribution assumption employed by gprof can result
in significant inaccuracies in reported times.
For example, if functions B and C both called function D 40 times, and the calls
originated by function B consumed three fourths of function D's time (with the
calls originated by function C consuming the remaining one fourth), the correct
profiling statistics would be as shown in Table II. Because of its average case
distribution assumption, however, gprof would allocate function D's time equally
between functions B and C, and would prepare inaccurate profiling statistics as
shown in Table III.
| TABLE II |
| |
| routine |
callers |
time |
| |
| start |
|
0.1/101.1 |
| main |
(start) |
1.0/101.0 |
| A |
(main) |
10.0/100.0 |
| B |
(A) |
20.0/50.0 |
| C |
(A) |
30.0/40.0 |
| D |
(B) |
40.0/40.0 |
| |
| TABLE III |
| |
| routine |
callers |
time |
| |
| start |
|
0.1/101.1 |
| main |
(start) |
1.0/101.0 |
| A |
(main) |
10.0/100.0 |
| B |
(A) |
20.0/40.0 |
| C |
(A) |
30.0/50.0 |
| D |
(B) |
40.0/40.0 |
| |
Like prof, gprof concentrates on zero order profiling statistics. That is, timing
statistics for a particular function are recorded with that function as the only
reference point. gprof does not record any first order or other higher order timing
distribution statistics, that is, timing statistics regarding a particular first
function having been called by a particular second function. The only rudimentary
first order statistics of any kind recorded by gprof are the number of times a
called function was called by each of its caller functions. Because gprof cannot
record first order timing statistics, however, gprof is forced to use the average
case distribution assumption to calculate self+descendants time in many situations,
as described above.
An improved profiler that remedies the deficiencies of prior art profiling techniques
is clearly desirable.
SUMMARY OF THE INVENTION
According to the present invention an improved method and apparatus are
provided for performing execution profiling. A first aspect of the invention is
directed to a method for equipping a program for execution profiling that comprises
the step of preprocessing the object code files of the program so as to add profiling
monitoring code to the beginning of all or substantially all functions, without
needing source code, so that profiling information is available for the entire
execution of the program.
According to one facet of this first aspect of the invention, timing information
is collected by forming tallies of the number of cycles required to execute the
profiled instructions. The preprocessing includes, for each function, the steps
of grouping the function's instructions into basic blocks, counting the number
of cycles required to execute the instructions of the basic block, and inserting
special monitoring code with the basic block. (Later, when the target program is
run, the special monitoring code is executed each time the basic block is executed,
and updates the profiling information to reflect the number of cycles required
to execute the basic block. In some embodiments the initial profiling code for
a function also records at least a partial call chain, indicating where the function
was called from, to enable compiling second order or higher order timing statistics.)
According to another facet of this first aspect of the invention, special
handling is provided for profiling calls to the Operating System (OS). OS calls
can be categorized into three groups: (1) calls that may require an indeterminate
amount of time ("indeterminate length"); (2) calls that substantially always require
a certain, fixed amount of time ("fixed determinate length"); and (3) calls that
substantially always require an amount of time determined by identifiable program
status constraints, but which therefore vary according to program status ("variably
determinate length"). The preprocessing step includes inserting code before and
after at least indeterminate length OS calls to start a timing procedure before
the OS call, end the timing procedure after the OS call, determine the elapsed
time and then incorporate this into the profiling information. In some embodiments,
the time required for fixed determinate length OS calls is determined during preprocessing
and then tracked during execution in the same manner as for basic blocks, described
above. In preferred embodiments, the preprocessing step includes adding simulation
code to profile variably determinate length OS calls. The simulation code models
or tracks the program status constraints that determine how long the OS call will
require. For each variably determinate length OS call, code is added that evaluates
the simulated program status constraints, determines from this the time required
for the OS call, and updates the profiling information accordingly.
According to yet another facet of this first aspect of the invention, the
preprocessing step includes adding terminal profiling code at the end of each function
of the program. The terminal profiling code for a called function passes profiling
information, including the self-plus-descendants time for the called function,
back to the caller, so the profiling code for the caller function can accurately
maintain self-plus-descendants time without the need for inaccurate distribution
assumptions such as the average case distribution assumed by the prior art.
A second basic aspect of the present invention is directed to a method of converting
profiling timing information into useful image data for human users. A first facet
of this aspect of the invention is directed to using the detailed profiling information
gathered as discussed above to automatically create a dynamic call graph, rather
than having to process source code and simply producing a static call graph.
A second facet of this aspect of the invention results from discovery of the
fact
that human users have a superior intuitive understanding of much of the basic timing
statistics, especially the self+descendants time, as a "river of time," in which
the total amount of time used by the program is allocated to, or "passes through"
the main function, after which it splits into main "tributaries" for the primary
routines called from the main function, with these main tributaries splitting into
smaller tributaries for the secondary routines called by the primary routines,
etc. According to this facet of the invention, therefore, visual presentations
of profiling data are constructed so as to correspond to the human user's "river
of time" intuitive understanding of the data. A method of converting profiling
information into a call graph according to this facet of the invention comprises,
for each arc in the graph connecting a calling-function/parent-node to a called-function/child
node, the steps of converting the self+descendants time for the called function
into a width measure, and displaying the call graph with each arc having a display
width corresponding to the width measure determined for that arc. In a preferred
embodiment, the display width of an arc is computed to be logarithmically proportional
to the self+descendants time for the called function.
A further understanding of the nature and advantages of the invention may be
realized
by reference to the remaining portions of the specification and the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a call graph with profiling information for a simple program.
FIG. 2 is a block diagram showing a relocatable object file being expanded by
a profiling expansion means according to an aspect of the invention into a new
relocatable object file.
FIG. 3 is a flowchart illustrating a general procedure for implementing the
profiling scheme of the present invention by modifying the object files for an
executable program.
FIG. 4 is a flowchart illustrating the general process of adding to a function
the profiling code of the preferred embodiment.
FIG. 5 illustrates a flowchart showing details of the process of adding time
profiling code to each basic block, according to a preferred embodiment of the
present invention.
FIGS. 6A-6E illustrate the operation of various types of profiling code during
execution of the target application.
FIG. 7 shows a view selection window of a particular embodiment of the present
invention that allows a user to select various view windows for displaying data.
FIG. 8 shows a reports display window of a particular embodiment of the present invention.
FIG. 9 shows a coverage display window of a particular embodiment of the present invention.
FIG. 10 shows a function detail display window of a particular embodiment of
the present invention.
FIGS. 11A-D illustrate function detail windows for functions A, B, C, and D
(of the program described with reference to FIG. 1B and Table II), respectively.
FIGS. 12A-C show call graph windows of a particular embodiment of the present invention.
FIG. 13 shows a flow chart illustrating the process of automatically time filtering
functions for display according to an aspect of the present invention.
FIGS. 14A-C show examples of a call graph displayed according to this aspect
of the preferred embodiment of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The preferred embodiment of the present invention employs numerous novel techniques
for providing enhanced profiling capabilities to a user. These techniques improve
all three major prongs of the profiling process: preprocessing an application to
add profiling code, executing the augmented application to collect profiling information,
and then presenting the collected information to the user.
I. Augmenting the Application and Collecting Profiling Data
One aspect of the present invention is directed to a method for equipping a program
for execution profiling by preprocessing the object code files of the program so
as to add profiling monitoring code to all or substantially all functions of the
program. The detailed analysis and preprocessing made possible by object code insertion
improves both the profiling coverage of the program and the profiling accuracy.
Object code insertion also enables the use of a technique according to another
aspect of the invention, which is the maintenance of profiling timing information
according to the number of CPU cycles required to perform the instructions of the
application routines. The execution cycles for most instructions are determined
during preprocessing; for other instructions, profiling code is added to dynamically
determine their timing during execution.
A. Profiling by Object Code Insertion
In the preferred embodiment, the insertion of additional code is performed in
the manner taught by U.S. Pat. No. 5,193,180, which is herein incorporated by reference
in its entirety, for all purposes, and with special reference to its teachings
regarding the implementation of monitoring schemes. The particular embodiment described
herebelow is intended for a SUN SPARC station, so the description is particular
in some respects to that system.
FIG. 2 illustrates a preexisting object code file
1 ("oldfile.o") being
augmented by profiling expansion means
5 to form a new object code file
1′ ("newfile.o"). In the preferred embodiment, profiling expansion
means
5 is a general purpose computer having a memory and operating under
the control of a computer program. In the this embodiment, all object files of
the target application program are processed by a profiling program that uses the
above expansion means to insert profiling monitoring code to each function of the
program. All of the object code files for an executable program are processed,
and instructions are added to implement the profiling scheme described below when
the application program is executed. Because source code is not needed, all of
the files and functions of the program can be monitored. This provides complete
profiling coverage of the program, rather than only partial coverage where source
code was unavailable (as with the prior art).
The general procedure of implementing a profiling scheme to have full coverage
for an executable program, by modifying all of the object code files for the executable
program, linking the modified program and then running it, is illustrated in FIG.
3. A first object file or library for the executable program is selected in block
100. If the file is determined to be a simple object file rather than a
library, in block
110, then the object file is processed in block
120
to implement the profiling, by the expansion process described in the above referenced
U.S. Pat. No. 5,193,180. If the file is determined to be a library in block
110,
then each object file that contributes to the library is processed in block
130,
in the same manner that a simple object file is processed by block
120.
Then, in block
140, the library is rebuilt from the modified object files.
After the object file or library has been processed, block
150 determines
if any unprocessed files remain for the executable file. If so, block
160
selects an unprocessed file and then the steps of blocks
110-
150
are repeated. Once it is determined in block
150 that all files for the
original executable program have been processed, all necessary linkage is performed
in block
170. In the preferred embodiment this linkage includes linkage
to a special library of profiling routines. These profiling routines include a
program start routine and a program exit routine that are linked to execute at
the start and finish, respectively, of the program. Also included in the profiling
library are function entry and exit profiling routines, and various other timing
routines described below. After linkage, the program is executed in block
180;
during this execution, the desired profiling is performed.
B. Preprocessing for Cycle-Counting Profiling
In the preferred embodiment, profiling expansion means
5 is provided with
detailed information regarding the computer system on which the profiled program
is to be run. This information includes the type of CPU employed by the computer
system, the clock speed or clock cycle period of the CPU, the cache architecture,
the presence of coprocessors, and any other information necessary to determine
the number of clock cycles required to execute the instructions in the instruction
set for that CPU.
Profiling initialization and profiling termination routines are added so
as to be executed at the beginning and end, respectively, of the program. Also,
every function in every object file for the application program is processed to
have added to it the profiling code of the present invention. The general process
of adding to a function the profiling code of the preferred embodiment is illustrated
by the flowchart of FIG.
4. Step
230 of FIG. 4, the step of adding
basic block profiling code, is illustrated in further detail in FIG.
5.
The operation of the profiling code is explained primarily with reference to FIG.
6.
In the process of FIG. 4, first, at step
200, the instructions of the
function
are parsed into basic blocks. A basic block is a contiguous sequence of instructions
that has no branches into or out of its body except as follows: a basic block begins
with an instruction that is a program control transfer destination-either the destination
of a branch (including the default or fall-through destination of a conditional
branch) or the destination of a call instruction, and ends with a program control
transfer instruction-either a branch or exit instruction. If the computer architecture
entails delay slots following branches, such delay slots are considered part of
the basic block of the branch statement they follow. The smallest possible basic
block is two instructions: the destination of a branch/call instruction and that
is itself a branch/exit instruction, followed by its delay slot. In some embodiments
a call instruction, which is akin to a branch followed by a return, is treated
as a basic block terminator, although in the preferred embodiment call statements
do not terminate basic blocks.
The usefulness of parsing the code into basic blocks is that each basic block
is always executed as a unit: if the first instruction is executed then the entire
block is executed; if the first instruction is not, then the remainder of the block
is not. After parsing the function code into basic blocks, each of the basic blocks
is analyzed at step
210, as described further below. Next, at step
220,
function entry profiling code, basic block profiling code, and function exit profiling
code are added to the function as determined according to the above analysis. In
the preferred embodiment the function entry and exit profiling codes take the form
of calls to special routines in a profiler library, which are added to the beginning
of the function and to any basic blocks ending in a function exit instruction,
respectively. It should be noted that the parsing, analysis, and code insertion
steps can be intermixed.
The process of profiling code to each basic block is illustrated in further detail
in FIG.
5. At step
300, basic block entry code is added to the start
of the basic block. (Later, during execution of the application program, this basic
block entry code will signal another entry into the basic block, by incrementing
an executions accumulator, which tallies how many times the basic block was executed).
In the preferred embodiment step
300 is only performed if source code is
available for the function and if the source code has been compiled for debugging,
because this particular profiling information is only used in the show-coverage
display of source code, described below. The profiling code determines whether
source code is available by looking for debugging symbols provided by the compiler.
Next, at step
310, for all of the instructions requiring a fixed number
of clock cycles to execute, these cycles are tallied, including the delay slot
for the terminating branch. This step includes an architecture-specific analysis
of the instructions to determine the required number of cycles for execution. In
the preferred embodiment, this analysis includes identifying stalls, such as caused
by loads and moves, and accounting for the amount of time required, including the
stall. (On many systems, the instruction following a load or move will be executed
when the load has been initiated, without waiting for it to be completed. An exception
occurs when the instruction following the load requires the use of the destination
address or register of the load instruction, in which case a stall occurs until
the load has been completed.)
After the fixed instruction cycles for the basic block have been tallied, fixed
cycle profiling code is added to the basic block at step
320. (Later, during
execution of the application program, each time the basic block is executed the
fixed cycle profiling code will increment the self time accumulator for the current
function by the determined number of cycles.) In the preferred embodiment, fixed
length Operating System (OS) calls, discussed further below, are treated in steps
310 and
320 just like other fixed execution length instructions and
charged to the self time of the current function. In alternative embodiments, however,
they are tracked as independent functions and their execution time is not tallied
in step
310. Instead, step
320 includes adding code to update self
time accumulators for the OS call and to charge the execution time for the OS call
to the self+descendants accumulator for the present function. Similarly, in the
preferred embodiment other OS calls, discussed below, are charged against the self
time of the current function, but could instead be tracked as independent functions.
After step
320, at step
330 any desired simulation profiling
code is added to the basic block. This is generally added only to basic blocks
containing any variably determinate length instructions, but the completeness of
the simulation may require that simulation code be added even to some basic blocks
that do not contain any variably determinate length instructions. Finally, at step
340, if the basic block includes an indeterminate length instruction, the
indeterminate length instruction is wrapped in timing profiling code. Variably
determinate length instructions and indeterminate length instructions are described
in further detail below, in the discussion regarding OS calls.
By performing execution timing according to information obtained during preprocessing,
profilers according to the present invention essentially create virtual cycle counters.
That is, they generally do not count the actual number of cycles taken by the program
during the profiling run, but rather, most cycles are counted according to what
the preprocessing determines the unaltered program would take in the absence of
additional profiling code.
C. Execution of Cycle-Counting Profiling Code
FIGS. 6A-6E illustrate the operation of various types of profiling code during
execution of the target application at step
180 of FIG.
3. The two
main data structures relied on by the profiling code are individual function accumulators,
and a simulated parallel call stack.
In some embodiments, the function accumulators are statically linked into the
application code, substantially adding to the symbol table. In the preferred embodiment,
however, function accumulators are located dynamically, as discussed below. Each
function accumulator has a set of primary profiling registers for the overall function,
and an array of registers for maintaining information on the usage distribution
for basic blocks (including the number of times "hit" and the total number of cycles
consumed). The primary registers of the function accumulator track 1) the number
of calls to the function; 2) the minimum and maximum self times for the function,
over all calls; 3) the sum of the self times for the function, over all calls;
4) the sum of the self+descendants times for the function, over all calls; and
5) the head of a linked list of caller records for all callers of the function.
The min and max self times are particularly helpful when profiling functions that
undergo substantial initialization the first time they are called. The primary
registers optionally also track the number of register window spills (see below)
and the aggregate amount of time spent in them, and the number of and aggregate
time for OS calls.
Each of the caller records contains an identification of the caller, a cumulative
counter for the number of times the current function has been called by the particular
caller function, and a cumulative counter for the total number of self+descendants
cycles from the current function that are attributable to the particular caller
function. Note that the caller identification can simply be a pointer to the caller's
accumulator. In some embodiments the caller records also track min and max self
times for the current function when called by that particular caller function.
These caller records serve to track profiling data by a call chain one call deep
(the caller function); more extensive caller records can be maintained and indexed
by longer call chains so as to provide higher order timing statistics.
The parallel stack serves both as a source for local counters that are used to
store the timing statistics for the each extant instantiation of each function,
and also as a call chain structure to allow for the proper attribution of self+descendants
time. Each frame on the parallel stack identifies the corresponding function, contains
or points to the local variables tracking the that particular function call, and,
so as to properly profile "tail call" optimized code, identifies the exit point
for the function. This latter tracking can often cause the simulated parallel stack
to differ substantially from the actual stack.
Tail call optimization is a technique allowable when function A calls function
B, and function B as its last instruction calls function C and returns this value
to function A. Under a standard compiler, the call to function C would result in
another frame being pushed on the stack, and the return from function C would be
followed by an immediate return with the same value from function B to function
A. In order to minimize the depth of the stack, in call chain optimization the
call to function C would result in function B essentially giving its position on
the stack over to function C, so that at the return from function C, the immediately
preceding function on the stack, to which function C would return, would be function
A (bypassing function B). If the parallel stack were also maintained in this manner
the correct profiling information could not be attributed to function B. So, according
to an aspect of the present invention, the frame for function B would remain on
the parallel stack, but the frame for function C would point to function A as its
exit point.
In some embodiments, frames on the parallel call stack simply identify each calling
function by name, or by some other identifier for the function as a whole. In other
embodiments, however, frames on the parallel stack frame also identify call site.
In these embodiments, if function A has two distinct call instructions to function
B, each of these calls would be distinguished on the parallel stack frame, and
would be distinguished in the tracked profiling data as well. For example, the
execution times for function B would get rolled up to function A in the usual manner,
but presentations of the allocation of function B's times to its various callers
would have one allocation to the first call site in function A and a separate allocation
to the second call site in function A.
When presenting information about a call site for which a linker symbol is unavailable,
such as in the above case with the two call sites within function A, the call site
will be labeled as an offset from the closest previous location for which a symbol
is available. For example, the first of the above mentioned call sites in function
A might be identified as "A+20", and the second call site might be identified as "A+65".
1) The Function Entry Profiling Routine
FIG. 6A illustrates the operation of the function entry profiling routine. First,
at step
400 the location of the function accumulator is looked up in a global
table. The first time the function is run, the accumulator location table entry
is initialized by executing a set of instructions created during the preprocessing
step. These instructions are
| |
|
| |
sethi |
V1, %g3 |
| |
jmpl |
[%g4 + FNOFFSET], %g2 |
| |
or |
V2, %g3 |
| |
|
During the preprocessing step the position of the accumulator relative to
the function offset was fixed and the values V
1 and V
2 were determined
so that the first and third instructions would cause g
3 to contain the location
of the accumulator relative to the jump
1 statement. The second statement
produces an object code having a constant value, and is thus used as a flag that
can be searched for either to discover the accumulator locating instructions, or
to determine the existence of the function. This latter aspect can be used for
counting and reporting the total number of functions in the program, whether executed
or not, and whether or not they are assigned symbols in the symbol table.
The second step of the function entry code is to initialize a new frame on the
parallel stack at step
401, and initialize the local counters, which include
self time and self+descendants time. Next, at step
402, the caller record
is updated. This step includes the substeps of first determining the caller from
the parallel call stack, and then examining the caller linked list from the current
function accumulator to see if there is already an entry. If not, the an entry
is created. Then, the number of times hit counter for the caller record is updated.
Next, at step
402, the return point for the current function is recorded.
Finally, at step
403, the global number of times hit counter for the current
function is updated.
2) The Function Exit Profiling Routine
The operation of the function exit profiling code is illustrated in FIG.
6B.
When executed, this routine first updates the current function accumulator according
to the local counts, as indicated by step
410. Next, at step
411,
the caller record and function accumulator for the call function indicated at the
end of the stack are updated according to the current local counts. This updating
includes updating self+descendants time for the caller function. Following step
411, the parallel stack is rolled back to the next stack frame at step
412,
and at step
413 the new stack end frame is compared to the recorded exit
point. If not equal, steps
411-
413 are repeated. The self+descendants
times are thus "rolled up" from the lower level functions to the higher. So, the
profiling code for the caller function can accurately maintain self-plus-descendants
time without the need for inaccurate distribution assumptions such as the average
case distribution assumed by the prior art.
3) Basic Block Timing
FIG. 6C illustrates the steps performed by the fixed-cycle profiling code. These
steps are step
420 of adding the number of fixed cycles for the basic block
(previously determined during preprocessing) to the self time and to the self+descendants
time for the current function, and step