A Survey of Data Breakpoint and Reverse Execution
Lap Chung Lam
lclam@cs.sunysb.edu
ABSTRACT
A debugger is an essential tool for software development because debugging accounts for 80% of the overall software development time. However, relatively less research is done on debuggers as compared to other aspects of programming tools. Software developers mostly rely on control breakpoints and single-stepping as the only methods of debugging. Under many circumstances, it is very difficult, time-consuming, and frustrating to use control breakpoints and single-stepping to track down the bugs that result from the corruption of program data. In this survey we study two attractive but hard to implement debugging techniques, data breakpoint and reverse execution, which promise to make the debugging process simpler and easier. Though data breakpoint and reverse execution can greatly facilitate the process of debugging computer programs, both methods pose serious implementation challenges which prevent their practical usage in software debugging. For the past ten years, some researches have tried to overcome these challenges of supporting data breakpoint and reverse execution. We studies and summarizes the past research on these issues, and also propose our own ideas of supporting data breakpoints and reverse execution. This survey focuses on single-process and single-thread interactive source-level debuggers such as GDB.
1 Introduction
Software debugging is a critical aspect of the software development cycle; it is the process by which software developers attempt to remove software defects called bugs from a computer program being developed. A software debugger is the indispensable tool that allows software developers to look into their programs closely to find and remove bugs. Without software debuggers, the programmers' lives would be a nightmare. The importance of software debuggers can not be overstated because it is not uncommon for the debugging phase of a software development project to take 60-70% of the overall development time. In fact, debugging is responsible for 80% of all software project overruns [KO]. Back to 15 years ago, though the computer programs were much simpler and smaller in code size than the programs today, one study of experienced programmers in 1985 showed that 78% of the programmers had bugs that took longer than one day to find, and 34% had bugs that took more than a month to find [BR85].
In the early computer age, software developers inserted print instructions into their programs to print out variable values for detecting bugs. Although print-statements are a very helpful technique for software debugging, they require a considerable amount of programmer time and effort. What application developers need is a tool, which can help them to find bugs more easily without spending much time and efforts. Thus software debuggers were born.
The purpose of debuggers is to allow software developers or maintainers to see what is going on "inside" another program, called the debuggee, while it executes or what the debuggee was doing at the moment it crashed. The basic functions of a debugger are starting and stopping the execution of the debuggee, printing the variable values, modifying the debuggee's state, and tracing the debuggee's stacks when the program is stopped.
Although debuggers are critical tools for software development, there were very few studies on them as compared to other aspects of computer science. They are very difficult tools to build because they depend heavily on the support of operating system and hardware [RO96]. For the past ten years, many aspects of computer science have been developed rapidly. The structures and algorithms of computer programs have become more and more complicated and the program size has become larger and larger. However, The most significant revolution of debugging technologies is the evolution from the text-base user interface to the multi-window menu-driven graphical user interface. Modern debuggers also incorporate an efficient representation of the program's data in memory. However, the underlining basic technologies for debugging still heavily rely on post-mortem dumping and control breakpoint.
We study two powerful but hard-to-implement debugging techniques, data breakpoint and reverse execution, which promise to make the debugging process simpler and easier. Though data breakpoint and reverse execution can greatly improve the efficiency of debugging computer programs, both pose serious implementation and performance problems, which prevent their practical use in software debugging. For the past ten years, some researches have tried to address these problems. The goal of this survey is to study and summarize the past researches on these two problems. This survey focuses on single-process and single-thread source level debuggers such as GDB.
The remainder of this section is going to explain how programmers debug their programs, and why we need data breakpoint and reverse execution. Section 2 explains the basic software architecture of a debugger. Section 3 describes the problems associated with implementing data breakpoint, the current implementation methods, their advantages, disadvantages, and performance. Section 4 presents the problems associated with reverse execution, the current implementation methods, their advantages, disadvantages, and performance. Section 5 proposes our own idea to improve existing debugger implementation to make debuggers more practical and usable. Finally, section 6 concludes the survey.
1.1 How do programmers debug?
The basic popular debugging method is control breakpoint that allows a programmer to specify a location in a program where he/she can stop the execution of the program so that he/she can examine the variables and the call stack of the program. When a running program crashes or exhibits incorrect behavior, immediately, the first question comes to the programmer's mind is: which part of my program causes this trouble? and when did this happen? This requires the programmer to think backward in time and look backward in the program execution to locate the defects. The only information at the moment the program crashed or exhibited incorrect behavior is the program source code, input data, and output data. The only thing the programmer can do is to look at the program source code, analyzes the input data and the output data, and then guesses where in the program causes the trouble, places a control breakpoint there, and re-executes the program from the beginning. When the program stops at the control breakpoint, the programmer examines the state of the program and tries to identify the bug. If the programmer's guess is wrong, then the programmer has to guess again, set a new control breakpoint and re-execute the program. The programmer repeats this process again and again until successfully pinpointing the location of the bug. The other method is single-stepping. When the debuggee stops at a breakpoint, the user can single-step the debuggee instruction by instruction or statement by statement and examine the debuggee's state after each step.
1.2 Why do we need data breakpoint?
Sometimes it is very difficult, time-consuming, and frustrating to track down the bugs that result from the corruption of program data by using the iterative process of guessing, setting control breakpoints, and reexecution. The bugs are caused by unintended modification of data object in memory. This results from the careless use of pointers. The difficulty of finding those bugs stems from the fact that the symptom occurs at a subsequent read of an illegally modified location, which may be arbitrarily far away from the location where the illegal modification occurs. Data breakpoints or watchpoints can be used to reduce the complexity of finding such kinds of bugs. Data breakpoint stops the execution of a program when certain data object is modified, or modified to certain value (conditional data breakpoint), so that the users can more easily identify the point where programming error first arises.
To illustrate how easy to identify the bugs caused by careless use of pointers, consider the program in listing 1. If we compile and run this program, the program will crash at line 2000. The true culprit is at line 1000 but the problem is not noticed until a long time after the erroneous modification was made. Assume we did not notice the warning message when we compiled the program and assume the program was not written by us (i.e., we do not know about the source code so well). Under such situation, it is difficult for us to guess where the problem is. The only obvious information we can get is the value of the variable cptr at the time when the program crashed. Without data breakpoint, we can only place a control breakpoint between line 10 and line 2000, and examine the value of cptr when the program stops at the control breakpoint. We can repeat this process may times until we reach line 1000. This could be a very long debugging process. Assume the value of the variable cptr is 64 at the time the program crashed. By using data breakpoint, we can tell the debugger to stop our program when ctpr==64. Immediately, we can identify that the problem is at line 1000.
Listing 1: A program with incorrect usage of pointer
![]()
. . .
line 10: char c;
line 11: char *cptr;
. . .
line 1000: cptr = c;
. . .
line 2000: printf("%c\n", *cptr);
. . .
![]()
Frequently programmers need to read and understand the program source codes written by other programmers. If the program source code size is huge, sometimes it is difficult or impossible to track down when, where, and how an important variable is initialized or used. It would be convenient that if the programmers can run the program under a debugger, and tell the debugger to stop the program if the variable of interest is referenced. This requires the debugger support the feature of efficient data breakpoints.
Though many current debuggers like GDB also include the data breakpoint functionality, the current implementation of data breakpoint is too slow to use. The GDB document mentions that data breakpoints may be implemented in software or hardware, depending on the systems. GDB does software watchpointing by single-stepping the program and testing the variables' value each time. This approach leads to hundreds of times slow-down compared to the normal execution[Sp99]. To make data breakpoints useful debugging primitive, data breakpoints have to be fast. In session 3 we study current implementations of data breakpoints.
1.3 Why do we need reverse execution?
Repeating execution from the beginning sometimes is not productive or impossible for many reasons. One reason is that some programs run for a long lime. Another reason is that two executions of the same program may not be identical because the outside environment for these two runs are different. Ideally, a debugger should allow the programmer to "scroll" forward and backward through an execution in the same way one can scroll through a text file [PL89]. If a debugger provides such a reverse execution feature, then the users do not need to re-execute their program if they need to go back to a particular point in time.
It is quite common that a programmer single-steps over a statement or single-step into a function by accident when he/she is debugging programs using single-stepping. Without reverse execution, the programmer has to restart the program from the beginning. The reverse execution with single-stepping backward allows the program to just go back to the previous statement or to step out the function. Unfortunately, many current implementations of reverse execution do not support single-stepping backward.
1.4 Why do we talk about data breakpoint and reverse execution in the same context?
The purpose of this survey is to study existing researches on data breakpoints and reverse execution. There are several reasons why we study the both techniques together. The main reason is that the implementations of both techniques rely on virtual memory system support or code instrumentation (code patching). Though most researches on reverse execution focus on parallel programs, this survey only focuses on the implementation technologies for single-process and single-thread debuggers.
Both data breakpoints and reverse execution need to deal with the contents of the program address space. In the case of data breakpoints, every time a memory location is modified, the debugger has to check if a data breakpoint is involved or the associated condition is met. In the case of reverse execution, whenever a memory location is read or write-referenced, or a page has been modified, some kind of logging is required so that the address space contents in the past can be reconstructed latter during the reverse execution. The CPU does not notify the debugger about these memory references. General implementations of both reverse execution and data breakpoints exploit the virtual memory mechanism to monitor the memory references. Those implementations turn off the read or write access permission of the data pages. When those pages are referenced, a trap is generated and the debugger is notified. More advanced implementations instrument the program to monitor read or write operations. Because the underling implementation of these two features are so similar, it is possible that two features can be implemented in a single framework.
2 Basic structure and implementation of a debugger
The fundamental tasks a debugger must perform are:
set control breakpoints or data breakpoints
start the debuggee
let users examine the state of the debuggee, and modify its memory contents.
single-step the debuggee
continue the suspended debuggee
terminate the debuggee
There are two kinds of software debugger. The first one is same-process debugger, and the second one is separate-process debugger. As their names suggest, a same-process debugger is one that executes in the same process context as the debuggee, and a separate-process debugger is one that executes as a separate process. Both approaches have advantages and limitations as far as debugging support is concerned. This section first focuses on the implementation issue of a separate-process debugger, and then gives a brief description on advantages and disadvantages of same-process debugger.
2.1 Separate-process debugger
For a separate-process debugger, the debugger and the debuggee are two separate user level processes. The debugger process is responsible for controlling the execution of the debuggee process, examining and changing its core image. Since the debugger is responsible for reading and writing the debuggee process' memory, starting and stopping the debuggee as requested by the user, this requires the support from the kernel. Almost every operating system provides some debugging support in the form of special system calls that are called debugging API. In this section, we are going to explain the basic algorithm and structure of a separate-process debugger, and their implementations.
All debuggers are implemented as a single big loop. They stay inside the loop waiting for some debugging events to happen, and then respond to the events. They repeat this process until they are terminated. The debugging events come from both the user and the debuggee. The events from the user are commands such as setting breakpoints, single-stepping, reading and writing debuggee memory space, and resuming the suspended debuggee. Before the debuggee runs or when the debuggee stops, the debugger simply waits and prompts the user for commands. When the debuggee is running, the debugger waits for the notification events from the kernel in the form of signals. If something happens, such as the debuggee performs a illegal operation or executes a breakpoint instruction, the kernel stops the debuggee and notifies the debugger. When this happens, the debugger has to examine the debuggee's state, inform the user what has just happened, and then waits for the commands from the user. The basic debugging algorithm works as follows:
the debugger fork()s a child process.
the child process executes the debuggee's executable binary and suspends the debuggee at the first instruction.
the debugger waits for the commands from the user. At this step, the users can set breakpoints and example the state of the debuggee.
the debugger executes the commands from the user. If the command is run, single-stepping, or continue, the debugger continues the execution of the debuggee.
the debugger waits for notification events from the kernel.
the debugger handles the notification events from the kernel and goes back to step 3.
Almost all debugging APIs for implementing debuggers, simple or complex, provide debuggers a way to control the debuggee, such as starting and stopping the debuggee, reading and writing its memory, and modify its state. On some UNIX systems, ptrace() system call is used to support software debugging, and on some new versions of UNIX systems, the /proc files system is used. Win32 debugging API is used by the Microsoft Windows OS.
2.1.1 induction to the ptrace() debugging API
Ptrace() is a system call which provides a means by which a debugger process may control the execution of a debuggee process, by examining and changing the debuggee's core image. Its function prototype takes the following form:
int ptrace(int request, int pid, int addr, int data);
The value of the request argument determines the precise action of the system call. Tables 1 shows all possible values of the request argument and the actions associated with them. The pid argument is the process id of the debuggee. The addr argument is a 32 bit address of the debuggee. If ptrace() is used to write the debuggee address space or registers, the data argument contains the data the caller wants to write to the addr location.
Table 1 some interesting ptrace request values and their actions
|
Ptrace Request Value |
Action |
|---|---|
|
PTRACE_TRACEME |
After the debugger fork()s a child process and before the child process exec()s the debuggee, the child process has to call ptrace(PTRACE_TRACEME, 0, 0, 0) to inform the kernel that it is going to be traced by its parent. The parent should be expected to trace the child. Then the child calls exec() to execute the debuggee, and suspends after the exec() system call. |
|
PTRACE_PEEKTEXT
|
The debugger can call ptrace() with this request to read a single word from 'addr' in the debuggee's code address space. The 'data' parameter is ignored. |
|
PTRACE_PEEKDATA |
The debugger reads a single word at location 'addr' in the debuggee's data address space. The 'data' parameter is ignored. |
|
PTRACE_POKETEXT
|
The debugger writes the 'data' to the location 'addr' in the debuggee's code address space. |
|
PTRACE-POKEDATA |
The debugger writes the 'data' to the location 'addr' in the debuggee's data address space. |
|
PTRACE_PEEKUSR |
Read the value of the debuggee's general register's value specified by 'addr'. |
|
PTRACE_POKEUSR |
Write 'data' to the general register specified by 'addr' from the debuggee. |
|
PTRACE_CONT |
The debuggee's execution is continued from wherever it stopped previously. The program counter can be reset at the same time optionally. |
|
PTRACE_KILL |
Send the child a SIGKILL to terminate it. |
|
PTRACE_SINGLESTEP |
Set the trap flag for single stepping. The debuggee will execute one instruction and stop. |
|
PTRACE_ATTACH |
Attach to the process specified in pid. |
|
PTRACE_DETACH |
Detach a process that was previously attached. |
Listing 2 shows a basic implementation of a debugger using ptrace(). In order to debug a debuggee, the debugger fork()s a child to execute the debuggee program. Before the child executes the debuggee program, the child has to inform the kernel that it will be debugged by its parent with the ptrace() call using the PTRACE_TRACEME request at the line number 10. After the child successfully executes the debuggee program, the kernel suspends the debuggee at the first instruction and lets the parent decide what to do with the debuggee and when to run the debuggee. The reason to suspend the debuggee before it runs is that the parent can set breakpoints on the debuggee. After the parent fork()s the child, it enters a loop to prompt the users for commands and handle the commands. The code at line number 17 reads a command from the user and the switch statement at line number 18 handles the commands. The user can take this chance to set breakpoints on the debuggee and then tell the debugger to run the debuggee. When the debuggee is running, the parent just wait()s for the notification events about the debuggee from the kernel as at line number 50. If something happens to the debuggee, such as segmentation fault, divide by 0, and breakpoint trap, the kernel stops the debuggee and notifies the debugger; the debugger has to check what has happened to the debuggee and inform the user accordingly as at line number 51, and then goes back to the beginning of the loop to wait for the user's command. At this state, the user can examine and modify the debuggee's core image through the debugger.
Listing 2: Basic implementation of a debugger using ptrace
![]()
int inf_status;
int command;
pid = fork();
if(pid < 0)
handle_error("fork");
if(pid == 0){ // I am the child
ptrace(PTRACE_TRACEME, 0, 0, 0); /* tell os that I will be debugged by my parent */
exec("program_to_be_debugged"); /* execute the debuggee */
}
// I am the parent
for(; ;){
print_prompt("dbg>");
command = read_command();
switch(command){
case RUN:
case CONT:
ptrace(PTRACE_CONT, pid, 0, 0); // continue the debuggee
break;
case SINGLESTEPPING:
ptrace(PTRACE_SINGLESTEP, pid, 1, 0);
break;
case SETBREAKPOINT:
/* use ptrace with PTRACE__PEEKTEXT to read the bytes
at the breakpoint location and save the bytes somewhere.
Use ptrace with PTRACE_POKETEXT to write a breakpoint
trap instruction at the breakpoint location. */
break;
case REMOVEBREAKPOINT:
/* use ptrace with PTRACE_POKETEXT to write the original
bytes back to the breakpoint location. */
break;
case READVARIABE:
/* use PTRACE+PEEKDATA to read the contents of a variable */
break;
case REAREGISTER:
/* read register;
break;
case TERMINATE:
// kill the debuggee
break;
... // more command
}
if(command == RUN || command == SINGLESTEPPING || command == CONT){
wait(&inf_status); //wait for the signals from the debuggee
handle_signals(inf_status);
}
}
![]()
2.1.2 Introduction to control breakpoints
The most basic functionality of a debugger is to set control breakpoints on the debuggee. The implementation of control breakpoint is to replace the instruction at the address where the programmer wants the program to stop with a special trap instruction. The purpose of this special instruction is to stop the execution of the debuggee at the location, and turn over the control to the debugger. Many computer systems provide a trap instruction, such as INT3 in Intel x86, for this purpose. Some other systems like PowerPC does not provide such kind of trap instruction, and any illegal instruction can be used for the same purpose.
Given an address in the debuggee's address space where a control breakpoint is to be set, the debugger reads the current instruction at the specified location, saves it for later replacement, and then writes the trap instruction at that location.
When the processor executes one of the trap instructions, it stops the executing debuggee and generates a special trap to the operating system. The operating system notifies the debugger that the debuggee has stopped, why it stopped, and where it stopped, including which thread of execution was running when the trap occurred. It is now up to the debugger to react accordingly. At this point, the user can examine the state of the debuggee, modify the state of some variables, remove breakpoints, set new breakpoints, single step the debuggee or continue the execution.
If the user wants to continue the execution of the debuggee, the debugger must put the instruction it saved earlier back to the original location, and have the debuggee single-stop this one instruction, and then set the breakpoint instruction again before allowing the debuggee to again proceed.
Some debuggers like GDB can specify a condition on a control breakpoint, and this is call conditional control breakpoint. When a breakpoint trap occurs, the debugger evaluates the condition. If the condition is true, then the debugger allows the user to examine the state of the debuggee; otherwise, the debugger continue the execution of the debuggee. For example, GDB uses "break address if con" to specify the condition on the breakpoint at the location "address". Con is an expression in GDB. GDB's breakpoint commands accept an expression, such as a+c>d, and compute its value. Any kind of constant, variable or operator defined by the programming language the user are using is valid in an expression in GDB. This includes conditional expressions, function calls, casts and string constants. GDB uses an interpreter to evaluate a condition expression. GDB first looks up the address of each variable contained in the expression from the symbol table, and then uses ptrace() to read the values of the variables from the debuggee address space to evaluate the condition.
2.1.3 Introduction to /proc file system
The /proc file system is not just a debugging API; it is a file system that provides access to the state of each process and light-weight process (lwp) in the system. The name of each entry in the /proc directory is a decimal number corresponding to a process-ID. These entries are themselves subdirectories. Access to process state is provided by additional files contained within each subdirectory [Pr56]. The most interesting files to the debugger are the 'as' file and the 'ctl' file. For the SunOS, the normal UNIX file operations such as open(), write() and read() are used to read and modify the files, and ioctl() is used for some other UNIX systems.
To control a debuggee process' execution such as starting and stopping it, the debugger must write some commands to the 'ctl' file. The 'ctl' is a write-only file to which structured messages (control messages) are written to direct the OS to change some aspect of the process's state or control its behavior in some way. The effect of a control message is immediately reflected in the state of the process visible through appropriate status and information files. Some important control messages and their actions are described in detail in Table 2. Table 2 only lists some of the control messages, and there are a lot more control messages that are defined in the /proc manual page.
The file 'as' contains the address-space image of the process; it can be opened for both reading and writing. lseek() is used to position the file at the virtual address of interest and then the address space can be examined or changed through read() or write() (or by using pread() or pwrite() for the combined operation). To set control breakpoints, read or modifies some values of some variables, 'as' is the file the debugger needs to access.
The file 'status' contains state information about the process. Through this the debugger can access the values of the debuggee's registers.
Table 2 some important control messages and their actions
|
Control Messages |
Actions |
|---|---|
|
PCSTOP |
When applied to the debuggee process's control file, PCSTOP directs the debuggee to stop and waits for it to stop. |
|
PCRUN |
Make the debuggee runnable again after a stop. This operation takes a long operand containing zero or more of the flags. PRSETP is one of the flags which directs the debuggee to execute a single machine instruction. On completion of the instruction, a trace trap occurs, and the kernel notifies the debugger. |
|
PCSTRACE |
Define a set of UNIX signals to be traced in the debuggee process. |
|
PCSFAULT |
Defines a set of hardware faults to be traced (examples: illegal instruction, integer overflow, page fault etc.) |
|
PCSENTRY |
Instructs the debuggee to stop on entry to specific system calls. |
|
PCSEXIT |
Instructs the debuggee to stop on exit from specific system calls. After the debugger fork()s a child process, the child process has to use this control messages to instruct itself to stop as soon as the exec() system call is finished so that as soon as the debuggee is loaded into the memory, it is suspended to wait for the debugger's instructions. |
|
PCKILL |
Send a terminate signal to kill the debuggee |
|
PCSREG |
Set the general registers of the debuggee |
The basic implementation of a debugger using /proc file system is illustrated in listing 3. Comparing the listing 3 to listing 2, we can see that their basic structures are the same. The only difference is that the API used to control the debuggee is different. Most operating systems provide the same basic functionality for developing a source level debugger, and the basic structure and algorithm of a debugger remain the same across different platforms, although their debugging APIs look different and work differently.
Listing 3 Basic implementation of a debugger using /proc
![]()
#include <sys/fault.h>
#include <sys/types.h>
#include <sys/signal.h>
#include <sys/syscall.h>
#include <procfs.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/stat.h>
struct proc_ctl {
int cmd;
long data;
};
struct sys_ctl {
int cmd;
sysset_t sysset;
};
main(int argc, char **argv)
{
pid_t pid;
pid_t tpid;
int s;
unsigned int pc;
char str[1024];
int cfd;
struct proc_ctl pctl;
int st;
int i;
pid = fork();
if(pid < 0)
handle_error("fork");
if(pid == 0){
char str[1024];
int cfd; // 'ctl' file descriptor
struct sys_ctl exitset;
// get 'ctl' file path of this process
sprintf(str, "/proc/%d/ctl", getpid());
cfd = open(str, O_WRONLY); // open the 'ctl' file
premptyset(&exitset.sysset); // reset the 'ctl' data structure
praddset(&exitset.sysset, SYS_execve); /* set the 'ctl' data structure to instruct the debuggee to stop after the exec() call */
exitset.cmd = PCSEXIT;
//set the command
write (cfd, (char *) &exitset, sizeof (struct sys_ctl));
// load the debuggee
execve(argv[1], &argv[1],NULL);
}
//I am the parent, open my child debuggee's 'ctl' file
sprintf(str, "/proc/%d/ctl",pid);
cfd = open(str, O_WRONLY);
for(;;){
print_prompt("dbg>");
command = read_command();
//handle commands
switch(command){
case RUN:
case CONT:
pctl.cmd = PCRUN;
pctl.data = PRCFAULT|PRCSIG;
//made the debuggee runnable
write (cfd, (char *) &pctl, sizeof (struct proc_ctl));
break;
....
}
if(command == RUN || command == SINGLESTEPPING || command == CONT){
wait(&inf_status); //wait for the signals from the debuggee
handle_signals(inf_status);
}
}
}
![]()
2.1.4 Comparison of ptrace() and /proc and their disadvantage
Compare to ptrace(), /proc is more efficient and flexible. The most obvious drawback of ptrace() is that each ptrace() system call can only read or write 32-bit data to the debuggee process. /proc does not have this limitation. /proc can access the debuggee's address space like the ordinary file using open(), write(), lseek(), and read(). It is also easier to use /proc to develop a debugger to debug the programs which have more than one thread. Each thread has a directory entry under the process directory in the /proc files system. The debugger can use the exact algorithm of debugging the main process to debug each individual thread. However, ptrace() does not support this capability.
2.1.5 The disadvantage of separate-process debugger
A separate-process debugger guarantees that an error in the debuggee cannot affect the debugger's state. This guarantee comes with a price, the inter-process communication overhead. Handling of a breakpoint trap needs at least 4 context switches and 4 system calls. When the debuggee executes a breakpoint trap, the kernel stops the debuggee and switches to the debugger. After the debugger handles the breakpoint trap, the debugger writes back the original instruction at the breakpoint trap location, set single step bit, and switches to the debuggee to single step the instruction, and then the execution switches back to the debugger to set the breakpoint again, and finally it switches back to the debuggee to continue its normal execution.
The overhead becomes even worse when conditions are used on control breakpoints. Usually the conditions are expressed as an expression such as "break if sum == 50". The debuggers like GDB use an interpreter to verify the condition of the breakpoints.
To minimize the overhead and improve the debugger's efficiency, many tasks should be moved into of the kernel or the debuggee. This survey will also discuss the methods on reducing the overhead.
2.2 Same-process debugger
The algorithm used by the same-process debuggers and the separate-process debuggers are basically the same. The only difference is that separate-process debuggers rely on the OS support to control the debuggee. A same-process debugger is a debugger that links its code and the debuggee's code into a single executable program. Both debugger and debuggee are living in the same address space. Therefore, there is no need for operating system support to set breakpoints, read and write the debuggee's memory, and examine the state of the debuggee. There is no context switch between the two modules. Switching between the debugger and the debuggee is just a jump instruction. Since a same-process debugger does not need kernel support, it is very portable.
One problem with same-process debuggers is that the debuggee is buggy, this automatically make the debugger buggy too. It is possible for a buggy program randomly modifies its address space and crashes. If the debuggee crashes itself, the debugger crashes too. This makes finding a bug impossible.
3 Current implementation of data breakpoints
Data breakpoint allows the user to specify a memory location to monitor; therefore, sometimes they are called watchpoints. Whenever the memory location is read-referenced or write referenced, the debugger returns the control to the user and let the user examine the data. Since it is too slow to check every single read and write, all data breakpoint implementations only check the write operation because only the write operations may modify the location. Data breakpoint is one of the most important features of a debugger because data corruption is a common and very difficult type of bug to identify. However, it is not one that has been widely adopted and standardized. The reason why data breakpoint is not widely used is its performance overhand. It is difficult to be implemented efficiently and has only limited support in most current debuggers. Even though debuggers such as GDB supports data breakpoint, they are too slow to use if the data breakpoint feature is turn on. It is about 100 to 1000 times slowdown than the normal execution.
The two basic data breakpoint implementations are hardware data breakpoint and software data breakpoint. The most efficient implementation of data breakpoints requires hardware support. It is becoming more common for machines to provide hardware support for data breakpoints, but it is by no means universal. While they are efficient, hardware-supported data breakpoints are not general, typically allowing the user to watching only a few memory words.
A debugger needs to know when a specified memory location is modified if the data breakpoint feature is turned on. Without hardware support, if we know which write instructions will modify the memory location, we only need to monitor those instructions. However, pointers in a program make it much harder. In many situations, we are unable to know which instructions will modify the location. Therefore the only solution is to check every single write when the program is run. Without hardware support, the simplest way to check if the location is modified is to single-step the debuggee program and give the debugger the control after each instruction to check if the instruction has modified a specified location. Single-stepping and checking every instruction is too slow to be used. More recently software implementations of data breakpoints try to reduce the number of checks, and make the checks faster. There are three ways[RP96] that have been proposed to reduce the number of checks. The first is to check only write instructions instead of all instructions; the second is to check only the write instructions that write to the same page as the watched locations; the third is to checkpoint, that is, periodically stop and check the condition with the ability to roll back to a previous state if the breakpoint condition has been met. There are many ways [RP96] proposed to make the check faster, from moving the checks outside to the debugger and into the kernel or the debuggee itself, to using special hardware that may be available, to use more efficient data structures and algorithms to do the checks, and various combinations of the above.
The recent proposed solutions of efficient data breakpoints are trap patching, code patching, checkpointing, and virtual memory.
3.1 Hardware Support
Many current processors provide direct support for data breakpoints including Intel x86 and the MIPS R4000. Typically, specialized registers, called debug registers, are used to specify the region of memory to be monitored. Intel x86 processors have four debug address registers[IN99], which can be used to set data breakpoints or control breakpoints. Each of the four debug-address registers hold a 32-bit linear address of a breakpoint. Breakpoint comparisons are made before physical address translation occurs. A hardware trap is generated when a write or read occurs to a monitored region of memory. Although hardware support is typically fast, it is also the most machine dependent. The biggest disadvantage of this approach is the extremely limited functionality that is provided by existing or proposed hardware. Currently, most CPU only provide up to four debug registers. At any moment, only up to four memory locations can be watched.
3.2 Single-Stepping
The popular debugger GDB first tries to set hardware data breakpoints when the user wants to use data breakpoints. If the hardware or the kernel does not support hardware data breakpoints, it will set software data breakpoints. GDB reports that software data breakpoints are extremely slow (100 to 1000 times slower) because GDB's software data breakpoints are implemented using single-stepping. Many CPUs supports single-stepping. When single-stepping is turned on, the CPU returns the control to the debugger each time it executes an instruction. If single-stepping is used to implement data breakpoints, then each instruction is checked after the control is returned to the debugger. If the instruction referenced a memory location whose effective address matches a breakpoint, the user is notified; otherwise, the next instruction is executed. Single-stepping is typically slow because for each instruction, it requires multiple protection boundary crossings and system calls.
3.3 Trap Patching
The obvious way to improve the performance of data breakpoints is to reduce the number of checks. The only instructions needed to be checked are those store instructions because only the store instructions can modify memory locations. Thus it is only necessary to check the data breakpoint condition before or after store instructions. The single-stepping method stops every instruction and check if the target location is accessed. However, typically only about 30% of instructions are memory reference instructions, and only a third of those are writes [KE93].
Trap Patching improves over single stepping by only checking the store instructions. Trap patching replaces all store instructions with a trap to the debugger. When the trap instruction executes, control is given to the debugger which determines whether a target location is accessed and whether the user specified condition on the location is met. If a target location is modified and the specified condition is true, the debugger returns the control to the user; otherwise, the debugger writes back the write instruction to the target location, single-steps the write instruction, replaces the write instruction with the trap again, and then the debuggee continues.
The advantage of trap patching is to reduce the number of checks; however, the overhead
of checks is till very high since each check incurs a few system calls and context switches like control breakpoints or single stepping data breakpoints. The insertion of the first data breakpoint and the deletion of the last data breakpoint are also expensive operations because each insertion and deletion requires at least one system call for each write instruction to replace the instruction. The insertion of the first data breakpoints also requires scanning the whole program an instruction at a time and replacing write instructions with traps.
3.4 Code Patching
Every memory write access check is till very expensive since the check is performed by the debugger. To further improve the speed of data breakpoints, the checks should be inside the debuggee instead of the debugger. Like trap patching, code patching checks every single write. Instead of trapping and returning control to the debugger, however, the check is done by the debuggee itself. This requires adding code to the executable to be debugged that checks the target of every write instruction.
There has been considerable research into uses of code patching. Code patching was first
implemented for fast control breakpoint [KE90]. Later, code patching was advocated as a good solution for data breakpoints in the simulation study from UC Berkeley[WL93]. After the study in UC Berkeley, some debuggers system such as KPC's debugger[CT94] use code patching to implement data breakpoints. Code patching patches each store instruction with a jump to a new code segment, called a displaced handler. The displaced handler first checks the data breakpoint condition and halts the program if necessary. Otherwise, the displaced handler simulates the displaced instruction and jumps to the logical successor of the displaced instruction.
The simulation study from UC Berkeley has shown that code patching is the second best method among hardware registers, simple virtual memory, trap patching and code patching. Roberts' study [RP96] also show the same result. The other advantage of code patching is that it is very portable in that it does not rely on any hardware or operating system support. However, code patching also suffers from a few problems. The first problem is that the debugger needs to dynamically call function from the debuggee to insert and delete data breakpoints since all the check is done inside the debuggee. Under some UNIX systems, the dynamic function calling code is extremely slow due to its excessive use of ptrace(). The second problem is that Heisenbug bugs will go away when new code is inserted. Finally, the biggest problem is that there is no guarantee that a program with data corruption problems will not corrupt the data structure used to check the watchpoints since they reside in the same address space.
3.5 Virtual Memory
Trap patching and code patching check every single store instruction. To further reduce the number of checks, only those writes that write to the pages containing a data breakpoint need to be checked. Virtual memory can be used to write protect all pages which contain a data breakpoint. When a write to a protected page occurs, it causes a trap to the debugger, which evaluates the condition. When execution continues, the debugger unprotects the page, single steps the instruction, protects the page, and then allows the program to run in full speed.
Although virtual memory significantly reduces the number of checks, both Roberts' study and UC Berkeley's simulation show that the performance of code patching is much better than virtual memory. The performance overhead is mostly due to the high overhead of checks. When trap occurs, one crossing is needed from the debuggee to debugger. After the debugger evaluates the conditions, it needs to make one system call to unprotect the page, another to enter the debuggee and single-step the instruction. Then the OS makes another context switch from the debuggee to the debugger, and the debugger makes another system call to re-protect the page and finally the last crossing from debugger to debuggee is needed for the debuggee to continue the execution.
Virtual memory has been used by openVMS's debugger[OD96] to implement data breakpoints. The openVMS's debugger manual mentions that virtual memory is only used for watching the static variables. For those non-static variables that are allocated inside a call stack, they use single-stepping to implement data breakpoints because performance problems arise if the call stack is write-protected. However, the manual does not mention anything about what kind of the problems arise. Definitely, the virtual memory scheme has trouble with local variables. The overhead of watching the local variable is expected to be significantly high since the stack needs to be written all the time.
3.6 Checkpointing
Another way to reduce the number of checks is to use checkpointing[KE93]. Though there is no evidence that this scheme has been implemented, this is an interesting scheme. The checkpointing implementation of data breakpoints is similar to using checkpointing to implement reverse execution such as IGOR (will be explained in the coming sections) does. The idea behind the checkpointing scheme is the debugger periodically stops the program and checks the condition. If the condition is false, then the debugger checkpoints the state of the debuggee and continue the debuggee's execution. The debugger repeats this process until the condition is true. Then the debugger restarts the debuggee from the previous checkpoint but uses a smaller checkpoint interval until it homes in to the exact location where the condition turns true.
Checkpointing only works for certain kinds of data breakpoints, those in which a condition turns true and stays true for the remainder of the program. In this context, then the IGOR system is also a data breakpoint system though IGOR was designed to be a reverse execution debugger. When using IGOR, you can tell the IGOR system to go back to the location where makes an expression, such as a>10 or a < 4, becomes true. Then IGOR goes back to the closest checkpoint and restarts the execution from the checkpoint.
There is no performance analysis on this scheme. But this scheme uses checkpointing and this means it also has problems which time and space. We will discuss the problems associated with checkpointing later.
3.7 Improvement on the data breakpoint implementations.
The data breakpoint implementations can be further improved with special kernel support[RP96] or special data structure[WL93.]. In the case of virtual memory and trap patching, the checks can be moved into the kernel so that two context switches, from the kernel to debugger and from debugger back to kernel, and some system calls (read data from debuggee's address space) can be saved.
When a write operation occurs, a write check has to be performed to determine whether the write instruction's target address references a monitored region (i.e. a data breakpoint is set at the region). This operation is called address lookup. Address lookup has to be fast to reduce the check overhead. Some older implementations use hash table too perform the address lookup. Hash table is very slow since it needs many instructions to verify if an address is one of the target address. A segmented bitmap[WL93] can be used to improve the performance of the checks because it minimizes the average number of memory accesses required. The segmented bitmap is an efficient data structure for checking whether a given target address is part of a data breakpoint condition.
The segmented bitmap data structure uses one bit to represent each word allocated by the program being debugged. Each bit indicates whether or not the corresponding word is monitored. This scheme uses roughly 3% of the total memory used by the program. However, it incurs at most two memory accesses per address lookup. To reduce the space overhead, the bitmap is organized into segments with an extra load instruction overhead.
3.8 Performance Comparison
Both Robert Wahbe[WA92] and Paul E. Roberts[RP96] have done research on the performance of some of the above implementations. Wahbe used simulation to compare the performance of hardware, code patching, virtual memory and trap patching. Roberts implemented single stepping, code patching, virtual memory, and virtual memory with kernel support for the checks into GDB. We use the data from both studies to compare the performance of these implementation techniques.
Table 3 list some mean relative overhead data from UC Berkeley, and Figure 1 shows the same data in graph. From the data we can see that the native hardware has the smallest overhead and for most of the benchmarks, the overhead almost can be neglected. Code patching is the best software implementation whose overhead is less than 4 times of the original run time. In most cases, virtual memory scheme's performance is also not too bad. However, for most of the cases, trap patching probably is too slow to be useful.
Table 3: Mean Relative Overhead Data from UC Berkeley


Figure 1: Mean Relative Overhead Data From UC Berkeley
Table 4, Table 5, and Table 6 list the results of Roberts' study. Robert expressed his data in a different way. He divided his experiments into three categories, global variable test, heap variable test, and stack variable test. For the software implementation of data breakpoints, Roberts and Wahbe have the same conclusion that code patching has the best performance. The worse code patch overhead from both result is less than 4 times of the original runtime. Roberts implemented the virtual memory scheme both with (user virtual memory) and without the support from kernel (kernel virtual memory). Roberts' kernel virtual memory scheme implementation is developed such that the checks are performed inside the kernel. However, The user virtual memory overhead of Roberts' result is worse than Wahbe's result. The kernel virtual memory's overhead is close to Wahbe's virtual memory scheme's result. Roberts' stack test proves that virtual memory is too slow for stack variables. Virtual memory is not good for local variables because the stack is modified all the time and it should not be write protected.
Table 4: Median time for global tests in seconds from Roberts' study
|
|
No wp |
Single stepping |
Code patching |
User virtual memory |
Kernel virtual memory |
|---|---|---|---|---|---|
|
Eqn |
57.3 |
2750000 |
69.6 |
1700 |
81.2 |
|
esp |
30.4 |
1370000 |
55.7 |
9450 |
159 |
|
xli |
9.3 |
458000 |
20.3 |
12700 |
190 |
|
gcc |
74.4 |
3440000 |
126 |
643 |
83.2 |
Table 5: Median time for heap tests in seconds from Roberts' study
|
|
No wp |
Single stepping |
Code patching |
User virtual memory |
Kernel virtual memory |
|---|---|---|---|---|---|
|
Eqn |
57.3 |
2750000 |
77.1 |
76.7 |
57.8 |
|
esp |
30.4 |
1370000 |
55.9 |
17400 |
160 |
|
xli |
9.3 |
458000 |
641 |
16300 |
212 |
|
gcc |
74.4 |
3440000 |
126 |
767 |
81.1 |
Table 6: Median time for stack tests in seconds from Roberts' study
|
|
No wp |
Single stepping |
Code patching |
User virtual memory |
Kernel virtual memory |
|---|---|---|---|---|---|
|
eqn |
57.3 |
2750000 |
70 |
98100 |
2080 |
|
esp |
30.4 |
1370000 |
54 |
87700 |
1120 |
|
xli |
9.3 |
458000 |
21 |
3000 |
40 |
|
gcc |
74.4 |
3440000 |
129 |
3610 |
296 |
The above results show that code patching is less than 4 times slow down and it is the best way to implement data breakpoints. Virtual memory with kernel support is also not too bad if it is not used for local variables. For the heap variables and the global variables, sometimes kernel virtual memory performs better than code patching.
4 Current Implementations of Reverse Execution
There is no direct hardware support for reverse execution. The current implementations for reverse execution are solely software based. A process is a state machine and it moves to a new state after the execution of each instruction instance. A process state includes the contents of the process' address space and the CPU state. After the execution of an instruction instance, either the CPU state or the contents of the address space is changed or both. Reverse execution is a debugger feature that allows one to go back to a previous state. In other words, reverse execution undoes the effect of each instruction instance between the latest instruction instance to the target restored point. We use instruction instance here because an instruction can be executed many times if it is inside a loop. Current processors are not reversible. This means that the current processors can not be instructed to undo the effect of an instruction instance, such that they can not undo an "add" operation or a "mov" operation. Therefore, additional information must be recorded in order to restore past process states.
To support reverse execution, the debugger needs to save the debuggee's states or history information. Currently there are two ways to implement reverse execution. The first method is to checkpoint CPU states and address space periodically. The second method is to checkpoint the CPU states and record the execution history. The execution history is recorded in the form of history elements. Each history element consists of an address and the associated value. Saving information to restore a process' state is time- and space-consuming. All the current implementation efforts of reverse execution focus on how to reduce the time and space overhead associated with maintaining the debuggee process' states.
The ideal way to implement reverse execution is to undo each instruction instance one by one. However, it is very difficult to implement such a reverse execution system because this requires the debugger to save the state of each instruction instance of the debuggee, and this is too slow to be practical. Besides, there would not be enough space to store the states. Therefore, All the current implementations use some sort of checkpointing to minimize the time and space required to support reverse execution. Most implementations of reverse execution do not actually execute reversely. Many people call the current reverse execution systems as replay systems. Figure 2 Shows how some reverse execution systems such as IGOR and RECAP work. C1, C2, C3, and C4 in the figure are checkpoints. Before reverse execution begins, the user has to specify a point to where he wants the execution to revert to. Then the debugger picks the closest checkpoint to the "revert" location and executes the debuggee forward to the specified point. For example, if we want to reverse the execution to the point R1, then the debugger first chooses the checkpoint at C3, restores the state of the debuggee process to C3 using the checkpointed information, and re-executes the debuggee process forward from C3 until R1 is reached.

Figure 2: Checkpointing and reverse execution
As mentioned earlier, there are two methods to use checkpointing to support reverse execution. This first one is to checkpoint the CPU states and address space of the debuggee process periodically and when reverse execution is needed, the debuggee reexecutes forward from a specific checkpoint. The RECAP system and the IGOR system use this method. The other method is to save history information, and periodically checkpoint the CPU state. When reverse execution is needed, a specific process state is reconstructed by restoring the address from the saved history information incrementally, and restoring the checkpointed CPU state at the specific checkpoint, and the debuggee reexecutes forward from the reconstructed state. The History Cache method and the Optimal Tracing algorithm use this method. The SPYDER system also uses history information, but unlike the other two systems which used replay to implement reverse execution, it backtracks to a previous statement by undoing the side effect of each statement in the reverse execution order.
Saving address space and history elements requires huge disk space. No system actually saves the whole address space at each checkpoint or saves every single history element. Both IGOR and RECAP only save the modified pages at each checkpoint. The History Cache scheme only saves the old value of an address in an interval when the address is first modified in the interval. The Optimal Tracing algorithm only saves the value of an address when the address is first read referenced in an interval. SPYDER saves the value of an address before it executing a statement which will modify the value of the address. A statement here can be a loop statement or an if statement.
Use checkpointing to implement reverse execution has many problems beside the performance overhead and the space requirement. Signals and system calls can cause problems because some system calls are not reversible and signals need to be regenerated at the exact time during reverse execution. For example, a system is made to print a line and we cannot undo the printing. Some of the current implementations just ignore system calls completely. Some other implementations do not repeat system calls during reverse execution, and they retrieve the results of the system calls from a log, which are recorded during the normal execution. In order to simulate signals during the reverse execution, we need to know when the signals are received during the normal execution. Some implementations also ignore both signals and system calls, but some other implementations just simulate the signals approximately by recording the time when it occurs during the original run. The best way to implement singles is to use software counter[PL89] which identifies each execution instance of an instruction from a process. By using software counter, we can record the software counter value when a signal occurs, and we can regenerate the signal at the same software counter value during replay.
We have studied some reverse execution systems, and only one of them has the single-step-back ability. Like single-step-forward, the single-step-backward is also an important debugging feature. It is very common that programmers single-step-forward one more instruction than desired by accident when they use sing-stepping-forward to debug their program. It would be nice if they can single-step-backward to the previous instruction. The difficulty of implementing single-stepping-backward is that we need a way to identify each individual execution instance of an instruction. All existing implementations of reverse execution goes back to a previous checkpointed state and reexecute the program forward to a specified point. If an instruction is inside a loop, most likely it will be executed many times. To single-step-backward one instruction, we need to know when we should stop the execution of the program after it executes the instruction. Ideally we should have an instruction counter (not an program counter) which always increments itself by one each time an instruction is executed. If the debuggee stops at an instruction with an instruction counter number N, and the user wants to revert the execution to N-1, then the debugger can restart the execution of the debuggee forward again from a checkpoint state at an instruction with an instruction counter number M after the debugger resetting the counter to M. As soon as the counter reaches N-1, then the debuggee is stopped. This implementation creates the illusion of single-stepping-backward. Currently there is no CPU that supports hardware instruction counter. However, software instruction counter can be implemented, and we will give more detail on software instruction counter later.
We will explain some reverse execution systems in the following sections, with a focus on how they save the state and handle signals and system calls.
4.1 IGOR
IGOR[FB88] is a debugging system running under the DUNE distributed operating system. The current (1988) IGOR system only supports debugging of single-thread programs. IGOR supports reverse execution, selective searching of execution history, substitution of the data and executable parts of the programs. IGOR requires the support from the OS to perform reverse execution because it uses the virtual memory system for the checkpointing.
4.1.1 Checkpointing and Implementation
IGOR relies on the virtual memory system to save the execution state of the debuggee. At each time interval, IGOR saves a copy of every page that has been changed since the last checkpoint. IGOR introduces two new system calls to support the checkpointing. The first system call is pagemod(), which returns a list of all memory pages that have been written since the previous call to pagemeod(). The implementation of pagemod() us the virtual memory's "dirty bit". At each checkpoint, the contents of these memory pages are saved in the checkpoint files, together with the contents of the registers and program counter. The interval between checkpoints is controlled with another new system call ualarm(). This system call is similar to the standard UNIX alarm() system call, in that it provides an interrupt to the program after a specified amount of time has elapsed. IGOR also needs to modify the compiler to insert some special code into the debuggee's program startup code to turn the checkpointing on or off.
4.1.2 Reversible Execution
To perform the reverse execution, the IGOR user has to specify the point to which execution is to revert to. This is similar to setting a data breakpoint. For example, the user might wish to back up to the first point at which x > 0, or to the first point for which n ==10. The IGOR system then locates the last checkpoint before the desired point and the execution restarts from the checkpoint. IGOR first reconstructs a core image of the specified checkpoint by retrieving the latest image of each page as of the desired checkpoint. The data are then copied in from the reconstructed core image, the stack and registers are restored, and the execution resumes. Actually, during the reverse execution, IGOR uses a object code interpreter to simulates the execution of the user's program forward from the checkpoint, and the interpreter stops when the selection criterion is met. The current implementation of IGOR only supports simple expressions like a > 10, a < 10, and a == 10.
4.1.3 Signals and System calls
IGOR does not mention how it handle signals. It only describes how to restart input/output during reverse execution. Before IGOR reexecutes a user program, it runs a prestart() routine first to restore some I/O states. IGOR users can provide their own prestart() routine to reestablish the I/O environment and perform any other desired actions prior to resuming the normal execution. If the user does not supply a prestart() routine, IGOR supplies a default routine. It queries the user about the name of each file that should be opened, the mode for each file (read, write, etc.), and the position within the file at which the file pointer should be positioned.
4.1.4 Performance
The data from the IGOR paper shows that the time overhead for checkpointing (Table 7) is reasonably good. The worse-case checkpointing overhead reported is 380% for the checkpointing interval of 0.1 second. This overhead is small enough for practical application. The amount of disk space (Table 8) required to store the checkpointed pages depends on the checkpointing interval. If the interval is not very small, then space requirement is not a problem. The worse-case space requirement is 30.7 pages per checkpoint for the checkpointing interval of 0.1 seconds. If the size of a page is 4k, then the space requirement is 1.2Mbyte/second. The worse-case space requirement is 41.2 pages per checkpoint for the checkpointing interval of 1 second, which require 0.16Mbytes/second. Table 7 shows the execution time overhead of IGOR's checkpointing for some programs. Table 8 Indicates the average percentage of non-text memory pages that were written out during the execution. The most serious drawback of IGOR is the overhead of the object code interpreter, which is approximately 140 times slower than actual execution. If we want to reverse execute a debuggee program for 2 seconds of the normal execution time, then it will take about 5 minutes for the object interpreter to finish the reverse execution.
Table 7: Checkpointing Overhead of the IGOR system

Table 8: Checkpointing space requirement of the IGOR system

4.2 RECAP
The basic idea of RECAP[PL89] is similar to IGOR. Both systems try to only save the modified pages at each checkpoint. The difference is that IGOR creates two new system calls, and RECAP depends on the copy-on-write fork() system call to do the job. However, RECAP mostly focuses on handling the signals and system calls since its checkpointing mechanism is much simpler than IGOR's checkpointing mechanism. RECAP is a debugger system for parallel programs using shared memory or message passing, therefore, it must handle the events such as shared memory accesses carefully.
4.2.1 Signals and System calls
RECAP's approach of reverse execution is to record all events that affect the state of the process and to checkpoint the state of a process periodically.
RECAP handles external events such as system calls, shared memory access, message receiving, and signals by recording the result of each external event. During replay all events are retrieved from the log instead of regenerated.
RECAP inserts a switch between a process and its external interactions to filter out the events. During the normal forward execution, the switch monitor all events and records them to a log as shown in Figure 3a. The outside circle represents the external environment. During replay, the switch is set to replaying the events from the log as shown in Figure 3b.
When a signal occurs, RECAP's runtime library logs the signal number, the time when it occurs, and the program counter where the process was interrupted. Ideally, the signals should occur at the same point in a replay as during the normal execution. However, under RECAP the signal time is approximated by the process' CPU time during replay. We will explain how RECAP regenerates signals later.

(a) Monitoring phase (b)Replay phase
Figure 3: Event monitoring and self-contained replay
4.2.2 Checkpointing
RECAP obtains its checkpointing for free by using UNIX copy-on-write fork() system call. During the normal execution, RECAP periodically stops a process and forces it to take a checkpoint. RECAP implements a checkpoint as a suspended process. Thus, to create a checkpoint, a process simply fork()s a copy of itself. The parent continues immediately, the child process is suspended until needed for replay. Forking a process automatically saves the process' registers. The original debuggee process' pages are copied to the child process only when they are modified. The copy-on-write fork() system call saves checkpointing a great deal of time since RECAP does not need to copy all pages at each checkpoint and it does not need to output the checkpoints to the disk.
4.2.3 Replaying
RECAP provides its users several ways to perform reverse execution. The first method is that the user needs to specify a particular time in the execution of the debuggee process, then RECAP chooses one of the suspended checkpoint processes and continues the chosen checkpoint process using the event log, until it reaches the specified time (not instruction). For example, if a debuggee executes for 5 seconds (CPU-time), then the user can ask RECAP to reverse the debuggee's execution to one second earlier. The chosen process must first fork a new process from the checkpoint, otherwise the checkpoint will be lost as a result of changes made during replay. The second method allows the user step execution backward repeatedly at a specified rate (in term of execution time).
During replay, a shared memory read retrieves the value from the event log instead of physical memory. Similarly, system calls are not executed again, but the affected variables are set using the values in the log.
To reproduce signals faithfully, RECAP suspends execution just before the time when a signal occurred and set a breakpoint at the program counter where the signal occurred. When the breakpoint is reached, the actual time can be checked against the saved signal time and execution continued if the desired time has not been reached. This approach is not precise because process timings are not reproducible, but is probably close enough in practice.
4.2.4 Implementation
A process executing under RECAP must use a different run-time library that logs system calls, records the time at which signals occur, and generates checkpoints. A system call may have a return value and affect one or more parameters. Each of these values is recorded in the process's event log. When a signal occurs, the run-time library logs the signal number, the time when it occurs, and the program counter where the process was interrupted. Ideally, the "time" should be a count of how many instructions were executed prior to the signal. Such a count could be obtained and used in replay if the hardware provided an instruction counter that decrements a counter after every instruction and generates a trap when the counter reaches zero.
4.2.5 Performance
There is no performance detail about RECAP. Using copy-on-write fork() system call to do checkpointing should be very fast. The overhead of starting a reverse execution is the same as restarting a suspended process. However, RECAP relies on swap space and memory to store the checkpointing information. This depends on how long a process runs and how often a checkpoint is performed. If the checkpoint interval is too short and the program runs for very long, the process table of the system will be full or the memory and swap space will be exhausted and the system will crash. The RECAP paper also mentions that logging information also requires a large amount of storage. Event for a slow processor, the event log would require 1MB of data per process per second.
4.3 SPYDER
Unlike other systems, where the state of a process is defined as the contents of address space and the CPU state, the SPYDER[AD91] system is based on the notion of execution history, which are the values of all variables and the program counter at checkpoint time. During the normal execution, SPYDER records the program counter and the old values of all variables that will be changed by a statement before the debuggee executes the statement instance. The SPYDER paper does not mention any thing about saving the CPU state and how to replay the execution. The whole paper focuses on how to save the execution history and backtrack to a previous statement.
The design goal of SPYDER is to bound the amount of space required to store the execution history information. SPYDER introduces two methods for implementing reverse execution. The first method is called the execution history approach, which records the execution history of each statement instance. This approach requires too much space and the space requirement is not bounded. The second approach, which is called the structured backtracking approach, will bound the space requirement by imposing some restrictions on the flexibility of reverse execution.
4.3.1 Execution History Approach
SPYDER uses a different approach than the previous two systems to implement execution backtracking. SPYDER does not checkpoint the execution of a program, instead it records the execution history of the program. During reverse execution, SPYDER uses the data saved in the execution history to undo the effects of program execution.
SPYDER defines the state of a program as follows: values of all variables in the program at the time, and the location of the program control. The execution of a statement basically modifies the program control and probably the value of some variables. Backtracking over the statement requires to undo the two effects. To undo the two effects requires recording the execution history of the program and the variable values before they are changed.
With each assignment statement SPYDER associates a change-set. The change-set consists of all variables whose values are modified by the statement. To be able to backtrack to any statement arbitrarily far back in execution, SPYDER records the complete execution history of statements and the corresponding previous values of variables in their change-set. Then SPYDER can backtrack to any statement by restoring the previously saved values of change-set variables starting at the current location and going backwards until that statement is encountered in the saved execution history.
The problem associates with this approach is that the space required to save the execution history and the change-set values can not be bounded at the compile time. To overcome this problem, SPYDER introduces the following approach, which can bound the space needed to save the execution history and change-set values at compile time.
4.3.2 The structured Backtracking Approach
Instead of associating a change-set with each individual statement only, SPYDER also associates a change-set with each composite statement like if statement or while loop. The change-set of a composite statement is the variables which are changed by the statement. For example, the change-set of a while loop will consist of all variables that could be modified if the loop body is executed one or more times.
If C is a function to calculate the change-set of a statement, and S1 and S2 are source statements, then SPYDER computes the change-sets of a composite statement as follows:
C(S1, S2) = C(S1)UC(S2)
S1 and S2 are the sub-statements of the composite statement. The change-set of the composite statement must be the union of the change-set of S1 and the change-set of S2. For example, the change-set of the while loop and if statement are computed as the follows:
C(if cond then S) = C(S)
C(if cond then S1 else S2) = C(S1)UC(S2)
C(while cond do S) = C(S)
To save space and bound the space at compile time, SPYDER only allocates space to save just one instance of each variable in its change-set for each statement no matter it is a simple or composite statement.
Whenever the control reaches a statement, SPYDER saves the current values of variables in its change-set in the same space. So every time when a statement in a loop body gets executed, the current values of variables in its change-set overwrite the previously saved values. Therefore, this implementation of change-set poses a constraint on the flexibility of backtracking, i.e., SPYDER can not backtrack to any arbitrary statement from the current statement.
4.3.3 Backtracking
Using execution history, SPYDER can backtrack to any statement instance by restoring the previously saved values of change-set variables starting at the current location and going backwards until the target statement is reached. However, by using the structured backtracking approach SPYDER can not backtrack to any arbitrary statement because of the limitation we mentioned above. The structure backtracking approach poses the following restrictions on reverse execution:
One cannot backtrack from a statement in iteration n to a statement in iteration n-1 inside the same loop.
One cannot backtrack from a statement outside a loop to a statement inside it.
One cannot backtrack from a statement outside an if statement block to a statement inside the if statement block.
Note that one can backtrack from a statement inside a loop to a statement outside it just as it is possible to break from inside a loop to the outside. Procedures are treated as composite statements. One can only backtrack from a statement inside a procedure to a statement outside it, but not other way around.
The SPYDER paper mentions that if one needs to backtrack to a statement inside a composite statement from a statement outside it, one can always backtrack first to the beginning of the composite statement, and then execute forward to the desired statement. However, there is no evidence in the paper that SPYDER records the CPU-state of the debuggee process. Therefore, we are not sure how the debuggee can be execute forward after it is backtracked to a previous statement. If SPYDER treats registers as variables, the CPU state can be stored inside the change-sets, and it is possible to execute the debuggee forward after reverting back to a previous statements.
4.3.4 Bounds on Space Requirements
The main problem SPYDER tries to solve is the space requirement for saving execution history information during the normal execution so that the old states can be restored from the saved information. SPYDER's algorithm bounds the space requirement and the space requirement can be calculated based on the following formula:
![]()
A is the total number of statements in the program, si is the size of the change-set of the ith statement, ni is the nesting level of the ith statement. If a statement is nested inside two loops, then the nesting level should be 2 for the statement. S is the sum of sizes of change-sets of all statements in the program.
The space requirement can be calculated at compile time if the program does not have pointer. However, if a program has pointer, then the si has to be calculated at run time. But the space bounds derived from the above formula would still hold.
4.3.5 Signals and System Calls
There is no information about how signals and system calls are handled in the SPYDER paper.
4.3.6 performance
The SPYDER paper also does not provide any information about the performance of the system; therefore, we have no idea about the efficiency of SPYDER.
4.4 Optimal Tracing and Incremental Re-Execution for Debugging Run Program
To reduce the time and space cost of checkpointing, Robert and Mark [NW94] present adaptive tracing strategies that provide bounded-time incremental replay and that are nearly optimal. Their implementation on a Sparc 10 traces less than 15 kilobytes/sec for CPU-intensive programs and for interactive programs the slowdown is small enough that tracing can be left on all the time. They overcome the time and space problem by adaptively deciding what and when to trace. To decide when to trace, they divide the execution into windows - contiguous intervals of the execution that can be individually replayed - and produce a set of traces for each window. To decide what to trace, they perform on-line analysis at each memory reference to dynamically determine which to record.
4.4.1 What to trace
There are two considerations to decide what to trace. First, a replay is correct if and only if each read from the memory retrieves the same value as during the original execution. Second, a window's replay is m-bounded if and only if they need to scan at most M previous trace records before the window's trace record to restore its state. The following proposition show what must be traced.
Proposition:
In general, enough trace data exist to provide an M-bounded replay of a window iff for every read in the window, either
the value read is saved in one of the M trace records preceding the start of the window's trace, or
a write to the same memory location is previously made within the same window.
Based on above proposition, they identify a class of unique-spanningM reads, where a unique-spanningM read is the first read from an address A in a window where A was written in an earlier window but its value is not traced in the M trace records recorded before the window. They show that tracing unique-spanningM reads is sufficient for incremental replay with bounded time.
4.4.2 when to trace
Netzer and Weaver proved that optimally deciding when to start a new window has polynomial time complexity. However, it requires more than constant work at each memory reference, which is too inefficient to be practical. They solve this problem with an approximation by simply fixing each window size at T where T is the number of memory references. Although the window size is fixed, their tracing is almost optimal. They also proved that the cases M=0 and M =Infinitive perform well, and it is unclear whether other values of M are of practical interest. In their case, they actually never implemented the unique-spanningM. They only implemented unique-spanning0 and unique-spanninginf and their experiment result shows that unique-spanning0 and unique-spanninginf work well enough so that they did not need to implement unique-spanningM. In the case of unique-spanning0, only one record is needed to scan to start a replay. For unique-spanninginf, sometimes all records need to be scanned.
4.4.3 Conceptual Algorithm
For each memory reference or entering a new window, the optimal tracing algorithm has to perform some actions. They call the actions as read_hook(), write_hook(), and window_boundary_hook(). Figure 4 shows their algorithms. In their algorithm, Traced is a bitvector that indicates which address have been traced since last written, and the NeedToTrace bitvector marks which addresses have not been written in the current window or traced since last written.
For the case unique-spanning0 (Figure 4b), every bit in the NeeToTrace bitvector is set to true when a new window is first entered. If a read reference is make to an address a and NeedToTrace[a] is true, the location a is traced and NeedToTrace[a] is set to false so that the subsequent reads to the same location in the same window will not be traced. When write reference is made to address a, it simply turns NeedToTrace[a] into false to indicate that the same address will not need to be traced in the same window.
The unique-spanninginf (Figure 4c)has an extra bitvector call Traced. For each new window, an address a needs to be traced only when NeedToTrace[a] is false. When entering a new window, NeedToTrace[a] is initialized to NOT Traced(a) for every address a. If a read reference is made to an address a and NeedToTrace[a] is true, the location a is traced and NeedToTrace[a] is set to false and Traced[a] is set to true so that subsequent reads to the same location, no matter in the same window or the subsequent windows, will not be traced. When a write reference is made to address a, it simply turns NeedToTrace[a] into false to indicate that the same address does not need to be traced in the same window, and turns Traced[a] to false to indicate that the same address will need to be braced in the subsequent windows.
The algorithm of unique-spanningM (Figure 4a)turns the bitvector variable Traced in the algorithm for unique-spanninginf into a counter. When entering a new window, an address a needs to be traced only when Traced[a] is 0; otherwise, it decreases Traced[a] by 1. When a read reference is made to an address a, if NeedToTrace[a] is true, then the address a is traced, NeedToTrace[a] is turned into false so that subsequent reads in the some window will not be traced, Traced[a] is assigned a new counter c so that the reads to the same address will not be traced in the subsequent c windows. If a write reference is made to address a, it turns NeedToTrace[a] into false to indicate that the same address will not need to be traced in the same window, and assign 0 to Traced[a] to indicate that a is needed to be traced in the subsequent windows.
window_boundary_hook{
foreach a
if(Traced[a]>0)
NeedToTrace[a] = 0;
Trace[a]--;
else
NeedToTrace[a] = 1;
}
read_hook(a){
if(NeedToTrace[a])
trace address & value;
NeedToTrace[a] = 0;
Traced[a]=C;
}
write_hook(a){
NeedToTrace[a]=0;
Traced[a]= 0;
}
(a) unique-spanningM reads
window_boundary_hook{
NeeToTrace bit-vector =1's;
}
read_hook(a){
if(NeedToTrace[a])
trace address & value;
NeedToTrace[a]=0;
}
write_hook(a){
NeedToTrace[a] = 0;
}
(b) unique-spanning0 reads
window_boundary_hook{
NeedToTrace bit-vector =
NOT (Traced bit-vector);
}
read_hook(a){
if(NeedToTrace[a])
trace address & value;
NeedToTrace[a]=0;
Trace[a]=1;
}
write_hook(a){
NeedToTrace[a]=0;
Traced[a]=0;
}
(c) unique-spanninginf reads
Figure 4: Fixed-window-size tracing algorithms, performed at each window boundary(window_boundary_hook), each read(read_hook) and each write (write_hook)
4.4.4 System calls and signals
They instrument each system call to perform several actions. First, they have to estimate which address might be modified during the call. Second, after the system call, for each address a modified by the system call, they reset NeedToTrace[a] to true. By resetting bitvector entries corresponding to the modified locations, any subsequent reads of these locations will be detected as unique-spanning and traced as usual. The side-effects of system calls are recorded into a log for later replay.
For some system calls, they can treat the calls as any other sequence of reads and writes. However, this scheme fails for some interactions with the environment where state is maintained outside of the user's process, such as on a bit-mapped display. They proposed a scheme in which when state changes to an external device must be reproduced, the device's state must also be traced during execution and restored during replay when the system call is reached. Instrumentation must be added to the system call to handle the details of how the device's state is accessed.
They use software instruction counter (SIC) to replay interrupts and signals. The software instruction counter is incremented by one at each backward branch and procedure call. The value of this counter together with the PC uniquely identifies each instruction instance. By instrumenting the program to maintain a software instruction counter, interrupt handlers can trace the value of the counter and the PC when an interrupt occurs. During replay a watchpoint can be set for the location at which the interrupt initially occurred to check the instruction counter against the traced value. When the counter reaches the traced value, a jump can be make to the interrupt handler.
4.4.5 Implementation
For each read and write operation, the optimal tracing algorithm requires on-line analysis, such as deciding if a window boundary is reached, or the contents of a memory location needs to be recorded. They instrument programs at the assembly level to do the analysis.
To make the bitvector operations efficient, they use two-level bitvectors similar to the bitmap data structure for improving the data breakpoint implementations. The bitvectors maintain one bit per four byte word. A flat bitvector for a 32-bit address space would occupy 128 megabytes of storage. A two-level bitvector is a table of pointers to bitvector fragments, with each fragment being allocated only when a reference is made to the memory region it represents.
4.4.6 Replaying
To replay a given window, some of the memory's state must be restored before the reexecution. If the unique-spanningM algorithm was used for the tracing, then M previous windows' traces and the current window's trace must be scanned to restore the memory state. A linear pass is needed to place into memory the value in each trace record. The current window is scanned up to the first system call if there is any. The traces after the system call will be restored after the reexecution of the system call.
System calls are instrumented so that system calls during replay do not go to the kernel. Instead, additional state is restored before returning to reproduce the side-effects of system calls. After a system call, again the traces after the system call are restored up to the next system call, etc.
4.4.7 Performance
The comparison of fixed-window and optimal tracing shows that fixing the window size appears to work well. They used four programs to test the performance of both fixed-window and optimal tracing. The fixed-window algorithms generated traces at most 50% larger than optimal for the program ccom and event. For compress and mesh, the algorithms were nearly optimal.
They conducted an run-time overhead measurement on both unique-spanning0 reads and
unique-spanninginf reads, and the result shows that they can trace long-running programs with a 1.75-7 times slowdown generating less than 15 kilobytes/sec of trace. Interactive programs can be traced with a trace generation rate on 100's of bytes/sec and no noticeable slowdown.
4.5 History Cache: Hardware Support for Reverse Execution
To save history information using software approach such SPYDER and the Optimal Tracing algorithm, a program has to be instrumented. Instrumentation introduces many problems such as
program size becomes larger
program runs slower
either an extra tool is needed or the compiler, the linker, or the assembler has to be modified for the code instrumentation.
Heisenbug bugs
buggy program may corrupt the instrumented code and its data structures.
It would be nice that the history information is recorded without modifying the original program and without introducing extra overhead. This requires the help from hardware. The History Cache scheme is the proposal using hardware to support for reverse execution. The History Cache[SO94] scheme uses a memory cache to do both recording and compacting the history information. The History Cache scheme is not a debugging system, it is only the hardware support for reverse execution, and you can use it for any application you want.
4.5.1 Implementation
In order to restore the previous state faithfully, we have to record at least three things before executing a new instruction instance. An instruction may modify the value of a register or the value of an address location, and the program counter also is changed. Assume a register is mapped to an unused address location, then before executing each instruction, we can save the program counter, the address which will be modified by the instruction, and the value of the address. We called this three things as the history data.
It requires huge disk space to save the history data of each instruction. If we need 3 words to save the history data, and each word is 4 bytes, then we need 12 Mbytes of history for 1 million instructions. To save disk space, The History Cache scheme only records the history elements that contain changes to the main memory because statistics of executed instructions in many applications show that a Store operation to the main memory constitutes around 10% of all executed instructions. However, CPU state is also required to reconstruct the previous process state. Therefore, The History Cache scheme also periodically checkpoint the CPU state.
The History Cache scheme reduced the spaced need further by compacting the size of the history. The history cache sits between the processor and the main memory as in Figure 5. At each Store to a location in the main memory, the old value of the location is saved to the history cache. To compact the history size, the new value is discarded if the history cache already holds the location. If there is no space in the history cache, an element from the history cache is chosen to be flushed to the disk. Periodically the history cache is flushed to the disk with the processor state appended to the history.

Figure 5: The history Cache
4.5.2 Signals, System calls and Reverse Execution
The only purpose of The History Cache is to provide hardware support for recording history information and it is not a complete software system; therefore, there is no information for handling the signals, system calls and reverse execution.
4.5.3 Performance
The History Cache compacts the history data 2.5 to 25 time. As we mentioned before that normally 12 Mbytes space is required for every 1 million executed instructions. Using a history cache size of 1K entries, the programs produce from 31 Kbytes to 190Kbytes of history data for 1 million executed instructions. The history cache is thus effective in significantly compacting the history.
4.6 Comparison of the above approaches
Each of the above approaches has their own strength and weakness. The biggest weakness is that all approaches do not handle single-stepping backward very well. Actually the only approach addresses single-stepping is SPYDER. However, SPYDER can not step back into a loop or a procedure. All other approaches do not address this problem at all. The difficulty of implementing single-stepping backward is that we need some way to identify each instruction. Robert Zulawnick[ZU95] has developed a reverse execution debugger system called Reverse Debugger Using Software Counter method. We did not discuss this method in our paper because the reverse execution is always restarted from the beginning of the program. However, this system uses a very good way to handle singe-stepping backward. The debuggee of this system has to be compiled with a special library and the compiler inserts a function call in front of each statement to count the number of statements. Before a statement is executed, a counter is incremented by 1. When single-stepping backward is required, the debuggee is restarted from the beginning, and a new counter is used. Then the debuggee executes until the new counter is equal to counter-1. The approach gives an illusion of single-stepping backward. The drawback of this system is that it always restart the debuggee from the beginning. However, this software counter method should work with other approaches two. This approach gives single-stepping backward in the statement level.
The problem associated with IGOR is that IGOR uses an object code interpreter during reverse execution and it runs 140 times slower. The other problem of IGOR is that it does not handle signals and system calls automatically.
RECAP handles signals an system calls very well but we believe that RECAP will not work if a program runs too long or the checkpoint interval is too small because RECAP uses fork() to implement checkpointing. The speed of the checkpointing should be fast, but the process table or the swap space will gradually run out.
The Optimal Tracing algorithm works the best for handling signals because it uses a software counter which can regenerate signals at the exact places. In term of overhead and space requirement, optimal tracing also works the best.
IGOR and RECAP can be implemented more easily since they rely on the support from the virtual memory system or the kernel. Optimal Tracing, SYPDER and RECAP are more difficult to implement since they need to instrument the programs. Optimal Tracing, SYPDER and RECAP are not reliable because the data structure for recording history information lives inside the same address space as the debuggee.
5 Our idea of implementing fast data breakpoint and reverse execution
Both same process debugger and separate debugger perform the same debugging tasks. The only reason to separate the debugger from the debuggee is to prevent the debuggee crashing the debugger. If the debugger and the debuggee reside in the same address space, when the debuggee crashes, there is no way to prevent it from crashing the debugger. This means that when the debuggee dies, most likely the debugger dies too. Therefore, all most all current symbolic debuggers use a separate-process debugger architecture.
With separate process debuggers, performance is a major problem of supporting both data breakpoint and reverse execution. For every memory read/write access, some actions such as memory check have to be performed. If the actions are performed in the debugger, the context switches between the debugger and debuggee impose a great deal of overhead. Therefore, some implementations move the actions from the debugger into the kernel or the debuggee itself.
Additionally, simple data breakpoint and control breakpoint are sometimes not good enough. The users may want to add some conditions to the breakpoints. For example, the users may want to stop the debuggee if a+c>b when the data breakpoint or control breakpoint is reached, where a, b, c are the variables of the debuggee. Debuggers like GDB use interpreter to check the conditions. For the above example, if the breakpoint is a data breakpoint, then every time a, or c, or b is changed, several context switches and system calls are needed to check the condition. First the debugger has to make three system calls to get the value of a, c, and b, and then check the condition. If the condition is true, the debugger stops the debuggee and returns the control to the user. If the condition is false, the debuggee continues execution. In order to improve the performance of the conditional control/data breakpoints, the condition check should be inside the kernel or the debuggee.
Code patching has been proved to be the best software implementation of data breakpoint, conditional control breakpoints [KE90], and reverse execution (Optimal Tracing Scheme) [NW94]. Code patching reduces performance overhead by moving some debugging tasks into the debuggee from the debugger. Code patching instruments the debuggee to check every memory accesses if a data breakpoint is involved, examine expressions' condition for control breakpoints, and record history information for reverse execution. The other advantage of code patching is that we can use the same code to implement both data breakpoint and reverse execution since both techniques need to perform some actions on every memory access. The disadvantage of instrumentation is that the data structures and the code for the checking or recording live in the same address space with the debuggee. If the debuggee has corrupted pointers, it may crash the data structures and instrumentation code without the debugger noticing it. This unreliability of code patching prevents itself to be widely adopted as a debugging technique. If code patching has some mechanism to prevent the debuggee from crashing the instrumentation code and its data structures, we can achieve the best of both worlds. In this section, we are going to propose a new design that uses the safe user-level extensions mechanism[CV99] to implement code patching to support conditional control breakpoints, data breakpoints and reverse execution, with protections against the debuggee crashing the instrumentation code and its data structures.

Figure 6: The layout of the user portion of a Palladium process's virtual address space. In this case, the user program has only one extension. The extension also spans the user address space (0-3GByte) and put at SPL 3, and the pages therein are at PPL 1. The extended application itself is at SPL 2, and its pages at PPL 0.
5.1 Safe user-level extensions
Palladium is a novel intra-address space protection mechanism developed by Chiueh[CV99], which exploits the segmentation and paging hardware in the Intel X86 architecture and efficiently supports safe kernel-level and user level extensions in a way that is largely transparent to programmers and existing programming tools. A user-level process in Linux can dynamically load an extension module into its address space using dlopen(), dlsym() and dlclose(). There is no protection between an extensible application and its extension modules. Typically if an extension module crashes, the application being extended crashes too. Palladium provides a protection mechanism for extensible applications such that even if its extension modules crash, the extensible application can still continue running normally.
The Intel X86 architecture virtual memory hardware supports both variable-length segments and fixed-sized pages. Each segment can be in one of four possible segment privilege levels (SPL), and each virtual page can be in one of two possible page privilege levels (PPL). SPL 0 is the most privileged level and SPL 3 is the least privileged level. Similarly, PPL 0 is more privileged than PPL1. By default SPL 0 and SPL 1 are mapped into PPL 0, and SPL2 and SPL3 are mapped into PPL1. Therefore, code segments at SPL 3 do not have the privilege to access pages at PPL 0.
Palladium combines both segment-level privilege and page-level privilege to support intra-address space protection. Figure 6 shows the layout of the user portion of a Palladium process' virtual address space. An extensible application promotes itself to SPL 2, and all of its writable pages are set to PPL 0. It also creates an extension segment that is at SPL3, and all pages of the extension segment are set to PPL1. Both process segment and extension segment span the same 0 to 3 Gbytes; therefore, a user-level extension can access anything in the 0-3 Gbyte address range at the segment level. However, at the page level, an extension cannot access those pages that the application chooses to hide and therefore are at PPL 0, because the paging hardware prevents SPL 3 code segments from accessing PPL 0 pages. Therefore, extensions can only access their own code, data and stack, as well as shared libraries and data regions exposed by the extensible application.
5.2 A new design
Our new design uses code patching with Palladium's intra-address space protection mechanism to implement conditional control breakpoints, data breakpoints and reverse execution. The Optimal Tracing algorithm for reverse execution and the Code Patching scheme for data breakpoints can be combined together to save some performance overheads. For each memory write access, the Optimal Tracing algorithm has to turn the NeedToTrace variable of the memory location to false, and the Code Patching scheme needs to check if the memory location is one of the data breakpoint targets. Therefore, we can add some code into the write_hook() operation of the Optimal Tracing algorithm to perform the same operation as the Code Patching scheme.
To prevent a debuggee from crashing the instrumentation code and its data structures, the instrumentation code and its data structures have to be at higher privilege level than the debuggee. Therefore, the instrumentation code and its data structures are implemented as an extensible application (instrumentation module), and the debuggee is compiled into an extension module (debuggee module). During debugging, the debugger loads the instrumentation module first and puts it at SPL2, and all its pages at PPL0, Then the debugger loads the debuggee module as an extension of the instrumentation module, and puts it at SPL3, and its pages at PPL1, as shown in Figure 7. Even if a debuggee module has corrupted pointers, Palladium can prevent the debuggee module from crashing the instrumentation module.

Figure 7: Layout of the debuggee process's virtual address space using the safe user-level extensions mechanism. The actual debuggee program is put into an extension, and the instrumentation code and its data structures are loaded as an extensible application.
During debugging, the instrumentation module runs first, and it can do some initialization before it calls the main function of the debuggee module to start the debuggee. To support data breakpoints and reverse execution, the above modified version of the write_hook() procedure of the Optimal Tracing algorithm can be implemented as a procedure of the instrumentation module. For every memory write access, if some data breakpoints are set, the procedure also needs to examine if the memory location is a data breakpoint target address. The compiler needs to insert a call instruction before each memory access instruction of the debuggee code to call the procedure.
To support conditional control breakpoints, the expression interpreter can also be implemented as a procedure (interpreter procedure) of the instrumentation module. If a conditional control breakpoint is set at an instruction, such as the instructionm in Figure 8, then the debugger replaces the instruction with a branch to some replacement code which makes a call to the interpreter procedure, as shown in Figure 8. When the debuggee is compiled, the compiler reserves some memory space on the debuggee, so that the debugger can dynamically insert the replacement code into the reserved memory space of the debuggee at runtime if a conditional control breakpoint is set. The purpose of the replacement code is to execute the replaced instruction after the return from the call to the interpreter procedure. In Figure 8, the instructionm' inside the replacement code is the copy of the replaced instruction, and it will be executed before the replacement code jumps back to the successor of the replaced instruction, which is the instructionn in this case. During debugging, the interpreter procedure examines the condition of each control breakpoint and returns the control to the debugger if the condition is true; otherwise, it returns to the replacement code of the debuggee to execute the replaced instruction and continue the debuggee's execution. To support conditional data breakpoints, the procedure for checking memory access can make a call to this interpreter procedure to evaluate the expression if a data breakpoint is reached.

Figure 8: Implementation of conditional control breakpoints using Palladium. In this example, a conditional control breakpoint is set at instructionm. Our design replaces instructionm with a branch instruction that jumps to the dynamically inserted replacement code. The replacement code makes a call to the interpreter procedure of the instrumentation module to evaluate the condition. After the interpreter procedure returns, the replacement code executes instructionm before it branches back to instructionn.
Code patching is not widely employed to implement data breakpoints and reverse execution because of its unreliability. By using our new design, any existing debugger like GDB can be extended safely to support both data breakpoints and reverse execution with lower performance overhead. Though we still use interpreter in our design to implement conditional control breakpoints and data breakpoints, its performance should be a several orders of magnitude faster than any other debuggers which evaluate the condition inside the debugger. Our implementation of conditional control/data breakpoints reduces the performance overhead because it does not need to do any context switches and make any system calls. It can access the memory contents of the debuggee directly.
6 Conclusion
In this survey we have reviewed and studied the current implementations of both data breakpoints and reverse execution. Comparing all the implementations we have reviewed so far, the best way to improve the performance of data breakpoints and reverse execution is to move these two debugging tasks into the debuggee, such as Code Patching scheme and Optimal Tracing do. However, without intra-address space protection, it is not reliable to move the tasks into the debuggee since the debuggee can crash the necessary data structures. This paper proposes a new design based on the safe-extension mechanism of Palladium to implement conditional control/data breakpoints and reverse execution using code patching. The save extension mechanism of Palladium guarantees to prevent the debuggee from crashing the instrumentation code and data structures.
References
[AD90] Hiralal Agrawal, Richard A. DeMillo, and Eugene H. Spafford. Efficient debugging with slicing and backtracking. Technical Report SERC- TR-80-P, Software Engineering Research Center, Purdue University, West Lafayette, IN, 1990
[AD91] H. Agrawal, R. A. DeMillo, and E. H. Spaafford. An Execution- Backtracking Approach to Debugging. IEEE Software, pages
21--26, May 1991.
[Br85] Bernd Bruegge. Adaptability and Portability of Symbolic Debuggers. PhD thesis, Department of computer Science, Carnegie-Mellon University, September 1995. CMU-CS-85-174
[CT94] Max Copperman and Jeff Thomas. Poor man's watchpoints. Xerox Palo Alto Research Center and Kubota Pacific Computer, Inc, Unpublished 1994.
[CV99] Tzi-cker Chiueh, Ganesh Venkitachalam, and Prashant Pradhan. Integrating Segmentation and Paging Protection for Safe, Efficient and Transparent Software Extensions. In Proc. of the 17th ACM Symposium on Operating System Principles (SOSP '99), pages 140-- 153, December 1999.
[FB88] Feldman, S. I. and Brown, C. B. (1988). IGOR: A system for program debugging via reversible execution. In Proc. ACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, pages 112--123. Published as SIGPLAN Notices, 24(1), Jan.
[FM96] Abramson, D., Foster, I., Michalakes, J., Sosic, R., Relative Debugging: A New Methodology for Debugging Scientific Applications, Communications of the ACM, Vol. 39, No. 11, 69 -- 77 (November 1996).
[GO86] L Gugerty and G Olson; Debugging by skilled and novice programmers, Conference proceedings on Human factors in computing systems , 1986, Pages 171 - 174
[IN99] Intel Architecture Software developer's manual, Volume 3: system programming 1999
[KE90] Peter B. Kessler. Fast breakpoints: Design and implementation. In Proceedings of the 1990 Conference on Programming Language Design and Implementation (PLDI), pages 78--84, June 1990.
[KE93] David Keppel. Fast data breakpoints. Technical Report 93-04-06,
University of Washington, 1993.
[KO] Adam Kolawa, The Evolution of Software Debugging, http://www.parasoft.com/papers/vision.htm
[ML89] J. M. Mellor-Crummey and T. J. LeBlanc; A software instruction counter, Proceedings of the third international conference on Architectural support for programming languages and operating systems , A software instruction counter 1989, Pages 78 - 86
[MS94] S. Mukherjea and J.T. Stasko. Toward visual debugging: Integrating algorithm animation capabilities within a a source-level debugger. ACM Trans. on ComputerHuman Interaction 1(3), pp. 215-244, 1994.
[NW94] Robert H. B. Netzer and Mark H. Weaver. Optimal tracing and incremental reexecution for debugging long-running programs. In ACM SIGPLAN Conference on Programming Language Design and implementation, Orlando, FL, June 1994
[OA88] Edward G. Okie and James D. Arthur; The execution history approach to intelligent debugging, Proceedings of the 1988 ACM sixteenth annual conference on Computer science , 1988, Pages 273 - 281
[OD96] OpenVMS Debugger Manual
http://sumerwww.nascom.nasa.gov/doc/odl/openvms/ssb71/4538/4538P.HTM
November 1996
[PA91] Hsin Pan, Debugging with Dynamic Instrumentation and Test-Based Knowledge, Technical Report SERC-TR-105-P, Department of Computer Sciences, Purdue University,1991
[PL89] Douglas Z. Pan and Mark A. Linton. Supporting Reverse Execution of Parallel Programs. Proceedings of the ACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, published in ACM SIGPLAN Notices, 24(1):124--129, January 1989.
[Pr56] SunOS 5.6 proc manual
[PX95] J. S. Plank, J. Xu, and R. H. B. Netzer. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS- 95-302, University of Tennessee, August 1995.
[RO96] Jonathan B. RosenBerg, How Debuggers Work Algorithms, Data Structures, and Architecture, Wiley Computer Publishing John Wiley & Sons, Inc. 1996
[RP96] Paul E. Roberts, Implementation and Evaluation of Data Breakpoint
Schemes in an Interactive Debugger, thesis, Department of Computer
Science, The University of Utah, December 1996
[SH96] Michael W. Shapiro , Tracing of User Programs for Incremental Replay, Dept. of Computer Science Brown University, Technical Report, May 19, 1997
[SH97] Michael W. Shapiro, RDB: A System for Incremental Replay Debugging, Dept. of Computer Science Brown University, Technical Report, May 19, 1997
[SO94] Sosic, Rok. History cache: hardware support for reverse execution. Computer Architecture News, 22 , 5 , 11-18. 1994
[SR99] Richard M. Stallman and Roland H. Pesch, Debugging with GDB The GNU Source-Level Debugger Seventh Edition, for GDB version 4.18
GDB document, Free Software Foundation, Inc. February 1999
[WA92] R. Wahbe. Efficient data breakpoints. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 200--212, Boston, Massachusetts, 1992.
[WL93] Rober Wahbe. Steven Lucco, and Susan L. Graham. Practical data breakpoints: Design and implementation. In Proceedings of the
ACMSIGPLAN'93 Conference on Programming Language Design and Implementation, pages 1-12, June 1993
[ZU95] Robert Zulawnick, Reverse Debugger Using Software Counter Method, master thesis, university of southern maine school of applied science, 1995