0% found this document useful (0 votes)
75 views12 pages

Effective Data-Race Detection For The Kernel: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk

1) DataCollider is a technique for dynamically detecting data races in kernel modules with negligible runtime overhead. 2) It works by randomly sampling a small percentage of memory accesses and using hardware breakpoints to monitor for conflicting accesses within a time window. 3) The authors implemented DataCollider for the Windows kernel and found 25 confirmed data races, 12 of which have been fixed.

Uploaded by

Yuang Peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views12 pages

Effective Data-Race Detection For The Kernel: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk

1) DataCollider is a technique for dynamically detecting data races in kernel modules with negligible runtime overhead. 2) It works by randomly sampling a small percentage of memory accesses and using hardware breakpoints to monitor for conflicting accesses within a time window. 3) The authors implemented DataCollider for the Windows kernel and found 25 confirmed data races, 12 of which have been fixed.

Uploaded by

Yuang Peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Effective Data-Race Detection for the Kernel

John Erickson, Madanlal Musuvathi,


Sebastian Burckhardt, Kirk Olynyk
Microsoft Research
{jerick, madanm, sburckha, kirko}@microsoft.com

Abstract

Data races are an important class of concurrency errors where two threads erroneously access a shared memory loca-
tion without appropriate synchronization. This paper presents DataCollider, a lightweight and effective technique
for dynamically detecting data races in kernel modules. Unlike existing data-race detection techniques, DataCollider
is oblivious to the synchronization protocols (such as locking disciplines) the program uses to protect shared
memory accesses. This is particularly important for low-level kernel code that uses a myriad of complex architec-
ture/device specific synchronization mechanisms. To reduce the runtime overhead, DataCollider randomly samples
a small percentage of memory accesses as candidates for data-race detection. The key novelty of DataCollider is that
it uses breakpoint facilities already supported by many hardware architectures to achieve negligible runtime over-
heads. We have implemented DataCollider for the Windows 7 kernel and have found 25 confirmed erroneous data
races of which 12 have already been fixed.

breakpoint and data-breakpoint1 facilities available in


1. Introduction modern hardware architectures to efficiently perform
this sampling. As a result, DataCollider has no runtime
Concurrent systems are hard to design, arguably be- overhead for non-sampled memory accesses allowing
cause of the difficulties of finding and fixing concur- the tool to run with negligible overheads for low sam-
rency errors. Data races are an important class of con- pling rates.
currency errors, where the program fails to use proper
synchronization when accessing shared data. The ef- We have implemented DataCollider for the 32-bit Win-
fects of an erroneous data race can range from immedi- dows kernel running on the x86 architecture, and used it
ate program crashes to silent lost updates and data cor- to detect data races in the core kernel and several mod-
ruptions that are hard to reproduce and debug. ules such as the filesystem, the networking stack, the
storage drivers, and a network file system. We have
Two memory accesses in a program are said to conflict found a total of 25 erroneous data races of which 12
if they access the same memory location and at least have already been fixed at the time of writing. In our
one of them is a write. A program contains a data race experiments, the tool is able to find erroneous data rac-
if two conflicting accesses can occur concurrently. Fig- es for sampling rates that incur runtime overheads of
ure 1 shows a variation of a data race we found in the less than 5%.
Windows kernel. The threads appear to be accessing
different fields. However, these bit-fields are mapped to Researchers have proposed multitude of dynamic data-
the same word by the compiler and the concurrent ac- race detectors [1,2,3,4,5,6,7] for user-mode programs.
cesses result in a data race. In this case, an update to the In essence, these tools work by dynamically monitoring
statistics field possibly hides an update to the status the memory accesses and synchronizations performed
field. during a concurrent execution. As data races manifest
rarely at runtime, these tools attempt to infer conflicting
This paper presents DataCollider, a tool for dynamical- accesses that could have executed concurrently. The
ly detecting data races in kernel modules. DataCollider tools differ in how they perform this inference, either
is lightweight. It samples a small number of memory
accesses for data-race detection and uses code-
1
Data breakpoints are also called hardware watchpoints.
struct{
AtPeriodicIntervals() {
int status:4;
// determine k based on desired
int pktRcvd:28;
// memory access sampling rate
} st;
repeat k times {
Thread 1 Thread 2 pc = RandomlyChosenMemoryAccess();
SetCodeBreakpoint( pc );
st.status = 1; st.pktRcvd ++; }
}

OnCodeBreakpoint( pc ) {
// disassemble the instruction at pc
Figure 1: An example of data race. Even though the
(loc, size, isWrite) = disasm( pc );
threads appear to be modifying different variables in
the source code, the variables are bit fields mapping DetectConflicts(loc, size, isWrite);
to the same integer
// set another code break point
pc = RandomlyChosenMemoryAccess();
SetCodeBreakpoint( pc );
using the happens-before [8] ordering induced by the }
synchronization operations [4,5,6] or a lock-set based
reasoning [1] or a combination of the two [2,3,7] DetectConflicts( loc, size, isWrite) {
temp = read( loc, size );
There are several challenges in engineering a data-race if ( isWrite )
detection tool for the kernel based on previous ap- SetDataBreakpointRW( loc, size );
proaches. First, the kernel-mode code operates at a low- else
er concurrency abstraction than user-mode code, which SetDataBreakpointW( loc, size );
can rely on clean abstractions of threads and synchroni-
zations provided by the kernel. In the kernel, the same delay();
thread context can execute code from a user-mode pro-
ClearDataBreakpoint( loc, size );
cess, a device interrupt service routine, or a deferred
procedure call (DPC). In addition, it is an onerous task temp’ = read( loc, size );
to understand the semantics of complex synchronization
primitives in order to infer the happens-before relation if( temp != temp’ ||
or lock-sets. For instance, Windows supports more than data breakpoint fired )
a dozen locks with different semantics on how the lock ReportDataRace( );
holder synchronizes with hardware interrupts, the }
scheduler, and the DPCs. It is also common for kernel
modules to roll-out custom implementations of syn-
chronization primitives. Figure 2: The basics of the DataCollider algo-
rithm. Right before a read or write access to shared
Second, hardware-facing kernel modules need to syn- memory location, chosen at random, DataCollider
chronize with hardware devices that concurrently modi- monitors for any concurrent accesses that conflict
fy device state and memory. It is important to design a with the current access.
data-race detection tool that can find these otherwise
hard-to-find data races between the hardware and the
kernel.
Moreover, these tools rely on invasive instrumentation
Finally, existing dynamic data-race detectors add pro- techniques that are difficult to get right on low-level
hibitive run-time overheads. It is not uncommon for kernel code.
such tools to incur up to 200x slowdowns [9]. The
overhead is primarily due to the need to monitor and DataCollider uses a different approach to overcome
process all memory and synchronization operations at these challenges. The crux of the algorithm is shown in
run time. Significant engineering effort in building da- Figure 2. DataCollider samples a small number of
ta-race detectors goes in reducing the runtime overhead memory accesses at runtime by inserting code break-
and the associated memory and log management [9,3]. points at randomly chosen memory access instructions.
Replicating these efforts within the constraints of kernel When a code breakpoint fires, DataCollider detects data
programming is an arduous, if not impossible, task. races involving the sampled memory access for a small
time window. It simultaneously employs two strategies
to do so. First, DataCollider sets a data breakpoint to after-free, which indirectly result in inadvertent sharing
trap conflicting accesses by other threads. To detect of memory.
conflicting writes performed by hardware devices and
by processors accessing the memory location through a Another important reason for avoiding data races is to
different virtual address, DataCollider use a repeated- protect the program from the weak memory models of
read strategy. It reads the value once before and once the compiler and the hardware. Both the compiler and
after the delay. A change in value is an indication of a hardware can reorder instructions and change the be-
conflicting write, and hence a data race. havior of racy programs in complex and confusing
ways [10,11]. Even if a racy program works correctly
The DataCollider algorithm has two features that make for the current compiler and hardware configuration, it
it suitable for kernel data-race detection. First and might fail on future configurations that implement more
foremost, it is easy to implement. Barring some imple- aggressive memory-model relaxations.
mentation details (Section 3), the entire algorithm is
shown in Figure 2. In addition, it is entirely oblivious to While bugs caused by data races may of course be
the synchronization protocols used by the kernel and found using more conventional testing approaches such
the hardware, a welcome design point as DataCollider as stress testing, the latter often fails to provide actiona-
does not have to understand the complex semantics of ble information to the programmer. Clearly, a data race
kernel synchronization primitives. report including stack traces or data values (or even
better, including a core dump that is demonstrating the
When the DataCollider finds a data race through the actual data race) is easier to understand and fix than a
data-breakpoint strategy, it catches both threads “red- silent data corruption that leads to an obscure failure at
handed,” as they are about to execute conflicting ac- some later point during program execution.
cesses. This greatly simplifies the debugging of data
race reports from DataCollider as the tool can collect 2.1. Definition of Data Race
useful debugging information, such as the stack trace of
the racing threads along with their context information, There is no “gold standard” for defining data races;
without incurring this overhead on non-sampled or non- several researchers have used the term to mean different
racy accesses. things. For our definition, we consulted two respected
standards (Posix threads [12] and the drafts of the C++
Not all data races are erroneous. Such benign races in- and C memory model standards [11,10]) and general-
clude races that do not affect the program outcome, ized their definitions to account for the particularities of
such as updates to logging/debugging variables, and kernel code. Our definition of data race is:
races that affect the program outcome in a manner ac-
ceptable to the programmer, such as conflicting updates  Two operations that access main memory are
to a low-fidelity counter. DataCollider uses a post- called conflicting if
processing phase that prunes and prioritizes the data- o the physical memory they access is not
race reports before showing them to the user. In our disjoint,
experience with DataCollider, we have observed that o at least one of them is a write, and
only around 10% percentage of data-race reports corre- o they are not both synchronization access-
spond to real errors, making the post-processing step es.
absolutely crucial for the usability of the tool.  A program has a data race if it can be executed on
a multiprocessor in such a way that two conflicting
2. Background and Motivation memory accesses are performed simultaneously
(by processors or any other device).
Shared memory multiprocessors are specifically built to
allow concurrent access to shared data. So why do data This definition is a simplification of [11,10] insofar we
races represent a problem at all? replaced the tricky notion of “not ordered before” with
the unambiguous “performed simultaneously” (which
The key motivation for data race detection is the empir- refers to real time).
ic fact that programmers most often use synchroniza-
tion to restrict accesses to shared memory. Data races An important part of our definition is the distinction
can thus be an indication of incorrect or insufficient between synchronization and data accesses. Clearly,
synchronization in the program. In addition, data races some memory accesses participate in perfectly desirable
can also reveal programming mistakes not directly re- races: for example, a mutex implementation may per-
lated to concurrency, such as buffer overruns or use- form a “release” by storing the value 0 in a shared loca-
tion, while another thread is performing an acquire and 2.3.1. Static vs. Dynamic
reads the same memory location. However, this is not a
data race because we categorize both of these accesses Data race detection can be broadly categorized into
as synchronization accesses. Synchronization accesses static race detection [13,14,15,16,17], which typically
either involve hardware synchronization primitives analyzes source or byte code without directly executing
such as interlocked instructions or use volatile or atom- the program, and dynamic race detection [1,2,3,4,5,6,7],
ic annotations supported by the compiler. which instruments the program and monitors its execu-
tion online or offline.
Note that our definition is general enough to apply to
code running in the kernel, which poses some unique Static race detectors have been successfully applied to
problems not found in user-mode code. For example, in large code bases [13,14]. However, as they rely on ap-
some cases data races can be avoided by turning off proximate information, such as pointer aliasing, they
interrupts; also, processes can exhibit a data race when are prone to excessive false warnings. Some tools, es-
accessing different virtual addresses that map to the pecially those targeting large code bases, approach this
same physical address. We talk more about these topics issue by filtering the reported warnings using heuristics
in Section 2.3.4. [13]. Such heuristics can successfully reduce the false
warnings to a tolerable level, but may unfortunately
2.2. Precision of Detection also eliminate correct warnings and lead to missed rac-
es. Other tools, targeted towards highly motivated users
Clearly, we would like data race detection tools to re- that wish to interactively prove absence of data races,
port as many data races as possible without inundating report all potential races to the user and rely on user-
the user with false error reports. We use the following supplied annotations that indicate synchronization dis-
terminology to discuss the precision and completeness ciplines [16,17].
of data race detectors. A missed race is a data race that
the tool does not warn about. A benign data race is a Dynamic data race detectors are less prone to false
data race that does not adversely affect the behavior of warnings than static techniques because they monitor
the program. Common examples of benign data races an actual execution of the program. However, they may
include threads racing on updates to logging or statistics miss races because successful detection might require
variables and threads concurrently updating a shared an error-inducing input and/or an appropriate thread
counter where the occasional incorrect update of the schedule. Also, many dynamic detectors employ several
counter does not affect the outcome of the program. On heuristics and approximations that can lead to false
the other hand, a false data race is an error report that alarms.
does not correspond to a data race in the program. Stat-
ic data-race detection techniques commonly produce Dynamic data race detectors can be classified into cate-
false data races due to their inherent inability to precise- gories based on whether they model a happens-before
ly reason about program paths, aliased heap objects, relation [6,5,7] (see Section 2.3.2), lock sets [1] (see
and function pointers. Dynamic data-race detectors can Section 2.3.3), or both [2,18].
report false data races if they do not identify or do not
understand the semantics of all the synchronizations 2.3.2. Happens-Before Tracking
used by the program.
Dynamic data race detectors do not just detect data rac-
2.3. Related Work es that actually took place (in the sense that the conflict-
ing accesses were truly simultaneous during the execu-
Researchers have proposed and built a plethora of race tion), but look for evidence that such a schedule would
detection tools. We now discuss the major approaches have been possible for a slightly different timing.
and implementation techniques appearing in related Tracking a happens-before relation on program events
work. We describe both happens-before-based and [8] is one way to infer the existence of a racy schedule.
lock-set-based tracking in some detail (Sections 2.3.2 This transitive relation is constructed by recording both
and 2.3.3), before explaining why neither one is very the ordering of events within a thread and the ordering
practical for data race detection in the kernel (Section effects of synchronization operations across threads.
2.3.4).
Once we can properly track the happens-before relation,
race detection is straightforward: For any two conflict-
ing accesses A and B, we simply check whether A hap-
pens-before B, or B happens-before A, or neither. If
neither, we know there exists a schedule where A and B For example, interrupts and interrupt handlers break the
are simultaneous. If properly tracked, happens-before thread abstraction, as the handler code may execute in a
does not lead to any false alarms. However, precise thread context without being part of that thread in a
tracking can be difficult to achieve in practice, as dis- logical sense. Similar problems arise when a thread
cussed in Section 2.3.4. calls into the kernel scheduler. The code executing in
the scheduler is not logically part of that same thread.
2.3.3. Lock Sets
Another example illustrating the difficulty of modeling
When detecting races in programs that follow a strict synchronization inside the kernel are DMA accesses.
and consistent locking discipline, using a lock-set ap- Such accesses are not executing inside a thread (in fact,
proach can provide some benefits. The basic idea is to they are not even executing on a processor). Clearly,
examine the lock set of each data access (that is, the set traditional monitoring techniques have a problem be-
of locks held during the access) and then to take for cause they cannot “instrument” the DMA access.
each memory location the intersection of the lock sets
of all accesses to it. If that intersection is empty, the Similar case holds for interrupt processing. For exam-
variable is not consistently protected by any one lock ple, code may first write some data and then raise an
and a warning is issued. interrupt, and then the same data is read by an interrupt
handler. Lock sets would report a false alarm because
The main limitation of the lock set approach is that it the data is not locked. But even happens-before tech-
does not check for true data races but for violations of a niques are problematic, because they would need to
specific locking discipline. Unfortunately, many appli- precisely track the causality between the instruction that
cations (and in particular kernel code) use locking dis- set the interrupt and the interrupt handler.
ciplines that are complex and use synchronization other
than locks. For these reasons, we decided to employ a design that
entirely avoids modeling the happens-before ordering
Whenever a program departs from a simple locking or lock-sets. As our results show, somewhat surprising-
scheme in any of the above ways, lock-set-based race ly, neither one is required to build an effective data race
detectors will be forced to either issue false warnings, detector.
or to use heuristics to suppress these warnings. The
latter approach is common, especially in the form of 2.3.5. Sampling to Reduce Overhead
state machines that track the “sharing status” of a varia-
ble [1,3]. Such heuristics are necessarily imperfect To detect races, dynamic data race detectors need to
compromises, however (they always fail to suppress monitor the synchronizations and memory accesses
some false warnings and always suppress some correct performed at runtime. This is typically done by instru-
warnings), and it is not clear how to tune them to be menting the code and inserting extra monitoring code
useful for a wide range of applications. for each data access. As the monitoring code executes
at every memory access, the overhead can be quite sub-
2.3.4. Problems with Tracking Synchroni- stantial.
zations
One way to ameliorate this issue is to exclude some
Both lock-set and happens-before tracking require a data accesses from processing. Prior work has identi-
thorough understanding of the synchronization seman- fied several promising strategies: adaptive sampling
tics, lest they produce false alarms or miss races. There that backs off hot locations [5] (the idea is that for such
are two fundamental difficulties we encountered when locations the monitoring can be less frequent and still
trying to apply these techniques in the kernel: detect races), or perform the full monitoring only for a
fixed fraction of the time [4] (the idea is that the proba-
bility of catching a race is roughly proportional to this
 Abstractions that we take for granted in user mode
fraction multiplied by the number of times the race re-
(such as threads) are no longer clearly defined in
peats). But these techniques still suffer from the cost of
kernel mode.
sampling, performed at every memory access. DataCol-
 The synchronization vocabulary of kernel code is lider avoids this problem by using hardware breakpoint
much richer and may include complicated se- mechanisms.
quences and ordering mechanisms provided by the
hardware.
3. DataCollider Implementation quires the debugging symbols of the program binary to
perform this disassembly. This requirement can be re-
This section describes the implementation of the laxed by using sophisticated disassemblers [19] in the
DataCollider algorithm for the Windows kernel on the future.
x86 architecture. The implementation heavily uses the
code and data breakpoint mechanisms available on x86. DataCollider performs a simple static analysis to identi-
The techniques described in this paper can be extended fy instructions that are guaranteed to only touch thread-
to other architectures and to user-mode code. But we local stack locations and removes them from the sam-
have not pursued this direction in this paper. pling set. Similarly, DataCollider removes synchroniz-
ing instructions from the sampling set by removing
Figure 2 describes the basics of the DataCollider algo- instructions that accesses memory locations tagged as
rithm. DataCollider uses the sampling algorithm, de- “volatile” or those that use hardware synchronization
scribed in Section 3.1, to process a small percentage of primitives, such as interlocked. This prevents DataCol-
memory accesses for data-race detection. For each of lider from reporting races on synchronization variables.
the sampled memory accesses, DataCollider uses a con- However, DataCollider can still detect a data race be-
flict detection mechanism, described in Section 3.2, to tween a synchronization access and a regular data ac-
find data races involving the sampled access. After de- cess, if the latter is in the sampling set.
tecting data races, DataCollider uses several heuristics,
described in Section 3.3, to prune benign data races. DataCollider samples program locations from the sam-
pling set by inserting code breakpoints. The initial
3.1. The Sampling Algorithm breakpoints are set at a small number of program loca-
tions chosen uniformly randomly from the sampling set.
If and when a code breakpoint fires, DataCollider per-
There are several challenges in designing a good sam-
forms conflict detection for the memory access at that
pling algorithm for data-race detection. First, data races
breakpoint. Then, DataCollider choses another program
involve two memory accesses both of which need to be
location uniformly randomly from the sampling set and
sampled to detect the race. If memory accesses are
sets a breakpoint at that location.
sampled independently, then the probability of finding
the data race is a product of the individual sampling
This algorithm uniformly samples all program locations
probabilities. DataCollider avoids this multiplicative
in the sampling set irrespective of the frequency with
effect by sampling the first access and using a data
which the program executes these locations. This is
breakpoint to trap the second access. This allows
because the choice of inserting a code breakpoint is
DataCollider to be effective at low sampling rates.
performed uniformly at random for all locations in the
sampling set. Over a period of time, the breakpoints
Second, data races are rare events – most executed in-
will tend to reside at rarely executed program locations,
structions do not result in a data race. The sampling
increasing the likelihood that those locations are sam-
algorithm should weed out the small percentage of rac-
pled the next time they execute.
ing accesses from the majority of non-racing accesses.
The key intuition behind the sampling algorithm is that
If DataCollider has information on which program loca-
if a program location is buggy and fails to use the right
tions are likely to participate in a race, either through
synchronization when accessing shared data, then every
user annotations or through prior analysis [20] then the
dynamic execution of that buggy code is likely to par-
tool can prioritize those locations by biasing their selec-
ticipate in a data race. Accordingly, DataCollider per-
tion from the sampling set.
forms static sampling of program locations rather than
dynamic sampling of executed instructions. A static
sampler provides equal preference to rarely execution 3.1.2. Controlling the Sampling Rate
instructions (which are likely to have bugs hidden in
them) and frequently executed instructions. While the program cannot affect the sampling distribu-
tion over program locations, the sampling rate is inti-
3.1.1. Static Sampling Using Code Break- mately tied to how frequently the program executes
locations with a code breakpoint. In the worst case, if
points all of the breakpoints are set on dead code, DataCollider
will stop performing data-race detection altogether. To
The static sampling algorithm works as follows. Given avoid this and to better control the sampling rate,
a program binary, DataCollider disassembles the binary DataCollider periodically checks the number of break-
to generate a sampling set consisting of all program points fired every second, and adjusts the number of
locations that access memory. The tool currently re-
breakpoints set in the program based on whether the When a data breakpoint fires, DataCollider has success-
experienced sampling rate is higher or lower than the fully detected a race. More importantly, it has caught
target rate. the racing threads “red handed” – the two threads are at
the point of executing conflicting accesses to the same
3.2. Conflict-Detection memory location.

As described in the previous section, DataCollider picks One particular shortcoming of data breakpoint support
a small percentage of memory accesses as likely candi- in x86 that we had to work around was the fact that,
dates for data-race detection. For these sampled access- when paging is enabled, x86 performs the breakpoint
es, DataCollider pauses the current thread waiting to comparisons based on the virtual address and has no
see if another thread makes a conflicting access to the mechanism to modify this behavior. Two concurrent
same memory location. It uses two strategies: data accesses to the same virtual addresses but different
breakpoints and repeated-reads. DataCollider uses these physical addresses do not race. In Windows, most of
two strategies simultaneously as each complements the the kernel resides in the same address space with two
weaknesses of the other. exceptions.

3.2.1. Detecting Conflicts with Data Break- Kernel threads accessing the user address space cannot
conflict if the threads are executing in the context of
points different processes. If a sampled access lies in the user
address space, DataCollider does not use breakpoints
Modern hardware architectures provide a facility to trap and defaults to the repeated-read strategy.
when a processor reads or writes a particular memory
location. This is crucial for efficient support for data Similarly, a range of kernel-address space, called ses-
breakpoints in debuggers. The x86 hardware supports sion memory, is mapped to different address spaces
four data breakpoint registers. DataCollider uses them based on the session the process belongs to. When a
to effectively monitor possible conflicting accesses to sampled access lies in the session memory space,
the currently sampled access. DataCollider sets a data breakpoint but checks if the
conflicting accesses belong to the same session before
When the current access is a write, DataCollider in- reporting the conflict to the user.
structs the processor to trap on a read or write to the
memory location. If the current access is a read, Finally, a data breakpoint will miss conflicts if a pro-
DataCollider instructs the processor to trap only on a cessor uses a different virtual address mapped to the
write, as concurrent reads to the same location do not same physical address as the sampled access. Similarly,
conflict. If no conflicting accesses are detected, data breakpoints cannot detect conflicts arising from
DataCollider resumes the execution of the current hardware devices directly accessing memory. The re-
thread after clearing the data breakpoint registers. peated-read strategy discussed below covers all these
cases.
Each processor has a separate data breakpoint register.
DataCollider uses an inter-processor interrupt to update
the break points on all processors atomically. This also
3.2.2. Detecting Conflicts with Repeated
synchronizes multiple threads attempting to sample Reads
different memory locations concurrently.
The repeated-read strategy relies on a simple insight: if
An x86 instruction can access variable sized memory. a conflicting write changes the value of a memory loca-
For 8, 16, or 32-bit accesses, DataCollider sets a break- tion, DataCollider can detect this by repeatedly reading
point of the appropriate size. The x86 processor traps if the memory location checking for value changes. An
another instruction accesses a memory location that obvious disadvantage of this approach is that it cannot
overlaps with a given breakpoint. Luckily, this is pre- detect conflicting reads. Similarly, it cannot detect mul-
cisely the semantics required for data-race detection. tiple conflicting writes the last of which writes the same
For accesses that span more than 32 bits, DataCollider value as the initial value. Despite these shortcomings,
uses more than one breakpoint up to the maximum we have found this strategy to be very useful in prac-
available of four. If DataCollider runs out of breakpoint tice. This is the first strategy we implemented (as it is
registers, it simply resorts to the repeated-read strategy easier to implement than using data breakpoints) and
discussed below. we were able to find several kernel bugs with this ap-
proach.
However, repeated-reads strategy catches only one of the number of references to a shared object, then the
the two threads “red-handed.” This makes it harder to data race could lead to a memory leak or a premature
debug data races, as one does not know which thread or free of the object.
device was responsible for the conflicting write. This
was our prime motivation for using data breakpoints. During the initial runs of the tool, we found that around
90% of the data-race reports are benign. Inspecting the-
3.2.3. Inserting Delays se we identified the following patterns that can be iden-
tified through simple static and/or dynamic analysis and
For a sampled memory access, DataCollider attempts to incorporated them in a post-process pruning phase.
detect a conflicting access to the same memory location
by delaying the thread for a short amount of time. For Statistics Counters: Around half of the benign data
DataCollider to be successful, this delay has to be long races involved conflicting updates to counters that
enough for the conflicting access to occur. On the other maintain various statistics about the program behavior
hand, delaying the thread for too long can be dangerous [21]. These counters are not necessarily write-only and
especially if the thread holds some resource crucial for could affect the control flow of the program. A com-
the proper functioning of the entire system. In general, mon scenario is to use these counter value to perform
it is impossible to predict how long to insert the delay. periodic computation such as flushing a log buffer. If
After experimenting with many values, we chose the DataCollider reports several data races involving an
following delay algorithm. increment instruction and the value of the memory loca-
tion consistently increases across these reports, then the
Depending on the IRQL (Interrupt Request Level) of pruning phase tags these data races as statistics-counter
the executing thread, DataCollider delays the thread for races. Checking for an increase in memory values helps
a preset maximum amount of time. At IRQLs higher the pruning phase in distinguishing these statistics
than the DISPATCH level (the level at which the kernel counters from reference counters that are usually both
scheduler operates), DataCollider does not insert any incremented and decremented.
delay. We considered inserting a small window of delay
at this level to identify possible data races between in- Safe Flag Updates: The next prominent class of benign
terrupt service routines. But we did not expect that races involves a thread reading a flag bit in a memory
DataCollider would be effective at short delays. location while another thread updates a different bit in
the same memory location. By analyzing few memory
Threads running at the DISPATCH level cannot yield instructions before and after the memory access, the
the processor to another thread. As such, the delay is pruning phase identifies read-write conflicts that in-
simply a busy loop. We currently delay threads at this volve different bits. On the other hand, write-write con-
level for a random amount of time less than 1 ms. For flicts can result in lost updates (as shown in Figure 1)
lower IRQLs, DataCollider delays the thread for a max- and are not tagged as benign.
imum of 15 ms by spinning in a loop that yields the
current time quantum. During this loop, the thread re- Special Variables: Some of the data races reported by
peatedly checks to see if other threads are making pro- DataCollider involve special variables in the kernel
gress by inspecting the rate at which breakpoints fire. If where races are expected. For instance, Windows main-
progress is not detected, the waiting thread prematurely tains the current time in a variable, which is read by
stops its wait. many threads while being updated by the timer inter-
rupt. The pruning phase has a database of such varia-
3.3. Dealing with Benign Data Races bles and prunes races involving these variables.

While it is possible to design other patterns that identify


Research on data-race detection has amply noted the
benign data races, one has to tradeoff the benefit of the
fact that not all data races are erroneous. A practical
pruning achieved with the risk of missing real data rac-
data-race detection tool should effectively prune or
es. For instance, we initially designed a pattern to clas-
deprioritize these benign data races when reporting to
sify two writes that write the same value as benign.
the user. However, inferring whether or not a data race
However, very few data-race reports matched this prop-
is benign can be tricky and might require deep under-
erty. On the other hand, Figure 4 shows an example of a
standing of the program. For instance, a data race be-
harmful data-race that we found involving two such
tween two concurrent non-atomic counter updates
writes.
might be benign if the counter is a statistic variable
whose fidelity is not important to the behavior of the
program. However, if the counter is used to maintain Also, we have made an explicit decision to make the
benign data races available to the user, but deprioritized
Data Races Reported Count velopment. We reported a total 38 data-race reports to
Fixed 12 the developers. This figure does not reflect the number
Confirmed and Being Fixed 13 of benign data races pruned heuristically and manually.
We defer the discussion of benign data races to Section
Under Investigation 8
4.4.
Harmless 5
Total 38
Of these 38 reports, 25 have been confirmed as bugs
and 12 of which have already been fixed. The develop-
Figure 3: Bugs reported to the developers after
ers indicated that 5 of these are indeed harmless. For
excluding benign data-race reports.
instance, one of the benign data races results in a driver
issuing an idempotent request to the device. While this
could result in a performance loss, the expected fre-
quency of the data race did not justify the cost of add-
against races that are less likely to be benign. Some of ing synchronization in the common case. Identifying
our users are interested in browsing through the pruned such benign races requires intimate knowledge of the
benign races to identify potential portability problems code and would not be possible without the program-
and memory-model issues in their code. We also found mers help.
an instance where a benign race, despite being harm-
less, indicated unintended sharing in the code and re-
As DataCollider naturally delays the racing access that
sulted in a design change.
temporally occurs first, it is likely to explore both out-
comes of the race. Despite this, only one of the 38 data
4. Evaluation races crashed the kernel in our experiments. This indi-
cates that the effects of an erroneous data race are not
There are two metrics for measuring the success of a immediately apparent for the particular input or the
data-race detection tool. First, is it able to find data rac- hardware configuration of the current run.
es that programmers deem important enough to fix?
Second, is it able to scale to a large system, which in We discuss two interesting error reports below
our case is the Windows operating system, with reason-
able runtime overheads? This section presents a case for 4.2.1. A Boot Hang Caused by a Data Race
an affirmative claim on these two metrics.
A hardware vendor was consistently seeing a kernel
4.1. Experimental Setup hang at boot-up time. This was not reproducible in any
of the in-house machine configurations, till the vendor
For the discussion in this section, we applied DataCol- actually shipped the hardware to the developers. After
lider on several modules in the Windows operating sys- inspecting the hang, a developer noticed a memory cor-
tem. DataCollider has been has been used on class driv- ruption in a driver that could be a result of a race condi-
ers, various PnP drivers, local and remote file system tion. When analyzing the driver in question, DataCol-
drivers, storage drivers, and the core kernel executive lider found the data race in an hour of testing on a regu-
itself. We are successfully able to boot the operating lar in-house machine (in which the kernel did not hang).
system with DataCollider and run existing kernel stress Once the source of the corruption was found (perform-
tests. ing a status update non-atomically), the bug was imme-
diately fixed.
4.2. Bugs Found
Figure 3 presents the data race reports produced by the
different versions of DataCollider during its entire de-
void AddToCache() {
// ...
A: x &= ~(FLAG_NOT_DELETED);
B: x |= FLAG_CACHED;

MemoryBarrier();
// ...
}

AddToCache();
assert( x & FLAG_CACHED );

Figure 4: An erroneous data race when the


AddToCache function is called concurrently.
Though the data race appears benign, as the con-
flicting accesses “write the same values,” the as-
sert can fail on some thread schedules.
Figure 5: Runtime overhead of DataCollider with in-
creasing sampling rate, measured in terms of the num-
ber of code breakpoints firing per second. The over-
head tends to zero as the sampling rate is reduced, in-
4.2.2. A Not-So-Benign Data Race dicating that the tool has negligible base overhead.

Figure 4 shows an erroneous data race. The function


AddToCache performs two non-atomic updates to the
flag variable. DataCollider produced an error report
with two threads simultaneously updating the flag at
location B. Usually, two instructions writing the same
values is a good hint that the data race is benign. How-
ever, the presence of the memory barrier indicated that
this report required further attention – the developer
was well aware of consequences of concurrency and the
rest of the code relied on crucial invariants on the flag
updates. When we reported this data race to the devel-
oper he initially tagged it as benign. On further discus-
sion, we discovered that the code relied on the invariant
that the CACHED bit is set after a call to AddToCache.
The data race can break this invariant when a concur-
rent thread overwrites CACHED bit when performing the
update at A, but gets preempted before setting the bit at Figure 6: The number of data races, uniquely identi-
B. fied by the pair of racing program locations, with the
runtime overhead. DataCollider is able to report data
race even under overheads under 5%
4.2.3. How Fixed
While data races can be hard to find and result in mys- cated a broken design due to a recent refactoring and
terious crashes, our experience is that most are relative- resulted in a design change.
ly easy to fix. Of the 12 bugs, 3 were the result of miss-
ing locks. The developer could easily identify the lock- 4.3. Runtime Overhead
ing discipline that was meant to be followed, and could
decide which lock to add without the fear of a deadlock. Users have an inherent aversion to dynamic analysis
6 data races were the fixed by using an atomic instruc- tools that add prohibitive runtime overheads. The obvi-
tions, such as interlocked increment, to make a read- ous reason is the associated wastage of test resources –
modify-write to a shared variable. 2 bugs were a result a slowdown of ten means that only one-tenth the
of unintended sharing and were fixed by making the amount of testing can be done with a given amount of
particular variable thread local. Finally, one bug indi- resources. More importantly, runtime overheads intro-
duced by a tool can affect the real-time execution of the
Data Race Category Count 4.4. Benign Data Races
Statistic Counter 52
Benign – Finally, we performed an experiment to measure the
Safe Flag Update 29
Heuristically
Special Variable 5 efficacy of our pruning algorithm for benign data races.
Pruned
Subtotal 86 The results are shown in Figure 7. We enabled
Double-check locking 8 DataCollider while running kernel stress tests for 2
Benign –
Volatile 8 hours sampling at approximately 1000 code breakpoints
Manually
Write Same Value 1 per second. DataCollider found a total of 113 unique
Pruned
Other 1 data races. The patterns described in Section 3.3 can
Subtotal 18 identify 86 (76%) of these as benign errors. We manu-
Real Confirmed 5 ally (and painfully) triaged these reports to ensure that
Investigating 4 these races were truly benign. Of the remaining races,
Subtotal 9
we manually identified 18 as not erroneous. 8 of them
Total 113
involved the double-checked locking idiom, where a
Figure 7: Categorization of data races found by thread performs a racy read of a flag without holding a
DataCollider during kernel stress. lock, but reconfirms the value after acquiring the lock.
8 were accesses to volatile variables that DataCollider’s
analysis was unable to infer the type of. These reports
can be avoided with a more sophisticated analysis for
program. The operating system could start a recovery determining the program types. This table demonstrates
action if a device interrupt takes too long to finish. Or a that a significant percentage of benign data races can be
test harness can incorrectly tag a kernel-build faulty if it heuristically pruned without risks of missing real data
takes too long to boot. races. During this process, we found 9 potentially harm-
ful data races of which 5 have already been confirmed
To measure the runtime overhead of DataCollider, we as bugs.
repeatedly measured the time taken for the boot-
shutdown sequence for different sampling rates and
compared against a baseline Windows kernel running 5. Conclusion
without DataCollider. These experiments where done
on the x86 version of Windows 7 running on a virtual This paper describes DataCollider, a lightweight and
machine with 2 processors and 512 MB memory. The effective data-race detector specifically designed for
host machine is an Intel Core2-Quad 2.4 GHz machine low-level systems code. Using our implementation of
with 4 GB memory running Windows Server 2008. DataCollider for the Windows operating system, we
The guest machine was limited to 50% of the pro- have found to date 25 erroneous data races of which 12
cessing resources of the host. This was done to prevent are already fixed.
any background activity on the host from perturbing the
performance of the guest. We would like to thank our shepherd Junfeng Yang and
all our anonymous reviewers for valuable feedback on
Figure 5 shows the runtime overhead of DataCollider the paper.
for different sampling rates, measured by the average
number of code breakpoints fired per second during the References
run. As expected, the overhead increases roughly line-
arly with the sampling rate. More interestingly, as the [1] Stefan Savage, Michael Burrows, Greg Nelson,
sampling rate tends to zero, DataCollider’s overhead and Patrick Sobalvarro, "Eraser: A Dynamic Data
reaches zero. This indicates that DataCollider can be Race Detector for Multithreaded Programs," ACM
“always on” in various testing and deployment scenari- Transactions on Computer Systems, vol. 15, no. 4,
os, allowing the user to tune the overhead to any ac- pp. 391-411, 1997.
ceptable limit. [2] Robert O'Callahan and Jong-Deok Choi, "Hybrid
Dynamic Data Race Detection," SIGPLAN Not.,
Figure 6 shows the number of data races detected for vol. 38, no. 10, pp. 167-178, 2003.
different runtime costs. DataCollider is able to detect
data races even for overheads less than 5% indicating [3] Yuan Yu, Tom Rodeheffer, and Wei Chen,
the utility of the tool at low overheads. "RaceTrack: Efficient Detection of Data Race
Conditions via Adaptive Tracking," in Symposium
on Operating System Principles (SOSP), 2005, pp.
221-234.
[4] Michael D Bond, Katherine E Coons, and Kathryn [17] Chandrasekhar Boyapati, Robert Lee, and Martin
S McKinley, "PACER: Proportional Detection of Rinard, "Ownership Types for Safe Programming:
Data Races," in Programming Languages Design Preventing Data Races and Deadlocks," in Object-
and Implementation (PLDI), 2010. Oriented Programming, Systems, Languages and
[5] Daniel Marino, Madanlal Musuvathi, and Satish Applications (OOPSLA), 2002, pp. 211-230.
Narayanasami, "LiteRace: Effective Sampling for [18] A Dinning and E Schonberg, "Detecting access
Lightweight Data-Race Detection," in anomalies in programs with critical sections," in
Programming Language Design and Workshop on Parallel and Distributed Debugging,
Implementation, 2009, pp. 134-143. 1991, pp. 85-96.
[6] Cormac Flanagan and Stephen N Freund, [19] The IDA Pro Disassembler and Debugger.
"FastTrack: Efficient and Precise Dynamic Race [Online]. http://www.hex-rays.com/idapro/
Detection," in Programming Language Design and [20] Koushik Sen, "Race Directed Random Testing of
Implementation, 2009, pp. 121-133. Concurrent Programs," in Programming Language
[7] E Pozniansky and A Schuster, "MultiRace: Design and Implementation (PLDI'08), 2008, pp.
Efficient on-the-fly data race detection in 11-21.
multithreaded C++ programs," Concurrency and [21] Satish Narayanasamy, Zhenghao Wang, Jordan
Computation: ractice and Experience, vol. 19, no. Tigani, Andrew Edwards, and Brad Calder,
3, pp. 327-340, 2007. "Automatically Classifying Benign and Harmful
[8] Leslie Lamport, "Time, clocks, and the ordering of Data Races Using Replay Analysis," in
events in a distributed system," Communications Programming Language Design and
of the ACM, vol. 21, no. 7, pp. 558-565, 1978. Implementation (PLDI '07), 2007, pp. 22-31.
[9] Paul Sack, Brian E Bliss, Zhiqiang Ma, Paul [22] Donald E. Knuth, The Art of Computer
Petersen, and Josep Torrellas, "Accurate and Programming, Volume 2.: Addison-Wesley
Efficient Filtering for the Intel Thread Checker Longman, 1997.
Race Detector," in Workshop on Architectural and [23] Steven C. Woo, Moriyoshi Ohara, Evan Torrie,
System Support for Improving Software Jaswinder P. Singh, and Anoop Gupta, "The
Dependability, 2006, pp. 34-41. SPLASH-2 Programs: Characterization and
[10] Hans Boehm and Sarita Adve, "Foundations of the Methodological Considerations," in ISCA '95:
C++ Concurrency Memory Model," HP Labs, International Symposium on Computer
Technical Report HPL-2008-56 , 2008. architecture, 1995, pp. 24-26.
[11] Hans Boehm. (2009, Sep.) N1411: Memory Model [24] Amitabh Srivastava and Alan Eustace, "ATOM: A
Rationale. [Online]. http://www.open- System for Building Customized Program
std.org/JTC1/SC22/WG14/www/docs/n1411.htm Analysis Tools," in Proceedings of the ACM
[12] IEEE, POSIX.1c, Threads extensions, 1995, IEEE SIGPLAN 1994 Conference on Programming
Std 1003.1c. Language Design and Implementation, 1994, pp.
196-205.
[13] Dawson Engler and Ken Ashcraft, "RacerX:
Effective, Static Detection of Race Conditions and
Deadlocks," in Symposium on Operating Systems
Principles (SOSP), 2003, pp. 237-252.
[14] Mayur Naik, Alex Aiken, and John Whaley,
"Effective Static Race Detection for Java," in
Programming Language Design and
Implementation (PLDI), 2006, pp. 308-319.
[15] Cormac Flanagan and Stephen Freund, "Type-
Based Race Detection for Java," in Programming
Language Design and Implementation (PLDI),
Vancouver, 2000, pp. 219-232.
[16] Zachary Anderson, David Gay, and Mayur Naik,
"Lightweight Annotations for Controlling Sharing
in Concurrent Data Structures," in Programming
Language Design and Implementation (PLDI),
Dublin, 2009.

You might also like