Monday, November 9, 2009

PRES

PRES is a neat tool that helps developers reproduce concurrency bugs found in the field. Some tools try to search every path, but can take many replays (1000+) before being able to reproduce the problem. This is due to the many sources of nondeterminism in a multiprocessor system. Threads can be interleaved in any order. The other approach is to capture run time data when the system is deployed to reproduce any problem, but this comes at a high cost of performance, sometimes slowing down the system 10X.

PRES takes a middle ground of recording some of the run time data, such as the order functions are called. When a developer needs to reproduce a bug, he gathers this data and PRES uses it to limit the number of replays required in order to reproduce a bug. It guides the replay system automatically to not go down paths that did not happen in the field.

I thought it also made a wise choice to start with the possible race conditions closest to the time the bug occurred and work outward. This makes it more likely to find the race condition as usually there is some locality to a bug. This does not have to be the case, in which case it would be neat to offer other heuristics, such as targeting specific threads or data for race conditions.

I often develop multi-threaded applications for Windows in both user and kernel mode. While a system like PRES could be useful sometimes, I have not needed it due to other tools working well enough. Between static analysis tools (PREfast and Static Driver Verifier) and dynamic analysis tools (Driver Verifier, Application Verifier, UMDH), I can typically be warned of race conditions without waiting for them to occur. When a bug occurs, a dump of the process (available on any Vista+ box) can usually pinpoint the problem. Kernel level code typically bug checks the system quickly after a problem, also producing a dump that can be used.

That said, PRES can still have its place for some heisenbugs that occur in the field.

No comments:

Post a Comment