Catching memory corruption with hardware breakpoints

TL;DR

Here’s my library to set hardware breakpoints: github.com/biocomp/hw_break. I used it to catch a rare crash.

Longer story

I’ve had a trouble at work with rare crashes of our service in certain situations. After looking at crash dumps, it turned out that std::function object I was calling was corrupt.

It was a memory corruption, but it happened before this call. Looking at crash dump would not help me determine a root cause.

Attaching a debugger and stepping through the code was not an option either – this situation was very rare, and the code was called a lot.

You can set up data change breakpoints (Visual Studio, WinDbg) in a debugger and have it break when the data at the address changes (or when it’s read or executed). But there was an issue – the address of that std::function object was different for every call. I would still have break at every call and update the address of breakpoint.

Then I wondered if there was a way to set up a similar break point from within my code programmatically (luckily, I could recompile the service). And yes, turns out the hardware breakpoints are a processor feature, you can set up to 4 per thread on x86 via debug registers.

It was a matter of setting proper register values, and on Windows this is done via GetThreadContext/SetThreadContext APIs. The CONTEXT structure will contain debug register values among the others, and you can manipulate them to set up hardware breakpoints.

I’ve discovered this library with good explanations of how everything works. But for every set and remove of a breakpoint, the library would start a thread. It does it because the documentation says you can’t call SetThreadContext without suspending the thread. I suspected this would be slow and it was. The service was uselessly slow at that point, and the crash did not reproduce.

SetThreadContext can update all kinds of registers, including the ones that affect current function parameters and return values. However, since I’m only changing debug registers and only on my own thread, I figured it would be safe to do without suspending my own thread.

I wrote my own C++ single-header library that supports debugging only current thread: github.com/biocomp/hw_break.

And indeed, it worked! It was fast, and the corruption was caught rather quickly. It turned out it was a buffer overrun that corrupted a bunch of stack values below it including my std::function (if size of data written into buffer large enough).