December 22, 2015

Memory Corruption: Finding and Fixing Elusive Crashes

By Gil Gribb

There is no end to the creative ways code can fail, but issues with random memory overwrites are almost invariably related to dynamic memory allocation. Some system allocates a bit of memory, uses it for a while, then returns the memory to the allocator and then, fatally, continues to use it. Meanwhile, the memory allocator has recycled the memory block and given it to some other system for use. Memory is a shared resource, so the memory could be given to another subsystem, another thread, or on a console even to the GPU.

Pointers into memory blocks that have been returned to the allocator are termed stale. Either reading or writing to a stale pointer is likely to cause bugs and crashes. Reads to stale pointers tend to be less mysterious because they are likely to crash in the same subsystem as the bad code. Writes to stale pointers cause a baffling variety of bugs and crashes which may not look related.

The conventional solution to finding memory issues with use of stale pointers is to use special allocators that use various strategies to try to identify these issues. An incomplete list of the ways allocators can try to shake these bugs loose:

  • After a block is freed it is filled with a special bit pattern that is likely to cause problems for anyone reading the block through a stale pointer.
  • After a block is freed it is filled with a special bit pattern, and set aside for a while. After a while it is recycled but prior to that, the bit pattern is checked. If the bit pattern has changed, we can assume someone wrote through a stale pointer.
  • New blocks are created with padding at the start and end of the block that contains a special bit pattern. These are called canaries, as in "canary in a coal mine". When the block is freed (or possibly at other times), the canaries are checked and if they are not the expected values, we can assume that someone either wrote off the beginning or end of the block. This isn't a stale pointer, per-se, but it can result in the same sort of mysterious behavior. 
  • Modern CPUs have hardware features that allow memory ranges, called pages, to be protected against read and/or write. This is the focus of the rest of the post.

Using Memory Protection to Track Down Memory Bugs

The virtual memory systems of modern CPUs offer lots of features beyond the scope of this post. For tracking down memory bugs, we use two features:

  • The ability to allocate vast amounts of address space for memory. In some sense, this means we don't need to ever use the same range of addresses twice. This makes it easy to distinguish legit memory access from bugs.
  • The ability to mark ranges of address space as unreadable and unwritable. The causes the CPU to crash when any attempt is made to read or write though a stale pointer.

CPUs have a fixed lower bound on the granularity of memory protection. Memory is allocated, mapped, freed and protected in pages. Most CPUs offer a variety of page sizes, but typically the smallest page available is 4096 bytes. Therefore in order to use the virtual memory features of the CPU to check for memory bugs, we need to allocate in 4k chunks. There is a lot of waste when we allocate 4096 bytes when we only needed 16. Therefore an allocator that allocates in pages to use memory protection is going to use a lot more memory. A game that might normally run in 2GB of memory might consume 30GB with an allocator like this. Performance also suffers as we are calling into the OS on every allocation and free.

We can also use the CPU hardware to guard against writing or reading past the end of the memory block by putting the memory block at the end of the page and protecting the next page. Alternately we could guard against reading or writing before the memory block by putting the memory block at the start of the page and making sure the previous page is protected. For odd sized or small memory blocks, there is no way to protect against both access before and after the block because of the page granularity.

Recently Paragon was suffering from the symptoms of a random memory overwrite. Fortunately an active member of our open source community had a great implementation of an allocator that he offered as a pull request and discussed in a great blog post.

We hooked this up and almost immediately found five critical bugs. In some cases the actual bugs were quite complicated, but there were also simple issues like this one:

GameplayCueDataMap.Add(ThisGameplayCueTag) = GameplayCueDataMap.FindChecked(Parent);

Here GameplayCueDataMap is an associative array, a TMap in UE4 parlance. First, FindChecked locates the parent item and returns a reference (aka pointer) to the item. Then Add is called and this causes data structures internal to the TMap to resize, which frees the memory block that the returned item points to. And finally when the assignment is attempted the stale pointer is dereferenced. Without a special debug allocator, it is not likely this bug would ever be found. It is not even likely to exhibit incorrect behavior. You can be sure that it will wait until well after midnight on the worst possible day of the year to finally rear its ugly head.

The solution is simple; force the dereference before you potentially expand the container. Then there is no stale pointer.

int32 ParentValue = GameplayCueDataMap.FindChecked(Parent);
GameplayCueDataMap.Add(ThisGameplayCueTag, ParentValue);

Fortunately these sorts of bugs are not very common. When they do occur, using the right tool is key to tracking them down.