Cache COW — a debugging adventure on Minoca OS

Sharing my own copy-on-write story in light of the Dirty COW bug.

What follows is the story of how I tracked down a particularly obscure bug in the early days of Minoca OS. Here goes.

Catching the Bug

In January 2015, the Texas Instruments PandaBoard was the only ARM platform we supported. And it occasionally failed to boot.

Back then, Minoca’s nightly build process ran on an x86 PC we found on the side of the road and the ARMv7 based PandaBoard. When all went well, Minoca OS would compile itself, install the natively built image to a second partition, reboot into that partition and run tests on the new build. But on some rare mornings, the “street PC” was up and running tests while the PandaBoard was hung. Without a debugger attached, there wasn’t much more to do than reboot.

Given the hang’s rarity, simply attaching a debugger to the PandaBoard every night would take too long to reach a diagnosis. This bug begged for a better way to be reproduced, but there were only two solid facts. The first was that it never hit on x86. In most cases, that means it’s cache related as x86 is cache coherent and ARM is not. The second was that it only hung on reboot and not in the middle of compiling the OS. Without enough details to craft a targeted repro, attaching a debugger to the PandaBoard and sticking it in a reboot loop seemed like a good place to start.

While not a very creative solution, it was worth a shot at gaining more information about the bug. And voila, in about fifteen minutes, there was some action:

Error Code: CRASH_PAGE_FAULT (0x1)
Parameter 1: 0x00000000
Parameter 2: 0xbf8d46a7
Parameter 3: 0xbfbbbe48
Parameter 4: 0x00000001

It crashed! Maybe this was the culprit. The routine that prints this message, KeCrashSystemEx, actually spins forever without a debugger attached, so this would have manifested as a hang during the nightly builds.

Dissecting the Crash

Analysis of the crash parameters indicated that the USB hub driver faulted attempting to access NULL (Parameter 1) while executing the instruction saved in Parameter 2.

1 : kd>u 0xbf8d46a7
usbhub!UsbHubpRemoveDevice+0x7:
bf8d46a7 6808 ldr r0, [r1] <-- The crash is here!
bf8d46a9 f7ff ef0e blx usbhub+0x4c8 [0xbf8d44c8]
bf8d46ad 6860 ldr r0, [r4, #4]
bf8d46af f7ff eeee blx usbhub+0x48c [0xbf8d448c]
bf8d46b3 6820 ldr r0, [r4]

It was a load from the address stored in r1.

While it was tempting to investigate why r1 was NULL, the conundrum staring me in the face was why UsbHubpRemoveDevice was executing during boot! The only hub attached to the PandaBoard was internal and could not be removed. A NULL dereference did not seem like the real problem. The next step was to figure out what was calling UsbHubpRemoveDevice.

The trap frame (Parameter 3), which stores the register context at the time of a crash, was the best place to find that answer. The saved link register records the return address for the executing function and it pointed to an address within UsbHubDispatchStateChange. By inspecting the C code for that routine, nothing seemed amiss. Device removal is a valid state change operation. The assembly, however, told a different story.

1 : kd>u 0xbf8d476b-0x4
usbhub!UsbHubDispatchStateChange+0xa3:
bf8d4767 f7ff ff8b bl usbhub!UsbHubpEnumerateChildren+0x1
bf8d476b bdb0 pop {r4, r5, r7, pc} <-- Saved Link Register
bf8d476d 6ce3 ldr r3, [r4, #76]
bf8d476f 2b02 cmp r3, #2
bf8d4771 d103 bne usbhub!UsbHubDispatchStateChange+0xb7
bf8d4773 4620 mov r0, r4
bf8d4775 4629 mov r1, r5
bf8d4777 f7ff ff93 bl usbhub!UsbHubpRemoveDevice+0x1 [0xbf8d46a1]
bf8d477b bdb0 pop {r4, r5, r7, pc} <-- Expected Link Register
bf8d477d 0134 movs r4, r6

The 16 byte discrepancy between the saved link register and the expected link register for a call to UsbHubpRemoveDevice was enough evidence that the crash was not a simple case of an uninitialized pointer. But I wanted more confirmation that the link register was correct and the attempt to remove a hub during boot was bogus. Luckily, the trap frame also saves the stack pointer at the time of a crash.

The Minoca Debugger’s dps command can be used to dump memory, pointer by pointer, resolving each address to its corresponding symbol. This was the stack, snipped for brevity:

1 : kd>dps 0xbfbbbe98
bfbbbe98 ff800000 0xff800000
bfbbbe9c 80717eb0 0x80717eb0
bfbbbea0 fffffff8 0xfffffff8
bfbbbea4 bf8d476b usbhub!UsbHubDispatchStateChange+0xa7
...
bfbbbeb0 bfbbbee4 0xbfbbbee4
bfbbbeb4 bfa24dc7 kernel!IopCallDriver+0xa7
...
bfbbbee0 bfbbbef4 0xbfbbbef4
bfbbbee4 bfa24e7b kernel!IopPumpIrpThroughStack+0x23
...
bfbbbef0 bfbbbf2c 0xbfbbbf2c
bfbbbef4 bfa24ff3 kernel!IoSendSynchronousIrp+0xf7
...
bfbbbf28 bfbbbf54 0xbfbbbf54
bfbbbf2c bfa250e7 kernel!IopSendStateChangeIrp+0x33
...
bfbbbf50 bfbbbf74 0xbfbbbf74
bfbbbf54 bfa1a73f kernel!IopQueryChildren+0x23 <-- Look here!
...
bfbbbf70 bfbbbf84 0xbfbbbf84
bfbbbf74 bfa1a801 kernel!IopProcessWorkEntry+0x21
...
bfbbbf80 bfbbbfb4 0xbfbbbfb4
bfbbbf84 bfa1a939 kernel!IopDeviceWorker+0x109
...
bfbbbfb0 bfbbbfe4 0xbfbbbfe4
bfbbbfb4 bfa152ad kernel!KepWorkerThread+0xad
...
bfbbbfe0 bfbbbffc 0xbfbbbffc
bfbbbfe4 bfa0dae7 kernel!PspKernelThreadStart+0x13

The presence of IopQueryChildren further up the stack was a great indication that the link register was correct and that UsbHubpEnumerateChildren actually got called before the processor ended up in UsbHubpRemoveDevice.

A quick inspection of the enumerate children routine’s C code showed it to be quite simple. It called two functions and returned. But how far did it get? To answer that, I had to go to the assembly again.

kd>u UsbHubpEnumerateChildren
usbhub!UsbHubpEnumerateChildren+0x1:
bf8d4681 b598 push {r3, r4, r7, lr}
bf8d4683 af03 add r7, sp, #12
bf8d4685 4604 mov r4, r0
bf8d4687 6849 ldr r1, [r1, #4]
bf8d4689 f7ff ef06 blx usbhub+0x498
bf8d468d 4602 mov r2, r0
bf8d468f 4b03 ldr r3, usbhub!UsbHubpEnumerateChildren+0x1c bf8d4691 447b add r3, pc
bf8d4693 6818 ldr r0, [r3]
bf8d4695 4621 mov r1, r4
bf8d4697 f7ff eef4 blx usbhub+0x480
bf8d469b bd98 pop {r3, r4, r7, pc}

The stack proved that the first instruction executed. The saved link register from the trap frame was pushed on the stack along with three more values, which can be seen on the stack above. The second instruction added 12 bytes to the stack pointer and stored it in r7. My calculation of what should have been stored in r7 matched the value saved in the trap frame for that register. Enough evidence to convince me the add instruction had executed.

Then the trail ran cold. A move, a load, and a branch followed, but nothing in the stack or trap frame confirmed their execution. The branch seemed most interesting, but it just jumped to the procedure linkage table (PLT) for the USB hub driver:

bf8d4498 e28fc600 add ip, pc, #0  ; 0x0
bf8d449c e28cca01 add ip, ip, #4096 ; 0x1000
bf8d44a0 e5bcf438 ldr pc, [ip, #+1080]! ; 0x438

The PLT looked fine though. Each PLT entry calculates the address of a corresponding global offsets table (GOT) entry. The kernel’s image loader fills each GOT entry with the address of a routine that is implemented in another binary. In this case, the GOT entry in question stored the address of UsbHubQueryChildren — the routine that the branch instruction was attempting to execute. The second branch instruction in UsbHubpEnumerateChildren jumped to a different PLT entry, but its corresponding GOT entry correctly pointed to IoCompleteIrp.

This really was rock bottom. Had either of those two routines executed, even a single instruction, the stack would have told a different story.

Educated Guess and Check

So, to take a step back, I was looking at an ARM-only crash in driver code where the program counter did not line up with the link register and the stack trace left off around a branch to the PLT. Wanting to sanity check this, I let the reboot loop run again. Similar crash: a page fault on a bogus address. Similar dead end: an execution stack that failed in the USB hub driver.

Recalling the number one rule about ARM-only crashes, the next step was to try to reproduce the bug with the caches disabled. Given that the program counter was going off the deep end, the PandaBoard’s 32KB L1 instruction cache seemed like a good place to start. I removed bit 12 from MMU_CONTROL_DEFAULT_VALUE, which gets set in the system control register (SCTLR), rebuilt the image and fired off the reboot loop. If the bug disappeared, then Minoca’s memory manager or image loader was probably missing an instruction cache invalidation when loading driver code. The bug hit. And it also hit with the branch predictor and L2 caches disabled.

Unfortunately, I could not try it with the L1 data cache disabled. Not only would it have been painfully slow, but Minoca OS depends on the ldrex and strex instructions, which require an exclusive monitor. The PandaBoard only implements that in its L1 data cache.

With caching tests out of the way, the second rule of thumb for tracing weird bugs is to test on a single core. Multi-core introduces inter-processor interrupts (IPIs) for shooting down translation lookaside buffer (TLB) entries. Perhaps a stale TLB entry was causing problems. Alas, rebuilding the OS with HlRunSingleProcessor set to TRUE still reproduced the bug.

Other hypotheses began to take form. Could the interrupt handler have been incorrectly restoring the pre-interrupt context, clobbering the program counter? No. That would have hit more often and in other code paths. Was there something wrong with this USB hub driver’s PLT? The disassembly looked correct and there was nothing special about the USB hub driver. Whatever explanations I could come up with were never as convincing as that number one rule — it had to be cache related.

One Last Theory

Not being able to boot Minoca OS with the L1 data cache disabled still bugged me. But if I couldn’t disable it, maybe I could be more aggressive about cleaning it. I decided to make each cache clean and invalidate request operate on the entire cache, not just one cache line. After modifying MmpCleanCacheRegion to call ArCleanEntireCache and MmpInvalidateInstructionCacheRegion to call ArInvalidateInstructionCache, I put the PandaBoard into the reboot loop. 12 hours later, I returned. No crash. The reboot loop had only been taking 15 minutes to reproduce the bug and now it was still rebooting after 12 hours. Cleaning and invalidating the entire caches all the time was not a solution, but it was progress!

How did these cache changes relate to the USB hub driver that was executing into the abyss? Well, if the driver’s .text section (i.e. its executable code) was read from disk into the data cache and never flushed, the instruction execution engine would fetch and execute whatever random bytes were sitting in RAM.

There was a lot going on in that last sentence, so let me explain. The ARMv7 architecture implements separate data and instruction caches. Normal reads and writes operate out of the data cache while instruction fetches read from the instruction cache. All cache misses are filled from RAM — that is, the instruction cache does not read from the data cache. When executable code is modified, the changes are stored in the data cache. Those changes will not be observed by instruction fetches until the data cache is flushed to RAM and the instruction cache is invalidated to remove stale entries. (If that didn’t quite make sense, there is a diagram below.)

Could this explain the similar, yet inconsistent outcomes of each crash — RAM had different bytes each time — and its rarity — most of the time the data cache got naturally flushed before the driver executed? It seemed plausible, but the PandaBoard’s SD driver used DMA. The .text section would have never been in the data cache unless the image loader modified the .text section, yet there were no text relocations in the image’s .dynamic section:

Z:/src/armv7dbg/bin$ arm-none-minoca-readelf.exe -d usbhub.drv
Dynamic section at offset 0x834 contains 13 entries:
Tag Type Name/Value
0x00000001 (NEEDED) Shared library: [kernel]
0x00000001 (NEEDED) Shared library: [usbcore.drv]
0x0000000e (SONAME) Library soname: [usbhub.drv]
0x00000004 (HASH) 0x94
0x00000005 (STRTAB) 0x2c0
0x00000006 (SYMTAB) 0x140
0x0000000a (STRSZ) 316 (bytes)
0x0000000b (SYMENT) 16 (bytes)
0x00000003 (PLTGOT) 0x18c4
0x00000002 (PLTRELSZ) 112 (bytes)
0x00000014 (PLTREL) REL
0x00000017 (JMPREL) 0x3fc
0x00000000 (NULL) 0x0

Before losing all hope on this theory, I figured it wouldn’t hurt to make sure nothing was modifying the USB hub driver’s code. The driver’s .text section had been consistently loading at the same address, so I put a check in MmHandleFault to see if the system ever write-faulted on that address. Sure enough, it hit! The faulting address was owned by a privately mapped image section that was backed by the first page of the USB hub driver. The trap frame indicated that PspImMapImageSegment was zeroing the slack space between image sections.

Caches and Copy-on-write

Following the write fault further, I observed it call MmpIsolateImageSection, allocate a private page for itself, and copy the image section’s existing contents to the private page. It was performing a copy-on-write! For those unfamiliar with the necessity of copy-on-write behavior, let me explain. Applications will often directly map files into memory (using mmap) to eliminate the system call overhead for I/O operations. The kernel loves this because it can dole out the same physical pages to all applications, saving memory. This gets tricky when an application writes to a virtual address that directly maps a file. Should the write modify the physical page shared by all applications? In most cases, the answer is no. The kernel traps the write attempt, allocates a new private page and copies the contents of the shared page over. The writing application can go on its way and the other applications are none the wiser.

This copy-on-write behavior is even more important when it comes to modifying drivers. The shared physical pages mapped into the driver’s image section are actually borrowed from the page cache. If the writes were allowed to operate on those shared pages, the changes would end up on the hard disk. Remember, the system was zeroing slack space in the driver. A bunch of zeros hitting the disk would have corrupted the binary. The copy-on-write was necessary and expected.

That said, it was wrong. MmpCopyPage was caching the contents of the original physical address (arrow 1 above) as it copied them to data cache entries for the newly allocated physical address (arrow 2), but it was not cleaning the cache (arrow 3). This simple act of zeroing the slack space was triggering a copy-on-write and dragging the entire .text section over to a new physical page. Without a data cache clean, the instruction cache was filling its cache lines (arrow 4) with whatever random bytes were in the new physical page. The processor was glad to execute them and do odd things, like trying to remove an internal USB hub during boot.

Things were adding up. After inserting the below code into MmpCopyPage, I put the PandaBoard into its reboot loop one last time. No crashes! Bug fixed!

if ((Section->Flags & IMAGE_SECTION_EXECUTABLE) != 0) {
MmpSyncSwapPage(ProcessorBlock->SwapPage, PageSize);
}

This Cache COW bug — as I’m now affectionately calling it — was particularly memorable because of how hard it was to reproduce. Typically, I aim to isolate a bug with a targeted test application, but caches are fickle. Sometimes you just have to trust your gut and remember the number one rule about ARM-only bugs. The cache is king.