Brendon Scheinman

When Debuggers Lie

Here at OkCupid (we're hiring btw) we recently spent a lot of time trying to debug an application panic that seemed like it couldn't possibly happen. Well, it turns out that the panic was happening, but we thought it was triggered from a different code path than it actually was, because the debugger seemingly lied to us about the call stack at the time of the panic! However, saying that the debugger lied to us would be less accurate than saying that the compiler lied to the debugger. To figure out how this happened, let's take a look at a simplified version of the code that caused this issue:

 1 #include <stdio.h>
 2 #include <stdlib.h>
 3 #include <string.h>
 4
 5 static int items[10];
 6
 7 static inline int
 8 get_nonzero_item(size_t i) {
 9     if (items[i] == 0) {
10         printf("the requested item was 0!\n");
11         abort();
12     }
13
14     return items[i];
15 }
16
17 static inline int
18 get_nonzero_three(void) {
19     return get_nonzero_item(3);
20 }
21
22 static inline int
23 get_nonzero_five(void) {
24     return get_nonzero_item(5);
25 }
26
27 int
28 main(void) {
29     int i;
30     int item;
31
32     if (getenv("FILL")) {
33         for (i = 0; i < sizeof(items) / sizeof(items[0]); ++i) {
34             items[i] = i;
35         }
36     } else {
37         memset(items, 0, sizeof items);
38     }
39
40     if (getenv("FOO")) {
41         item = get_nonzero_three();
42     } else {
43         item = get_nonzero_five();
44     }
45
46     printf("item = %d\n", item);
47
48     return 0;
49 }

The key part of this code that caused the bug is that two inlined code paths lead to the exact same print/abort code, which the compiler will combine together as part of code optimization. This optimization makes it impossible for the debugger to determine which code path led to that failure. Let's try this out for ourselves by getting both possible panic paths to run. We'll use clang for this example, although I confirmed that equivalent behavior occurs with gcc 5.4.0.

$ clang --version
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
$ clang -std=c99 -O3 -g -o inline_merge inline_merge.c
$ ./inline_merge
the requested item was 0!
Aborted (core dumped)
$ FOO=1 ./inline_merge
the requested item was 0!
Aborted (core dumped)

The first invocation of our test program should have panicked within get_nonzero_five because the environment variable FOO was not set, and the second invocation should have panicked within get_nonzero_three. Let's take a look at the resulting core dumps to see whether this is really the case.

$ gdb ./inline_merge core_with_foo                                                             
...
(gdb) bt
#0  0x00007f1453115428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f145311702a in __GI_abort () at abort.c:89
#2  0x00000000004006ac in get_nonzero_item (i=3) at inline_merge.c:11
#3  get_nonzero_three () at inline_merge.c:19
#4  main () at inline_merge.c:41

This looks correct-- the environment variable FOO was set, so we tried calling get_nonzero_three and panicked. How about the other core file?

$ gdb ./inline_merge core_without_foo
...
(gdb) bt
#0  0x00007f07be73f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f07be74102a in __GI_abort () at abort.c:89
#2  0x00000000004006ac in get_nonzero_item (i=3) at inline_merge.c:11
#3  get_nonzero_three () at inline_merge.c:19
#4  main () at inline_merge.c:41

Well that can't be right! We didn't set FOO so the program should have called get_nonzero_five instead of get_nonzero_three. How did this happen? Let's take a look at the instructions that the compiler generated for the above code. Because everything got inlined, we can just look at the disassembly for main.

$ objdump -d inline_merge | less
...
0000000000400600 <main>:
  400600:       55                      push   %rbp
  400601:       53                      push   %rbx
  400602:       50                      push   %rax
  400603:       bf 60 07 40 00          mov    $0x400760,%edi
  400608:       e8 93 fe ff ff          callq  4004a0 <getenv@plt>
  40060d:       48 85 c0                test   %rax,%rax
  400610:       74 39                   je     40064b <main+0x4b>
  400612:       0f 28 05 27 01 00 00    movaps 0x127(%rip),%xmm0        # 400740 <_IO_stdin_used+0x10>
  400619:       0f 29 05 40 0a 20 00    movaps %xmm0,0x200a40(%rip)        # 601060 <items>
  400620:       0f 28 05 29 01 00 00    movaps 0x129(%rip),%xmm0        # 400750 <_IO_stdin_used+0x20>
  400627:       0f 29 05 42 0a 20 00    movaps %xmm0,0x200a42(%rip)        # 601070 <items+0x10>
  40062e:       48 b8 08 00 00 00 09    movabs $0x900000008,%rax
  400635:       00 00 00
  400638:       48 89 05 41 0a 20 00    mov    %rax,0x200a41(%rip)        # 601080 <items+0x20>
  40063f:       bb 03 00 00 00          mov    $0x3,%ebx
  400644:       bd 05 00 00 00          mov    $0x5,%ebp
  400649:       eb 20                   jmp    40066b <main+0x6b>
  40064b:       0f 57 c0                xorps  %xmm0,%xmm0
  40064e:       0f 29 05 1b 0a 20 00    movaps %xmm0,0x200a1b(%rip)        # 601070 <items+0x10>
  400655:       0f 29 05 04 0a 20 00    movaps %xmm0,0x200a04(%rip)        # 601060 <items>
  40065c:       48 c7 05 19 0a 20 00    movq   $0x0,0x200a19(%rip)        # 601080 <items+0x20>
  400663:       00 00 00 00
  400667:       31 ed                   xor    %ebp,%ebp
  400669:       31 db                   xor    %ebx,%ebx
  40066b:       bf 65 07 40 00          mov    $0x400765,%edi
  400670:       e8 2b fe ff ff          callq  4004a0 <getenv@plt>
  400675:       48 85 c0                test   %rax,%rax
  400678:       74 06                   je     400680 <main+0x80>
  40067a:       85 db                   test   %ebx,%ebx
  40067c:       75 08                   jne    400686 <main+0x86>
  40067e:       eb 1d                   jmp    40069d <main+0x9d>
  400680:       85 ed                   test   %ebp,%ebp
  400682:       89 eb                   mov    %ebp,%ebx
  400684:       74 17                   je     40069d <main+0x9d>
  400686:       bf 69 07 40 00          mov    $0x400769,%edi
  40068b:       31 c0                   xor    %eax,%eax
  40068d:       89 de                   mov    %ebx,%esi
  40068f:       e8 3c fe ff ff          callq  4004d0 <printf@plt>
  400694:       31 c0                   xor    %eax,%eax
  400696:       48 83 c4 08             add    $0x8,%rsp
  40069a:       5b                      pop    %rbx
  40069b:       5d                      pop    %rbp
  40069c:       c3                      retq
  40069d:       bf 80 07 40 00          mov    $0x400780,%edi
  4006a2:       e8 19 fe ff ff          callq  4004c0 <puts@plt>
  4006a7:       e8 04 fe ff ff          callq  4004b0 <abort@plt>
  4006ac:       0f 1f 40 00             nopl   0x0(%rax)

The environment variable check to determine which function to call takes place at instruction 400670. Then, if FOO is set, we execute these instructions:

40067a:       85 db                   test   %ebx,%ebx
40067c:       75 08                   jne    400686 <main+0x86>
40067e:       eb 1d                   jmp    40069d <main+0x9d>

Otherwise, we execute these:

400680:       85 ed                   test   %ebp,%ebp
400682:       89 eb                   mov    %ebp,%ebx
400684:       74 17                   je     40069d <main+0x9d>

Earlier in the function, we assigned the values for items[i] to %ebx and %ebp to correspond to the possibility of calling get_nonzero_item with 3 and 5, respectively; the check for FOO determines which of these two values we end up using. However, the most important part of these two snippets is that both of them jump to the address 40069d if the result was zero. What is going on at that address?

40069d:       bf 80 07 40 00          mov    $0x400780,%edi
4006a2:       e8 19 fe ff ff          callq  4004c0 <puts@plt>
4006a7:       e8 04 fe ff ff          callq  4004b0 <abort@plt>
4006ac:       0f 1f 40 00             nopl   0x0(%rax)

These instructions implement the printing and panicking code from get_nonzero_item on lines 10-11. After inlining the functions get_nonzero_three and get_nonzero_five, the compiler noticed that this block of code was duplicated between the two inlined functions and combined them as an optimization. Let's look at the instruction pointer in main at the time of the panic (it has the same value in both coredumps):

(gdb) frame 4
#4  main () at inline_merge.c:41
41              item = get_nonzero_three();
(gdb) p/x $rip
$1 = 0x4006ac

The program was executing this block that was shared between the two inlined functions. Once execution has reached this point, how can the debugger know which inlined code path led there? In some cases this might be possible using dynamic analysis but in general that problem is extremely difficult, if not impossible, to solve reliably. So why did gdb choose to tell us that we had executed get_nonzero_three in both cases? Why didn't it choose get_nonzero_five both times instead? Let's take a look at the debug information that the compiler generated:

$ readelf --debug=info inline_merge | less
...
<1><9b>: Abbrev Number: 11 (DW_TAG_subprogram)
   <9c>   DW_AT_low_pc      : 0x400600
   <a4>   DW_AT_high_pc     : 0xac
   <aa>   DW_AT_name        : (indirect string, offset: 0xb6): main
   ...
<2><ca>: Abbrev Number: 13 (DW_TAG_inlined_subroutine)
   <cb>   DW_AT_abstract_origin: <0x83>
   <cf>   DW_AT_ranges      : 0x0
   <d3>   DW_AT_call_file   : 1
   <d4>   DW_AT_call_line   : 41
...
<2><e8>: Abbrev Number: 15 (DW_TAG_inlined_subroutine)
   <e9>   DW_AT_abstract_origin: <0x8f>
   <ed>   DW_AT_low_pc      : 0x400680
   <f5>   DW_AT_high_pc     : 0x6
   <f9>   DW_AT_call_file   : 1
   <fa>   DW_AT_call_line   : 43

This is the DWARF debugging information (check out these two great resources for an introduction to DWARF) representing the main function and the inlined instances of the functions get_nonzero_three and get_nonzero_five, respectively. The DW_TAG_inlined_subroutine blocks provide information about which instructions correspond to each inlined function so that the debugger knows which ones to display in the backtrace. Let's examine the relevant instructions for each inlined function. The DW_AT_ranges attribute on the first inlined function block tells us that the instruction ranges covered by this inlined function are defined in another DWARF section called .debug_ranges at offset 0x0 so let's take a look at what's there:

$ readelf --debug=Ranges inline_merge | less
...
Offset   Begin    End
00000000 000000000040067a 0000000000400680
00000000 000000000040069d 00000000004006ac
00000000 <End of list>
...

This means that the inlined function get_nonzero_three covers two blocks of instructions: 40067a-400680 and 40069d-4006ac. Notice that these correspond exactly to two blocks of instructions that we looked at above: the check of if (items[3] == 0) and the log/panic block at the end. Most importantly, this second block includes the instruction pointer in main at the time of the panic. Now how about get_nonzero_five? It has the DWARF attributes DW_AT_low_pc and DW_AT_high_pc, which tell us the start and end instructions of its range. This is effectively a shorthand version for a single row in the ranges table above. The range for get_nonzero_five is 400680-400686. This corresponds exactly to the block of instructions above that implements if (items[5] == 0). However, it notably does not include the range 40069d-4006ac, representing the log/panic block! Therefore when the debugger checks which inlined functions are relevant for the instruction pointer 0x4006ac, it matches get_nonzero_three but not get_nonzero_five, and this happens regardless of which function was actually called to lead us to that point!

How can this be displayed correctly?

There's really nothing that the debugger can do in this case, given the debugging information available to it. That information indicates that the current instruction pointer is within the range of get_nonzero_three but not get_nonzero_five, so the debugger is actually doing exactly the right thing based on that information. What information could the compiler provide instead to enable the debugger to provide useful feedback to developers? Once this optimization is performed, the most accurate information that a debugger could possibly provide in this case is that main could be executing either the inlined function get_nonzero_three or the inlined function get_nonzero_five but that it can't say for sure. Another alternative that's better than the status quo would be to omit ambiguous inlined function calls altogether and only display calls that definitively happened based on the current state. gdb already does this with chains of inlined calls that have multiple possible execution paths, but debugging information in this case made it appear as though the execution path was not ambiguous.

The compiler might have indicated this ambiguity by including the block 40069d-4006ac in the debugging entry for get_nonzero_five as well, which would have allowed the debugger to choose either of the display approaches described above. Based on a cursory reading of the DWARF spec I didn't see anything disallowing overlapping instruction ranges for concrete instances of inlined functions. However, if this is not allowed then I think the best alternative would be to include the overlapping range in neither inlined instance, which would cause the debugger to display neither function in the stack trace. This approach would also require marking get_nonzero_item as a direct inlined call from main for the ambiguous code block; otherwise it would be omitted from the stack trace as well.

Omitting those calls altogether is preferable to potentially including an incorrect call in a stack trace, because the latter can lead developers down the wrong debugging path. In our case, if the debugger had told us that there were two execution paths that could have led to the panic, we would have known to explore both possibilities and would have quickly identified the true root cause. Instead, we spent a lot of time debugging a root cause that never happened in the first place.

What can developers do about this?

Given the current behavior of compilers, there are a few takeaways that can help developers with these situations. The first is simply being aware that these sorts of optimizations are possible and that stack traces can be misleading in such cases. Part of the reason why it took us awhile to debug our issue was that we assumed that the stack trace we saw was accurate. We knew that variable values are often optimized out and can sometimes be incorrect in optimized builds, but we thought that stack traces should still be reliable.

The second takeaway is that error handling should always provide information about the data factoring into the error. For instance, if the error message in this program had included the index that caused the fault, we could have debugged the issue based on the error message alone; we would also have had a clear indication that the stack trace was incorrect. With the code written above, the faulty stack trace was still consistent with the log output that we saw, so there wasn't a clear reason to doubt the stack trace. Additionally, in this particular case, making the error message depend on the input data might have prevented the compiler from performing the optimization that caused this bug in the first place, because the two failure cases would have shared much less code, but that won't reliably be the case in all situations. A more general expression of this takeaway is: when writing error handling code, we should always aim to make it as easy as possible to distinguish between multiple causes of, or inputs into, an issue.