While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:
x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing. The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits. As best I understand
things, here's what happens when things go wrong:
1. The parent process performs a vfork, GDB sees the VFORKED event
and creates an inferior and thread for the vfork child,
2. GDB resumes the vfork child process. As schedule-multiple is on
and target-non-stop is off, this is translated into a request to
start all processes (see user_visible_resume_ptid),
3. In the linux-nat layer we spot that one of the threads we are
about to start is a vfork parent, and so don't start that
thread (see resume_lwp), the vfork child thread is resumed,
4. GDB waits for the next event, eventually entering
linux_nat_target::wait, which in turn calls linux_nat_wait_1,
5. In linux_nat_wait_1 we eventually call
resume_stopped_resumed_lwps, this should restart threads that have
stopped but don't actually have anything interesting to report.
6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
vfork parents like resume_lwp does, so at this point the vfork
parent is resumed. This feels like the start of the bug, and this
is where I'm proposing to fix things, but, resuming the vfork parent
isn't the worst thing in the world because....
7. As the vfork child is still alive the kernel holds the vfork
parent stopped,
8. Eventually the child performs its exec and GDB is sent and EXECD
event. However, because the parent is resumed, as soon as the child
performs its exec the vfork parent also sends a VFORK_DONE event to
GDB,
9. Depending on timing both of these events might seem to arrive in
GDB at the same time. Normally GDB expects to see the EXECD or
EXITED/SIGNALED event from the vfork child before getting the
VFORK_DONE in the parent. We know this because it is as a result of
the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
handle_vfork_child_exec_or_exit for details). Further the comment
in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
when we remain attached to the child (not the parent) we should not
expect to see a VFORK_DONE,
10. If both events arrive at the same time then GDB will randomly
choose one event to handle first, in some cases this will be the
VFORK_DONE. As described above, upon seeing a VFORK_DONE GDB
expects that (a) the vfork child has finished, however, in this case
this is not completely true, the child has finished, but GDB has not
processed the event associated with the completion yet, and (b) upon
seeing a VFORK_DONE GDB assumes we are remaining attached to the
parent, and so resumes the parent process,
11. GDB now handles the EXECD event. In our case we are detaching
from the parent, so GDB calls target_detach (see
handle_vfork_child_exec_or_exit),
12. While this has been going on the vfork parent is executing, and
might even exit,
13. In linux_nat_target::detach the first thing we do is stop all
threads in the process we're detaching from, the result of the stop
request will be cached on the lwp_info object,
14. In our case the vfork parent has exited though, so when GDB
waits for the thread, instead of a stop due to signal, we instead
get a thread exited status,
15. Later in the detach process we try to resume the threads just
prior to making the ptrace call to actually detach (see
detach_one_lwp), as part of the process to resume a thread we try to
touch some registers within the thread, and before doing this GDB
asserts that the thread is stopped,
16. An exited thread is not classified as stopped, and so the assert
triggers!
So there's two bugs I see here. The first, and most critical one here
is in step #6. I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.
With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event. The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event. As the parent is still stopped #12 doesn't
apply, and in #13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point. And finally, in #15 we are safe to poke the
process registers because it will not have exited by this point.
However, I did mention two bugs.
The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step #13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from. If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.
But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.
There's no test included in this commit. In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed. Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.
If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive. I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
static int
resume_stopped_resumed_lwps (struct lwp_info *lp, const ptid_t wait_ptid)
{
- if (!lp->stopped)
+ inferior *inf = find_inferior_ptid (linux_target, lp->ptid);
+
+ if (inf->vfork_child != nullptr)
+ {
+ linux_nat_debug_printf ("NOT resuming LWP %s (vfork parent)",
+ lp->ptid.to_string ().c_str ());
+ }
+ else if (!lp->stopped)
{
linux_nat_debug_printf ("NOT resuming LWP %s, not stopped",
lp->ptid.to_string ().c_str ());