gdbserver/linux: set lwp !stopped when failing to resume
I see some failures, at least in gdb.multi/multi-re-run.exp and
gdb.threads/interrupted-hand-call.exp. Running `stress -C $(nproc)` at
the same time as the test makes those tests relatively frequent.
Let's take gdb.multi/multi-re-run.exp as an example. The failure looks
like this, an unexpected "no resumed":
continue
Continuing.
No unwaited-for children left.
(gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=2: iter=1: continue until exit
The situation is:
- Inferior 1 is stopped somewhere, it won't really play a role here.
- Inferior 2 has 2 threads, both stopped.
- We resume inferior 2, the leader thread is expected to exit, making
the process exit.
From GDB's perspective, a failing run looks like this:
[infrun] fetch_inferior_event: enter
[infrun] scoped_disable_commit_resumed: reason=handling event
[infrun] do_target_wait: Found 2 inferiors, starting at #1
[infrun] random_pending_event_thread: None found.
[remote] wait: enter
[remote] Packet received: T0506:
20dcffffff7f0000;07:
20dcffffff7f0000;10:
9551555555550000;thread:pae4cd.ae4cd;core:e;
[remote] wait: exit
[infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
[infrun] print_target_wait_results: 713933.713933.0 [Thread 713933.713933],
[infrun] print_target_wait_results: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
[infrun] handle_inferior_event: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
[infrun] clear_step_over_info: clearing step over info
[infrun] context_switch: Switching context from 0.0.0 to 713933.713933.0
[infrun] handle_signal_stop: stop_pc=0x555555555195
[infrun] start_step_over: enter
[infrun] start_step_over: stealing global queue of threads to step, length = 0
[infrun] operator(): step-over queue now empty
[infrun] start_step_over: exit
[infrun] process_event_stop_test: no stepping, continue
[remote] Sending packet: $Z0,
555555555194,1#8e
[remote] Packet received: OK
[infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [713933.713933.0] at 0x555555555195
[remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;25;2c;4c;97;#0a
[remote] Packet received: OK
[remote] Sending packet: $vCont;c:pae4cd.-1#9f
[infrun] prepare_to_wait: prepare_to_wait
[infrun] reset: reason=handling event
[infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] fetch_inferior_event: exit
[infrun] fetch_inferior_event: enter
[infrun] scoped_disable_commit_resumed: reason=handling event
[infrun] do_target_wait: Found 2 inferiors, starting at #0
[infrun] random_pending_event_thread: None found.
[remote] wait: enter
[remote] Packet received: N
[remote] wait: exit
[infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
[infrun] print_target_wait_results: -1.0.0 [process -1],
[infrun] print_target_wait_results: status->kind = NO_RESUMED
[infrun] handle_inferior_event: status->kind = NO_RESUMED
[remote] Sending packet: $Hgp0.0#ad
[remote] Packet received: OK
[remote] Sending packet: $qXfer:threads:read::0,1000#92
[remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="
40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="
40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="
40b6c6f7ff7f0000"/>\n</threads>\n
[infrun] stop_waiting: stop_waiting
[remote] Sending packet: $qXfer:threads:read::0,1000#92
[remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="
40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="
40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="
40b6c6f7ff7f0000"/>\n</threads>\n
[infrun] infrun_async: enable=0
[infrun] reset: reason=handling event
[infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] fetch_inferior_event: exit
We can see that we resume the inferior with vCont;c, but got NO_RESUMED.
When the test passes, we get an EXITED status to indicate the process
has exited.
From GDBserver's point of view, it looks like this. The logs contain
some logging I added and that are part of this patch.
[remote] getpkt: getpkt ("vCont;c:pae4cf.-1"); [no ack sent]
[threads] resume: enter
[threads] thread_needs_step_over: Need step over [LWP 713931]? Ignoring, should remain stopped
[threads] thread_needs_step_over: Need step over [LWP 713932]? Ignoring, should remain stopped
[threads] get_pc: pc is 0x555555555195
[threads] thread_needs_step_over: Need step over [LWP 713935]? No, no breakpoint found at 0x555555555195
[threads] get_pc: pc is 0x7ffff7d35a95
[threads] thread_needs_step_over: Need step over [LWP 713936]? No, no breakpoint found at 0x7ffff7d35a95
[threads] resume: Resuming, no pending status or step over needed
[threads] resume_one_thread: resuming LWP 713935
[threads] proceed_one_lwp: lwp 713935
[threads] resume_one_lwp_throw: continue from pc 0x555555555195
[threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: NOW ptid=713935.713935.0 stopped=0 resumed=0
[threads] resume_one_thread: resuming LWP 713936
[threads] proceed_one_lwp: lwp 713936
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] resume: exit
[threads] wait_1: enter
[threads] wait_1: [<all threads>]
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
[threads] resume_stopped_resumed_lwps: resuming stopped-resumed LWP LWP 713935.713936 at
7ffff7d35a95: step=0
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] operator(): check_zombie_leaders: leader_pid=713931, leader_lp!=NULL=1, num_lwps=2, zombie=0
[threads] operator(): check_zombie_leaders: leader_pid=713935, leader_lp!=NULL=1, num_lwps=2, zombie=1
[threads] operator(): Thread group leader 713935 zombie (it exited, or another thread execd).
[threads] delete_lwp: deleting 713935
[threads] wait_for_event_filtered: exit (no unwaited-for LWP)
sigchld_handler
[threads] wait_1: ret = null_ptid, TARGET_WAITKIND_NO_RESUMED
[threads] wait_1: exit
What happens is:
- We resume the leader (713935) successfully.
- The leader exits.
- We resume the secondary thread (713936), we get ESRCH. This is
expected this the leader has exited.
- resume_one_lwp_throw throws, it's caught by resume_one_lwp.
- resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the
failure can be explained by the LWP becoming zombie, and swallows the
error.
- Note that this means that the secondary lwp still has stopped==1.
- wait_1 is called, probably because linux_process_target::resume marks
the async pipe at the end.
- The exit event isn't ready yet, probably because the machine is under
load, so waitpid returns nothing.
- check_zombie_leaders detects that the leader is zombie and deletes
- We try to find a resumed (non-stopped) LWP to get an event from,
there's none since the leader (that was resumed) is now deleted, and
the secondary thread is still marked stopped.
wait_for_event_filtered returns -1, causing wait_1 to return
NO_RESUMED.
What I notice here is that there is some kind of race between the
availability of the process' exit notification and the call to wait_1
that results from marking the async pipe at the end of resume.
I think what we want from this wait_1 invocation is to keep waiting, as
we will eventually get thread exit notifications for both of our
threads.
The fix I came up with is to mark the secondary thread as !stopped (or
resumed) when we fail to resume it. This makes wait_1 see that there is
at least one resume lwp, so it won't return NO_RESUMED. I think this
makes sense to consider it resumed, because we are going to receive an
exit event for it. Here's the GDBserver logs with the fix applied:
[threads] resume: enter
[threads] thread_needs_step_over: Need step over [LWP 724595]? Ignoring, should remain stopped
[threads] thread_needs_step_over: Need step over [LWP 724596]? Ignoring, should remain stopped
[threads] get_pc: pc is 0x555555555195
[threads] thread_needs_step_over: Need step over [LWP 724597]? No, no breakpoint found at 0x555555555195
[threads] get_pc: pc is 0x7ffff7d35a95
[threads] thread_needs_step_over: Need step over [LWP 724598]? No, no breakpoint found at 0x7ffff7d35a95
[threads] resume: Resuming, no pending status or step over needed
[threads] resume_one_thread: resuming LWP 724597
[threads] proceed_one_lwp: lwp 724597
[threads] resume_one_lwp_throw: continue from pc 0x555555555195
[threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: NOW ptid=724597.724597.0 stopped=0 resumed=0
[threads] resume_one_thread: resuming LWP 724598
[threads] proceed_one_lwp: lwp 724598
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] resume: exit
[threads] wait_1: enter
[threads] wait_1: [<all threads>]
sigchld_handler
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
[threads] operator(): check_zombie_leaders: leader_pid=724595, leader_lp!=NULL=1, num_lwps=2, zombie=0
[threads] operator(): check_zombie_leaders: leader_pid=724597, leader_lp!=NULL=1, num_lwps=2, zombie=1
[threads] operator(): Thread group leader 724597 zombie (it exited, or another thread execd).
[threads] delete_lwp: deleting 724597
[threads] wait_for_event_filtered: sigsuspend'ing
sigchld_handler
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 724598, ERRNO-OK
[threads] wait_for_event_filtered: waitpid 724598 received 0 (exited)
[threads] filter_event: 724598 exited
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 724597, ERRNO-OK
[threads] wait_for_event_filtered: waitpid 724597 received 0 (exited)
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
sigchld_handler
[threads] wait_1: ret = LWP 724597.724598, exited with retcode 0
[threads] wait_1: exit
Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100