LP#
1953044: defer final cleanup of reaped children
This patch makes OpenSRF::Server->reap_children() put
the objects representing reaped children onto a "zombie list"
for later cleanup. Doing this avoids a race condition
where ->reap_children(), when invoked via a signal, can destroy
a child object out from underneath the loop in ->check_status()
that sets up an IO::Select object to check for input from
child pipes. When this race condition is triggered, the
listener will crash with the following error and attempt
to reset itself:
server: died with error Use of freed value in iteration
at /usr/lib/x86_64-linux-gnu/perl/5.28/IO/Select.pm line 70.
Because the reset logic itself has some problems, this type of
failure can result in a service that appears to be active
but whose listener thinks it has more active children than
is actually the case, degrading (or preventing) responsiveness.
In particular, one symptom of this bug is observing the size
of the service's backlog queue growing even when it hasn't
actually spawned the maximum number of drones it's configured to
have.
To test
-------
[1] Fire a lot of requests at an OpenSRF service and check for
the error listed above. On Perl 5.24 or later, this will
eventually happen.
To increase the chances of such an error happening quickly,
I found that the following change to opensrf.Slooooooow.wait
was enough to trigger the bug in a reasonable amount of
time. (Throwing a lot of parallel requests at the service
seems to help reduce the waiting time.)
@@ -24,7 +24,10 @@ sub wait_for_it {
$pause //= 1;
$log->info("Holding for $pause seconds...");
- sleep($pause);
+ #sleep($pause);
+ select(undef, undef, undef, 0.1);
+
+ if (int(rand(10)) <= 5) { die; }
$log->info("Done waiting, time to return.");
return [$pause, @_]
}
[2] Apply the patch and repeat step 1. This time, the error
should not occur.
Signed-off-by: Galen Charlton <gmc@equinoxOLI.org>