summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorEd Santiago <santiago@redhat.com>2019-12-28 07:48:33 -0700
committerEd Santiago <santiago@redhat.com>2019-12-28 08:00:05 -0700
commit0f78f345d9b063987fc14685a13898191e6f1304 (patch)
tree78c4241277148375fb91f63199459e6a5454e5c8
parent24b4921508be011b8fc4707d262116f77c8a176b (diff)
downloadpodman-0f78f345d9b063987fc14685a13898191e6f1304.tar.gz
podman-0f78f345d9b063987fc14685a13898191e6f1304.tar.bz2
podman-0f78f345d9b063987fc14685a13898191e6f1304.zip
Fix race condition in kill test leading to hang
When you open a FIFO for reading, but there's no writer, you hang. This is just one of those obscure UNIXisms we all know but just forget all too often. My last PR was guilty of introducing such a condition; I caught it by accident while testing other stuff. In short, the signal container was doing 'echo DONE' as its last step, and we (BATS) were reading the FIFO to check for it; but if the container exited before we opened the FIFO for read, the open would hang. This is not a hang that we can catch in the test: it would hang the entire job forever. CI would presumably time out eventually, but with no useful indication of the cause of the error. Solution: use 'exec' to open the FIFO early and keep it open, and use 'read -u FD' instead of 'read <$fifo': the former reads from an open FD, the latter forces a new open() each time. There is a shorter, more maintainable solution -- see #4755 -- but that suffers from the same hanging problem in the (unlikely) case where the signal-handling container exits, e.g. if signal handling is broken in podman. The test would hang, with no helpful indicator. Although this PR is a little more advanced scripting, I have commented the relevant code well and believe the maintenance cost is worth the risk of undebuggable hangs. There is still a hang risk: if 'podman logs -f' fails and exits immediately, the 'exec' will hang. I can't think of a non-racy way to prevent that, and choose to live with that risk. Tested by temporarily including 9 (SIGKILL) in the signals list. The read timeout triggers, and the end user has a fair chance of tracking down the root cause. Signed-off-by: Ed Santiago <santiago@redhat.com>
-rw-r--r--test/system/130-kill.bats17
1 files changed, 13 insertions, 4 deletions
diff --git a/test/system/130-kill.bats b/test/system/130-kill.bats
index af3409b9b..aae7f114f 100644
--- a/test/system/130-kill.bats
+++ b/test/system/130-kill.bats
@@ -16,10 +16,19 @@ load helpers
# and confirm that signals are received. We can't use run_podman here.
local fifo=${PODMAN_TMPDIR}/podman-kill-fifo.$(random_string 10)
mkfifo $fifo
- $PODMAN logs -f $cid >$fifo &
+ $PODMAN logs -f $cid >$fifo </dev/null &
podman_log_pid=$!
+
+ # Open the FIFO for reading, and keep it open. This prevents a race
+ # condition in which the container can exit (e.g. if for some reason
+ # it doesn't handle the signal) and we (this test) try to read from
+ # the FIFO. Since there wouldn't be an active writer, the open()
+ # would hang forever. With this exec we keep the FD open, allowing
+ # 'read -t' to time out and report a useful error.
+ exec 5<$fifo
+
# First container emits READY when ready; wait for it.
- read -t 10 ready <$fifo
+ read -t 10 -u 5 ready
is "$ready" "READY" "first log message from container"
# Helper function: send the given signal, verify that it's received.
@@ -28,7 +37,7 @@ load helpers
local signum=${2:-$1} # e.g. if signal=HUP, we expect to see '1'
run_podman kill -s $signal $cid
- read -t 10 actual <$fifo
+ read -t 10 -u 5 actual || die "Timed out: no ACK for kill -s $signal"
is "$actual" "got: $signum" "Signal $signal handled by container"
}
@@ -46,7 +55,7 @@ load helpers
# Done. Tell the container to stop, and wait for final DONE
run_podman exec $cid touch /stop
- read -t 5 done <$fifo
+ read -t 5 -u 5 done || die "Timed out waiting for DONE from container"
is "$done" "DONE" "final log message from container"
# Clean up