aboutsummaryrefslogtreecommitdiff
path: root/test/system/260-sdnotify.bats
Commit message (Collapse)AuthorAge
* k8systemd: run k8s workloads in systemdValentin Rothberg2022-05-17
| | | | | | | | | | | | | | | | | | | | Support running `podman play kube` in systemd by exploiting the previously added "service containers". During `play kube`, a service container is started before all the pods and containers, and is stopped last. The service container communicates its conmon PID via sdnotify. Add a new systemd template to dispatch such k8s workloads. The argument of the template is the path to the k8s file. Note that the path must be escaped for systemd not to bark: Let's assume we have a `top.yaml` file in the home directory: ``` $ escaped=$(systemd-escape ~/top.yaml) $ systemctl --user start podman-play-kube@$escaped.service ``` Closes: https://issues.redhat.com/browse/RUN-1287 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
* sdnotify: send MAINPID only onceValentin Rothberg2022-05-12
| | | | | | | | | Send the main PID only once. Previously, `(*Container).start()` and the conmon handler sent them ~simultaneously and went into a race. I noticed the issue while debugging a WIP PR. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
* sdnotify test: accept MAINPID anywhereEd Santiago2021-09-30
| | | | | | | | | | systemd sometimes spits out lines in the wrong order. Deal with it. This fixes an infrequent flake that I haven't filed because I didn't understand it well enough. (Hence, this reduces BUGS but does not reduce BUG COUNT. Sorry!) Signed-off-by: Ed Santiago <santiago@redhat.com>
* System tests: add cleanup & debugging outputEd Santiago2021-09-01
| | | | | | | | | | | | Cleanup: the final 'play' test wasn't cleaning up after itself, leading to angry warning messages when rerunning tests (in my environment; never in CI) Debug: I'm seeing a lot of "Could not parse READY=1 as MAINPID=nnn" flakes in the sdnotify:container test (nine in the past month). Add debug traces to help diagnose in future flakes. Signed-off-by: Ed Santiago <santiago@redhat.com>
* Implement SD-NOTIFY proxy in conmonDaniel J Walsh2021-08-20
| | | | | | | | | | | | | | | | This leverages conmon's ability to proxy the SD-NOTIFY socket. This prevents locking caused by OCI runtime blocking, waiting for SD-NOTIFY messages, and instead passes the messages directly up to the host. NOTE: Also re-enable the auto-update tests which has been disabled due to flakiness. With this change, Podman properly integrates into systemd. Fixes: #7316 Signed-off-by: Joseph Gooch <mrwizard@dok.org> Signed-off-by: Daniel J Walsh <dwalsh@redhat.com> Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
* System tests: honor $OCI_RUNTIME (for CI)Ed Santiago2021-05-03
| | | | | | | | | | | | | | | | | | | | | | | Some CI systems set $OCI_RUNTIME as a way to override the default crun. Integration (e2e) tests honor this, but system tests were not aware of the convention; this means we haven't been testing system tests with runc, which means RHEL gating tests are now failing. The proper solution would be to edit containers.conf on CI systems. Sorry, that would involve too much CI-VM work. Instead, this PR detects $OCI_RUNTIME and creates a dummy containers.conf file using that runtime. Add: various skips for tests that don't work with runc. Refactor: add a helper function so we don't need to do the complicated 'podman info blah blah .OCIRuntime.blah' thing in many places. BUG: we leave a tmp file behind on exit. Signed-off-by: Ed Santiago <santiago@redhat.com>
* sdnotify tests: try real hard to kill socat processesEd Santiago2021-03-11
| | | | | | | | | | | | | | | | | | | | | | | podman gating tests are hanging in the new Fedora CI setup; long and tedious investigation suggests that 'socat' processes are being left unkilled, which then causes BATS to hang when it (presumably) runs a final 'wait' in its end cleanup. The two principal changes are to exec socat in a subshell with fd3 closed, and to pkill its child processes before killing the process itself. I don't know if both are needed. The pkill definitely is; the exec may just be superstition. Since I've wasted more than a day of PTO time on this, I'm okay with a little superstition. What I do know is that with these two changes, my reproducer fails to reproduce in over one hour of trying (normally it fails within 5 minutes). AND, update: only rawhide (f35) leaves stray socat processes behind. f33 and ubuntu do not, so 'pkill -P' fails. I really have no idea what's going on. Signed-off-by: Ed Santiago <santiago@redhat.com>
* system tests: the catch-up gameEd Santiago2020-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | - run test: minor cleanup to .containerenv test. Basically, make it do only two podman-runs (they're expensive) and tighten up the results checks - ps test: add ps -a --storage. Requires small tweak to run_podman helper, so we can have "timeout" be an expected result - sdnotify test: workaround for #8718 (seeing MAINPID=xxx as last output line instead of READY=1). As found by the newly-added debugging echos, what we are seeing is: MAINPID=103530 READY=1 MAINPID=103530 It's not supposed to be that way; it's supposed to be just the first two. But when faced with reality, we must bend to accommodate it, so let's accept READY=1 anywhere in the output stream, not just as the last line. Signed-off-by: Ed Santiago <santiago@redhat.com>
* Tests: Fix common flakes, and improve apiv2 test logEd Santiago2020-10-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - apiv2 - the 'ten /info requests' test is flaking often, taking ~8 seconds (our limit is 7, up from 5 a few weeks ago). Brent suggested that the first /info call might be expensive, because it needs to access storage. So, let's prime it by running one /info outside the timing loop. And, because even that continues to fail, bump it up to 10 seconds and file #8076 to track the slowdown. - toolbox test - WaitForReady() has timed out, even on one occasion causing a run failure because it failed 3 times. Solution: bump up timeout from 2s to 5s. Not really great, but CI systems are underpowered, and it's not unreasonable that 2s might be too low. - sdnotify test - add a 'podman wait' between stop & rm. This may prevent a "cannot rm container as it is running" race condition. While working on this, Brent and I noticed a few ways that test-apiv2 logging can be improved: - test name: when request is POST, display the jsonified parameters, not the original input ones. This should make it much easier to reproduce failures. - use curl's "--write-out" option to capture http code, content type, and request time. We were getting the first two via grep from logged headers; this is cleaner. And there was no other way to get timing. We now include the timing as X-Response-Time in the log file. - abort on *any* curl error, not just 7 (cannot connect). Any error at all from curl is bad news. Signed-off-by: Ed Santiago <santiago@redhat.com>
* System test additionsEd Santiago2020-10-14
| | | | | | | | | | | | | | | | | | | | | | | | - run --userns=keep-id: confirm that $HOME gets set (#8013) - inspect: confirm that JSON output is a sane number of lines (10 or more), not an unreadable one-liner (#8011 and #8021). Do so with image, pod, network, volume because the code paths might be different. - cgroups: confirm that 'run' preserves cgroup manager (#7970) - sdnotify: reenable tests, and hope CI doesn't hang. This test was disabled on August 18 because CI jobs were hanging and timing out. My suspicion was that it was #7316, which in turn seems to have hinged on conmon #182. The latter was merged on Sep 16, so let's cross our fingers and see what happens. Also: remove inaccurate warning from a networking test. And, wow, fix is_cgroupsv2(), it has never actually worked. Signed-off-by: Ed Santiago <santiago@redhat.com>
* Migrate away from docker.ioEd Santiago2020-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CI and system tests currently pull some images from docker.io. Eliminate that, by: - building a custom image containing much of what we need for testing; and - copying other needed images to quay.io (Reason: effective 2020-11-01 docker.io will limit the number of image pulls). The principal change is to create a new quay.io/libpod/testimage, using the new test/system/build-testimage script, instead of relying on quay.io/libpod/alpine_labels. We also switch to using a hardcoded :YYYYMMDD tag, instead of :latest, in an attempt to futureproof our CI. This image includes 'httpd' from busybox-extras, which we use in our networking test (previously we had to pull and run busybox from docker.io). The testimage can and should be extended as needed for future tests, e.g. adding test file content or other useful tools. For the '--pull' tests which require actually pulling from the registry, I've created an image with the same name but tagged :00000000 so it will never be pulled by default. Since this image is only used minimally, it's just busybox. Unfortunately there remain two cases we cannot solve in this tiny alpine-based image: 1) docker registry 2) systemd For those, I've (manually) run: podman pull [ docker.io/library/registry:2.7 | registry.fedoraproject.org/fedora:31 ] podman tag !$ quay.io/... podman push !$ ...and amended the calling tests accordingly. I've tried to make the the smallest reasonable diff, not the smallest possible one. I hope it's a reasonable tradeoff. Signed-off-by: Ed Santiago <santiago@redhat.com>
* system tests: enable more remote tests; cleanupEd Santiago2020-08-19
| | | | | | | | | | | | | | | | | | | | | | | | info, images, run, networking tests: remove some skip_if_remote()s that were added in the varlink days. All of these tests now seem to work with APIv2. help test: check that first output line from 'podman --help' is the program description (regression check for #7273). load test: clean up stray images, rewrite test to make it conform to existing convention. In the process, discover and file #7337 exec test (and networking): file #7360, and add FIXME comment to skip()s suggesting evaluating those tests once that is fixed. pod test: now that #6328 is fixed, use 'podman pod inspect --format' instead of relying on jq Various other tests: add an explanation of why test is disabled so we can more easily distinguish "this will never be meaningful under remote" vs "hey, doesn't work for now, but maybe someday". Signed-off-by: Ed Santiago <santiago@redhat.com>
* Re-disable sdnotify tests to try to fix CIEd Santiago2020-08-18
| | | | | | | | Some CI tests are hanging, timing out in 60 or 120 minutes. I wonder if it's #7316, the bug where all podman commands hang forever if NOTIFY_SOCKET is set? Signed-off-by: Ed Santiago <santiago@redhat.com>
* system tests: enable sdnotify testsEd Santiago2020-08-13
| | | | | | | | | | | | | | | | | | | | | | | | Oops. PR #6693 (sdnotify) added tests, but they were disabled due to broken crun on f31. I tried for three weeks to get a magic CI:IMG PR to update crun on the CI VMs ... but in that time I forgot to actually enable those new tests. This PR removes a 'skip', replacing it with a check that systemd is running plus one more to make sure our runtime is crun. It looks like sdnotify just doesn't work on Ubuntu (it hangs), and my guess is that it's a crun/runc issue. I also changed the test image from fedora:latest to :31, because, sigh, fedora:latest removed the systemd-notify tool. WARNING WARNING WARNING: the symptom of a missing systemd-notify is that podman will hang forever, not even stopped by the timeout command in podman_run! (Filed: #7316). This means that if the sdnotify-in-container test ever fails, the symptom will be that Cirrus itself will time out (2 hours?). This is horrible. I don't know what to do about it other than push for a fix for 7316. Signed-off-by: Ed Santiago <santiago@redhat.com>
* BATS system tests for new sdnotifyEd Santiago2020-07-06
Signed-off-by: Ed Santiago <santiago@redhat.com>