diff options
author | Valentin Rothberg <vrothberg@redhat.com> | 2022-06-10 12:38:28 +0200 |
---|---|---|
committer | Matthew Heon <mheon@redhat.com> | 2022-06-23 09:11:57 -0400 |
commit | 30e7cbccc1942d4f8e3b4f489b9f33f30dec7233 (patch) | |
tree | 532d323a187caf4309b0ec0f5a5ace46574f27d0 /pkg/domain | |
parent | 15188dce0566ffcba9e7da6fbde69625f49e0e16 (diff) | |
download | podman-30e7cbccc1942d4f8e3b4f489b9f33f30dec7233.tar.gz podman-30e7cbccc1942d4f8e3b4f489b9f33f30dec7233.tar.bz2 podman-30e7cbccc1942d4f8e3b4f489b9f33f30dec7233.zip |
libpod: fix wait and exit-code logic
This commit addresses three intertwined bugs to fix an issue when using
Gitlab runner on Podman. The three bug fixes are not split into
separate commits as tests won't pass otherwise; avoidable noise when
bisecting future issues.
1) Podman conflated states: even when asking to wait for the `exited`
state, Podman returned as soon as a container transitioned to
`stopped`. The issues surfaced in Gitlab tests to fail [1] as
`conmon`'s buffers have not (yet) been emptied when attaching to a
container right after a wait. The race window was extremely narrow,
and I only managed to reproduce with the Gitlab runner [1] unit
tests.
2) The clearer separation between `exited` and `stopped` revealed a race
condition predating the changes. If a container is configured for
autoremoval (e.g., via `run --rm`), the "run" process competes with
the "cleanup" process running in the background. The window of the
race condition was sufficiently large that the "cleanup" process has
already removed the container and storage before the "run" process
could read the exit code and hence waited indefinitely.
Address the exit-code race condition by recording exit codes in the
main libpod database. Exit codes can now be read from a database.
When waiting for a container to exit, Podman first waits for the
container to transition to `exited` and will then query the database
for its exit code. Outdated exit codes are pruned during cleanup
(i.e., non-performance critical) and when refreshing the database
after a reboot. An exit code is considered outdated when it is older
than 5 minutes.
While the race condition predates this change, the waiting process
has apparently always been fast enough in catching the exit code due
to issue 1): `exited` and `stopped` were conflated. The waiting
process hence caught the exit code after the container transitioned
to `stopped` but before it `exited` and got removed.
3) With 1) and 2), Podman is now waiting for a container to properly
transition to the `exited` state. Some tests did not pass after 1)
and 2) which revealed the third bug: `conmon` was executed with its
working directory pointing to the OCI runtime bundle of the
container. The changed working directory broke resolving relative
paths in the "cleanup" process. The "cleanup" process error'ed
before actually cleaning up the container and waiting "main" process
ran indefinitely - or until hitting a timeout. Fix the issue by
executing `conmon` with the same working directory as Podman.
Note that fixing 3) *may* address a number of issues we have seen in the
past where for *some* reason cleanup processes did not fire.
[1] https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27119#note_970712864
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
[MH: Minor reword of commit message]
Signed-off-by: Matthew Heon <mheon@redhat.com>
Diffstat (limited to 'pkg/domain')
-rw-r--r-- | pkg/domain/infra/abi/containers.go | 22 |
1 files changed, 4 insertions, 18 deletions
diff --git a/pkg/domain/infra/abi/containers.go b/pkg/domain/infra/abi/containers.go index 1934533cf..281e448f6 100644 --- a/pkg/domain/infra/abi/containers.go +++ b/pkg/domain/infra/abi/containers.go @@ -16,7 +16,6 @@ import ( "github.com/containers/image/v5/manifest" "github.com/containers/podman/v4/libpod" "github.com/containers/podman/v4/libpod/define" - "github.com/containers/podman/v4/libpod/events" "github.com/containers/podman/v4/libpod/logs" "github.com/containers/podman/v4/pkg/checkpoint" "github.com/containers/podman/v4/pkg/domain/entities" @@ -939,6 +938,7 @@ func (ic *ContainerEngine) ContainerStart(ctx context.Context, namesOrIds []stri } return reports, errors.Wrapf(err, "unable to start container %s", ctr.ID()) } + exitCode = ic.GetContainerExitCode(ctx, ctr) reports = append(reports, &entities.ContainerStartReport{ Id: ctr.ID(), @@ -1099,25 +1099,11 @@ func (ic *ContainerEngine) ContainerRun(ctx context.Context, opts entities.Conta func (ic *ContainerEngine) GetContainerExitCode(ctx context.Context, ctr *libpod.Container) int { exitCode, err := ctr.Wait(ctx) - if err == nil { - return int(exitCode) - } - if errors.Cause(err) != define.ErrNoSuchCtr { - logrus.Errorf("Could not retrieve exit code: %v", err) + if err != nil { + logrus.Errorf("Waiting for container %s: %v", ctr.ID(), err) return define.ExecErrorCodeNotFound } - // Make 4 attempt with 0.25s backoff between each for 1 second total - var event *events.Event - for i := 0; i < 4; i++ { - event, err = ic.Libpod.GetLastContainerEvent(ctx, ctr.ID(), events.Exited) - if err != nil { - time.Sleep(250 * time.Millisecond) - continue - } - return event.ContainerExitCode - } - logrus.Errorf("Could not retrieve exit code from event: %v", err) - return define.ExecErrorCodeNotFound + return int(exitCode) } func (ic *ContainerEngine) ContainerLogs(ctx context.Context, containers []string, options entities.ContainerLogsOptions) error { |