Skip to content
This repository was archived by the owner on Dec 13, 2018. It is now read-only.

WIP: Add Checkpoint and Restore support to libcontainer#479

Merged
crosbymichael merged 51 commits into
masterfrom
criu
May 21, 2015
Merged

WIP: Add Checkpoint and Restore support to libcontainer#479
crosbymichael merged 51 commits into
masterfrom
criu

Conversation

@crosbymichael

Copy link
Copy Markdown
Contributor

This is a work in progress PR so that we can move the discussion around this work into a PR to help move things along and get other people involved.

We are trying to get nsinit work as an easy way to test but the first class support should be for using it from a daemon such as docker.

Any more work can be done on the criu branch of this repo via pull requests.

Comment thread configs/config.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Could the json property be called something else?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp
Sure, any suggestions? The name ext_pipes was chosen because the standard file descriptors are actually external pipes but I am open to better suggestions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaiedKazemi I was thinking that the name of the exported field and the json tag be similar like elsewhere in the code. If the name is StdFds, then maybe std_fds for the json tag.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1. or rename StdFds to ExtPipes :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will do.

@LK4D4

LK4D4 commented Mar 24, 2015

Copy link
Copy Markdown
Contributor

Supercool.
Gofmt failed.

@jessfraz

Copy link
Copy Markdown
Contributor

so cool so cool so cool, can't contain myself

On Tue, Mar 24, 2015 at 1:49 PM, Alexander Morozov <notifications@github.com

wrote:

Supercool.
Gofmt failed.


Reply to this email directly or view it on GitHub
#479 (comment).

Comment thread container_linux.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have hashed this before :) Should we check for the presence of a minimum version of criu that supports all the flags or maybe just document it, so distributions know what version they need?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the flags have been there for many years and we haven't introduced new flags in a long time. But if someone is running an old version of CRIU that doesn't support a specific flag, CRIU will print an error message and exit -- no damage done. That said, we can certainly add code to do version check every time but I really think it'd be an overkill :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaiedKazemi That is fair. I think if we just document a minimum version, that will be helpful :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for validating compatibility.

@vishh

vishh commented Mar 24, 2015

Copy link
Copy Markdown
Contributor

👍

@hqhq

hqhq commented Mar 26, 2015

Copy link
Copy Markdown
Contributor

Looking forward... 👍

Comment thread container_linux.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of all the debug code?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh
Since we're still debugging, it makes a lot easier to add additional command line options when invoking CRIU through the environment. Otherwise, we need to edit and recompile source for every little change in the options. I will remove it soon.

@crosbymichael

Copy link
Copy Markdown
Contributor Author

Just for an FYI, nsinit is just for testing, the use case we should optimize for is the daemon(docker) use cases. We should have the ability to make checkpoint/restore options configurable.

Comment thread container_linux.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete dead code?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh
Will do.

@boucher

boucher commented Mar 31, 2015

Copy link
Copy Markdown
Contributor

I'm interested in a slightly different use case than possible with the way this PR would work. Specifically, I want to be able to checkpoint a container without stopping it, and then later be able to either "roll back" the container to that point in time or alternatively spin up a new container from that checkpoint. Is there any interest in the idea of allowing you to specify an explicit path to checkpoint to and restore from (and, additionally, flags like --leave-running).

@boucher

boucher commented Apr 1, 2015

Copy link
Copy Markdown
Contributor

I'm trying to get this pull request running and having some trouble. What I've tried so far is to build libcontainer (with it's Dockerfile, and the addition of criu to that Dockerfile), and run the following test:

docker run -d --privileged -w /busybox dockercore/libcontainer nsinit exec -- sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 3; done'

followed by nsinit checkpoint from within the resulting container. That fails, with this output log:

https://gist.github.com/boucher/d2539bcaa1311c5d5f5c

Any suggestions? Or, is there a better way to test this?

(edit: I should note that I tried both master and the 1.5 stable branch of criu)

@boucher

boucher commented Apr 3, 2015

Copy link
Copy Markdown
Contributor

Just wanted to follow up in this thread that I was able to get this PR running outside of Docker, directly on a linux host, but haven't yet been able to get it working inside of Docker.

Comment thread container_linux.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on getting this branch up and running in Docker, but I'm a bit unsure about what to do here. Because we're not passing the -d flag to criu, this Wait will actually just wait forever -- that's fine in the sort of direct nsinit resume an interactive container case (which I had used to test this branch), but it doesn't work when dealing with the Docker daemon. Should this be a configurable flag we're passing in, or should the wait be handled at the nsinit level and not here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here is a mistake. I've fixed it in #486

Comment thread container_linux.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this file now just to be a flag that things are checkpointed? (Previously, this is where things were being stored if no directory was specified)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boucher Yes, it's a flag. It's used in currentStatus() to determine the Checkpointed state.

@boucher

boucher commented Apr 17, 2015

Copy link
Copy Markdown
Contributor

I'm getting this error on some restore attempts when running this branch:

Error (cr-service.c:58): Failed unpacking request: Success
Error (cr-service.c:694): Can't recv request: Success
data too short after length-prefix of 1217

I added a log in criuSwrk to print all the opts right before actually sending them, and other than the actual directory paths I seem to have two identical set of options that work and don't work (though the underlying container isn't the same, but the error message seems to imply a communication failure not an actual restore failure). I'm running in two different VMs so I'm going to try and narrow down the differences some more to figure out what's going on but I figured I'd post this here to see if there were any ideas.

@avagin

avagin commented Apr 17, 2015

Copy link
Copy Markdown
Contributor

@boucher I think we use too small buffer in CRIU. Probably we need to allocate it dinamically. Could you try out this patch for criu?

diff --git a/include/cr-service-const.h b/include/cr-service-const.h
index 668882b..66a059f 100644
--- a/include/cr-service-const.h
+++ b/include/cr-service-const.h
@@ -1,7 +1,7 @@
 #ifndef __CR_SERVICE_CONST_H__
 #define __CR_SERVICE_CONST_H__

-#define CR_MAX_MSG_SIZE 1024
+#define CR_MAX_MSG_SIZE 16384
 #define CR_DEFAULT_SERVICE_ADDRESS "/var/run/criu_service.socket"

 #endif /* __CR_SERVICE_CONST_H__ */

@avagin

avagin commented Apr 17, 2015

Copy link
Copy Markdown
Contributor

The last issue is a criu bug: http://lists.openvz.org/pipermail/criu/2015-April/019928.html

@avagin

avagin commented Apr 17, 2015

Copy link
Copy Markdown
Contributor

@crosbymichael @SaiedKazemi @LK4D4 @mrunalp @vmarmol I think I've fixed all known issue. Maybe it's time to merge this branch?

@SaiedKazemi

Copy link
Copy Markdown

@avagin @crosbymichael @LK4D4 @vmarmol
Thanks Andrew. I agree it's time to merge into master.

@crosbymichael

Copy link
Copy Markdown
Contributor Author

Ok, so I should be able to build this branch, and everything with work flawlessly the first time?

avagin and others added 11 commits May 20, 2015 15:19
In a future we may want to have more external descriptors.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
…a const.

Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
container.Checkpoint() doesn't clear c.initProcess and
it's used on restore.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
…equiring that one already exists.

Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
Adds iptables as a requirement of criu in the Dockerfile.

Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
The format of criu version is X.Y[.Z]. The current code can not parse X.Y,
because scanf returns the "input does not match format" error.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Docker-DCO-1.1-Signed-off-by: Ross Boucher <rboucher@gmail.com> (github: boucher)
@sirupsen

Copy link
Copy Markdown

This is really exciting work! Is there any update on the status of this PR?

README example for using checkpoint/restore.
@crosbymichael

Copy link
Copy Markdown
Contributor Author

I'm getting the error:

nsinit checkpoint --image-path /tmp/checkpoint
criu failed: type DUMP errno 0

@crosbymichael

Copy link
Copy Markdown
Contributor Author

criu is compiled via the v1.5.2 tag and running on ubuntu 15.04 (the new LTS)

@boucher

boucher commented May 21, 2015

Copy link
Copy Markdown
Contributor

Can you send the dump log?

On Thursday, May 21, 2015, Michael Crosby notifications@github.com wrote:

criu is compiled via the v1.5.2 tag


Reply to this email directly or view it on GitHub
#479 (comment).

Sent from Gmail Mobile

@crosbymichael

Copy link
Copy Markdown
Contributor Author

@boucher i don't have the dump.log in the image dir? is it located somewhere else?

@avagin

avagin commented May 21, 2015

Copy link
Copy Markdown
Contributor

@crosbymichael Could you attache a strace log?
strace -fo strace.log -s 256 nsinit checkpoint --image-path /tmp/checkpoint

@crosbymichael

Copy link
Copy Markdown
Contributor Author

@avagin

avagin commented May 21, 2015

Copy link
Copy Markdown
Contributor

@crosbymichael this log doesn't contain "criu failed:...". dump.log is saved in /var/run/nsinit/nsinit/criu.work/

'''
write(1023, "(00.235986) Dumping finished successfully\n", 42) = 42
'''

@crosbymichael

Copy link
Copy Markdown
Contributor Author

ok, it was a setuid issue with my binary

@crosbymichael

Copy link
Copy Markdown
Contributor Author

LGTM

@crosbymichael

Copy link
Copy Markdown
Contributor Author

thanks @avagin @boucher @SaiedKazemi

crosbymichael added a commit that referenced this pull request May 21, 2015
WIP: Add Checkpoint and Restore support to libcontainer
@crosbymichael crosbymichael merged commit 35d1c0e into master May 21, 2015
@crosbymichael crosbymichael deleted the criu branch May 21, 2015 20:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.