Containerd snapshotter 分为两种,分别为核心 snapshotter 和非核心 snapshotter。
参考 containerd/docs/snapshotters[1],其中,核心 snapshotter 包括:
非核心 snapshotter 包括:
Containerd 将 snapshot 以插件的方式抽象出来,使用者可以实现不同的 snapshot 处理逻辑:
containerd/blob/main/core/snapshots/snapshotter.go[8]
具体的 interface 如下:
type Snapshotter interface {
// Stat returns the info for an active or committed snapshot by name or
// key.
//
// Should be used for parent resolution, existence checks and to discern
// the kind of snapshot.
Stat(ctx context.Context, key string) (Info, error)
// Update updates the info for a snapshot.
//
// Only mutable properties of a snapshot may be updated.
Update(ctx context.Context, info Info, fieldpaths ...string) (Info, error)
// Usage returns the resource usage of an active or committed snapshot
// excluding the usage of parent snapshots.
//
// The running time of this call for active snapshots is dependent on
// implementation, but may be proportional to the size of the resource.
// Callers should take this into consideration. Implementations should
// attempt to honor context cancellation and avoid taking locks when making
// the calculation.
Usage(ctx context.Context, key string) (Usage, error)
// Mounts returns the mounts for the active snapshot transaction identified
// by key. Can be called on a read-write or readonly transaction. This is
// available only for active snapshots.
//
// This can be used to recover mounts after calling View or Prepare.
Mounts(ctx context.Context, key string) ([]mount.Mount, error)
// Prepare creates an active snapshot identified by key descending from the
// provided parent. The returned mounts can be used to mount the snapshot
// to capture changes.
//
// If a parent is provided, after performing the mounts, the destination
// will start with the content of the parent. The parent must be a
// committed snapshot. Changes to the mounted destination will be captured
// in relation to the parent. The default parent, "", is an empty
// directory.
//
// The changes may be saved to a committed snapshot by calling Commit. When
// one is done with the transaction, Remove should be called on the key.
//
// Multiple calls to Prepare or View with the same key should fail.
Prepare(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// View behaves identically to Prepare except the result may not be
// committed back to the snapshot snapshotter. View returns a readonly view on
// the parent, with the active snapshot being tracked by the given key.
//
// This method operates identically to Prepare, except the mounts returned
// may have the readonly flag set. Any modifications to the underlying
// filesystem will be ignored. Implementations may perform this in a more
// efficient manner that differs from what would be attempted with
// `Prepare`.
//
// Commit may not be called on the provided key and will return an error.
// To collect the resources associated with key, Remove must be called with
// key as the argument.
View(ctx context.Context, key, parent string, opts ...Opt) ([]mount.Mount, error)
// Commit captures the changes between key and its parent into a snapshot
// identified by name. The name can then be used with the snapshotter's other
// methods to create subsequent snapshots.
//
// A committed snapshot will be created under name with the parent of the
// active snapshot.
//
// After commit, the snapshot identified by key is removed.
Commit(ctx context.Context, name, key string, opts ...Opt) error
// Remove the committed or active snapshot by the provided key.
//
// All resources associated with the key will be removed.
//
// If the snapshot is a parent of another snapshot, its children must be
// removed before proceeding.
Remove(ctx context.Context, key string) error
// Walk will call the provided function for each snapshot in the
// snapshotter which match the provided filters. If no filters are
// given all items will be walked.
// Filters:
// name
// parent
// kind (active,view,committed)
// labels.(label)
Walk(ctx context.Context, fn WalkFunc, filters ...string) error
// Close releases the internal resources.
//
// Close is expected to be called on the end of the lifecycle of the snapshotter,
// but not mandatory.
//
// Close returns nil when it is already closed.
Close() error
}
项目地址:containerd/stargz-snapshot[9]
(1)镜像格式 eStargz:支持懒加载的镜像格式。
(2)eStargz 实现了基于 workload 的优化。
mv /etc/containerd/config.toml /etc/containerd/config.toml.bak
# 修改containerd配置文件
tee /etc/containerd/config.toml <<-EOF
version = 2
# Enable stargz snapshotter for CRI
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "stargz"
disable_snapshot_annotations = false
# Plug stargz snapshotter into containerd
[proxy_plugins]
[proxy_plugins.stargz]
type = "snapshot"
address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"
EOF
# 下载stargz-snapshotter二进制文件
wget https://github.com/containerd/stargz-snapshotter/releases/download/v0.14.3/stargz-snapshotter-v0.14.3-linux-amd64.tar.gz
tar -C /usr/local/bin -xvf stargz-snapshotter-v0.14.3-linux-amd64.tar.gz
# 下载service文件并启动服务
wget -O /etc/systemd/system/stargz-snapshotter.service https://raw.githubusercontent.com/containerd/stargz-snapshotter/main/script/config/etc/systemd/system/stargz-snapshotter.service
systemctl enable --now stargz-snapshotter
# 重启containerd,使配置生效(加载stargz插件)
systemctl restart containerd
# 拉取镜像(优化后的镜像)
ctr-remote image rpull --plain-http ghcr.io/stargz-containers/alpine:3.15.3-esgz
# 启动容器
ctr-remote run --rm -t --snapshotter=stargz ghcr.io/stargz-containers/alpine:3.15.3-esgz test echo hello
# 拉取 stargz 格式镜像
ctr-remote image rpull --plain-http ghcr.io/stargz-containers/golang:1.15.3-buster-org
# 基于entrypoint优化
ctr-remote image optimize --oci \
--entrypoint='[ "/bin/bash", "-c" ]' --args='[ "go version" ]' \
ghcr.io/stargz-containers/golang:1.15.3-buster-org \
localhost:5000/golang:1.15.3-esgz-go-abin
ctr-remote run --rm -t --snapshotter=stargz localhost:5000/golang:1.15.3-esgz-go-abin test

# 基于本地文件runtime优化
tee /tmp/hello.go <<- EOF
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
EOF
ctr-remote image optimize --oci \
--mount=type=bind,source=/tmp/hello.go,destination=/hello.go,options=bind:ro \
--entrypoint='[ "/bin/bash", "-c" ]' --args='[ "go build -o /hello /hello.go && /hello" ]' \
ghcr.io/stargz-containers/golang:1.15.3-buster-org \
localhost:5000/golang:1.15.3-esgz-abin-hello-world
ctr-remote run --rm -t --snapshotter=stargz localhost:5000/golang:1.15.3-esgz-abin-hello-world test

其它相关选项:
Option | Description |
|---|---|
--entrypoint | Entrypoint of the container (in JSON array) |
--args | Arguments for the entrypoint (in JSON array) |
--env | Environment variables in the container |
--user | User name to run the process in the container |
--cwd | Working directory |
--period | The time seconds during profiling the file accesses |
-t or --terminal | Attach terminal to the container. This flag must be specified with -i |
-i | Attach stdin to the container |
CNI 相关优化:
ctr-remote image optimize --oci \
--cni \
--entrypoint='[ "/bin/bash", "-c" ]' \
--args='[ "curl example.com" ]' \
ghcr.io/stargz-containers/golang:1.15.3-buster-org \
registry2:5000/golang:1.15.3-esgz-curl
(1)chunk 级预取(called prioritized files)
(2)runtime 性能优化

eStargz 将预取信息(prioritized files)编码为带有标记的文件条目序列。
文件条目分为两类:
(1)优先考虑预取的文件
(2)其它文件(不优先考虑)
如果 archive 中没有需要优先考虑预取的文件,必须打上 no-prefetch 标记。
如果 archive 有(一个或多个)需要优先考虑预取的文件,由这两类文件组成两个独立的区域,并且在边界打上 prefetch 标记。
标记文件是内容为 4 位 0xf 的 regular 文件,必须记录在元数据 TOC 中(作为一个 TOCEntry),预取标记文件的名称为 .prefetch.landmark,不预取标记文件的名称为 .no.prefetch.landmark。
镜像的 workload 是定义在 Dockerfile 中的运行时配置,包括入口命令、环境变量和用户。
Stargz snapshotter 提供镜像转换命令ctr-remote images optimize来创建优化的 Stargz 镜像。通过在 sandbox 环境运行指定的 workload,并记录对所有文件的访问,这些文件作为优先预取的文件(prioritized files)。通过以下方式生成 eStargz 文件:
(1)在 archive 头部放 prioritized files,按访问顺序进行排列。
(2)将 prefetch 标记文件放在这些文件的末尾。
(3)将所有其它文件(non prioritized files)放在 prefetch 标记后面。
运行容器之前,stargz snapshotter 预取和预缓存 prioritized files 所在的范围(prefetch 标记文件之前的内容)。
将镜像挂载到新的命名空间,启动 fanotifier[12] 进程对文件的访问行为进行记录。
Linux 的文件监听事件:应用层的进程操作一个目录或文件时,会触发 system call,此时内核 notification 子系统把该进程对文件的操作上报给应用层的监听进程(称为 listener)。
“dnotify:2001 年的 2.4 版本引入,只能监控 directory,采用的是 signal[13] 机制来向 listener 发送通知,可以携带的信息很有限。 inotify:2005 年在 2.6.13 内核中亮相,除了可以监控目录,还可以监听普通文件产生的事件,inotify 摈弃了 signal 机制,通过 event queue 向 listener 上传事件信息。 fanotify:2.6.36 内核引入,fanotify 的出现解决了只能 notify 的问题,允许 listener 介入并改变文件事件的行为,实现从“监听”到“监控”的跨越。
创建 fanotifier(使用内核的 API) routine:

通过 Pipe 和 fanotifier routine 通信,开始 record,获取记录的文件。

使用 fanotifier 创建的 mount 命名空间。

创建容器和任务:

启动 fanotifier routine:

启动获取 fanotifier 数据的 routine:

运行任务(containerd 的 API):

记录的文件:

fanotify[14] 不支持监控被访问文件的 range:

“Soci-snapshotter 暂时未归属于 containerd 项目。
项目地址:awslabs/soci-snapshotter[15]
(1)为标准 OCI 镜像实现了懒加载功能,不需要镜像转换过程。(认为镜像转换带来了签名问题)
Soci-snapshotter 构建了 index artifact(SOCI index),存储在 registry,通过 OCI Reference Types working group[16] 实现的机制查询 registry 中的 SOCI index。
(2)后台预取。
SOCI 依赖 OCI 镜像中心的 referrers 特性来使用 indices 和 manifests,但是大多数已有的 registry 都不支持这一特性。ORAS 项目[17]使得用户可以 push OCI Artifacts 到 registry 和从 registry pull OCI Artifacts。
# 删除本地启动的 registry
docker rm -f registry
# 启动基于 ORAS 项目的 registry
docker run -d -p 5000:5000 --restart=always --name registry ghcr.io/oras-project/registry:v1.0.0-rc
# 下载镜像
ctr i pull docker.io/library/rabbitmq:latest
# 打 tag
ctr i tag docker.io/library/rabbitmq:latest localhost:5000/rabbitmq:latest
# 上传到本地 registry
ctr i push --platform linux/amd64 --plain-http localhost:5000/rabbitmq:latest
curl http://localhost:5000/v2/_catalog
curl http://localhost:5000/v2/golang/tags/list
# 查看详细信息
curl http://localhost:5000/v2/golang/manifests/latest
# 下载 soci-snapshotter
git clone https://github.com/awslabs/soci-snapshotter.git
cd soci-snapshotter
make
# 创建 SOCI index,产生 ztocs 文件(每个镜像层一个)和 manifest(将ztocs和镜像层联系起来)
./out/soci create --oras localhost:5000/rabbitmq:latest
# 查看ztoc的信息
./out/soci ztoc info sha256:6e956974b9a4c9b1f5ae93e4be741e5a5c6cb55bf3181e68e89fa60ef8884c28
# 查看索引 manifest
./out/soci index list
# 查看索引manifest信息
./out/soci index info sha256:ee961dc6c0aea50ac4975918070e55ed7239aaa9e9c4d68191be2745c6a70a24 | jq
# 将 SOCI manifest上传到 registry
./out/soci push --plain-http localhost:5000/rabbitmq:latest
# 为containerd配置snapshotter插件
tee -a /etc/containerd/config.toml <<- EOF
[proxy_plugins]
[proxy_plugins.soci]
type = "snapshot"
address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"
EOF
sudo systemctl restart containerd
ctr plugin ls | grep snapshotter | grep soci
# 启动 SOCI snapshotter
./out/soci-snapshotter-grpc&> ./soci-snapshotter-logs &
# 通过 SOCI snapshotter 拉取镜像
./out/soci image rpull --plain-http --soci-index-digest sha256:ee961dc6c0aea50ac4975918070e55ed7239aaa9e9c4d68191be2745c6a70a24 localhost:5000/rabbitmq:latest
mount | grep soci
# 启动容器
ctr run --snapshotter soci --net-host localhost:5000/rabbitmq:latest soci
ctr c ls
ctr c d soci
ztoc 文件记录了每层的元数据信息(文件在镜像层中的位置):

manifest 记录了镜像的元数据信息(digest 为转换后 ztoc 的签名):

SOCI 希望实现更灵活的预取。
通常,预取是基于 workload 实现的,而不是基于镜像或基础镜像层。例如,用户可能有一个 Python3 的基础镜像层被大量应用共享,基于 workload 的优化不能共享这个基础层,因为每个应用的启动顺序是不一样的。
(1)这会导致对 registry 的存储占用急剧增加,缓存命中率降低。
(2)如果基础镜像层改变,每个应用的镜像需要重新优化。
此外,有些 workload 需要在子文件层面进行预取,例如,机器学习的 workload 中,启动后会迅速读取大量大文件中的少部分 header。
SOCI 是一个单独的加载顺序文档(load order document,LOD),可以指定加载哪些文件或文件段。一个镜像可以有多个加载顺序文档。在容器启动阶段,可以用管理员指定的逻辑检索加载顺序文档。
SOCI 的按加载顺序优化还未实现。(距离写这篇文章已经过去 3 年多,最近整理时看了下,还是未实现)
构建 ztocs:


构建 manifest:


拉取镜像时:

构建镜像层到 ztoc 的映射:

并行处理并缓存其它镜像层:

项目地址:containerd/accelerated-container-image[18]
(1)通过 TCMU[19] 导出为块设备。TCM 就是 LIO,kernel 的 iSCSI server,最开始跑在 kernel,TCMU(Userspace)将 I/O 转发给用户态,由用户态 daemon 完成后端数据读写,架构思想和 FUSE 类似,只是工作层级不同。
(2)定义了镜像格式,需要转换。
安装 overlaybd-snapshotter:
git clone https://github.com/containerd/accelerated-container-image.git
cd accelerated-container-image
make
sudo make install
# 创建配置文件
sudo tee /etc/overlaybd-snapshotter/config.json <<- EOF
{
"root": "/var/lib/containerd/io.containerd.snapshotter.v1.overlaybd",
"address": "/run/overlaybd-snapshotter/overlaybd.sock",
"verbose": "info",
"rwMode": "overlayfs",
"logReportCaller": false,
"autoRemoveDev": false
}
EOF
# 启动 overlaybd-snapshotter 服务
sudo systemctl enable /opt/overlaybd/snapshotter/overlaybd-snapshotter.service
sudo systemctl start overlaybd-snapshotter
安装 overlaybd-tcmu:
sudo apt install -y libcurl4-openssl-dev libssl-dev libaio-dev libnl-3-dev libnl-genl-3-dev libgflags-dev
# 安装 zstd
git clone https://github.com/facebook/zstd.git
make -j 8
sudo make install
# 安装 cmake,需要版本大于 3.15
wget https://github.com/Kitware/CMake/releases/download/v3.25.1/cmake-3.25.1.tar.gz
tar -zxvf cmake-3.25.1.tar.gz
cd cmake-3.25.1
make -j 8
sudo make install
# 加载 target_core_user 内核模块
modprobe target_core_user
lsmod | grep target_core_user
git clone https://github.com/containerd/overlaybd.git
cd overlaybd
git submodule update --init
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j 8
sudo make install
# 启动 overlaybd-tcmu 服务
sudo systemctl enable /opt/overlaybd/overlaybd-tcmu.service
sudo systemctl start overlaybd-tcmu
sudo tee /etc/containerd/config.toml -a <<- EOF
[proxy_plugins.overlaybd]
type = "snapshot"
address = "/run/overlaybd-snapshotter/overlaybd.sock"
EOF
sudo systemctl restart containerd
sudo ctr plugins ls | grep snapshotter
sudo nerdctl run --net host -it --rm --snapshotter=overlaybd registry.hub.docker.com/overlaybd/redis:6.2.1_obd
sudo nerdctl pull golang:1.15.3
sudo /opt/overlaybd/snapshotter/ctr obdconv docker.io/library/golang:1.15.3 localhost:5000/golang:1.15.3-obd
sudo nerdctl push localhost:5000/golang:1.15.3-obd
# 启动容器
sudo nerdctl run --net host -it --rm --snapshotter=overlaybd localhost:5000/golang:1.15.3-obd
我们计划将 eBPF 集成到 NRI 插件,以采集负载运行期间访问的文件,作为预取数据,不过随着重心转移,后续没有继续投入。参考
containerd/nydus-snapshotter/pull/506[20]
不同云厂商基于 Containerd snapshot plugin 实现了不同的 snapshotter,尽管用到的技术不同,目的都是为了实现镜像懒加载,在这个基础上,又做了很多探索,希望进一步优化镜像下载时间,例如:避免镜像转换,精确感知负载需要访问的镜像文件等。
镜像拉取时间长导致容器拉起速度慢确实是业界面临的难题,特别是 AI 场景,镜像文件越来越大,对存储和网络的带宽,以及节点自身资源和性能都有很高的要求。
此外,镜像懒加载技术自身存在天然的弊端:缓存未命中会导致 I/O 时延显著增加。对于容器场景,容器的拉起可以在秒级完成,但是拉起之后的业务进程是否可以像访问本地文件一样访问远端存储是个问题。通用的做法是快速拉起容器,异步下载镜像文件,最终效果(不同容器使用相同镜像的场景)和没有懒加载的 overlay 类似。
因此,镜像懒加载只能解决特定场景的问题,是否需要懒加载和业务场景有很大关系。容器拉起时间(包括拉镜像时间,对于懒加载,只需要拉取元数据)占任务 E2E 时间的比例以及任务自身的总时间都是考虑因素。
[1]
containerd/containerd/docs/snapshotters: https://github.com/containerd/containerd/tree/main/docs/snapshotters
[2]
zfs: https://github.com/containerd/zfs
[3]
devmapper: https://github.com/containerd/containerd/blob/main/docs/snapshotters/devmapper.md
[4]
fuse-overlayfs: https://github.com/containerd/fuse-overlayfs-snapshotter
[5]
nydus: https://github.com/containerd/nydus-snapshotter
[6]
overlaybd: https://github.com/containerd/accelerated-container-image
[7]
stargz: https://github.com/containerd/stargz-snapshotter
[8]
containerd/containerd/blob/main/core/snapshots/snapshotter.go: https://github.com/containerd/containerd/blob/main/core/snapshots/snapshotter.go
[9]
containerd/stargz-snapshot: https://github.com/containerd/stargz-snapshotter
[10]
基于 workload 优化: https://github.com/containerd/stargz-snapshotter/blob/main/docs/ctr-remote.md#optimizing-an-image
[11]
eStargz: https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md
[12]
fanotifier: https://zhuanlan.zhihu.com/p/186027813
[13]
signal: https://zhuanlan.zhihu.com/p/77598393
[14]
fanotify: https://lwn.net/Articles/339399/
[15]
awslabs/soci-snapshotter: https://github.com/awslabs/soci-snapshotter
[16]
OCI Reference Types working group: https://github.com/opencontainers/wg-reference-types
[17]
ORAS 项目: https://oras.land/
[18]
containerd/accelerated-container-image: https://github.com/containerd/accelerated-container-image
[19]
TCMU: https://www.kernel.org/doc/Documentation/target/tcmu-design.txt
[20]
containerd/nydus-snapshotter/pull/506: https://github.com/containerd/nydus-snapshotter/pull/506