问题描述
Docker守护进程在异常断电后卡在activating状态,并且内存占用在无限增加。
[root@node01 docker]# service docker statusRedirecting to /bin/systemctl status docker.service● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─restart.conf Active: activating (start) since Tue 2021-12-07 15:44:32 CST; 1min 18s ago Docs: https:/docs.docker.com Main PID: 274797 (dockerd) Tasks: 481 Memory: 15.4G -- 占用不正常 CGroup: /system
[root@node01 docker]# service docker statusRedirecting to /bin/systemctl status docker.service● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─restart.conf Active: deactivating (stop-sigterm) Docs: https:/docs.docker.com Main PID: 274797 (dockerd) Tasks: 481 Memory: 247.3G -- 一直在疯狂增加 CGroup: /system.slice/docker.service └─274
定位过程
首先,查看docker版本和docker info信息
[root@node01files]# docker versionClient: Docker Engine - Community Version: 20.10.7 API version: 1.41 Go version: go1.13.15 Git commit: f0df350 Built: Wed Jun 2 11:58:10 2021 OS/Arch: linux/amd64 Context: default Experimental: trueServer: Docker Engine - Community Engine: Version: 20.10.7 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: b0f5bc3runc: Version: 1.0.0-rc95 GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7 docker-init: Version: 0.19.0 GitCommit: de40ad0
[root@node01 files]# docker infoClient: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Build with BuildKit (Docker Inc., v0.5.1-docker) scan: Docker Scan (Docker Inc., v0.8.0)Server: Server Version: 20.10.7 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7 init version: de40ad0 Security Options: seccomp Profile: default Kernel Version: 3.10.0-957.21.3.el7.x86_64 Operating System: CentOS 7.6 OSType: linux Architecture: x86_64 Total Memory: 256GiB Name: node01 Docker Root Dir: /var/lib/docker Debug Mode: false File Descriptors: 366 Goroutines: 290 EventsListeners: 0 Registry: https:/index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: true
版本还算比较新,查看docker日志:
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.831370861+08:00" level=info msg="Starting up"Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950367668+08:00" level=info msg="parsed scheme: \"unix\"" module=grpcDec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950430356+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpcDec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950486773+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpcDec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950524941+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpcDec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004622322+08:00" level=info msg="parsed scheme: \"unix\"" module=grpcDec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004663918+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpcDec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004697382+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpcDec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004726533+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpcDec 07 14:58:14 node01 dockerd[239420]: time="2021-12-07T14:58:14.832862519+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"Dec 07 14:59:00 node01 dockerd[239420]: time="2021-12-07T14:59:00.039554832+08:00" level=info msg="Loading containers: start."Dec 07 15:02:27 node01 dockerd[239420]: fatal error: runtime: out of memoryDec 07 15:02:27 node01 dockerd[239420]: runtime stack:Dec 07 15:02:27 node01 dockerd[239420]: runtime.throw(0x5604d0deee9d, 0x16)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/panic.go:774 +0x74Dec 07 15:02:28 node01 dockerd[239420]: runtime.sysMap(0xfb6c000000, 0x5c000000, 0x5604d35dab58)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mem_linux.go:169 +0xc7Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).sysAlloc(0x5604d35be240, 0x5beee000, 0x7f6fd1fbab88, 0x5604cf22bf6b)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:701 +0x1cfDec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).grow(0x5604d35be240, 0x2df77, 0xffffffff)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1255 +0xa5Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).allocSpanLocked(0x5604d35be240, 0x2df77, 0x5604d35dab68, 0x5604cf212927)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1170 +0x268Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc_m(0x5604d35be240, 0x2df77, 0x7f6c01180100, 0x7f6c01184350)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1022 +0xc6Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc.func1()Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1093 +0x4eDec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc(0x5604d35be240, 0x2df77, 0x7f6fd1010100, 0x7f6c01184350)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1092 +0x8cDec 07 15:02:28 node01 dockerd[239420]: runtime.largeAlloc(0x5beee000, 0x5604cf250001, 0x7f6c01184350)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1138 +0x99Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc.func1()Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1033 +0x48Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack(0x0)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:370 +0x63Dec 07 15:02:28 node01 dockerd[239420]: runtime.mstart()Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/proc.go:1146Dec 07 15:02:28 node01 dockerd[239420]: goroutine 1 [running]:Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack_switch()Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0008bff70 sp=0xc0008bff68 pc=0x5604cf2528e0Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc(0x5beee000, 0x5604d1e8fde0, 0x1, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1032 +0x8a7 fp=0xc0008c0010 sp=0xc0008bff70 pc=0x5604cf200ef7Dec 07 15:02:28 node01 dockerd[239420]: runtime.growslice(0x5604d1e8fde0, 0xef119e4000, 0x3107eaa, 0x3107eaa, 0x3107eab, 0x5604cf20146b, 0xfb6bbbfb80, 0x7f6c01184350)Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/slice.go:181 +0x1e4 fp=0xc0008c0078 sp=0xc0008c0010 pc=0x5604cf23b0b4Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).first(0xc0008c0288)Dec 07 15:02:28 node01 dockerd[239420]: docker/vendor/go.etcd.io/bbolt/cursor.go:182 +0x138 fp=0xc0008c00d8 sp=0xc0008c0078 pc=0x5604d012d348Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).next(0xc0008c0288, 0x0, 0x9f, 0x9f, 0x7f7150506429, 0x2b8, 0xc0008c0188, 0x5604cf23ee10)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:234 +0x84 fp=0xc0008c0128 sp=0xc0008c00d8 pc=0x5604d012d684Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).Next(0xc0008c0288, 0x7f715050638a, 0x9f, 0x9f, 0xfb6bbcf720, 0x9f, 0x2b8)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:75 +0x43 fp=0xc0008c0198 sp=0xc0008c0128 pc=0x5604d012ca43Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List.func1(0xc0034350a0, 0xc0008c0200, 0xc0034350a0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:288 +0x19a fp=0xc0008c02b8 sp=0xc0008c0198 pc=0x5604d0142a4aDec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*DB).View(0xc0001dc200, 0xc0008c03f8, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/db.go:725 +0xaa fp=0xc0008c0340 sp=0xc0008c02b8 pc=0x5604d0130a3aDec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List(0xc00482f2c0, 0xc001a7af00, 0x5e, 0x0, 0x0, 0x0, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:279 +0x1b3 fp=0xc0008c0430 sp=0xc0008c0340 pc=0x5604d0141583Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).kmap(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x5e, 0xc001a7ae40)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:43 +0x286 fp=0xc0008c0560 sp=0xc0008c0430 pc=0x5604cfcc81a6Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).list(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:164 +0x7e fp=0xc0008c0678 sp=0xc0008c0560 pc=0x5604cfcc988eDec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*datastore).List(0xc0079ff680, 0xc001a7ae40, 0x5e, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/datastore.go:517 +0x205 fp=0xc0008c0730 sp=0xc0008c0678 pc=0x5604cfccc745Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*network).getEndpointsFromStore(0xc00289a380, 0xc004a99258, 0x7, 0x5604d2206920, 0xc001a515f0, 0xc002f3ff48)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/store.go:190 +0x343 fp=0xc0008c09f8 sp=0xc0008c0730 pc=0x5604d01acaf3Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*controller).reservePools(0xc0004b7400)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:977 +0x4c1 fp=0xc0008c0c28 sp=0xc0008c09f8 pc=0x5604d0159701Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.New(0xc001055d00, 0x9, 0x10, 0xc0007ac870, 0xc001387fb0, 0xc001055d00, 0x9)Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:245 +0x615 fp=0xc0008c0dd8 sp=0xc0008c0c28 pc=0x5604d0154815Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).initNetworkController(0xc00000c1e0, 0xc000207080, 0xc001387fb0, 0xc000236f50, 0xc00000c1e0, 0xc0007170c8, 0xc001387fb0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon_unix.go:855 +0xa9 fp=0xc0008c0e90 sp=0xc0008c0dd8 pc=0x5604d0a49429Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).restore(0xc00000c1e0, 0xc00004e5c0, 0xc000220000)Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:490 +0x50b fp=0xc0008c1088 sp=0xc0008c0e90 pc=0x5604d0a3898bDec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.NewDaemon(0x5604d21de4c0, 0xc00004e5c0, 0xc000207080, 0xc0007ac870, 0x0, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:1147 +0x2be8 fp=0xc0008c19c8 sp=0xc0008c1088 pc=0x5604d0a3c938Dec 07 15:02:28 node01 dockerd[239420]: main.(*DaemonCli).start(0xc0007ab410, 0xc000218600, 0x0, 0x0)Dec 07 15:02:28 node01 dockerd[239420]: /docker/cmd/dockerd/daemon.go:195 +0x743 fp=0xc0008c1d00 sp=0xc0008c19c8 pc=0x5604d0dbb3c3Dec 07 15:02:28 node01 dockerd[239420]: main.runDaemon(...)
日志中存在明显的异常堆栈打印,分析可知,启动过程中走到Loading containers: start.之后卡住,然后打印fatal error: runtime: out of memory,也就是内存爆了。根据堆栈信息,可以看出异常的调用路径如下:
NewDaemon —> restore() —> initNetworkController —> libnetwork.New() —> reservePools() —> getEndpointsFromStore() —> List —> cache.list(kvObject) —> cache.kmap(kvObject) —> List(keyPrefix) —> Next —> next —> first
拉取对应版本的代码,根据上述调用过程,找到指定的代码位置:
// List returns the range of keys starting with the passed in prefixfunc (b *BoltDB) List(keyPrefix string) ([]*store.KVPair, error) { var ( db *bolt.DB err error ) b.Lock() defer b.Unlock() kv := []*store.KVPair{} if db, err = b.getDBhandle(); err != nil { return nil, err } defer b.releaseDBhandle() err = db.View(func(tx *bolt.Tx) error { bucket := tx.Bucket(b.boltBucket) if bucket == nil { return store.ErrKeyNotFound } cursor := bucket.Cursor() prefix := []byte(keyPrefix) for key, v := cursor.Seek(prefix); bytes.HasPrefix(key, prefix); key, v = cursor.Next() { dbIndex := binary.LittleEndian.Uint64(v[:libkvmetadatalen]) v = v[libkvmetadatalen:] val := make([]byte, len(v)) copy(val, v) kv = append(kv, &store.KVPair{ Key: string(key), Value: val, LastIndex: dbIndex, }) } return nil }) if len(kv) == 0 { return nil, store.ErrKeyNotFound } return kv, err}
可以看出,这段是读取boltdb数据库(用于缓存网络配置),从之前了解看,这个数据库在异常断电时很容易损坏,所以怀疑是数据库损坏了,导致此处的遍历读取超出了预期的循环次数,而每次循环都会创建变量,分配内存,最终被内核OOM。
在docker社区查找相关issue[1] [2],发现确实存在boltdb数据库损坏的现象,不过最终报错的现象不太一样。最后,在docker社区也提了个issue[3],社区反馈也是怀疑boltdb数据库损坏,并建议可以把local-kv.db文件删除再重启来恢复。
个人觉得,上面报错的地方可以优化一下,对db文件做一次检查,如果检查到异常,提前抛异常,而不是不停地吃内存(由于异常的环境被破坏了,这个想法需要等复现后再考虑优化)。
解决方案
- 临时解决:删除local-kv.db文件再重启docker服务。
- 永久解决:优化异常代码。
参考资料
- https://github.com/moby/moby/issues/37238
- https://github.com/moby/moby/issues/42099
- https://github.com/moby/moby/issues/43072
版权声明:内容来源于互联网和用户投稿 如有侵权请联系删除