今天在排查一个问题发现集群内节点非正常移除,被调度为not ready
,集群上的服务运行在此节点上的状态都变成了unknown
or nodelost
导致有些服务无法正常运行了。
先将异常节点剔除了,把服务调度正常后,查看kubectl
,/var/log/syslog
(ubuntu的messages)发现一直在刷以下的报错,从错误信息可以推测到,这台计算节点存在一个孤儿Pod,并且该Pod挂载了数据卷(volume),阻碍了Kubelet对孤儿Pod正常的回收清理。
Jan 21 16:45:44 localhost kubelet[1277]: E0121 16:45:44.079748 1277 kubelet_volumes.go:128] Orphaned pod "86d60ee9-9fae-11e8-8cfc-525400290b20" found, but volume paths are still present on disk. : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Jan 21 16:45:46 localhost kubelet[1277]: E0121 16:45:46.069180 1277 kubelet_volumes.go:128] Orphaned pod "86d60ee9-9fae-11e8-8cfc-525400290b20" found, but volume paths are still present on disk. : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Jan 21 16:45:48 localhost kubelet[1277]: E0121 16:45:48.077430 1277 kubelet_volumes.go:128] Orphaned pod "86d60ee9-9fae-11e8-8cfc-525400290b20" found, but volume paths are still present on disk. : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Jan 21 1
通过id号,进入kubelet的目录,可以发现里面装的是容器的数据,etc-hosts
文件中还保留着podname
# cd /var/lib/kubelet/pods/86d60ee9-9fae-11e8-8cfc-525400290b20
/var/lib/kubelet/pods/86d60ee9-9fae-11e8-8cfc-525400290b20# ls
containers etc-hosts plugins volumes
/var/lib/kubelet/pods/86d60ee9-9fae-11e8-8cfc-525400290b20# cat etc-hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
172.16.1.180 omc-test-2509590746-mw56s
通过搜索相关的信息也有相关的问题:
https://github.com/kubernetes/kubernetes/issues/60987
首先通过etc-hosts文件的pod name 发现已经没有相关的实例在运行了,然后按照issue中的提示,删除pod
# rm -rf 86d60ee9-9fae-11e8-8cfc-525400290b20
但是这个方法有一定的危险性
,还不确认是否有数据丢失
的风险,如果可以确认,再执行。或在issue中寻找更好的解决方法。
再去查看日志,就会发现syslog不会再刷类似的日志了。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。