本文介绍如何根据 Pod 异常状态信息中的 Exit Code 进一步定位问题。
查看 Pod 异常状态信息
执行以下命令,查看异常 Pod 状态信息。
kubectl describe pod <pod name>
返回结果如下:
Containers:kubedns:Container ID: docker://5fb8adf9ee62afc6d3f6f3d9590041818750b392dff015d7091eaaf99cf1c945Image: ccr.ccs.tencentyun.com/library/kubedns-amd64:1.14.4Image ID: docker-pullable://ccr.ccs.tencentyun.com/library/kubedns-amd64@sha256:40790881bbe9ef4ae4ff7fe8b892498eecb7fe6dcc22661402f271e03f7de344Ports: 10053/UDP, 10053/TCP, 10055/TCPHost Ports: 0/UDP, 0/TCP, 0/TCPArgs:--domain=cluster.local.--dns-port=10053--config-dir=/kube-dns-config--v=2State: RunningStarted: Tue, 27 Aug 2019 10:58:49 +0800Last State: TerminatedReason: ErrorExit Code: 255Started: Tue, 27 Aug 2019 10:40:42 +0800Finished: Tue, 27 Aug 2019 10:58:27 +0800Ready: TrueRestart Count: 1
在返回结果的容器列表
Last State
字段中, Exit Code
为程序上次退出时的状态码,该值不为0即表示程序异常退出,可根据退出状态码进一步分析异常原因。退出状态码说明
状态码需在0 - 255之间。
0表示正常退出。
若因外界中断导致程序退出,则状态码区间为129 - 255。例如,操作系统给程序发送中断信号
kill -9
或 ctrl+c
,导致程序状态变为 SIGKILL
或 SIGINT
。通常因程序自身原因导致的异常退出,状态码区间在1 - 128。在某些场景下,也允许程序设置使用129 - 255区间的状态码。
若指定的退出状态码不在0 - 255之间(例如,设置
exit(-1)
),此时将会自动执行转换,最终呈现的状态码仍会在0 - 255之间。若将退出时状态码记为
code
,则不同情况下转换方式如下:当指定的退出时状态码为负数,转换公式为:
256 - (|code| % 256)
当指定的退出时状态码为正数,转换公式为:
code % 256
常见异常状态码
137:表示程序被
SIGKILL
中断信号杀死。异常原因可能为:通常是由于 Pod 中容器内存达到了其资源限制(
resources.limits
)。例如,内存溢出(OOM)。由于资源限制是通过 Linux 的 cgroup 实现的,当某个容器内存达到资源限制, cgroup 就会将其强制停止(类似于 kill -9
),此时通过 describe pod
可以看到 Reason 是 OOMKilled
。宿主机本身资源不够用(OOM),则内核会选择停止一些进程来释放内存。
说明:
无论是 cgroup 限制,还是因为节点机器本身资源不够导致的进程停止,都可以从系统日志中找到记录。方法如下:
Ubuntu 系统日志存储在目录
/var/log/syslog
,CentOS 系统日志存储在目录 /var/log/messages
中,两者系统日志均可通过 journalctl -k
命令进行查看。livenessProbe(存活检查)失败,使得 kubelet 停止 Pod。
被恶意木马进程停止。
1和255:通常表示一般错误,具体原因需要通过容器日志进一步定位。例如,可能是设置异常退出使用
exit(1)
或 exit(-1)
导致的,而-1将会根据规则转换成255。Linux 标准中断信号
Linux 程序被外界中断时会发送中断信号,程序退出时的状态码为中断信号值加128。例如,
SIGKILL
的中断信号值为9,那么程序退出状态码则为9 + 128 = 137。更多标准信号值参考如下表:信号 Signal | 状态码 Value | 动作 Action | 描述 Comment |
SIGHUP | 1 | Term | Hangup detected on controlling terminal or death of controlling process |
SIGINT | 2 | Term | Interrupt from keyboard |
SIGQUIT | 3 | Core | Quit from keyboard |
SIGILL | 4 | Core | Illegal Instruction |
SIGABRT | 6 | Core | Abort signal from abort(3) |
SIGFPE | 8 | Core | Floating-point exception |
SIGKILL | 9 | Term | Kill signal |
SIGSEGV | 11 | Core | Invalid memory reference |
SIGPIPE | 13 | Term | Broken pipe: write to pipe with no readers; see pipe(7) |
SIGALRM | 14 | Term | Timer signal from alarm(2) |
SIGTERM | 15 | Term | Termination signal |
SIGUSR1 | 30,10,16 | Term | User-defined signal 1 |
SIGUSR2 | 31,12,17 | Term | User-defined signal 2 |
SIGCHLD | 20,17,18 | Ign | Child stopped or terminated |
SIGCONT | 19,18,25 | Cont | Continue if stopped |
SIGSTOP | 17,19,23 | Stop | Stop process |
SIGTSTP | 18,20,24 | Stop | Stop typed at terminal |
SIGTTIN | 21,21,26 | Stop | Terminal input for background process |
SIGTTOU | 22,22,27 | Stop | Terminal output for background process |
C/C++ 退出状态码
/usr/include/sysexits.h
中进行了退出状态码标准化(仅限 C/C++),如下表:定义 | 状态码 | 描述 |
#define EX_OK | 0 | successful termination |
#define EX__BASE | 64 | base value for error messages |
#define EX_USAGE | 64 | command line usage error |
#define EX_DATAERR | 65 | data format error |
#define EX_NOINPUT | 66 | cannot open input |
#define EX_NOUSER | 67 | addressee unknown |
#define EX_NOHOST | 68 | host name unknown |
#define EX_UNAVAILABLE | 69 | service unavailable |
#define EX_SOFTWARE | 70 | internal software error |
#define EX_OSERR | 71 | system error (e.g., can't fork) |
#define EX_OSFILE | 72 | critical OS file missing |
#define EX_CANTCREAT | 73 | can't create (user) output file |
#define EX_IOERR | 74 | input/output error |
#define EX_TEMPFAIL | 75 | temp failure; user is invited to retry |
#define EX_PROTOCOL | 76 | remote error in protocol |
#define EX_NOPERM | 77 | permission denied |
#define EX_CONFIG | 78 | configuration error |
#define EX__MAX 78 | 78 | maximum listed value |
状态码参考
更多状态码含义可参考以下表格:
状态码 | 含义 | 示例 | 描述 |
1 | Catchall for general errors | let "var1 = 1/0" | Miscellaneous errors, such as "divide by zero" and other impermissible operations |
2 | Misuse of shell builtins (according to Bash documentation) | empty_function() {} | Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison). |
126 | Command invoked cannot execute | /dev/null | Permission problem or command is not an executable |
127 | "command not found" | illegal_command | Possible problem with $PATH or a typo |
128 | Invalid argument to exit | exit 3.14159 | exit takes only integer args in the range 0 - 255 (see first footnote) |
128+n | Fatal error signal "n" | kill -9 $PPID of script | $? returns 137 (128 + 9) |
130 | Script terminated by Control-C | Ctl-C | Control-C is fatal error signal 2, (130 = 128 + 2, see above) |
255* | Exit status out of range | exit -1 | exit takes only integer args in the range 0 - 255 |