0. 写在前面:为什么你需要“神器”而非“常用命令
有时候接口 5xx 突增,应用层排查一圈没看出明显异常,更多像是网络可达性间歇抖动。出现这个情况的时候我习惯用一条“线”把路径捋顺:容器网络命名空间 → veth → 宿主机 bridge(docker0/cni0)→ kube-proxy/NAT → underlay(物理/隧道)→ 目标节点/外网。只要在这条线上逐段探针,总有一个点暴露真相。下面是我当晚的完整操作留痕,命令与输出都保留,随取随用。
$ kubectl get pods -o wide -n prod | head -5
NAME READY STATUS RESTARTS AGE IP NODE
api-67f6cc8974-2kmdg 1/1 Running 0 4d 10.244.1.23 node-a
api-67f6cc8974-lxw4n 1/1 Running 0 4d 10.244.2.57 node-b
redis-0 1/1 Running 0 4d 10.244.1.42 node-a
gateway-7f9d64b8d7-2xk6d 1/1 Running 0 4d 10.244.3.18 node-c$ kubectl get svc,ep -n prod | egrep 'api|redis'
service/api ClusterIP 10.96.12.34 <none> 80/TCP
endpoints/api 10.244.1.23:8080,10.244.2.57:8080
service/redis ClusterIP 10.96.98.76 <none> 6379/TCP
endpoints/redis 10.244.1.42:6379我会挑三条链路做“样本”:
api@node-a → redis@node-a)api@node-b → api@node-a)api@* → 10.96.12.34:80)容器镜像常常“瘦身”得连 ping 都没有;没有工具就先走
busybox或在宿主机用 nsenter 进入网络命名空间。
$ kubectl -n prod exec -it api-67f6cc8974-2kmdg -- sh
/ # uname -a
Linux api-67f6cc8974-2kmdg 5.15.0-1051-azure #59~20.04 SMP x86_64 GNU/Linux若容器无 shell,用宿主机 nsenter:
$ PID=$(docker inspect -f '{{.State.Pid}}' api-67f6cc8974-2kmdg 2>/dev/null || crictl inspect --output go-template --template '{{.info.pid}}' <containerID>)
$ sudo nsenter -t "$PID" -n bash -lc 'ip -br a; ip route'
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 10.244.1.23/24 fe80::2c1f/64
default via 10.244.1.1 dev eth0/ # ping -c2 10.244.1.42
PING 10.244.1.42 (10.244.1.42): 56 data bytes
64 bytes from 10.244.1.42: seq=0 ttl=64 time=0.374 ms
64 bytes from 10.244.1.42: seq=1 ttl=64 time=0.341 ms
--- 10.244.1.42 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.341/0.357/0.374/0.016 ms/ # nc -zv 10.244.1.42 6379
10.244.1.42 (10.244.1.42:6379) open/ # ping -c4 10.244.2.57
PING 10.244.2.57 (10.244.2.57): 56 data bytes
64 bytes from 10.244.2.57: seq=0 ttl=63 time=0.892 ms
64 bytes from 10.244.2.57: seq=1 ttl=63 time=1.104 ms
64 bytes from 10.244.2.57: seq=2 ttl=63 time=25.331 ms
64 bytes from 10.244.2.57: seq=3 ttl=63 time=120.442 ms
--- 10.244.2.57 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 0.892/36.942/120.442/51.012 ms延迟波动大,像是隧道或队列拥塞。再用 TCP 验证:
/ # curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' http://10.244.2.57:8080/health
200 0.123/ # curl -sS -o /dev/null -w '%{http_code} %{remote_ip}\n' http://10.96.12.34/health
200 10.96.12.34如偶发 000/超时,通常是 kube-proxy、端点漂移或 SNAT 问题。
/ # getent hosts api.prod.svc.cluster.local
10.96.12.34 api.prod.svc.cluster.local
/ # dig +short api.prod.svc.cluster.local @10.96.0.10
10.96.12.34DNS 没抖。继续往“线”下一个节点走。
$ # Pod IP 10.244.1.23 所在宿主机 node-a
$ ip -br addr | egrep 'cni0|flannel|vxlan|docker0'
cni0 UP 10.244.1.1/24
flannel.1 UP 10.244.1.0/32$ ip route | grep 10.244.1.23
10.244.1.23 dev vethb42c3 scope link$ ethtool -S vethb42c3 | egrep 'rx_|tx_' | head
rx_packets: 129284
tx_packets: 130112
rx_dropped: 0
tx_dropped: 12tx_dropped 不为 0,记一笔。
$ bridge fdb show br cni0 | head -3
02:42:ed:7a:1b:2c dev vethb42c3 master cni0
12:2e:9a:15:00:aa dev veth9a7d1 master cni0
33:33:00:00:00:16 dev cni0 self permanent$ ip neigh show dev cni0 | head -2
10.244.1.23 lladdr 02:42:ed:7a:1b:2c REACHABLE
10.244.1.42 lladdr 02:42:6a:aa:bb:cc STALE$ kubectl -n kube-system get cm kube-proxy -o yaml | grep -i mode
mode: "ipvs"IPVS 模式下看虚拟服务与后端:
$ sudo ipvsadm -Ln | egrep '10.96.12.34|8080' -A2
TCP 10.96.12.34:80 rr
-> 10.244.1.23:8080 Masq 1 0 0
-> 10.244.2.57:8080 Masq 1 0 0iptables 模式下看 NAT 表命中计数:
$ sudo iptables -t nat -vnL KUBE-SERVICES | head -3
pkts bytes target prot opt in out source destination
600 36000 KUBE-SVC-XXXX tcp -- * * 0.0.0.0/0 10.96.12.34 /* api */ tcp dpt:80$ sudo conntrack -S
cpu=0 found=524288 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 id=0$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 524288达到上限时会出现莫名其妙的新连接超时。如果在峰值时间有问题,先临时调高(评估内存):
$ sudo sysctl -w net.netfilter.nf_conntrack_max=1048576
net.netfilter.nf_conntrack_max = 1048576不同 CNI 路径不一样,但命令思路一样。
$ ss -lunp | grep 8472
udp UNCONN 0 0 10.0.0.11:8472 0.0.0.0:* users:(("flanneld",pid=1320,fd=12))$ sudo tcpdump -ni any udp port 8472 -c 4
14:01:21 VXLAN, flags [I] (0x08), vni 1
14:01:21 VXLAN, flags [I] (0x08), vni 1
14:01:22 VXLAN, flags [I] (0x08), vni 1
14:01:22 VXLAN, flags [I] (0x08), vni 1有包,说明隧道在跑。
$ ss -tnp | grep ':179 '
ESTAB 0 0 10.0.0.11:60514 10.0.0.12:179 users:(("bird",pid=980,fd=18))$ sudo calicoctl node status | sed -n '1,15p'
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 10.0.0.12 | node-to-node mesh | up | 02:11:34 | Established |
+--------------+-------------------+-------+----------+-------------+$ cilium status | head -6
KVStore: Ok Disabled
Kubernetes: Ok 1.28 (v1.28.3) [linux/amd64]
Cilium: Ok 1.14.5$ sudo cilium monitor -t drop -n 5
xx drop (Policy denied) flow 0x0 to endpoint 234, identity 12345->56789如果这里出现 policy deny,别再怀疑 TCP,本质是网络策略阻断。
容器常见 mtu 1450(VXLAN 头开销),物理口 1500;若路径某段更小(云厂商负载均衡、VPN),碎片/DF 就会出问题。
$ ip -br link | egrep 'eth0|cni0|flannel|vxlan'
eth0 UP mtu 1500 …
cni0 UP mtu 1450 …
flannel.1 UP mtu 1450 …在 Pod 内测试 Path MTU(DF 不分片):
/ # ping -M do -s 1472 8.8.8.8 -c2
PING 8.8.8.8 (8.8.8.8): 1472 data bytes
From 10.0.0.1 icmp_seq=1 Frag needed and DF set (mtu = 1450)缩到 1452 仍不通,就用 1430 再试:
/ # ping -M do -s 1430 8.8.8.8 -c2
64 bytes from 8.8.8.8: seq=0 ttl=113 time=35.2 ms
64 bytes from 8.8.8.8: seq=1 ttl=113 time=35.1 ms有结论:路径 MTU≈1470-40-?,生效 MTU 1450。生产里要么调 NIC/隧道 MTU,要么强制应用侧 --mtu,要么在边界做 MSS clamp。
$ kubectl -n kube-system get cm kube-proxy -o yaml | grep -i hairpin
hairpinMode: "hairpin-veth"如果 hairpin 关闭,现象是Pod 访问 Service 偶发失败。开启后重启 kube-proxy;某些 CNI 需设置 hairpinMode: true。
$ sysctl net.ipv4.conf.all.rp_filter
net.ipv4.conf.all.rp_filter = 1容器多网卡/多路由场景把它调到宽松:
$ sudo sysctl -w net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.all.rp_filter = 2$ ip route get 1.1.1.1
1.1.1.1 via 10.0.0.1 dev eth0 src 10.0.0.11$ sudo iptables -t nat -vnL POSTROUTING | egrep 'MASQUERADE|KUBE-'
120K 7.2M MASQUERADE all -- * eth0 10.244.0.0/16 0.0.0.0/0SNAT 缺失会导致回包走错路,表现为 SYN 发送但无 SYN-ACK。
抓包要像外科手术:只切关键位置,只看关键字段。
在 Pod 内抓:
/ # tcpdump -ni eth0 tcp port 80 -c 4
14:22:01 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S], seq 12345, win 64240
14:22:02 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S], seq 12345, win 64240
14:22:04 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S], seq 12345, win 64240
14:22:08 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S], seq 12345, win 64240只有 SYN,没 SYN-ACK。到宿主机 cni0 再抓:
$ sudo tcpdump -ni cni0 host 10.244.1.23 and tcp port 80 -c 4
14:22:01 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S]
14:22:01 IP 10.96.12.34.80 > 10.244.1.23.46322: Flags [S.], ack 12346
14:22:02 IP 10.244.1.23.46322 > 10.96.12.34.80: Flags [S]
14:22:02 IP 10.96.12.34.80 > 10.244.1.23.46322: Flags [S.], ack 12346宿主机能看到 SYN-ACK,而 Pod 内看不到回包,说明问题在 命名空间边界(veth/hairpin/NF)。常见成因:
tx_dropped 升);ethtool -S vethX 与 iptables -vnL 的计数就能坐实。抓 VXLAN:
$ sudo tcpdump -ni any udp port 8472 -c 6
14:25:10 IP 10.0.0.11.4789 > 10.0.0.12.8472: VXLAN, vni 1, encapsulated Ethernet
14:25:10 IP 10.0.0.12.8472 > 10.0.0.11.4789: VXLAN, vni 1, encapsulated Ethernet
...再抓物理口:
$ sudo tcpdump -ni eth0 host 10.0.0.12 and udp port 8472 -vv -c 4
14:25:12 IP (tos 0x0, ttl 64) 10.0.0.11.4789 > 10.0.0.12.8472: UDP, length 1542
14:25:13 IP (tos 0x0, ttl 64) 10.0.0.11.4789 > 10.0.0.12.8472: UDP, length 1542length 1542 超过 1500,若路径不支持巨帧,就会碎片或丢包,对应上文 MTU 现象。对策:把隧道口 MTU、cni0、Pod 一致调低(1450 或更小),或在边界启用 MSS clamping:
$ sudo iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtuPod 内抓 53:
/ # tcpdump -ni eth0 port 53 -c 6
14:30:11 IP 10.244.1.23.50524 > 10.96.0.10.53: 1234+ A? api.prod.svc.cluster.local.
14:30:11 IP 10.96.0.10.53 > 10.244.1.23.50524: 1234 1/0/0 A 10.96.12.34
...CoreDNS 侧看 QPS 与延迟:
$ kubectl -n kube-system logs -l k8s-app=kube-dns --tail=20 | sed -n '1,5p'
[INFO] 14:30:11.123 query: A api.prod.svc.cluster.local. (1 servers, time 0.5 ms)如果解析慢,八成是上游递归 DNS 或 NetworkPolicy 限制了 UDP/53。
随手放在运维工具箱里,节省来回敲命令的时间。
cat > knet.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
POD="$1"; NS="${2:-default}"
NODE=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.nodeName}')
IP=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.podIP}')
CID=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.status.containerStatuses[0].containerID}' | sed -E 's/.*:\/\///')
ssh "$NODE" "sudo bash -lc '
PID=\$(crictl inspect \"$CID\" | jq -r .info.pid 2>/dev/null || docker inspect -f {{.State.Pid}} \"$CID\");
VETH=\$(ls -l /proc/\$PID/ns/net >/dev/null 2>&1;
for v in /sys/class/net/veth*/ifindex; do
test -e \$v || continue;
if ip -j addr show \${v%/*} | jq -re \".[].addr_info[]? | select(.local==\\\"$IP\\\")\" >/dev/null; then echo \${v%/*##*/}; fi;
done)
echo NODE=$NODE POD=$POD IP=$IP VETH=\$VETH
tcpdump -ni \$VETH -c 10 port 80 or port 53
'"
EOF
chmod +x knet.sh示例执行:
$ ./knet.sh api-67f6cc8974-2kmdg prod
NODE=node-a POD=api-67f6cc8974-2kmdg IP=10.244.1.23 VETH=vethb42c3
14:41:10 IP 10.244.1.23.50122 > 10.96.12.34.80: Flags [S]
14:41:10 IP 10.96.12.34.80 > 10.244.1.23.50122: Flags [S.]
14:41:11 IP 10.244.1.23.50524 > 10.96.0.10.53: 1234+ A? api.prod.svc.cluster.local.
14:41:11 IP 10.96.0.10.53 > 10.244.1.23.50524: 1234 1/0/0 A 10.96.12.34
...事故现场只来一次,证据得留全。
$ sudo tcpdump -ni cni0 host 10.244.1.23 -w /tmp/api-$$.pcap
tcpdump: listening on cni0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
42 packets captured$ sudo iptables-save > /tmp/iptables-$(date +%F-%H%M).rules
$ sudo nft list ruleset > /tmp/nft-$(date +%F-%H%M).rules 2>/dev/null || true
$ sudo ss -s > /tmp/ss-$(date +%F-%H%M).txt
$ sudo conntrack -L | head -5 > /tmp/conntrack-$(date +%F-%H%M).txtip route get 出口正常,iptables MASQUERADE 计数为 0 → 缺 SNAT。cilium monitor -t drop 或 iptables -vnL 命中计数暴涨 → 网络策略/防火墙。conntrack 逼近上限 → 提升 nf_conntrack_max 并削峰。ping 只告诉你“有呼吸”,curl 才说明“能说话”,tcpdump 让你“看见血液循环”。容器网络的每一跳都可验证,把黑盒拆成透明管道,问题就不再神秘。把上面的命令按你环境换个 IP、换个接口名,照着走,十分钟内你会知道故障在哪一段。其余的,就是修补而已。
这里我先声明一下,日常生活中大家都叫我波哥,跟辈分没关系,主要是岁数大了.就一个代称而已. 我的00后小同事我喊都是带哥的.张哥,李哥的. 但是这个称呼呀,在线下参加一些活动时.金主爸爸也这么叫就显的不太合适. 比如上次某集团策划总监,公司开大会来一句:“今个咱高兴!有请IT运维技术圈的波哥讲两句“ 这个氛围配这个称呼在互联网这行来讲就有点对不齐! 每次遇到这个情况我就想这么接话: “遇到各位是缘分,承蒙厚爱,啥也别说了,都在酒里了.我干了,你们随意!” 所以以后咱们改叫老杨,即市井又低调.还挺亲切,我觉得挺好.
运维X档案系列文章:
企业级 Kubernetes 集群安全加固全攻略( 附带一键检查脚本)
看完别走.修行在于点赞、转发、在看.攒今世之功德,修来世之福报
点击阅读原文或打开地址实时收集分析全球vps的项目 vps.top365app.com
老杨AI的号: 98dev