最近经常在修复漏洞,昨天有几台物理机突然说需要升级内核,升级原因为 Redhat 7.4 kernel介于3.10.0-693.el7和3.10.0-693.5.2.el7,可能会因为nfs4 client 而导致主机hung主。收到这个通知,本文作者就展开了调查,经过一系列的Redhat官网追踪,终于找到了相关证据,下面就让作者为您揭开这个谜底。
[17596.853096] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [10.1.1.xx-ma:11637] [17596.853853]
Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport
vsock sb_edac edac_core coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd ppdev vmw_balloon joydev pcspkr sg parport_pc parport shpchp vmw_vmci i2c_piix4
nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi
vmwgfx drm_kms_helper sd_mod syscopyarea crc_t10dif sysfillrect crct10dif_generic sysimgblt fb_sys_fops
ttm drm crct10dif_pclmul ata_piix crct10dif_common crc32c_intel libata serio_raw vmxnet3 vmw_pvscsi
i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [17596.853900] CPU: 1 PID: 11637 Comm: 172.32.xx.xx-ma
Tainted: G L ------------ 3.10.0-693.1.1.el7.x86_64 #1 [17596.853901] Hardware name: VMware, Inc. VMware
Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015 [17596.853903] task: ffff8804242f5ee0
ti: ffff8802cc220000 task.ti: ffff8802cc220000 [17596.853904] RIP: 0010:[<ffffffffc058489a>]
[<ffffffffc058489a>] nfs_reap_expired_delegations+0x9a/0x220 [nfsv4] [17596.853921] RSP: 0018:ffff8802cc223df8
EFLAGS: 00000206 [17596.853922] RAX: 0000000000000004 RBX: ffff88041ce0d000 RCX: 0000000000000003 [17596.853923]
RDX: 0000000000000000 RSI: ffff8800b769d848 RDI: ffff8800bb556000 [17596.853924] RBP: ffff8802cc223e58 R08:
ffff88041be93540 R09: 0000000000000000 [17596.853925] R10: 0000000000000000 R11: 7fffffffffffffff R12:
ffff88041ce0d000 [17596.853926] R13: ffffffffc0584a6d R14: ffff8802cc223d78 R15: ffff8800b769d7c0
[17596.853927] FS: 0000000000000000(0000) GS:ffff88043fc40000(0000) knlGS:0000000000000000 [17596.853928]
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [17596.853929] CR2: 00007fd8449a7000 CR3: 00000000019f2000
CR4: 00000000000407e0 [17596.853932] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17596.853934] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [17596.853935] Stack:
[17596.853936] ffffffffc059d3c0 ffff88041be93540 ffff88041329a000 0000000000000000 [17596.853937]
04cdd20102072112 0000000400000000 00000000f389a2ad ffff88042c49a400 [17596.853939] ffff88042c49a400
ffff88042c49a4c8 ffff88042c49a530 0000000000000000 [17596.853940] Call Trace: [17596.853949]
[<ffffffffc0580c22>] nfs4_state_manager+0x5f2/0x8c0 [nfsv4] [17596.853955] [<ffffffffc0580ef0>]
? nfs4_state_manager+0x8c0/0x8c0 [nfsv4] [17596.853961] [<ffffffffc0580f0f>] nfs4_run_state_manager+0x1f/0x40 [nfsv4]
[17596.853964] [<ffffffff810b098f>] kthread+0xcf/0xe0 [17596.853966] [<ffffffff810b08c0>] ?
insert_kthread_work+0x40/0x40 [17596.853970] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [17596.853972]
[<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [17596.853972] Code: 24 10 4c 8b 7c 24 10 49 39
df 75 1b e9 e8 00 00 00 49 8b 07 48 89 44 24 10 4c 8b 7c 24 10 49 39 df 0f 84 d2 00 00 00 49 8b 47 48 <a8> 10
75 e2 49 8b 47 48 a8 40 74 da 49 8b be 70 03 00 00 e8 8e
[949664.745423] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [192.168.0.xx-m:17637]
[949664.745429] Modules linked in: binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack n f_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables ipta ble_filter rpcsec_gss_krb5 nfsv4 dns_resolver
nfs fscache ppdev crc32_pclmul sg ghash_clmulni_intel virtio_balloon joydev virtio_rng
aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc i2c_piix4 parport
pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod
cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_net virtio_console
virtio_scsi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix
libata i2c_core crct10dif_pclmul [949664.745491] crct10dif_common crc32c_intel serio_raw
virtio_pci virtio_ring floppy virtio dm_mirror dm_region_hash dm_log dm_mod
[949664.745501] CPU: 5 PID: 17637 Comm: 192.168.0.xx-m Tainted: G L ------------ 3.10.0-693.el7.x86_64
#1 [949664.745504] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.9.1-5.el7_3.2 04/01/2014
[949664.745506] task: ffff880fb3299fa0 ti: ffff880fe12a8000 task.ti: ffff880fe12a8000 [949664.745508]
RIP: 0010:[<ffffffffc0507429>] [<ffffffffc0507429>] nfs_mark_return_delegation.isra.4+0x19/0x20 [nfsv4]
[949664.745535] RSP: 0018:ffff880fe12abdc0 EFLAGS: 00000202 [949664.745537] RAX: ffff880fe8760800 RBX:
00000000f50f06d8 RCX: 000000000000000f [949664.745539] RDX: 000000000000000f RSI: ffff880411667100 RDI:
ffff880fe6a99800 [949664.745540] RBP: ffff880fe12abdc0 R08: 0000000000000000 R09: 0000000000000000
[949664.745541] R10: ffff880fff359c40 R11: ffffea003c3f8980 R12: 0000000000000010 [949664.745543]
R13: ffffffffc05074de R14: ffffffffffffff10 R15: ffff880fe6a99800 [949664.745545] FS: 0000000000000000(0000)
GS:ffff880fff340000(0000) knlGS:0000000000000000 [949664.745547] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b [949664.745548] CR2: 0000000000000004 CR3: 00000000019f2000 CR4: 00000000000006e0
[949664.745555] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [949664.745556] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [949664.745557] Stack: [949664.745559]
ffff880fe12abde8 ffffffffc0507551 ffff880fb2f5aed0 ffff880fe87608c8 [949664.745561] ffff880fe8760800
ffff880fe12abe58 ffffffffc05089ae ffffffffc05213c0 [949664.745564] ffff880ffa8e2180 ffff880411667100
0000000000000000 00592902022a921d [949664.745566] Call Trace: [949664.745580] [<ffffffffc0507551>]
nfs_revoke_delegation+0x71/0x90 [nfsv4] [949664.745592] [<ffffffffc05089ae>] nfs_reap_expired_delegations+0x1ae/0x220
[nfsv4] [949664.745603] [<ffffffffc0504c22>] nfs4_state_manager+0x5f2/0x8c0 [nfsv4] [949664.745626] [<ffffffffc0504ef0>]
? nfs4_state_manager+0x8c0/0x8c0 [nfsv4] [949664.745637] [<ffffffffc0504f0f>] nfs4_run_state_manager+0x1f/0x40 [nfsv4]
[949664.745643] [<ffffffff810b098f>] kthread+0xcf/0xe0 [949664.745647] [<ffffffff8108ddeb>] ? do_exit+0x6bb/0xa40
[949664.745649] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [949664.745654] [<ffffffff816b4f18>]
ret_from_fork+0x58/0x90 [949664.745656] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [949664.745657]
Code: 48 8b 07 f0 80 88 28 01 00 00 20 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 f0 80 4e 48 02 48 8b 07
f0 80 88 28 01 00 00 20 <5d> c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 4c 8d af
综上原因为: RHEL7.4: NFS4状态管理线程卡在nfs_reap_expired_delegations无限循环中,导致NFS4客户端hung。
因为作者比较懒,而且考虑到物理机升级内核风险太大,综合原因选择了第4种和第5种解决方案,下面主要讲解第4和第5种修复方案。
设置默认使用NFSv3,NFS客户机自动协商NFS的版本以与NFS服务器一起使用,NFS客户机将始终使用可能的最高版本的NFS。
方法:
在配置文件/etc/sysconfig/nfs中设置:RPCNFSDARGS=”-N 4”
执行NFS服务重启:systemctl restart nfs
客户端重新挂载
如下操作在Server端进行:
临时生效:echo '0' > /proc/sys/fs/leases-enable
永久生效:在配置文件/etc/sysctl.conf增加“fs.leases-enable = 0”
重启服务使客户端生效:systemctl restart nfs-server
生活中任何事情都可以总结一个方案或者一篇文章,比如今天的MySQL的姜老师用一个参数replication_optimze_for_static_plugin_config的探索过程让我们知道了MySQL的博大精深。
#6. 参考文献
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。