最近集群归档目录(ARCH)80%报警,随着业务交易的突增归档量由原来的80G+增涨为150G,因此对ARCH目录再扩容500GB。N次扩容操作都没出个问题,这次差点就载了。
[root@dbrac1 ~]# echo "- - -" > /sys/class/scsi_host/host0/scan
[root@dbrac1 ~]# echo "- - -" > /sys/class/scsi_host/host1/scan
[root@dbrac1 ~]# echo "- - -" > /sys/class/scsi_host/host2/scan
[root@dbrac1 ~]# echo "- - -" > /sys/class/scsi_host/host3/scan
[root@dbrac1 ~]# echo "- - -" > /sys/class/scsi_host/host4/scan
[root@dbrac1 ~]# multipath -l
mpathv (560002ac0000000000000007a00020406) dm-24 3PARdata,VV
size=500G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 1:0:0:17 sdbv 68:144 active undef unknown
|- 3:0:0:17 sdbx 68:176 active undef unknown
|- 1:0:1:17 sdbw 68:160 active undef unknown
`- 3:0:1:17 sdby 68:192 active undef unknown
[root@dbrac1 ~]# vim /etc/multipath.conf
defaults {
user_friendly_names yes
}
multipaths {
multipath {
no_path_retry fail
wwid 560002ac0000000000000007a00020406
alias ASM-ARCH2
}
}
-- 重新多路径
[root@dbrac1 ~]# /etc/init.d/multipathd restart
ok
正在关闭multipathd 端口监控程序: [确定]
正在启动守护进程multipathd: [确定]
[root@dbrac1 ~]# multipath -l
ASM-ARCH2 (560002ac0000000000000007a00020406) dm-24 3PARdata,VV
size=500G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 1:0:0:17 sdbv 68:144 active undef unknown
|- 3:0:0:17 sdbx 68:176 active undef unknown
|- 1:0:1:17 sdbw 68:160 active undef unknown
`- 3:0:1:17 sdby 68:192 active undef unknown
[root@dbrac1 ~]#chown grid.asmadmin /dev/mapper/ASM-ARCH2
[root@dbrac1 ~]#chmod 660 /dev/mapper/ASM-ARCH2
SQL> set linesize 200;
SQL> col name format a20;
SQL> select group_number,name,TOTAL_MB, FREE_MB from v$asm_diskgroup;
GROUP_NUMBER NAME TOTAL_MB FREE_MB
------------ -------------------- ---------- ----------
1 ARCH 512000 134955
......
6 rows selected.
SQL> col name format a20;
SQL> col path format a30;
SQL> select name,path,mode_status,state,disk_number,failgroup from v$asm_disk;
NAME PATH MODE_ST STATE DISK_NUMBER FAILGROUP
-------------------- ------------------------------ ------- -------- ----------- -----------------------
/dev/mapper/ASM-ARCH2 ONLINE NORMAL 0
ARCH_0000 /dev/mapper/ASM-ARCH1 ONLINE NORMAL 0 ARCH_0000
......
SQL> alter diskgroup ARCH add disk '/dev/mapper/ASM-ARCH2' rebalance power 1;
SQL> set line 800
SQL> select group_number,name,TOTAL_MB, FREE_MB from v$asm_diskgroup;
GROUP_NUMBER NAME TOTAL_MB FREE_MB
------------ ------------------------------ ---------- ----------
1 ARCH 1024000 634424
......
6 rows selected.
SQL> select * from v$asm_operation;
no rows selected
SQL> select group_number,name,total_mb,free_mb,total_mb-free_mb used_mb from v$asm_disk_stat;
GROUP_NUMBER NAME TOTAL_MB FREE_MB USED_MB
------------ ------------------------------ ---------- ---------- ----------
1 ARCH_0000 512000 317197 194803
1 ARCH_0001 512000 317227 194773
......
18 rows selected.
4个节点的RAC集群,突然收到其它3个节点数据库宕机报警,唯一还支撑业务的仅有目前操作的节点,Session直接飙升到1300(幸亏数据库Sesssion最大配置比较高:2500)。当时最先怀疑的是:其它3个节点的新加磁盘路径权限没有赋权。
835 2024-11-20 14:49:34 echo "- - -" > /sys/class/scsi_host/host0/scan
836 2024-11-20 14:49:34 echo "- - -" > /sys/class/scsi_host/host1/scan
837 2024-11-20 14:49:34 echo "- - -" > /sys/class/scsi_host/host2/scan
838 2024-11-20 14:49:34 echo "- - -" > /sys/class/scsi_host/host3/scan
839 2024-11-20 14:49:35 echo "- - -" > /sys/class/scsi_host/host4/scan
840 2024-11-20 14:49:37 multipath -l
841 2024-11-20 14:49:50 /etc/init.d/multipathd reload
842 2024-11-20 14:49:54 multipath -l
843 2024-11-20 14:51:05 exit
844 2024-11-20 15:58:27 vim /etc/multipath.conf
845 2024-11-20 15:58:47 /etc/init.d/multipathd restart
846 2024-11-20 15:58:49 multipath -l
847 2024-11-20 15:58:58 vim /etc/multipath.conf
848 2024-11-20 15:59:09 /etc/init.d/multipathd restart
849 2024-11-20 15:59:11 multipath -l
850 2024-11-20 15:59:46 cat /var/log/messages
851 2024-11-20 16:02:08 chown grid.asmadmin /dev/mapper/ASM-ARCH2
852 2024-11-20 16:03:16 chmod 660 /dev/mapper/ASM-ARCH2
Nov 20 14:42:24 dbrac2 kernel: sd 3:0:0:1: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Nov 20 14:42:24 dbrac2 kernel: sd 1:0:0:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Nov 20 14:42:24 dbrac2 kernel: sd 3:0:1:5: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Nov 20 14:42:24 dbrac2 kernel: sd 1:0:1:11: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
Nov 20 14:47:16 dbrac2 puppet-agent[19326]: Finished catalog run in 4.88 seconds
Nov 20 14:47:23 dbrac2 sshd[22245]: Accepted password for hnyunwei from 10.10.6.15 port 10266 ssh2
Nov 20 14:47:28 dbrac2 kernel: scsi: host 0 channel 0 id 0 lun4194304 has a LUN larger than allowed by the host adapter
Nov 20 14:47:29 dbrac2 kernel: scsi: host 0 channel 3 id 0 lun4194304 has a LUN larger than allowed by the host adapter
Nov 20 14:48:55 dbrac2 kernel: scsi: host 0 channel 0 id 0 lun4194304 has a LUN larger than allowed by the host adapter
Nov 20 14:48:56 dbrac2 kernel: scsi: host 0 channel 3 id 0 lun4194304 has a LUN larger than allowed by the host adapter.
......
Nov 20 14:50:02 dbrac2 multipathd: sdbr: couldn't get asymmetric access state
Nov 20 14:50:02 dbrac2 multipathd: sdbs: couldn't get asymmetric access state
Nov 20 14:50:02 dbrac2 multipathd: sdbt: couldn't get asymmetric access state
Nov 20 14:50:02 dbrac2 multipathd: sdbu: couldn't get asymmetric access state
Nov 20 14:50:03 dbrac2 kernel: device-mapper: table: 253:24: multipath: error getting device
Nov 20 14:50:03 dbrac2 kernel: device-mapper: ioctl: error adding target to table
Nov 20 14:50:03 dbrac2 multipathd: mpatha: ignoring map
......
Nov 20 14:50:05 dbrac2 multipathd: mpathv: load table [0 20971520 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 4 1 68:144 1 68:176 1 68:160 1 68:192 1]
......
Nov 20 14:50:05 dbrac2 multipathd: mpathv: event checker started
Nov 20 14:50:05 dbrac2 kernel: sd 1:0:0:17: alua: port group 01 state A preferred supports tolusnA
Nov 20 14:50:05 dbrac2 kernel: sd 3:0:0:17: alua: port group 01 state A preferred supports tolusnA
Nov 20 14:50:05 dbrac2 kernel: sd 1:0:1:17: alua: port group 01 state A preferred supports tolusnA
Nov 20 14:50:05 dbrac2 kernel: sd 3:0:1:17: alua: port group 01 state A preferred supports tolusnA
Nov 20 14:50:05 dbrac2 multipathd: dm-24: remove map (uevent)
Nov 20 14:50:05 dbrac2 multipathd: mpathv: stop event checker thread (140737345021696)
Nov 20 14:50:05 dbrac2 multipathd: dm-24: remove map (uevent)
Nov 20 14:50:05 dbrac2 multipathd: dm-24: devmap not registered, can't remove
Nov 20 14:50:05 dbrac2 multipathd: dm-24: adding map
Nov 20 14:50:05 dbrac2 multipathd: mpathv: event checker started
Nov 20 14:50:05 dbrac2 multipathd: mpathv: devmap dm-24 added
......
Nov 20 16:21:47 dbrac2 kernel: rport-1:0-16: blocked FC remote port time out: removing rport
Nov 20 16:21:47 dbrac2 kernel: rport-2:0-85: blocked FC remote port time out: removing rport
Nov 20 16:26:19 dbrac2 kernel: rport-4:0-2: blocked FC remote port time out: removing rport
Nov 20 16:26:19 dbrac2 kernel: rport-3:0-5: blocked FC remote port time out: removing rport
Nov 20 16:26:19 dbrac2 kernel: rport-2:0-4: blocked FC remote port time out: removing rport
Nov 20 16:26:19 dbrac2 kernel: rport-1:0-5: blocked FC remote port time out: removing rport
Nov 20 16:30:48 dbrac2 sshd[123983]: Accepted password for hnyunwei from 10.10.6.9 port 55959 ssh2
Wed Nov 20 16:06:36 2024
NOTE: ASMB terminating
Errors in file /u01/oracle/diag/rdbms/dbrac/rac2/trace/rac2_asmb_77922.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 2136 Serial number: 3
Errors in file /u01/oracle/diag/rdbms/dbrac/rac2/trace/rac2_asmb_77922.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 2136 Serial number: 3
Wed Nov 20 16:06:36 2024
System state dump requested by (instance=2, osid=77922 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/oracle/diag/rdbms/dbrac/rac2/trace/rac2_diag_77675.trc
ASMB (ospid: 77922): terminating the instance due to error 15064
Wed Nov 20 16:06:36 2024
opiodr aborting process unknown ospid (16621) as a result of ORA-1092
Wed Nov 20 16:06:36 2024
opiodr aborting process unknown ospid (80307) as a result of ORA-1092
Wed Nov 20 16:06:36 2024
opiodr aborting process unknown ospid (50410) as a result of ORA-1092
Wed Nov 20 16:06:36 2024
ORA-1092 : opitsk aborting process
Wed Nov 20 16:06:36 2024
opiodr aborting process unknown ospid (140718) as a result of ORA-1092
Wed Nov 20 16:06:36 2024
ORA-1092 : opitsk aborting process
Wed Nov 20 16:06:37 2024
ORA-1092 : opitsk aborting process
Wed Nov 20 16:06:37 2024
ORA-1092 : opitsk aborting process
Wed Nov 20 16:06:37 2024
ORA-1092 : opitsk aborting process
Wed Nov 20 16:06:38 2024
2024-11-20 16:06:36.265:
[/u01/grid/11.2.0.3/product/bin/oraagent.bin(11611)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/grid/11.2.0.3/product/log/dbrac2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-11-20 16:06:36.689:
[ohasd(10948)]CRS-2765:Resource 'ora.asm' has failed on server 'dbrac2'.
2024-11-20 16:06:36.694:
[/u01/grid/11.2.0.3/product/bin/oraagent.bin(11611)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/grid/11.2.0.3/product/log/dbrac2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-11-20 16:06:36.768:
[crsd(77204)]CRS-2765:Resource 'ora.dbrac.db' has failed on server 'dbrac4'.
2024-11-20 16:06:36.772:
[crsd(77204)]CRS-2765:Resource 'ora.asm' has failed on server 'dbrac4'.
2024-11-20 16:06:36.972:
[/u01/grid/11.2.0.3/product/bin/oraagent.bin(77425)]CRS-5011:Check of resource "dbrac" failed: details at "(:CLSN00007:)" in "/u01/grid/11.2.0.3/product/log/dbrac2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2024-11-20 16:06:37.063:
-- 重启前
ASM-ARCH2 (560002ac0000000000000007a00020406) dm-24 3PARdata,VV
size=10G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 1:0:0:17 sdbv 68:144 active undef unknown
|- 3:0:0:17 sdbx 68:176 active undef unknown
|- 1:0:1:17 sdbw 68:160 active undef unknown
`- 3:0:1:17 sdby 68:192 active undef unknown
-- 重启后 :
ASM-ARCH2 (560002ac0000000000000007a00020406) dm-24 3PARdata,VV
size=500G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
|- 1:0:0:17 sdbr 68:80 active ready running
|- 3:0:0:17 sdbt 68:112 active ready running
|- 1:0:1:17 sdbs 68:96 active ready running
`- 3:0:1:17 sdbu 68:128 active ready running
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。