当前位置：网站首页>4tb production database cannot be accessed due to disk rejecting i/o to offline device failure

4tb production database cannot be accessed due to disk rejecting i/o to offline device failure

2022-06-22 12:19:00 【weixin_ forty-one million five hundred and sixty-one thousand n】

1、 Project background

An important project uses oracle database , be based on ADG Build one active and one standby , The total amount of data is 4TB about , Normal operation for nearly 5 year . Recently, the master-slave synchronization of the standby database has been delayed due to the severe degradation of disk performance 1.4T Archive log of has not been applied （ Produce... Every day 350G Log , It's delayed 4 God ）, At present, all applications are switched to the main database , Spare warehouse for maintenance . But just switched 2 God , Main warehouse on week 5 I hung up at night , So there's this article , It aims to provide some ideas and methods for friends who encounter the same problems , Of course, I hope you won't see this article .

2、 System environment

2 Database servers ：system x3950 x6 , altogether 16 Block hard disk , front 4 Block hard disk size is 300GB, after 12 Block size is 4T, With 4 Pieces of hard disk are made for a group raid10.
Operating system version ：redhat6.5
Database version :oracle11.2.0.4
OGG edition ：ogg12.2.0.1
Database architecture ： be based on ADG Build one active and one standby

3、 Fault description

1、 The disk partition is mounted on several directories to execute ls Unable to display information , newspaper input/output error
2、/var/log/messages Relevant error reporting information is as follows
Jan 31 03:47:03 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
Jan 31 03:49:47 test-db-2 kernel: sd 1:2:0:0: rejecting I/O to offline device
3、 The database cannot be accessed , among 45 Data files in sdc2,sdc3 On the partition .
4、 The database is backed up in sdd1 On the partition , Unable to restore in a timely manner based on backup .

4、 Troubleshoot problems

1、 Confirm the reason why the database cannot be opened normally
2、 Confirm whether the hard disk on the database server is alarmed

5、 Problem analysis and solution

1、 disk IO The error resulted in... On the disk partition 45 Data file (s) cannot be used , So the database cannot be opened directly , take 45 Data files offline, After trying to open the database , It is found that a large number of data files reported by the access business do not exist , At this time, the database cannot provide normal external access .
2、 Server hard disk 16 block , With 4 The plates are made of a group raid10, They correspond to each other sda,sdb,sdc,sdd, Now find sdd The corresponding disk lights yellow to give an alarm , Under normal circumstances, the yellow light on a disk should not be affected , There may be a logical error here , You can partition these partitions umount adopt fsck Check if there are any bad blocks （ Unfortunately through fsck These partitions cannot be detected ）, Next, consider restarting the server ( Turn it off first 、 Start up )
Reason for restart ：
1、 The database cannot normally provide external access
2、 Database backup is inaccessible
3、 Only one block disk is lit yellow , During the restart process, check whether there are other error messages or whether the file system can be repaired automatically
During database server shutdown , The flowers are near 1 It didn't shut down normally for hours , Always deal with the following interface Insert picture description here
Last , The server is forced to shut down by the on key on the server , Then pull out the disk with the yellow light , Restart the server

After the database server starts normally, it is found that sdc2,sdc3 Mounted and accessed ,sdd Unable to mount normally at present , Try to open the database and put all in offline The database file of becomes online
First of all, will 30 No. data file recover Again online
SQL> alter database datafile 30 online;
alter database datafile 30 online
*
ERROR at line 1:
ORA-01113: file 30 needs media recovery
ORA-01110: data file 30: ‘/test/test2/teststg01.dbf’

SQL> recover datafile 30;
Media recovery complete.
SQL> select file#,status,name from v$datafile;

30 OFFLINE /test/test2/teststg01.dbf
31 RECOVER /test/test2/teststg02.dbf
32 RECOVER /test/test2/teststg03.dbf
33 RECOVER /test/test2/bigdata01.dbf

…

SQL> alter database datafile 30 online;
Database altered.
Batch execution
SQL>recover datafile 31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60, 63;

SQL>alter database datafile 31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,63 online;

All data files are now online 了 , Then perform dump Back up and copy to other servers

6、 Summary and reflection

When all business data is exported , The data before the database failure exists , The next step is to resume business in time , There is still a lot of work to be done , Update later . However, many problems were found through this database failure , Therefore, it is necessary to summarize and think accordingly , Avoid similar situations .

When a problem occurs in the standby database , The following points should be better

1、 Although we reported to the company in time , However, it did not attract enough attention from relevant leaders （ Because the main database or leaders are not here , Where does the cost come from ）
2、 You should go to the machine room for patrol inspection immediately , Two sets of equipment are purchased at the same time （ Due to the complicated procedures for entering the computer room 、 The division of labor is omitted or unclear ）
3、 The backup on the primary database should be backed up to other servers in a timely manner
4、 Prepare other standby servers , It is better to have the same configuration , If you do not have the same configuration , You can also lower the configuration , Synchronize the most important business data in real time .

原网站

版权声明
本文为[weixin_ forty-one million five hundred and sixty-one thousand n]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206221138210145.html