Recovering a SQL 2012 cluster from a corrupt master database


Emergency support can be a bit stressful, especially out of hours on a system you’ve never worked on before.

We recently needed to step in to help a customer who’s 2 node SQL cluster had gone down late one night and managed to get it backup in a few hours after a major disk corruption on the data and logs volume of cluster shared disks taking our the system databases and the user databases.

The Errors

1. The SQL service in failover cluster manager is showing as failed.

2. The SQL server service won’t start manually.

3. Error in the event log

The Fix

Step 1.

Put the shared disks into maintenance mode in failover cluster manager and run chkdsk “drive” /F /X on all of them. Repairs will happen as necessary. In this case, a lot of repairs where made.

Step 2.

Take the disks out of maintenance mode and browse through to the data and logs volumes – we noted zero size user databases – those are gone for sure and need to be restored from backup one the system databases are fixed.

Step 3.

Stop, think and ignore the interweb search results you get back.

There is a lot of mention on the internet about using /ACTION=REBUILDDATABASE to rebuild the system database to get you back to the point where SQL will start and will let you restore a backup of the system databases.

This method didn’t work no matter what media or setup.exe we used, it silently failed and gave no errors.

Go and find yourself another SQL instance running the same version, restore the master backups there.  As per this blog post, restore it with an alternative name such as “restore_master” for the both the database name and the database MDF and LDF files to avoid conflicts with the running SQL instance.

Copy those back over to your damaged SQL instance, rename the old ones and rename these restored back to the correct name.

hero-1Step. 4 Start SQL. Relax and restore the user databases as needed.