So, it happened. After 3 years and 4 months in my current position, we have corruption! Every DBAs nightmare, right? Now, before I go any further, good planning won the day, and no data was harmed during the making of this blog. Thanks to a good backup routine, and page level restores (how I love page level restores), each time the corruption re-occurred (and it did) I was able to recover the database back to it’s full state with zero data loss.
The first time this instance corrupted, a full investigation was launched. A premier support ticket was opened with MS, while internal investigation by the SAN, VM and SQL teams took place. Microsoft came back reporting all clear. All they could find was some outdated VMWare drivers. So, we opened a ticket with VMWare to investigate, again, the all clear came back reporting that, yes, the drivers were old, but in support for our current ESX version (5.1). EMC were called to evaluate the logs of the array, to make sure there were no problems there. Again, the all clear came through. Heads were scratched, but we all decided it was one of those unknown events, and we would put it behind us, while keeping a close eye on the server, and increasing the frequency of Integrity checks (from daily to every 3 hours).
A few weeks passed peacefully and then !boom!, I got my alerts in the middle of the night “Database Integrity Check Failed on Server”! crap! same server! crap!. After another 12 hours outage, and much yawning on my part, the system was backup and recovered. Again, we launched our internal dialog to look at all possible possibilities, was it the SAP install, was it the SQL install (installed by a contractor I might add), was the OS at fault, were the disks bad. All involved came back with a resounding “it’s all good”, let’s keep monitoring and hope for the best. Christmas passed, and then suddenly bedlam. The database began corrupting almost constantly, and I began a 6 day cycle of repair \ corrupt \ repair \ corrupt. During this cycle and before we tried all of the following, to no avail:
- Migrate VM to multiple new hosts.
- Run chdsk on all SQL data and log drives.
- Run a SQL repair.
- Delete the database completely and restore from a clean backup.
- De-provision and re-provision all disks (some had been badly provisioned with Lazy0 rather than Eager0).
And then, one of our more competent VM admins arose from his winter hibernation, and pointed out that we had a 5 month old snapshot attached to the VM!!
After several minutes banging my head on my desk, and recriminations saved for later, I asked how long it would take to remove the snapshot (hours as it happens), and why no one had pointed this out before (much shuffling of feet ensued). Once the snapshot was removed, I did a final recovery and brought the database back on-line. We are now 7 days and counting, with no re-occurrence.
When a snapshot is attached to a VM, it creates vmdk and vmdk-delta disks of all of the current virtual disks connected to the VM. Any changes made to the VM are written to these delta drives, and over time these files grow. The amount of growth is directly proportional to the IO performed on the VM, which as you can imagine on a SQL server is high, so during the 5 month life time of this snapshot, the growth steadily increased, and with it, the risk to the data. The team had already been advised to never use snapshots as a backup method of any of the SQL instances in the environment, and all old snapshots were supposed to have been removed, but sometimes mistakes are made.
There is no moral to this story, no diagnostic tool to detect the cause, just a rant and a share so hopefully someone can diagnose the same problems early, to avoid the pain I went through.
Thanks for reading.