Server crash: lesson learned

How much back do you need to do so that your web serve is “safe”(from hardware failure)? Well I had made duplicated disks (so that they are identical and can boot), one of them is constantly running (backup server). In addition, any changing files (websites, uploads, mysql) are backed up once a day and transferred to the backup server and a third backup. You would think this is good enough?  No!

On April 11, 2014, I noticed that the main server was not responding. I took a look at the computer and it showed a SMART error for the main drive so it refused to boot.  SMART is a way to detecting eminent harddrive fail (but not totally failed yet). In the bios I was able to disable SMART check and the drive booted ok.

Now come my mistakes.

Mistake 1: did not try to fix a table in mysql, which prevented a full database dump inside unix. However I was able to use phpmyadmin to do an export (this exported file is saved on a PC client), it also complained a corrupted table. The table belongs to a wordpress plugin (counterize).  I did not, at the time, know phpmyadmin has an option to fix tables. Had I done that, probably life would be much easier.

Mistake 2a: I did not check the integrity of the dump to see if it is restorable first (e.g. at a backup server).  This cost me 2 days.  I also forgot my database structures…thought I had only one but actually had 3! so I lost some data on photo galleries which I had to re-upload.

Mistake 2b: Failed to see all the previous daily dumps were no good! the crontab scripts did dump daily but once in a while it would produce a good one, but most were of much smaller sizes (e.g. 56meg instead of the correct 230 meg size. There was only one good backup in Jan 2014, all others since then were smaller, then suddenly a good one in week11, Feb 2014). Strangely, the web servers were functioning correctly all these times, suggesting that the HD was having issues saving the ASCII file but the binary files which mysql directly accesses are fine.  This also contradicts the fact that a large tar file I made April 11th also worked fine, so all the static files were up to date.

Mistake 3: I wanted to duplicate the failing drive. so I carefully (i thought!) checked the BIOS to make sure the failing drive is the 1st drive to be booted, and the slave drive is the 2nd drive (in retrospect that slave should be not int he boot list at all, or carefully destroyed of its booting sector or contents).  BUT, the computer booted using the slave as a master and promptly called for gmirror which starts duplicating the slave into the master!  I quickly rebooted but too late, I lost all the contents of the failing drive.  The correct thing to do is to let this drive run, and remotely backup all the data (first take the time to learn how to fix a sql table!), then boot this one up as a non main server in case I missed something (e.g. the access logs!).

Then what is left is a tar file, a sql dump file, which I spent 2 days to restore. Finally on April 13th it worked, restoring all the contents of michiganbees.org, but bees.msu.edu was only up to 2012. Luckily I had ported the mysql to an outside server (paying $50 per year) so I could restore the contents back here.

In the end:

1. I  lost one page I made (bees.msu.edu/flowers, which I redid)
2. all the access log fines for 2013 and up to 2014. Not sure why 2013 log files were not backed up on the backup server.
3. Lost about one week’s time trying to get it back to working.

But it it just pure luck that I did not lose data on michiganbees.org! I could very well lose all the data there since the sql backup was no good.

Improvements:

1. Now i made sure the tar script works. Previously the incremental tar was not working properly.

2. Now I automatically restore a copy of the main server to the backup server every morning. so by 7 am, ww2.michiganbees.net = michiganbees.net. And I get an email notifying me to this effect, including the size of the mysql file.  This way it ensures the mysql data is good and it is restorable at another server. This was not done previously.

3. I used dump/restore to duplicate a few HDs and they are bootable. I am doing away with gmirror which messed me up at least two times.

Still to do: need a script to extract tar file once a week at the backup server, so it is always ready to be booted as the main server.

Appendix:

1. File sizes of sql dump:

Prior to crash:
-rw-r–r–  1 zach  user   4914155 Apr  6 04:30 Mysql-Sun.sql.gz
-rw-r–r–  1 zach  user   5877569 Apr  7 04:30 Mysql-Mon.sql.gz
-rw-r–r–  1 zach  user   5961466 Apr 11 04:30 Mysql-Fri.sql.gz

Post crash:
-rw-r–r–  1 zach  user  14383209 Apr 18 21:44 Mysql-15.sql.gz
-rw-r–r–  1 zach  user  25144923 May  9 04:14 Fri-drone.sql.gz
-rw-r–r–  1 zach  user  26538228 May 12 04:13 Mon-drone.sql.gz

2. Steps to duplicate a system drive: (assuming USB connected drive is da0, and main drive is ada0s1)

gpart destroy -F da0
gpart create -s GPT da0
gpart add -t freebsd-boot -l gpboot -b 40 -s 512K da0
gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 da0
gpart add -t freebsd-swap -l gpswap -s 4096M da0
gpart add -t freebsd-ufs -l gptfs da0
newfs -U /dev/gpt/gptfs
mount /dev/gpt/gptfs /mnt
cd /mnt
dump -0Lauf – /dev/ada0s1  | restore -rf –