Tonight presents a valuable lesson. I had a box running heavy MySQL duty that would crash at odd times. I could get MySQL to start, but the processes would die, it wouldn’t terminate cleanly, and even on a freshly started copy it was giving me “out of memory” errors. After fighting this for some time (say hours) and assuming that it was me the user, I checked the system in a bout of frustration.
Being a Xeon, my first look after rebooting it was in the error log of the BIOS. It had a lone ECC error in the log. Where I couldn’t even run show databases; before it will go through a check and stay up now. I bring this up as it presents two invaluable lessons:
A)It’s usually the software or the sysadmin that screws a server up. Not the hardware. That being said it is best to consider it. This is the second time I’ve seen a machine with ECC RAM screw up like this in two years and several hundred servers later. I have seen maybe 20 ECC equipped machines that actually had DIMMs that were bad. Probably half that. With that being said MySQL tends to show it first.
B)ECC RAM is worth the extra outlay in the datacenter. This could have easily not been detected for a long period of time, and cost a client and the next client that would have been put on the server.