A source of an amazing amount of misery for our website customers comes from MySQL CHARSET encoding. Many times, a customer will export their database from their previous deployment, install it on their new server, and discover strange characters embedded in their database.
We’ve actually written a couple of utilities to repair these problems in the past, including on in Ruby. However, they are very specific to the individual account you’re working on. Additionally, it’s not unusual for a customer to have a partially broken installation due to attempts to fix it. We had a customer some months ago who gave us a veeery long weekend after dumping the site, then shutting down the old account, and then requesting we do the import for them.
Unfortunately, the encoding on the dump was wrong, and after importing it, we have a plethora of issues. In addition, the database had a mix of unicode (utf-8) and latin1 encoding due to several previous attempts to fix the problem, or perhaps upgrades (nobody was sure), so we ended up having to sift through literally gigs of mysql .sql dumps by hand to fix the problem.
It turns out that:
That last one is sooo not fun.
If you’re in this situation, you should seriously read this post from yk, which basically lays out the basics on converting table encoding. I’ll rate that post a 4 out of 10. Useful, but it pales in comparison to Derek Siver’s post on oreillynet, which actually lays out a better explanation than the notes I kept on our process. It’s one of the best posts I’ve come across on fixing the problem, so I’m actually going to link to it rather than spend two hours writing up our notes, which are substantially the same. Thanks for the Monday evening back Derek. I think I’ll go watch my MythTV box.
I cannot reiterate enough the importance of starting your MySQL database projects exclusively with UTF-8 encoding - it will save you from more headaches down the road.