Secondary menu

MySQL CHARSET: latin1 to utf8 and back again: seeking XEN with character encoding

MySQL CHARSET: latin1 to utf8 and back again: seeking XEN with character encoding

Written by on

A source of an amazing amount of misery for our website customers comes from MySQL CHARSET encoding. Many times, a customer will export their database from their previous deployment, install it on their new server, and discover strange characters embedded in their database.

We’ve actually written a couple of utilities to repair these problems in the past, including on in Ruby. However, they are very specific to the individual account you’re working on. Additionally, it’s not unusual for a customer to have a partially broken installation due to attempts to fix it. We had a customer some months ago who gave us a veeery long weekend after dumping the site, then shutting down the old account, and then requesting we do the import for them.

Unfortunately, the encoding on the dump was wrong, and after importing it, we have a plethora of issues. In addition, the database had a mix of unicode (utf-8) and latin1 encoding due to several previous attempts to fix the problem, or perhaps upgrades (nobody was sure), so we ended up having to sift through literally gigs of mysql .sql dumps by hand to fix the problem.

Not fun.

It turns out that:

  • You can have different encodings set for the MySQL server and client
  • PHP can have it’s own encoding type, which is different than either MySQL setting
  • Dumps can have mixed types in them (as stated)
  • Altering MySQL table character encoding can only be done on a per-table basis, which can be fixed, but which can introduce double encoding if you have a mixed database if you script it, causing you to have to revert back to the original dump mid-process if you make a mistake

That last one is sooo not fun.

If you’re in this situation, you should seriously read this post from yk, which basically lays out the basics on converting table encoding. I’ll rate that post a 4 out of 10. Useful, but it pales in comparison to Derek Siver’s post on oreillynet, which actually lays out a better explanation than the notes I kept on our process. It’s one of the best posts I’ve come across on fixing the problem, so I’m actually going to link to it rather than spend two hours writing up our notes, which are substantially the same. Thanks for the Monday evening back Derek. I think I’ll go watch my MythTV box.

I cannot reiterate enough the importance of starting your MySQL database projects exclusively with UTF-8 encoding - it will save you from more headaches down the road.