Anyone working on the web should have an understanding of Unicode. Lots of effort has been made to spread the word, but still a surprisingly large number of people, including software developers, don't get it. Joel Spolsky's Absolute Minimum Every Software Developer ... Must Know About Unicode ... is probably the standard place to send people who don't yet grasp the concept (so if that's you, read the link).
Unfortunately, knowing the basics of character encoding will only get you so far when working with "dirty data." I don't know about you, but I have this happen a lot: a large dataset gets sent to me, or is exported from some application, and it looks like text. As in: ASCII encoding. And it needs imported.
No problem, I'll write a quick Python script to suck in all this nicely formatted, tab-delimited text data, translate it to match a Django model and I can get back to my real life.
But the dataset is way too large for me to look over completely and it turns out there are some characters in there that aren't ASCII. Oops. My Django import script spits out UnicodeDecodeErrors like they're sour grapes.
Dirty dot data
The first thing is to figure out how this text file was encoded. What I'm calling dirty data are really just text files that I don't know how to decode without dusting them off (ie. cycling through obvious possible encodings). These are clearly not ASCII, so that leaves several dozen candidates (that Python knows about). The most obvious place to start, outside Unicode, is latin-1.
If you're like me, you didn't realize Excel on Windows exports tab-delimited files as latin-1 text (aka ISO-8859-1 aka ISO-Latin). Latin-1, FYI, is an obsolete character set. I'll give Microsoft a pass since support for it wasn't disbanded until mid-2004 and I'm working in Office 2003. I won't mention it's version SP2, released in September 2005, because that's just being ornery.
I should also note that this version of Excel does not support a Unicode export to tab-delimited text, only CSV (comma-separated values). This appears to have bitten many Internet denizens in the past (probably because Excel calls it, simply, "Unicode Text - *.txt"). I should also note that these are UTF-16 files, so they won't act like ASCII, if that's what you were expecting.
Django encoding utils
Everything internal to Django is stored in Unicode. When model data is saved to the database back-end, Django takes care of everything encoding-wise.
But Django, just like you, cannot figure out how the data it's given is encoded unless it's told. If this is known (or otherwise discerned), translating to Unicode is super-easy with the built-in conversion functions. Specifically, the function you want is smart_unicode:
from django.utils.encoding import smart_unicode
file = open('file.txt')
data = file.readlines()
for line in data:
translated_data = smart_unicode(line)
Easy enough, right? Almost. I said this hypothetical data was encoded as Latin-1, but smart_unicode, by default, expects the strings you give it to be encoded as UTF-8 Unicode. If you give it plain ASCII, it will work because UTF-8 is compatible with ASCII. But almost anything else will throw an exception at some point.
You fix it by telling smart_unicode what encoding you're giving it:
data = smart_unicode(line, encoding='latin-1')
data = smart_unicode(line, encoding='utf_16')
And so on.