[ Culled from emails by shreevatsa and anubhav. ]
General tips⤵️↑
⤵️
↑
- Detailed information
- You canread thisor the introductionsection hereto get a rough understanding of the issues involved, as I mentioned before onthis mailing list here.]
- Handling text correctly
- Getting Unicode right in Python
- When programming in Python, always remain aware of whether a particular object is “unicode” (code points) or “str” (bytes).
- Basically, Unicode code points logically represent a character (like “092E: DEVANAGARI LETTER MA”), independent of encoding. Python contains both “unicode” objects that are these, or the C-like representation of the actual bytes used to represent the characters **in some encoding**.
- General info: This is like Java’s “string” and “bytes” types (see here).
Use unicode internally
- Always use the “unicode” type internally. So:
- As soon as you see some input from the external world, decode it immediately [e.g. for a file, the stream of bytes it contains may represent a stream of characters in the ‘utf-8’ encoding, so decode from the file into ‘unicode’ characters whenever you read from it], and
- To this end, I’ve taken to putting
“from __future__ import unicode_literals”
at the top of my Python programs, so that whenever I write a line of code like
s = ‘hello world’
it is equivalent to writing
s = u’hello world’
That is, so that all literals are interpreted as Unicode by default. This is the default in Python 3.
Outputting stuff
- whenever you write something to output (even “standard output”), always encode it and write out the actual stream of bytes to the output, so that there can be no confusion.
- By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode.
File input
- csv reading:
- use unicodecsv.reader instead of csv.
- Confusingly, it looks like unicodecsv.reader expects the file to be opened in the default encoding, not UTF-8.
Grapheme cluster handling⤴️↑
⤴️
↑
- See here.