संस्कृत-साङ्गणकाः Sanskrit Programmers

[ Culled from emails by shreevatsa and anubhav. ]

General tips
⤵️
↑

Detailed information
- You canread thisor the introductionsection hereto get a rough understanding of the issues involved, as I mentioned before onthis mailing list here.]
- Handling text correctly
- Getting Unicode right in Python
When programming in Python, always remain aware of whether a particular object is “unicode” (code points) or “str” (bytes).
- Basically, Unicode code points logically represent a character (like “092E: DEVANAGARI LETTER MA”), independent of encoding. Python contains both “unicode” objects that are these, or the C-like representation of the actual bytes used to represent the characters **in some encoding**.
- General info: This is like Java’s “string” and “bytes” types (see here).

Always use the “unicode” type internally. So:
- As soon as you see some input from the external world, decode it immediately [e.g. for a file, the stream of bytes it contains may represent a stream of characters in the ‘utf-8’ encoding, so decode from the file into ‘unicode’ characters whenever you read from it], and
To this end, I’ve taken to putting
“from __future__ import unicode_literals”
at the top of my Python programs, so that whenever I write a line of code like
s = ‘hello world’
it is equivalent to writing
s = u’hello world’
That is, so that all literals are interpreted as Unicode by default. This is the default in Python 3.

whenever you write something to output (even “standard output”), always encode it and write out the actual stream of bytes to the output, so that there can be no confusion.
By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode.

csv reading:
- use unicodecsv.reader instead of csv.
- Confusingly, it looks like unicodecsv.reader expects the file to be opened in the default encoding, not UTF-8.