UTF8 follies

April 8, 2011

Character encoding issues are shortening my life expectancy. It’s been a few years since I dealt with them regularly, so they bite me from time to time.

Recently I puzzled over a set of strings that were showing up double-encoded in my Postgres database. These strings were encoded from incoming Latin 1 to UTF8, travelled around the program with their utf8 flags set, went through JSON->decode and encode multiple times without apparent harm, but once in the database, they had the telltale double-encoding gibberish characters. E.g.:


use Encode;
my $utf8_string = "Télévision extrême à domicile";
print encode('utf8', $utf8_string);
# becomes Télévision extrême à domicile

I spent an absurdly long time capturing strings and dumping during execution:

say qx{echo $str | od -c }

od being a handy Unix utility I became very familiar with in my days of converting bibliographic data.

Turns out I was focusing on the wrong leads. These strings became values in a hash which was run through JSON->encode before saving to db. What prompted the re-encoding was not the values of the hash, but the keys. These were string literals in my program which were not marked as UTF8. Perl looked at my keys, looked at my values, saw a mixed bag, and decided to run the whole thing through the shredder again, just to be on the safe side.

The solution was simple:
use utf8

Because my keys were source-code string literals, I want my source code to be UTF8. use utf8 assures that.

Advertisements