Unicode and Postgres — Russ Garrett

Sunday 18^th January, 2009

Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They’re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS’s C library. There are a couple of issues with this:

Your data is at the mercy of changes to this library (changes to glibc are, let’s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS
You have to pick a language (like en_gb) to base the collation on. I’m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn’t feel right. There’s no way of implementing the default Unicode collation algorithm

Because of this, our Postgres database cluster is configured with a using the “C” locale and the UNICODE encoding. The “C” locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn’t work:

    db=# SELECT UPPER('Café');
    upper
    -------
    CAFé

This is essentially why Last.fm scrobbles aren’t case-sensitive for languages other than plain English. We’re not planning on changing the way our constraints work on a DB level, it’s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.

Global sorting on last.fm, such as you can find on your library page, is handled by a separate service which is aware of the default Unicode collation.

The Right Way

If I were designing the Last.fm DB from scratch today, I’d use the pg_collkey Unicode Collation functions for Postgres, which lets you interface with the ICU libraries for Unicode.

The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:

db=# SELECT collkey('Café', 'root', true, 1, true);
collkey
---------
-)31
(1 row)
db=# SELECT collkey('Cafe', 'root', true, 1, true);
collkey
---------
-)31
(1 row)

So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:

CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));