Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They’re implemented as functional indexes on
UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS’s C library. There are a couple of issues with this:
- Your data is at the mercy of changes to this library (changes to glibc are, let’s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS
- You have to pick a language (like
en_gb) to base the collation on. I’m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn’t feel right. There’s no way of implementing the default Unicode collation algorithm
Because of this, our Postgres database cluster is configured with a using the “C” locale and the UNICODE encoding. The “C” locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn’t work:
This is essentially why Last.fm scrobbles aren’t case-sensitive for languages other than plain English. We’re not planning on changing the way our constraints work on a DB level, it’s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.
Global sorting on last.fm, such as you can find on your library page, is handled by a separate service which is aware of the default Unicode collation.
The Right Way
The collkey function provided by
pg_collkey will return a unique binary key representing the normalized version of text:
So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation: