Archive for the ‘Databases’ Category
* Aggregating boolean arrays in Postgres
Posted on August 26th, 2009. Filed under Databases.
We use arrays of booleans – a non-standard Postgres feature – quite frequently for storing permissions and similar data. It’s an elegant way to denormalize potentially scary schemas, as long as your client libraries support them.
We needed a function to count the number of true values across a whole table’s worth of boolean arrays – basically a histogram of how many true values in each array field. I thought it would be useful to share because I use PL/PgSQL so occasionally I always forget how to write it, and examples are useful.
Here’s the code:
CREATE OR REPLACE FUNCTION boolean_array_count(INTEGER[], BOOLEAN[]) RETURNS integer[] AS $BODY$ DECLARE r INTEGER[]; size INTEGER; BEGIN IF $2 IS NULL THEN RETURN $1; END IF; size := max(coalesce(array_upper($1, 1), 1), array_upper($2, 1)); FOR i IN 1..size LOOP IF $2[i] = true THEN r[i] = coalesce($1[i], 0) + 1; ELSE r[i] = coalesce($1[i], 0); END IF; END LOOP; RETURN r; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE; CREATE AGGREGATE boolean_array_count (BOOLEAN[]) ( SFUNC = boolean_array_count, STYPE = integer[] );
You also need an implementation of the inexplicably missing max(integer, integer) function:
CREATE OR REPLACE FUNCTION max(integer, integer) RETURNS integer AS $BODY$ BEGIN IF $1 > $2 THEN RETURN $1; END IF; RETURN $2; END; $BODY$ LANGUAGE 'plpgsql' IMMUTABLE STRICT SECURITY DEFINER;
* Unicode and Postgres
Posted on January 18th, 2009. Filed under Databases.
Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They’re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS’s C library. There are a couple of issues with this:
- Your data is at the mercy of changes to this library (changes to glibc are, let’s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS
- You have to pick a language (like en_gb) to base the collation on. I’m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn’t feel right. There’s no way of implementing the default Unicode collation algorithm
Because of this, our Postgres database cluster is configured with a using the “C” locale and the UNICODE encoding. The “C” locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn’t work:
db=# SELECT UPPER('Café');
upper
-------
CAFé
This is essentially why Last.fm scrobbles aren’t case-sensitive for languages other than plain English. We’re not planning on changing the way our constraints work on a DB level, it’s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.
Global sorting on last.fm, such as you can find on your library page, is handled by a separate service which is aware of the default Unicode collation.
The Right Way
If I were designing the Last.fm DB from scratch today, I’d use the pg_collkey Unicode Collation functions for Postgres, which lets you interface with the ICU libraries for Unicode.
The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:
db=# SELECT collkey('Café', 'root', true, 1, true);
collkey
---------
-)31
(1 row)
db=# SELECT collkey('Cafe', 'root', true, 1, true);
collkey
---------
-)31
(1 row)
So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:
CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));