Archive for the ‘Databases’ Category

* Aggregating boolean arrays in Postgres

Posted on August 26th, 2009. Filed under Databases.


We use arrays of booleans – a non-standard Postgres feature – quite frequently for storing permissions and similar data. It’s an elegant way to denormalize potentially scary schemas, as long as your client libraries support them.

We needed a function to count the number of true values across a whole table’s worth of boolean arrays – basically a histogram of how many true values in each array field. I thought it would be useful to share because I use PL/PgSQL so occasionally I always forget how to write it, and examples are useful.

Here’s the code:

CREATE OR REPLACE FUNCTION boolean_array_count(INTEGER[], BOOLEAN[]) 
RETURNS integer[] AS
$BODY$
DECLARE
	r INTEGER[];
	size INTEGER;
BEGIN
	IF $2 IS NULL THEN
		RETURN $1;
	END IF;
	size := max(coalesce(array_upper($1, 1), 1), array_upper($2, 1));
	FOR i IN 1..size LOOP
		IF $2[i] = true THEN
			r[i] = coalesce($1[i], 0) + 1;
		ELSE
			r[i] = coalesce($1[i], 0);
		END IF;
	END LOOP;
	RETURN r;
END;
$BODY$ 
LANGUAGE 'plpgsql' VOLATILE;
 
CREATE AGGREGATE boolean_array_count (BOOLEAN[]) (
	SFUNC = boolean_array_count,
	STYPE = integer[]
);

You also need an implementation of the inexplicably missing max(integer, integer) function:

CREATE OR REPLACE FUNCTION max(integer, integer) RETURNS integer AS
$BODY$
BEGIN
	IF $1 > $2 THEN
		RETURN $1;
	END IF;
	RETURN $2;
END;
$BODY$ LANGUAGE 'plpgsql' IMMUTABLE STRICT SECURITY DEFINER;


* Unicode and Postgres

Posted on January 18th, 2009. Filed under Databases.


Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They’re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS’s C library. There are a couple of issues with this:

  • Your data is at the mercy of changes to this library (changes to glibc are, let’s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS
  • You have to pick a language (like en_gb) to base the collation on. I’m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn’t feel right. There’s no way of implementing the default Unicode collation algorithm

Because of this, our Postgres database cluster is configured with a using the “C” locale and the UNICODE encoding. The “C” locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn’t work:


db=# SELECT UPPER('Café');
upper
-------
CAFé

This is essentially why Last.fm scrobbles aren’t case-sensitive for languages other than plain English. We’re not planning on changing the way our constraints work on a DB level, it’s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.

Global sorting on last.fm, such as you can find on your library page, is handled by a separate service which is aware of the default Unicode collation.

The Right Way

If I were designing the Last.fm DB from scratch today, I’d use the pg_collkey Unicode Collation functions for Postgres, which lets you interface with the ICU libraries for Unicode.

The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:


db=# SELECT collkey('Café', 'root', true, 1, true);
collkey
---------
-)31
(1 row)
db=# SELECT collkey('Cafe', 'root', true, 1, true);
collkey
---------
-)31
(1 row)

So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:

CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));



About Me

I build infrastructure.

I currently work for Smarkets as Head of Tech Operations. Before that I worked at Last.fm. I also co-founded the London Hackspace.

I live in London and sometimes moonlight as a freelance photographer.

Links

Projects