<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Russ Garrett &#187; Databases</title>
	<atom:link href="http://russ.garrett.co.uk/category/databases/feed/" rel="self" type="application/rss+xml" />
	<link>http://russ.garrett.co.uk</link>
	<description></description>
	<lastBuildDate>Wed, 02 Jun 2010 21:20:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Aggregating boolean arrays in Postgres</title>
		<link>http://russ.garrett.co.uk/2009/08/26/aggregate-boolean-arrays-in-postgres/</link>
		<comments>http://russ.garrett.co.uk/2009/08/26/aggregate-boolean-arrays-in-postgres/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 11:30:54 +0000</pubDate>
		<dc:creator>Russ</dc:creator>
				<category><![CDATA[Databases]]></category>

		<guid isPermaLink="false">http://russ.garrett.co.uk/?p=83</guid>
		<description><![CDATA[We use arrays of booleans &#8211; a non-standard Postgres feature &#8211; quite frequently for storing permissions and similar data. It&#8217;s an elegant way to denormalize potentially scary schemas, as long as your client libraries support them. We needed a function to count the number of true values across a whole table&#8217;s worth of boolean arrays [...]]]></description>
			<content:encoded><![CDATA[<p>We use arrays of booleans &#8211; a non-standard Postgres feature &#8211; quite frequently for storing permissions and similar data. It&#8217;s an elegant way to denormalize potentially scary schemas, as long as your client libraries support them.</p>
<p>We needed a function to count the number of true values across a whole table&#8217;s worth of boolean arrays &#8211; basically a histogram of how many true values in each array field. I thought it would be useful to share because I use PL/PgSQL so occasionally I always forget how to write it, and examples are useful.</p>
<p>Here&#8217;s the code:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">OR</span> <span style="color: #993333; font-weight: bold;">REPLACE</span> <span style="color: #993333; font-weight: bold;">FUNCTION</span> boolean_array_count<span style="color: #66cc66;">&#40;</span>INTEGER<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">BOOLEAN</span><span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> 
RETURNS integer<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #993333; font-weight: bold;">AS</span>
$BODY$
DECLARE
	r INTEGER<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span>;
	size INTEGER;
BEGIN
	<span style="color: #993333; font-weight: bold;">IF</span> $<span style="color: #cc66cc;">2</span> <span style="color: #993333; font-weight: bold;">IS</span> <span style="color: #993333; font-weight: bold;">NULL</span> THEN
		<span style="color: #993333; font-weight: bold;">RETURN</span> $<span style="color: #cc66cc;">1</span>;
	END <span style="color: #993333; font-weight: bold;">IF</span>;
	size :<span style="color: #66cc66;">=</span> max<span style="color: #66cc66;">&#40;</span>coalesce<span style="color: #66cc66;">&#40;</span>array_upper<span style="color: #66cc66;">&#40;</span>$<span style="color: #cc66cc;">1</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">,</span> array_upper<span style="color: #66cc66;">&#40;</span>$<span style="color: #cc66cc;">2</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
	<span style="color: #993333; font-weight: bold;">FOR</span> i <span style="color: #993333; font-weight: bold;">IN</span> 1<span style="color: #66cc66;">..</span>size LOOP
		<span style="color: #993333; font-weight: bold;">IF</span> $<span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> true THEN
			r<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> coalesce<span style="color: #66cc66;">&#40;</span>$<span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">+</span> <span style="color: #cc66cc;">1</span>;
		ELSE
			r<span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> coalesce<span style="color: #66cc66;">&#40;</span>$<span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#41;</span>;
		END <span style="color: #993333; font-weight: bold;">IF</span>;
	END LOOP;
	<span style="color: #993333; font-weight: bold;">RETURN</span> r;
END;
$BODY$ 
<span style="color: #993333; font-weight: bold;">LANGUAGE</span> <span style="color: #ff0000;">'plpgsql'</span> VOLATILE;
&nbsp;
<span style="color: #993333; font-weight: bold;">CREATE</span> AGGREGATE boolean_array_count <span style="color: #66cc66;">&#40;</span><span style="color: #993333; font-weight: bold;">BOOLEAN</span><span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#40;</span>
	SFUNC <span style="color: #66cc66;">=</span> boolean_array_count<span style="color: #66cc66;">,</span>
	STYPE <span style="color: #66cc66;">=</span> integer<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span>
<span style="color: #66cc66;">&#41;</span>;</pre></div></div>

<p>You also need an implementation of the inexplicably missing max(integer, integer) function:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">OR</span> <span style="color: #993333; font-weight: bold;">REPLACE</span> <span style="color: #993333; font-weight: bold;">FUNCTION</span> max<span style="color: #66cc66;">&#40;</span>integer<span style="color: #66cc66;">,</span> integer<span style="color: #66cc66;">&#41;</span> RETURNS integer <span style="color: #993333; font-weight: bold;">AS</span>
$BODY$
BEGIN
	<span style="color: #993333; font-weight: bold;">IF</span> $<span style="color: #cc66cc;">1</span> <span style="color: #66cc66;">&gt;</span> $<span style="color: #cc66cc;">2</span> THEN
		<span style="color: #993333; font-weight: bold;">RETURN</span> $<span style="color: #cc66cc;">1</span>;
	END <span style="color: #993333; font-weight: bold;">IF</span>;
	<span style="color: #993333; font-weight: bold;">RETURN</span> $<span style="color: #cc66cc;">2</span>;
END;
$BODY$ <span style="color: #993333; font-weight: bold;">LANGUAGE</span> <span style="color: #ff0000;">'plpgsql'</span> IMMUTABLE STRICT SECURITY DEFINER;</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://russ.garrett.co.uk/2009/08/26/aggregate-boolean-arrays-in-postgres/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode and Postgres</title>
		<link>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/</link>
		<comments>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/#comments</comments>
		<pubDate>Sun, 18 Jan 2009 21:06:53 +0000</pubDate>
		<dc:creator>Russ</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[normalization]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://russ.garrett.co.uk/?p=24</guid>
		<description><![CDATA[Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They&#8217;re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS&#8217;s C library. There are a couple of [...]]]></description>
			<content:encoded><![CDATA[<p>Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They&#8217;re implemented as functional indexes on <code>UPPER(name)</code>. Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS&#8217;s C library. There are a couple of issues with this:</p>
<ul>
<li> Your data is at the mercy of changes to this library (changes to glibc are, let&#8217;s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS</li>
<li>You have to pick a language (like en_gb) to base the collation on. I&#8217;m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn&#8217;t feel right. There&#8217;s no way of implementing the <a href="http://en.wikipedia.org/wiki/Unicode_collation_algorithm">default Unicode collation algorithm</a></li>
</ul>
<p>Because of this, our Postgres database cluster is configured with a using the &#8220;C&#8221; locale and the UNICODE encoding.  The &#8220;C&#8221; locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn&#8217;t work:</p>
<p><code><br />
db=# SELECT UPPER('Café');<br />
upper<br />
-------<br />
CAFé<br />
</code></p>
<p>This is essentially why Last.fm scrobbles aren&#8217;t case-sensitive for languages other than plain English. We&#8217;re not planning on changing the way our constraints work on a DB level, it&#8217;s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.</p>
<p>Global sorting on last.fm, such as you can find on your <a href="http://www.last.fm/user/Russ/library">library page</a>, is handled by a separate service which is aware of the default Unicode collation.</p>
<h1>The Right Way</h1>
<p>If I were designing the Last.fm DB from scratch today, I&#8217;d use the <a href="http://www.flexiguided.de/publications.pgcollkey.en.html">pg_collkey</a> Unicode Collation functions for Postgres, which lets you interface with the <a href="http://www.ibm.com/software/globalization/icu/">ICU</a> libraries for Unicode.</p>
<p>The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:</p>
<p><code><br />
db=# SELECT collkey('Café', 'root', true, 1, true);<br />
collkey<br />
---------<br />
-)31<br />
(1 row)<br />
db=# SELECT collkey('Cafe', 'root', true, 1, true);<br />
collkey<br />
---------<br />
-)31<br />
(1 row)<br />
</code></p>
<p>So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:<br />
<code lang="sql"><br />
CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
