<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Russ Garrett &#187; unicode</title>
	<atom:link href="http://russ.garrett.co.uk/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://russ.garrett.co.uk</link>
	<description></description>
	<lastBuildDate>Wed, 02 Jun 2010 21:20:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Unicode and Postgres</title>
		<link>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/</link>
		<comments>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/#comments</comments>
		<pubDate>Sun, 18 Jan 2009 21:06:53 +0000</pubDate>
		<dc:creator>Russ</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[normalization]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://russ.garrett.co.uk/?p=24</guid>
		<description><![CDATA[Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They&#8217;re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS&#8217;s C library. There are a couple of [...]]]></description>
			<content:encoded><![CDATA[<p>Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They&#8217;re implemented as functional indexes on <code>UPPER(name)</code>. Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS&#8217;s C library. There are a couple of issues with this:</p>
<ul>
<li> Your data is at the mercy of changes to this library (changes to glibc are, let&#8217;s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS</li>
<li>You have to pick a language (like en_gb) to base the collation on. I&#8217;m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn&#8217;t feel right. There&#8217;s no way of implementing the <a href="http://en.wikipedia.org/wiki/Unicode_collation_algorithm">default Unicode collation algorithm</a></li>
</ul>
<p>Because of this, our Postgres database cluster is configured with a using the &#8220;C&#8221; locale and the UNICODE encoding.  The &#8220;C&#8221; locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn&#8217;t work:</p>
<p><code><br />
db=# SELECT UPPER('Café');<br />
upper<br />
-------<br />
CAFé<br />
</code></p>
<p>This is essentially why Last.fm scrobbles aren&#8217;t case-sensitive for languages other than plain English. We&#8217;re not planning on changing the way our constraints work on a DB level, it&#8217;s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.</p>
<p>Global sorting on last.fm, such as you can find on your <a href="http://www.last.fm/user/Russ/library">library page</a>, is handled by a separate service which is aware of the default Unicode collation.</p>
<h1>The Right Way</h1>
<p>If I were designing the Last.fm DB from scratch today, I&#8217;d use the <a href="http://www.flexiguided.de/publications.pgcollkey.en.html">pg_collkey</a> Unicode Collation functions for Postgres, which lets you interface with the <a href="http://www.ibm.com/software/globalization/icu/">ICU</a> libraries for Unicode.</p>
<p>The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:</p>
<p><code><br />
db=# SELECT collkey('Café', 'root', true, 1, true);<br />
collkey<br />
---------<br />
-)31<br />
(1 row)<br />
db=# SELECT collkey('Cafe', 'root', true, 1, true);<br />
collkey<br />
---------<br />
-)31<br />
(1 row)<br />
</code></p>
<p>So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:<br />
<code lang="sql"><br />
CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://russ.garrett.co.uk/2009/01/18/unicode-postgres/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
