Datacenter Security: A Cautionary Tale

Posted on March 12th, 2009. Filed under Systems Admin.

Last.fm has presence in three datacenters in the London area – this is currently more because of necessity than redundancy; datacenter space is at a real premium in the UK and we can’t fulfill our power growth needs at just one site. We operate our own 20 gigabit fibre ring between these, which means we can basically treat all of our London presence as one site from a latency perspective. We don’t yet have enough redundancy to tolerate the complete loss of a site with no effect to the service (but we’re getting there).

Our largest (provisioned) site is at the Level3 facility in Braham Street, near Aldgate. It has fairly standard security, with the appropriate number of security gimmicks; I enter using a proximity card and PIN, take the lift up to our floor, and then get onto the floor using my card and a hand scan.

At 4am on Monday of this week, the Braham Street facility suffered a break-in. Three men managed to batter down an external fire escape door, made their way to our floor, and broke down the door to the data floor. Then then proceeded to try and break into a suite. They failed to get into one suite, leaving the door pretty mangled, and so they moved onto the one next door. Our suite.One of our routers (please dont steal it)

The robbers succeeded in breaking our door down. They then made their way to the back of our suite, and picked out a very specific rack: the one which holds our core router for that site, a Cisco 6500-series machine. I think they probably knew what they were looking for; these routers contain several cards which are probably the most lightweight, valuable items we have. These are popular things to steal.

The only thing which stopped the entirety of Last.fm going down on Monday morning was the robbers not spotting that the door to that rack was unlocked. They had started to crowbar the door off when site security and the police apprehended them. They were taken to court the following day and pled guilty, sentencing to follow.

That was a bit too close for comfort.

The scary thing is that this isn’t the first time Braham Street has been burgled. In 2006 the thieves were successful and managed to steal cards out of Level3’s own routers, bringing down part of their London network. Also in 2006, thieves simply walked into the Easynet (formerly Interxion) facility in Brick Lane and loaded up a van with £6m worth of kit. More recently, there have been two burglaries at BT telephone exchanges in the London area, where the thieves also came out with a tidy number of cards from the routers powering their new 21st century network.

I guess the moral of this story is that even if you think you’re resilient against everything, are you resilient against thieves walking in and stealing bits of your network?

Tags: , , .

Unicode and Postgres

Posted on January 18th, 2009. Filed under Databases.

Due to the way our database is set up, Last.fm has some fairly huge case-insensitive text unique keys (artist, album, track, etc). They’re implemented as functional indexes on UPPER(name). Postgres is capable of being configured with Unicode locales, however this effectively offloads the normalization/collation decisions to the OS’s C library. There are a couple of issues with this:

  • Your data is at the mercy of changes to this library (changes to glibc are, let’s face it, is a bit opaque), which is especially troublesome in when your unique indexes depend on it; You can end up being unable to import your data into a new database running a slightly different OS
  • You have to pick a language (like en_gb) to base the collation on. I’m not sure happens when you try and sort a truly international dataset like ours using a specific locale, but it certainly doesn’t feel right. There’s no way of implementing the default Unicode collation algorithm

Because of this, our Postgres database cluster is configured with a using the “C” locale and the UNICODE encoding. The “C” locale is a cop-out: it only covers the basic Latin characters, so if you try and do anything with non-basic-latin characters, it doesn’t work:


db=# SELECT UPPER('Café');
upper
-------
CAFé

This is essentially why Last.fm scrobbles aren’t case-sensitive for languages other than plain English. We’re not planning on changing the way our constraints work on a DB level, it’s too tricky to do when you have a table with hundreds of millions of existing strings to de-duplicate. Any changes to the case sensitivity of scrobbles in the future will be done on a higher level.

Global sorting on last.fm, such as you can find on your library page, is handled by a separate service which is aware of the default Unicode collation.

The Right Way

If I were designing the Last.fm DB from scratch today, I’d use the pg_collkey Unicode Collation functions for Postgres, which lets you interface with the ICU libraries for Unicode.

The collkey function provided by pg_collkey will return a unique binary key representing the normalized version of text:


db=# SELECT collkey('Café', 'root', true, 1, true);
collkey
---------
-)31
(1 row)
db=# SELECT collkey('Cafe', 'root', true, 1, true);
collkey
---------
-)31
(1 row)

So, to create an index which will enforce uniqueness on a text column while ignoring accents, case, and punctuation:

CREATE UNIQUE INDEX table_collkey ON table(collkey(column, 'root', true, 1, true));

Tags: , , , .

Linux Kernel Tuning

Posted on January 1st, 2009. Filed under Systems Admin.

As promised in my netbooting post, here’s an annotated walkthrough of the Linux kernel tuning parameters that we use fairly constantly at Last.fm.

Many of these parameters are documented in the files under Documentation/ in a Linux source tree, however it’s generally a pain to find parameters in that mess, so I will distill some of that here. I’ll update this as I learn more.

Networking Tuning

These are the most important settings, especially if you’re using Gigabit networking (which everyone should be!). Although these are fairly aggressive, there shouldn’t be any penalty to applying them to every server (we tend to). They are all sysctl settings.

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

The hard limits for the maximum amount of socket buffer space, in bytes. Of course 16MB per socket sounds like a lot, but most sockets won’t use anywhere near this much, and it’s nice to be able to expand if necessary.

net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

These are the corresponding settings for the IP protocol, in the format (min, default, max) bytes. The max value can’t be larger than the equivalent net.core.{r,w}mem_max.

net.ipv4.tcp_mem

Don’t touch tcp_mem for two reasons: Firstly, unlike tcp_rmem and tcp_wmem it’s in pages, not bytes, so it’s likely to confuse the hell out of you. Secondly, it’s already auto-tuned very well by Linux based on the amount of RAM.

net.ipv4.tcp_max_syn_backlog = 4096

Increase the number of outstanding syn requests allowed.
Note: some people (including myself) have used tcp_syncookies to handle the problem of too many legitimate outstanding SYNs. I quote the Linux documentation:

Note, that syncookies is fallback facility.
It MUST NOT be used to help highly loaded servers to stand
against legal connection rate. If you see synflood warnings
in your logs, but investigation shows that they occur
because of overload with legal connections, you should tune
another parameters until this warning disappear.

net.core.netdev_max_backlog = 2500

Standard network driver tuning improves speed for gigabit ethernet connections.

VM

vm.min_free_kbytes = 65536

This tells the kernel to try and keep 64MB of RAM free at all times. It’s useful in two main cases:

  • Swap-less machines, where you don’t want incoming network traffic to overwhelm the kernel and force an OOM before it has time to flush any buffers.
  • x86 machines, for the same reason: the x86 architecture only allows DMA transfers below approximately 900MB of RAM. So you can end up with the bizarre situation of an OOM error with tons of RAM free.
vm.swappiness =0

It’s said that altering swappiness can help you when you’re running under high memory pressure with software that tries to do its own memory management (i.e. MySQL). We’ve had limited success with this and I’d much prefer to use software which doesn’t pretend to know more about your hardware than the OS (i.e. PostgreSQL). Not that I’m bitter.

vm.overcommit_memory=1

The overcommit_memory sysctl isn’t something you’ll usually have to change if your software isn’t insane, but our netboot setup uses it so I thought I’d mention it. From the documentation:

  • 0 – Heuristic overcommit handling. Obvious overcommits of
    address space are refused. Used for a typical system. It
    ensures a seriously wild allocation fails while allowing
    overcommit to reduce swap usage. root is allowed to
    allocate slighly more memory in this mode. This is the
    default.
  • 1 – Always overcommit. Appropriate for some scientific
    applications.
  • 2 – Don’t overcommit. The total address space commit
    for the system is not permitted to exceed swap + a
    configurable percentage (default is 50) of physical RAM.
    Depending on the percentage you use, in most situations
    this means a process will not be killed while accessing
    pages but will receive errors on memory allocation as
    appropriate.

For more info on this, see the overcommit accounting documentation.

Disk Tuning

Really the only thing to note here is to set elevator=deadline in your kernel command line if you’re using RAID. This changes the IO scheduler to deadline, which has empirically been found to be best for almost all server workloads.

You can probably get a percent or two more out by tuning other settings, but we’ve found it’s not worth it.

Tags: , , , .

Configuring NetFlow on the Catalyst 6500

Posted on December 29th, 2008. Filed under Networking.

A quick note on the black art of Cisco configuration. Conveniently the Catalyst 6500 series (and likely higher models which use dCEF) has a different method of configuring NetFlow from lower-end switches. The Cisco docs don’t really touch on why this is. (This guide is based on IOS 12.2(33)SXH on the Sup720. Your mileage most likely will vary.)

So, firstly enable NetFlow like you would on any other IOS switch. It’s worth noting that at some point during the configuration you’ll likely get one of those trademark heart-stopping console freezes for up to 20 seconds. It’s not clear if this actually interrupts switching.

switch(config)#interface Te2/2
switch(config-if)#ip flow ingress
switch(config-if)#ip flow egress

I understand that this command used to be called ip route-cache flow, just to add to the confusion.

Now enable NDE to export your data to something like flow-tools:

switch(config)#ip flow-export source Vlan1
switch(config)#ip flow-export version 5 origin-as # This is where it hangs a while...
switch(config)#ip flow-export destination x.x.x.x yyyy

At this point you can run sh ip flow export to see your many flows being exported. Well, except you can’t, because on the 6500, the ip flow class of commands only deal with NetFlow for packets which hit the supervisor module, i.e. forwarding cache misses. (Older cat6500 hardware would merit a discussion of MSFCs and PFCs here, but my hardware isn’t old, so we don’t need that complication.)

So, to enable NetFlow and NDE for dCEF switched packets throughout the switch, the appropriate incantations are done using the mls series of commands:

switch(config)#mls netflow
switch(config)#mls nde sender version 5
switch(config)#mls flow ip interface-full

Confusingly, although it uses the NDE collector you configured earlier, you must view the MLS NDE stats differenlty, by using sh mls nde.

More detail can be found in the Configuring Netflow section of the Catalyst 6500 config guide.

Tags: , , , .

Diskless Web Serving for Fun and Profit

Posted on December 3rd, 2008. Filed under Systems Admin.

We’ve used network-booting diskless servers at Last.fm ever since we got our third web server back in 2004. I think it’s one of the best architectural decisions we’ve made, yet there’s precious little information around about running diskless servers. Hopefully this article will go some way towards rectifying that.

Why?

First, a summary of why diskless web serving rocks so much:

  • No disks mean less failures, hence less maintenance. I hate disks.
  • One single image means only one copy of your web serving environment to maintain.
  • It’s very easy to bring web servers online – seconds from power-on to web serving.

There are situations where diskless web serving won’t work, primarily when the content you want to serve won’t fit into an economical amount of RAM. If your code base isn’t enormous and you serve your static assets separately (which you absolutely should be doing), you shouldn’t hit this problem.

Of course, you can use this for other purposes than just web serving – we also boot our Hadoop clusters off the network.

What you need

  • A server to boot from: the hardware requirements for this are minimal, however it must be reliable. If this dies, your entire web cluster does. Availability for this can be improved using HA-NFS and similar trickery.
  • Between 1 and N web nodes. The only hardware requirement for your web nodes is that they support PXE booting, but pretty much everything does these days.

I’m going to assume that you’re using Debian/Ubuntu on your machines. Debian’s installation tools are very handy for getting this running; your mileage may vary with other operating systems.

Booting

Here’s how a server will boot up into a web serving environment:

  1. The machine powers on. The BIOS is configured to use PXE in its boot order.
  2. The network card’s PXE code brings up the link and gets an IP via DHCP (usually this will be a static MAC->IP mapping).
  3. The DHCP server contains the “next-server” and “filename” options. The PXE client connects to the next-server, grabs the file (which happens to be PXELINUX), and executes it.
  4. PXELINUX grabs the boot configuration for the machine, which has the Linux kernel and initrd details in. It grabs them from TFTP and runs Linux.
  5. Linux starts and runs the initrd.
  6. The initrd mounts the root filesystem, switches to it, then starts init. From here on, it’s the plain Linux boot sequence.

It’s a bit of a marathon, but the stages are actually quite simple. (And it’s magical when it works.) For the purposes of tying things together nicely, I’m going to run through configuring these steps roughly backwards.

Bootstrap a Linux install

You’ll need a Linux install for your webservers to run. This will be a full Linux filesystem which will exist on the boot server. To make this we use Debian’s excellent debootstrap tool. We’ll put a Debian Etch install in the directory /netboot/root:

root@bootserver:/netboot# debootstrap etch ./root
I: Retrieving Release
I: Retrieving Packages
I: Validating Packages
...

Once that finishes (it will take a while), you should have a pristine Debian install. One thing you should do immediately is set up the debian_chroot file so you know if you’re inside the image or not:

root@bootserver:/netboot# echo "webserver" > /netboot/root/etc/debian_chroot

Now you can chroot into the filesystem and start configuring your new install:

root@bootserver:/netboot# chroot /netboot/root
(webserver)root@bootserver:/#

You’re now inside the image. You need to set up the fstab so that /proc gets mounted on boot. On a web server it’s also good security practice to mount /tmp noexec:

none        /       tmpfs   defaults        0       0
proc        /proc   proc    defaults        0       0
none        /tmp    tmpfs   noexec          0       0

Tuning: Linux Swapless Memory Management

It’s worth noting that Linux’s memory management strategy doesn’t take kindly to being run with high memory pressures and no swap. People have tried various approaches to solving this in a diskless environment, even going so far as putting swap partitions on network block devices. We’ve found that it’s not too hard to keep things under control if you’re careful, even with PHP’s poor memory management.

There are two methods we use: firstly, leave a 10% safety margin when allocating your Apache MaxChildren. 10% “wasted” RAM may seem bad, but it’s a small price to pay for maintainability.

Secondly, put these settings in /etc/sysctl.conf:

vm.overcommit_memory=1
vm.vfs_cache_pressure=300
vm.min_free_kbytes=32768

I might document these better in a future post, but it’s at least a good start.

Bear in mind that in statistics, the size of the tmpfs root filesystem will not show up as “used” memory, it will show up as “cached”. Annoyingly, there’s no easy way of telling how much of your cached memory is essential and how much can be “swapped out”.

(Oh, and keep an eye out for shared memory leaks if you’re seeing inexplicable out-of-memory issues. That one kept me guessing for 4 months. Shared memory is also accounted for under “cached”.)

The kernel and initrd

Next, you should choose which kernel you want to use. It might be possible to use a distribution’s stock kernel, but it simplifies things a lot if you have a custom kernel with the modules you require statically compiled in. Generally this is pretty minimal: just Ethernet, NFS and possibly USB HID drivers are needed.

You now need to build a skeleton initrd, which you can do with mkinitrd, then mount it:

root@bootserver:~# mkinitrd -o ./netboot.img
root@bootserver:~# mkdir netboot
root@bootserver:~# mount -o loop ./netboot.img ./netboot
root@bootserver:~# ls ./netboot
bin  bin2  dev  dev2  devfs  etc  keyscripts  lib  lib64  linuxrc  linuxrc.conf  loadmodules  mnt  proc  sbin  script  scripts  sys  tmp  usr  var

I’m going to skirt around the finer points of initrd construction here; the initrd mkinitrd provides is slight overkill for our needs. The important thing is that you add a custom linuxrc file into your initrd to configure the network and root filesystem. Here’s the one we use – it’s a little crude but it works (refinements welcome).

Once that’s done, unmount and compress it:

root@bootserver:~# umount ./netboot
root@bootserver:~# gzip -9 ./netboot.img

TFTP

Set up a TFTP server of your choice. I like atftpd, but I find it quite hard to get excited about TFTP servers so I won’t prescribe one. I’ll assume that it’s working and it’s serving from the directory /tftproot. Install PXELINUX into that directory, as well as your kernel (vmlinux) and initrd. You should have this directory structure:

/tftpboot/netboot.img.gz              (your initrd)
/tftpboot/vmlinux-2.6.27.7-amd64      (the kernel)
/tftpboot/pxelinux.0                  (PXELINUX itself)
/tftpboot/pxelinux.cfg/               (PXELINUX config directory)

Into the pxelinux.cfg directory, you can now put a file called default, in this format:

LABEL linux
        KERNEL vmlinuz-2.6.27.7-amd64
        APPEND initrd=netboot.img.gz ramdisk_size=8192

DHCP

Lastly, you need to set up your DHCP server to send the correct boot options to your web nodes. I’ll assume you’re using ISC dhcpd v3, which seems to be a decent enough DHCP server. In dhcpd.conf, we create a separate group for web servers (this is just a snippet, it assumes you have a working config beforehand):

group {
        next-server 10.0.0.10;          # IP address of your boot server
        filename "/pxelinux.0";         # Path of pxelinux on your boot server, relative to the tftp root
        option root-path "10.0.0.10:/export/root,actimeo=120";   # Where to mount your root from
 
        # An example web node, statically mapped by MAC address:
        host www1 {
                hardware ethernet 00:E0:81:2F:64:6C;
                fixed-address 10.0.1.1;
                option host-name "www1";
        }
}

Tuning: The NFS root filesystem

You can see ,actimeo=120 in the root-path option. This is a standard mount option for NFS, and it’s used to control the stat (or getattr) cache. On your web nodes, all your system files will be NFS-mounted. In some cases files in these directories will be hit very frequently (glibc loves statting /etc/localtime) – you don’t want to incur a network trip every time that happens. This setting sets the cache to 120 seconds, so be aware that it may cause some weirdness.

Icing

Provided I haven’t missed anything, you should be able to boot a node and have it load up your Linux install. That’s the hard part.

We have init scripts which copy our web codebase onto the tmpfs ramdisk, then launch Apache and Memcache with parameters appropriate to the machine spec. Those are pretty specialist, though, so I’m not publishing them.

To finish by way of a list of credits, here’s a quick list of the other things which keep our web cluster ticking over smoothly:

  • Ganglia: lightweight, comprehensive low-level monitoring.
  • Cacti: customisable higher-level monitoring.
  • dsh: distributed shell.
  • Perlbal: configurable layer 7 load-balancing.
  • LVS: fast layer 3 load-balancing.

Tags: , , , , .

Simulating Network Trouble to Catch the Unexpected

Posted on October 29th, 2008. Filed under Networking.

We recently had an issue where one of two fibre pairs between one of our core switches and a new high-capacity edge rack got nudged a little too violently, then started throwing errors in one direction. Annoyingly, this didn’t get detected, and since it was a load-balanced link connectivity seemed fine.

Until we moved one of our user charts machines, which turns out to be quite sensitive to packet loss, into the rack. When we hit peak traffic (just about time to go to the pub), it started to time out, and we started serving and rendering the wrong charts in the wrong places. People hate it when that happens.

The tool for consistently reproducing these problems is already built into every modern Linux distribution. The traffic-shaping system includes the netem module which provides a huge array of network emulation possibilities. In our case, it was as simple as:

# tc qdisc change dev eth0 root netem loss 5%

We re-ran our test and it failed first time.

For a detailed reference on Linux network emulation see this documentation.

Tags: , , , .

First, I guess.

Posted on October 29th, 2008. Filed under Meta.

So, finally I have one of these blog things again.

I’ve mainly started this up again because I need somewhere to dump technical systemsy stuff which nobody else seems to write/care about. Plus RJ has one now and I can’t be outdone on that ground.

Hopefully it’ll at least be a bit more useful than the previous page-of-links.

.

About Me

I build infrastructure.

I currently work for Smarkets as Head of Tech Operations. Before that I worked at Last.fm. I also co-founded the London Hackspace.

I live in London and sometimes moonlight as a freelance photographer.

Links

Projects