Wednesday 3^rd December, 2008

We’ve used network-booting diskless servers at Last.fm ever since we got our third web server back in 2004. I think it’s one of the best architectural decisions we’ve made, yet there’s precious little information around about running diskless servers. Hopefully this article will go some way towards rectifying that.

Why?

First, a summary of why diskless web serving rocks so much:

No disks mean less failures, hence less maintenance. I hate disks.
One single image means only one copy of your web serving environment to maintain.
It’s very easy to bring web servers online - seconds from power-on to web serving.

There are situations where diskless web serving won’t work, primarily when the content you want to serve won’t fit into an economical amount of RAM. If your code base isn’t enormous and you serve your static assets separately (which you absolutely should be doing), you shouldn’t hit this problem.

Of course, you can use this for other purposes than just web serving - we also boot our Hadoop clusters off the network.

What you need

A server to boot from: the hardware requirements for this are minimal, however it must be reliable. If this dies, your entire web cluster does. Availability for this can be improved using HA-NFS and similar trickery.
Between 1 and N web nodes. The only hardware requirement for your web nodes is that they support PXE booting, but pretty much everything does these days.

I’m going to assume that you’re using Debian/Ubuntu on your machines. Debian’s installation tools are very handy for getting this running; your mileage may vary with other operating systems.

Booting

Here’s how a server will boot up into a web serving environment:

The machine powers on. The BIOS is configured to use PXE in its boot order.
The network card’s PXE code brings up the link and gets an IP via DHCP (usually this will be a static MAC->IP mapping).
The DHCP server contains the “next-server” and “filename” options. The PXE client connects to the next-server, grabs the file (which happens to be PXELINUX), and executes it.
PXELINUX grabs the boot configuration for the machine, which has the Linux kernel and initrd details in. It grabs them from TFTP and runs Linux.
Linux starts and runs the initrd.
The initrd mounts the root filesystem, switches to it, then starts init. From here on, it’s the plain Linux boot sequence.

It’s a bit of a marathon, but the stages are actually quite simple. (And it’s magical when it works.) For the purposes of tying things together nicely, I’m going to run through configuring these steps roughly backwards.

Bootstrap a Linux install

You’ll need a Linux install for your webservers to run. This will be a full Linux filesystem which will exist on the boot server. To make this we use Debian’s excellent debootstrap tool. We’ll put a Debian Etch install in the directory /netboot/root:

root@bootserver:/netboot# debootstrap etch ./root
I: Retrieving Release
I: Retrieving Packages
I: Validating Packages
...

Once that finishes (it will take a while), you should have a pristine Debian install. One thing you should do immediately is set up the debian_chroot file so you know if you’re inside the image or not:

root@bootserver:/netboot# echo "webserver" > /netboot/root/etc/debian_chroot

Now you can chroot into the filesystem and start configuring your new install:

root@bootserver:/netboot# chroot /netboot/root
(webserver)root@bootserver:/#</pre>

You’re now inside the image. You need to set up the fstab so that /proc gets mounted on boot. On a web server it’s also good security practice to mount /tmp noexec:

none        /       tmpfs   defaults        0       0
proc        /proc   proc    defaults        0       0
none        /tmp    tmpfs   noexec          0       0

Tuning: Linux Swapless Memory Management

It’s worth noting that Linux’s memory management strategy doesn’t take kindly to being run with high memory pressures and no swap. People have tried various approaches to solving this in a diskless environment, even going so far as putting swap partitions on network block devices. We’ve found that it’s not too hard to keep things under control if you’re careful, even with PHP’s poor memory management.

There are two methods we use: firstly, leave a 10% safety margin when allocating your Apache MaxChildren. 10% “wasted” RAM may seem bad, but it’s a small price to pay for maintainability.

Secondly, put these settings in /etc/sysctl.conf:

vm.overcommit_memory=1
vm.vfs_cache_pressure=300
vm.min_free_kbytes=32768

I might document these better in a future post, but it’s at least a good start.

Bear in mind that in statistics, the size of the tmpfs root filesystem will not show up as “used” memory, it will show up as “cached”. Annoyingly, there’s no easy way of telling how much of your cached memory is essential and how much can be “swapped out”.

(Oh, and keep an eye out for shared memory leaks if you’re seeing inexplicable out-of-memory issues. That one kept me guessing for 4 months. Shared memory is also accounted for under “cached”.)

The kernel and initrd

Next, you should choose which kernel you want to use. It might be possible to use a distribution’s stock kernel, but it simplifies things a lot if you have a custom kernel with the modules you require statically compiled in. Generally this is pretty minimal: just Ethernet, NFS and possibly USB HID drivers are needed.

You now need to build a skeleton initrd, which you can do with mkinitrd, then mount it:

root@bootserver:~# mkinitrd -o ./netboot.img
root@bootserver:~# mkdir netboot
root@bootserver:~# mount -o loop ./netboot.img ./netboot
root@bootserver:~# ls ./netboot
bin  bin2  dev  dev2  devfs  etc  keyscripts  lib  lib64  linuxrc  linuxrc.conf  loadmodules  mnt
proc  sbin  script  scripts  sys  tmp  usr  var

I’m going to skirt around the finer points of initrd construction here; the initrd mkinitrd provides is slight overkill for our needs. The important thing is that you add a custom linuxrc file into your initrd to configure the network and root filesystem. Here’s the one we use - it’s a little crude but it works (refinements welcome).

Once that’s done, unmount and compress it:

root@bootserver:~# umount ./netboot
root@bootserver:~# gzip -9 ./netboot.img

TFTP

Set up a TFTP server of your choice. I like atftpd, but I find it quite hard to get excited about TFTP servers so I won’t prescribe one. I’ll assume that it’s working and it’s serving from the directory /tftproot. Install PXELINUX into that directory, as well as your kernel (vmlinux) and initrd. You should have this directory structure:

/tftpboot/netboot.img.gz              (your initrd)
/tftpboot/vmlinux-2.6.27.7-amd64      (the kernel)
/tftpboot/pxelinux.0                  (PXELINUX itself)
/tftpboot/pxelinux.cfg/               (PXELINUX config directory)

Into the pxelinux.cfg directory, you can now put a file called default, in this format:

LABEL linux
    KERNEL vmlinuz-2.6.27.7-amd64
    APPEND initrd=netboot.img.gz ramdisk_size=8192

DHCP

Lastly, you need to set up your DHCP server to send the correct boot options to your web nodes. I’ll assume you’re using ISC dhcpd v3, which seems to be a decent enough DHCP server. In dhcpd.conf, we create a separate group for web servers (this is just a snippet, it assumes you have a working config beforehand):

group {
    next-server 10.0.0.10;          # IP address of your boot server
    filename "/pxelinux.0";         # Path of pxelinux on your boot server, relative to the tftp root
    option root-path "10.0.0.10:/export/root,actimeo=120";   # Where to mount your root from

    # An example web node, statically mapped by MAC address:
    host www1 {
            hardware ethernet 00:E0:81:2F:64:6C;
            fixed-address 10.0.1.1;
            option host-name "www1";
    }
}

Tuning: The NFS root filesystem

You can see ,actimeo=120 in the root-path option. This is a standard mount option for NFS, and it’s used to control the stat (or getattr) cache. On your web nodes, all your system files will be NFS-mounted. In some cases files in these directories will be hit very frequently (glibc loves statting /etc/localtime) - you don’t want to incur a network trip every time that happens. This setting sets the cache to 120 seconds, so be aware that it may cause some weirdness.

Icing

Provided I haven’t missed anything, you should be able to boot a node and have it load up your Linux install. That’s the hard part.

We have init scripts which copy our web codebase onto the tmpfs ramdisk, then launch Apache and Memcache with parameters appropriate to the machine spec. Those are pretty specialist, though, so I’m not publishing them.

To finish by way of a list of credits, here’s a quick list of the other things which keep our web cluster ticking over smoothly:

Ganglia: lightweight, comprehensive low-level monitoring.
Cacti: customisable higher-level monitoring.
dsh: distributed shell.
Perlbal: configurable layer 7 load-balancing.
LVS: fast layer 3 load-balancing.