We’ve used network-booting diskless servers at Last.fm ever since we got our third web server back in 2004. I think it’s one of the best architectural decisions we’ve made, yet there’s precious little information around about running diskless servers. Hopefully this article will go some way towards rectifying that.
Why?
First, a summary of why diskless web serving rocks so much:
- No disks mean less failures, hence less maintenance. I hate disks.
- One single image means only one copy of your web serving environment to maintain.
- It’s very easy to bring web servers online - seconds from power-on to web serving.
There are situations where diskless web serving won’t work, primarily when the content you want to serve won’t fit into an economical amount of RAM. If your code base isn’t enormous and you serve your static assets separately (which you absolutely should be doing), you shouldn’t hit this problem.
Of course, you can use this for other purposes than just web serving - we also boot our Hadoop clusters off the network.
What you need
- A server to boot from: the hardware requirements for this are minimal, however it must be reliable. If this dies, your entire web cluster does. Availability for this can be improved using HA-NFS and similar trickery.
- Between 1 and N web nodes. The only hardware requirement for your web nodes is that they support PXE booting, but pretty much everything does these days.
I’m going to assume that you’re using Debian/Ubuntu on your machines. Debian’s installation tools are very handy for getting this running; your mileage may vary with other operating systems.
Booting
Here’s how a server will boot up into a web serving environment:
- The machine powers on. The BIOS is configured to use PXE in its boot order.
- The network card’s PXE code brings up the link and gets an IP via DHCP (usually this will be a static MAC->IP mapping).
- The DHCP server contains the “next-server” and “filename” options. The PXE client connects to the next-server, grabs the file (which happens to be PXELINUX), and executes it.
- PXELINUX grabs the boot configuration for the machine, which has the Linux kernel and initrd details in. It grabs them from TFTP and runs Linux.
- Linux starts and runs the initrd.
- The initrd mounts the root filesystem, switches to it, then starts init. From here on, it’s the plain Linux boot sequence.
It’s a bit of a marathon, but the stages are actually quite simple. (And it’s magical when it works.) For the purposes of tying things together nicely, I’m going to run through configuring these steps roughly backwards.
Bootstrap a Linux install
You’ll need a Linux install for your webservers to run. This will be a full Linux filesystem which will exist
on the boot server. To make this we use Debian’s excellent debootstrap
tool. We’ll put a Debian
Etch install in the directory /netboot/root
:
Once that finishes (it will take a while), you should have a pristine Debian install. One thing you should do
immediately is set up the debian_chroot
file so you know if you’re inside the image or not:
Now you can chroot
into the filesystem and start configuring your new install:
You’re now inside the image. You need to set up the fstab
so that /proc
gets mounted
on boot. On a web server it’s also good security practice to mount /tmp noexec
:
none / tmpfs defaults 0 0
proc /proc proc defaults 0 0
none /tmp tmpfs noexec 0 0
Tuning: Linux Swapless Memory Management
It’s worth noting that Linux’s memory management strategy doesn’t take kindly to being run with high memory pressures and no swap. People have tried various approaches to solving this in a diskless environment, even going so far as putting swap partitions on network block devices. We’ve found that it’s not too hard to keep things under control if you’re careful, even with PHP’s poor memory management.
There are two methods we use: firstly, leave a 10% safety margin when allocating your Apache
MaxChildren
. 10% “wasted” RAM may seem bad, but it’s a small price to pay for maintainability.
Secondly, put these settings in /etc/sysctl.conf
:
vm.overcommit_memory=1
vm.vfs_cache_pressure=300
vm.min_free_kbytes=32768
I might document these better in a future post, but it’s at least a good start.
Bear in mind that in statistics, the size of the tmpfs root filesystem will not show up as “used” memory, it will show up as “cached”. Annoyingly, there’s no easy way of telling how much of your cached memory is essential and how much can be “swapped out”.
(Oh, and keep an eye out for shared memory leaks if you’re seeing inexplicable out-of-memory issues. That one kept me guessing for 4 months. Shared memory is also accounted for under “cached”.)
The kernel and initrd
Next, you should choose which kernel you want to use. It might be possible to use a distribution’s stock kernel, but it simplifies things a lot if you have a custom kernel with the modules you require statically compiled in. Generally this is pretty minimal: just Ethernet, NFS and possibly USB HID drivers are needed.
You now need to build a skeleton initrd, which you can do with mkinitrd
, then mount it:
I’m going to skirt around the finer points of initrd construction here; the initrd mkinitrd
provides is slight overkill for our needs. The important thing is that you add a custom linuxrc
file into your initrd to configure the network and root filesystem.
Here’s the one we use - it’s a little crude but it works
(refinements welcome).
Once that’s done, unmount and compress it:
TFTP
Set up a TFTP server of your choice. I like atftpd, but I find it quite hard to get excited about TFTP servers
so I won’t prescribe one. I’ll assume that it’s working and it’s serving from the directory
/tftproot
. Install PXELINUX into
that directory, as well as your kernel (vmlinux
) and initrd. You should have this directory
structure:
/tftpboot/netboot.img.gz (your initrd)
/tftpboot/vmlinux-2.6.27.7-amd64 (the kernel)
/tftpboot/pxelinux.0 (PXELINUX itself)
/tftpboot/pxelinux.cfg/ (PXELINUX config directory)
Into the pxelinux.cfg directory, you can now put a file called default, in this format:
LABEL linux
KERNEL vmlinuz-2.6.27.7-amd64
APPEND initrd=netboot.img.gz ramdisk_size=8192
DHCP
Lastly, you need to set up your DHCP server to send the correct boot options to your web nodes. I’ll assume
you’re using ISC dhcpd v3, which seems to be a decent enough DHCP server. In dhcpd.conf
, we
create a separate group for web servers (this is just a snippet, it assumes you have a working config
beforehand):
group {
next-server 10.0.0.10; # IP address of your boot server
filename "/pxelinux.0"; # Path of pxelinux on your boot server, relative to the tftp root
option root-path "10.0.0.10:/export/root,actimeo=120"; # Where to mount your root from
# An example web node, statically mapped by MAC address:
host www1 {
hardware ethernet 00:E0:81:2F:64:6C;
fixed-address 10.0.1.1;
option host-name "www1";
}
}
Tuning: The NFS root filesystem
You can see ,actimeo=120
in the root-path option. This is a standard mount option for NFS, and
it’s used to control the stat (or getattr) cache. On your web nodes, all your system files will be
NFS-mounted. In some cases files in these directories will be hit very frequently (glibc loves statting
/etc/localtime
) - you don’t want to incur a network trip every time that happens. This setting
sets the cache to 120 seconds, so be aware that it may cause some weirdness.
Icing
Provided I haven’t missed anything, you should be able to boot a node and have it load up your Linux install. That’s the hard part.
We have init scripts which copy our web codebase onto the tmpfs ramdisk, then launch Apache and Memcache with parameters appropriate to the machine spec. Those are pretty specialist, though, so I’m not publishing them.
To finish by way of a list of credits, here’s a quick list of the other things which keep our web cluster ticking over smoothly:
To comment on this post, mention me on mastodon, or drop me an email.