Linux Kernel Tuning — Russ Garrett

Thursday 1^st January, 2009

As promised in my netbooting post, here’s an annotated walkthrough of the Linux kernel tuning parameters that we use fairly constantly at Last.fm.

Many of these parameters are documented in the files under Documentation/ in a Linux source tree, however it’s generally a pain to find parameters in that mess, so I will distill some of that here. I’ll update this as I learn more.

Networking Tuning

These are the most important settings, especially if you’re using Gigabit networking (which everyone should be!). Although these are fairly aggressive, there shouldn’t be any penalty to applying them to every server (we tend to). They are all sysctl settings.

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

The hard limits for the maximum amount of socket buffer space, in bytes. Of course 16MB per socket sounds like a lot, but most sockets won’t use anywhere near this much, and it’s nice to be able to expand if necessary.

net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

These are the corresponding settings for the IP protocol, in the format (min, default, max) bytes. The max value can’t be larger than the equivalent net.core.{r,w}mem_max.

net.ipv4.tcp_mem

Don’t touch tcp_mem for two reasons: Firstly, unlike tcp_rmem and tcp_wmem it’s in pages, not bytes, so it’s likely to confuse the hell out of you. Secondly, it’s already auto-tuned very well by Linux based on the amount of RAM.

net.ipv4.tcp_max_syn_backlog = 4096

Increase the number of outstanding syn requests allowed. Note: some people (including myself) have used tcp_syncookies to handle the problem of too many legitimate outstanding SYNs. I quote the Linux documentation:

Note, that syncookies is fallback facility. It MUST NOT be used to help highly loaded servers to stand against legal connection rate. If you see synflood warnings in your logs, but investigation shows that they occur because of overload with legal connections, you should tune another parameters until this warning disappear.

net.core.netdev_max_backlog = 2500

Standard network driver tuning improves speed for gigabit ethernet connections.

VM

vm.min_free_kbytes = 65536 This tells the kernel to try and keep 64MB of RAM free at all times. It's useful in two main cases:

Swap-less machines, where you don’t want incoming network traffic to overwhelm the kernel and force an OOM before it has time to flush any buffers.
x86 machines, for the same reason: the x86 architecture only allows DMA transfers below approximately 900MB of RAM. So you can end up with the bizarre situation of an OOM error with tons of RAM free.

vm.swappiness = 0

It’s said that altering swappiness can help you when you’re running under high memory pressure with software that tries to do its own memory management (i.e. MySQL). We’ve had limited success with this and I’d much prefer to use software which doesn’t pretend to know more about your hardware than the OS (i.e. PostgreSQL). Not that I’m bitter.

vm.overcommit_memory=1

The overcommit_memory sysctl isn’t something you’ll usually have to change if your software isn’t insane, but our netboot setup uses it so I thought I’d mention it. From the documentation:

0 - Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slighly more memory in this mode. This is the default.
1 - Always overcommit. Appropriate for some scientific applications.
2 - Don’t overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable percentage (default is 50) of physical RAM. Depending on the percentage you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.

For more info on this, see the overcommit accounting documentation.

Disk Tuning

Really the only thing to note here is to set elevator=deadline in your kernel command line if you’re using RAID. This changes the IO scheduler to deadline, which has empirically been found to be best for almost all server workloads.

You can probably get a percent or two more out by tuning other settings, but we’ve found it’s not worth it.