We recently had an issue where one of two fibre pairs between one of our core switches and a new high-capacity edge rack got nudged a little too violently, then started throwing errors in one direction. Annoyingly, this didn’t get detected, and since it was a load-balanced link connectivity seemed fine.
Until we moved one of our user charts machines, which turns out to be quite sensitive to packet loss, into the rack. When we hit peak traffic (just about time to go to the pub), it started to time out, and we started serving and rendering the wrong charts in the wrong places. People hate it when that happens.
The tool for consistently reproducing these problems is already built into every modern Linux distribution.
The traffic-shaping system includes the
netem module which provides a huge array of network
emulation possibilities. In our case, it was as simple as:
# tc qdisc change dev eth0 root netem loss 5%
We re-ran our test and it failed first time.
For a detailed reference on Linux network emulation see this documentation.