Green Threads and Pipes in Python

Friday 16^th December, 2011

I’ve been hacking on WAL-E, a nice little Postgres backup system from Heroku which uses gevent for concurrency. Much of my changes are related to UNIX pipelines, and I’ve run into a subtle issue which not only affects gevent but also Eventlet (which is our coroutine library of choice at Smarkets).

Here’s a trivial example — an Eventletized version of an example in the Python manual:

from eventlet.green.subprocess import Popen, PIPE

fp = file('./input.file', 'r') # Should be reasonably large
tf = file('./output.file', 'w')

p1 = Popen(['sort'], stdin=fp, stdout=PIPE)
p2 = Popen(['cat', '-'], stdin=p1.stdout, stdout=tf)
p1.stdout.close()

p1.wait()
p2.wait()

You’ll get the following error:

cat: -: Resource temporarily unavailable
sort: write failed: standard output: Broken pipe
sort: write error

The problem is that you’re not expected to actually pipe data between separate processes. Eventlet assumes that you’ll be using the p1.stdout file descriptor from within your Python process, and it helpfully marks it as non-blocking for you so methods like communicate won’t block. When you hand that file descriptor to cat, the flags are preserved, and cat isn’t happy when it tries to read from what it thinks is a blocking socket and gets -EAGAIN.

Gevent doesn’t have a patched version of the subprocess library, but the pattern of patching stdin and stdout of Popen is repeated in a lot of gevent-using code, including within WAL-E itself.

I’m not sure if there’s an easy fix for this; the code can’t know whether you’ll be using the pipe yourself or passing it onto another process. In any case I’d rather be explicit about changing the options.