This is an attempt to build a fairly comprehensive list of metrics you should be monitoring on a PostgreSQL 9.4 database server. If you’ve found any particular metrics useful which aren’t listed here, please let me know.
Many thanks to the Government PaaS team at GDS, who I’ve been working with, for giving me the chance to put this list together.
System Metrics
You should be recording a comprehensive set of system performance metrics for any database server, but these are the ones you really want to keep an eye on:
Free disk space
This seems obvious, but you want to keep 10% available disk space on the Postgres data partition, since disk space can fluctuate during vacuuming, especially with high write loads. Running out of disk space will (at best) be detrimental to your availability. Alert on it.
CPU Usage
Again, pretty self-explanatory; max out the CPU and you’re going to get slowdowns. High user CPU usage can indicate badly optimised queries, but if your queries are reasonably well-optimised, being CPU-bound is a pretty good situation for Postgres to be in.
High system CPU usage can be an indication of too much query parallelism (also check out the number of context switches).
I/O usage
CPU percentage of IOwait should be your first port of call if you’re seeing Postgres slowness – it indicates the amount of time the machine is waiting for the disk.
It’s worth keeping metrics for the various individual I/O stats (iops, merged I/O transactions, queue size, service time, average wait time, etc.) which will help you drill down on any I/O problems. If you’ve got separate partitions for WAL and data, break those down.
Postgres Metrics
Total number of connections
This will tell you how close you are to hitting your max_connections
limit, and show up any clients which are leaking database connections.
Number of connections by state
This query breaks down connections by state:
The possible states of interest are:
- `active`
- Connections currently executing queries. A large number tends to indicate DB slowness.
- `idle`
- Idle connections, not in a transaction.
- `idle in transaction`
- Connections with an open transaction, not executing a query. Lots of these can indicate long-running transactions.
- `idle in transaction (aborted)`
- Connection is in a transaction, but an error has occurred and the transaction hasn't been rolled back.
Connections waiting for a lock
The number of connections blocked waiting for a lock can be an indicator of a slow transaction with an exclusive lock.
Maximum transaction age
Long-running transactions are bad because they prevent Postgres from vacuuming old data. This causes database bloat and, in extreme circumstances, shutdown due to transaction ID (xid) wraparound. Transactions should be kept as short as possible, ideally less than a minute.
Alert if this number gets greater than an hour or so.
Checkpoint interval
Checkpointing too frequently is a bad thing for performance. Postgres will warn you in
its logs about frequent checkpoints, or you can monitor the frequency in
the pg_stat_bgwriter
table (more info in this
article by
2ndquadrant).
Query execution time
Ideally measure this at your application level, not at the Postgres level. Alternatively,
log queries periodically by setting log_min_duration_statement = 0
and analyze them, or investigate the pg_stat_statements
module
(tools linked below).
Beware that a global average query execution time means very little – you should be breaking execution time down by query, and distribution often matters more than the average. Measure percentiles, especially when your web page loads depend on these queries.
Log Archiving
If you’re archiving logs, for example with WAL-E, you should at least:
- Check that your WAL files are being archived (there should be a new
one at least every
checkpoint_timeout
seconds). - Check your last base backup ran and succeeded.
Streaming Replication
Standby server status
Check this to tell you whether your standby is connected and replicating (run this query on the primary):
Standby server lag in bytes
This tells you how many bytes behind the primary the standby is (again, run on the primary):
There are queries which tell you the standby lag in seconds, but these tend to break on low-traffic databases when there are no writes being replicated.
More Performance Troubleshooting Tools
- I have a collection of Postgres views which I’ve found useful when investigating performance issues.
pg_stat_statements
is a module which collects detailed, normalised query statistics and exposes them as a database view for analysis.- pg-query-analyzer and pgbadger are two query log analysis tools. They’re good for doing one-off analysis, but you don’t want to be logging all queries continuously on a busy server.
- pganalyze is a hosted performance monitoring
service which makes use of
pg_stat_statements
. Their stats collector process is open source. - PgBouncer is a Postgres connection pool which can be helpful with some Postgres workloads. It provides some additional metrics which might be useful.
To comment on this post, mention me on mastodon, or drop me an email.