This is an attempt to build a fairly comprehensive list of metrics you should be monitoring on a PostgreSQL 9.4 database server. If you’ve found any particular metrics useful which aren’t listed here, please let me know.
You should be recording a comprehensive set of system performance metrics for any database server, but these are the ones you really want to keep an eye on:
Free disk space
This seems obvious, but you want to keep 10% available disk space on the Postgres data partition, since disk space can fluctuate during vacuuming, especially with high write loads. Running out of disk space will (at best) be detrimental to your availability. Alert on it.
Again, pretty self-explanatory; max out the CPU and you’re going to get slowdowns. High user CPU usage can indicate badly optimised queries, but if your queries are reasonably well-optimised, being CPU-bound is a pretty good situation for Postgres to be in.
High system CPU usage can be an indication of too much query parallelism (also check out the number of context switches).
CPU percentage of IOwait should be your first port of call if you’re seeing Postgres slowness – it indicates the amount of time the machine is waiting for the disk.
It’s worth keeping metrics for the various individual I/O stats (iops, merged I/O transactions, queue size, service time, average wait time, etc.) which will help you drill down on any I/O problems. If you’ve got separate partitions for WAL and data, break those down.
Total number of connections
This will tell you how close you are to hitting your
limit, and show up any clients which are leaking database connections.
SELECT count(*) FROM pg_stat_activity;
Number of connections by state
This query breaks down connections by state:
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
The possible states of interest are:
- Connections currently executing queries. A large number tends to indicate DB slowness.
- Idle connections, not in a transaction.
idle in transaction
- Connections with an open transaction, not executing a query. Lots of these can indicate long-running transactions.
idle in transaction (aborted)
- Connection is in a transaction, but an error has occurred and the transaction hasn’t been rolled back.
Connections waiting for a lock
The number of connections blocked waiting for a lock can be an indicator of a slow transaction with an exclusive lock.
SELECT count(distinct pid) FROM pg_locks WHERE granted = false;
Maximum transaction age
Long-running transactions are bad because they prevent Postgres from vacuuming old data. This causes database bloat and, in extreme circumstances, shutdown due to transaction ID (xid) wraparound. Transactions should be kept as short as possible, ideally less than a minute.
Alert if this number gets greater than an hour or so.
SELECT max(now() - xact_start) FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active');
Checkpointing too frequently is a bad thing for performance. Postgres will warn you in
its logs about frequent checkpoints, or you can monitor the frequency in
pg_stat_bgwriter table (more info in this
Query execution time
Ideally measure this at your application level, not at the Postgres level. Alternatively,
log queries periodically by setting
log_min_duration_statement = 0
and analyze them, or investigate the
(tools linked below).
Beware that a global average query execution time means very little – you should be breaking execution time down by query, and distribution often matters more than the average. Measure percentiles, especially when your web page loads depend on these queries.
If you’re archiving logs, for example with WAL-E, you should at least:
- Check that your WAL files are being archived (there should be a new
one at least every
- Check your last base backup ran and succeeded.
Standby server status
Check this to tell you whether your standby is connected and replicating (run this query on the primary):
SELECT state FROM pg_stat_replication WHERE client_hostname = '<hostname>'
Standby server lag in bytes
This tells you how many bytes behind the primary the standby is (again, run on the primary):
SELECT pg_xlog_location_diff(sent_location, replay_location) AS byte_lag FROM pg_stat_replication WHERE client_hostname = '<hostname>'
There are queries which tell you the standby lag in seconds, but these tend to break on low-traffic databases when there are no writes being replicated.
More Performance Troubleshooting Tools
- I have a collection of Postgres views which I’ve found useful when investigating performance issues.
pg_stat_statementsis a module which collects detailed, normalised query statistics and exposes them as a database view for analysis.
- pg-query-analyzer and pgbadger are two query log analysis tools. They’re good for doing one-off analysis, but you don’t want to be logging all queries continuously on a busy server.
- pganalyze is a hosted performance monitoring
service which makes use of
pg_stat_statements. Their stats collector process is open source.
- PgBouncer is a Postgres connection pool which can be helpful with some Postgres workloads. It provides some additional metrics which might be useful.