Booster's Blog

A blog hosted on GitHub Pages

Posts WIKI About me Photos Repositories

PostgreSQL tables and indexes tend to accumulate bloat over time due to how the MVCC model handles updates and deletes. This guide helps you detect and safely remove bloat using proven, production-ready techniques. It covers lightweight and in-depth analysis methods, explains when action is needed, and shows how to clean up bloated tables and indexes with minimal disruption — before performance or storage become a concern.


How This Guide Can Help

If you’re seeing storage grow faster than expected, or queries getting slower for no clear reason — bloat may be the cause.

This guide will help you:

  • Spot and estimate bloat without impacting performance
  • Decide when and how to act, using safe, production-friendly tools

Table of Contents

  1. Introduction
  2. Why Bloat Happens
  3. Estimating Bloat
  4. Interpreting Results and Deciding When to Act
  5. Fixing Bloat
  6. Safety Tips and Space Requirements
  7. Summary

1. Introduction

PostgreSQL uses a Multi-Version Concurrency Control (MVCC) model to ensure high concurrency and data consistency. But this mechanism comes with a hidden cost: PostgreSQL does not immediately overwrite or remove rows that are updated or deleted. Instead, it creates new versions and leaves the old ones behind as “dead tuples” — invisible to queries, but still occupying disk space.

Over time, this leads to internal fragmentation, known as bloat.

Bloat affects both tables and indexes, and is especially harmful in high-velocity transactional systems. If left unmanaged, it gradually degrades database performance and increases operational costs.

Consequences of Bloat

Impact Area Description
Storage costs Bloated tables and indexes can consume 2–5× more space than needed, increasing storage and backup requirements
Query performance More I/O is required to read the same amount of data; query execution becomes slower
Maintenance efficiency Autovacuum works less efficiently, especially when scanning bloated indexes

2. Why Bloat Happens

Bloat is a natural side effect of PostgreSQL’s MVCC design. It accumulates due to:

  • Frequent UPDATEs and DELETEs - each change creates a new row version and marks the old one as dead. If VACUUM doesn’t clean it up in time, the space stays occupied.
  • Insufficient Autovacuum Frequency - if autovacuum runs too rarely, PostgreSQL writes new data to fresh pages instead of reusing space, causing table growth.
  • Long-running Transactions - they block VACUUM from reclaiming dead tuples by holding back the xmin horizon.
  • Mass DELETEs - large batch deletions generate many dead rows at once, which can’t be reused unless followed by INSERTs.

3. Estimating Bloat

Before you can fix bloat, you need to understand where it lives — and how bad it is. PostgreSQL doesn’t track bloat explicitly, so you need to estimate it using either catalog-based queries or low-level inspection tools.

There are two main approaches:

3.1 Lightweight Estimation (SQL-based)

This method uses system catalog statistics (like pg_stat_user_tables and pg_stats) to estimate how much space is wasted in tables and indexes. These queries are safe to run even on busy production systems.

Pros Cons
Very fast and safe Estimates only — results may be inaccurate if statistics are outdated
No impact on performance Cannot detect TOAST bloat or deduplication in indexes
Can be automated and scheduled regularly Index bloat estimation is especially unreliable for complex cases

Recommended scripts by ioguix:

3.2 Accurate Estimation (pgstattuple-based)

The pgstattuple extension inspects the physical layout of tables and indexes to provide accurate bloat information. It has two modes of operation:

  • pgstattuple() — performs a full-table scan and returns exact statistics for live tuples, dead tuples, and free space. Works for both tables and indexes.
  • pgstattuple_approx() — a faster alternative that uses the visibility map to skip pages containing only visible tuples, returning approximate results. This function works only for tables, not indexes.
Pros Cons
Accurate low-level bloat data (for both tables and indexes) Can generate significant I/O load, especially on large indexes
Includes TOAST bloat May still read large parts of the table, depending on vacuum status

I recommend using this method selectively — for the most critical tables and indexes only — and ideally during non-peak hours. Example scripts are available in the GitHub repository: table_bloat_approx.sql for tables and index_bloat.sql for indexes.

💡 Tip: Start with lightweight estimation to identify likely candidates, then validate the most suspicious objects using pgstattuple.


4. Interpreting Results and Deciding When to Act

Once you have estimated the level of bloat in your tables and indexes, the next step is deciding what to do with that information. Not all bloat is equally harmful, and trying to remove every bit of it may waste time and resources.

Object Type Threshold Action
Tables > 20% bloat Consider repacking
Tables > several GB (even at lower %) Action may still be justified
Indexes > 40% bloat Consider rebuilding
Indexes 20–30% bloat (frequently scanned) Consider rebuilding if query performance is degraded

💡 Tip: These thresholds are not strict rules — context matters. For example, a 5 GB table with 50% bloat may be less critical than a 100 GB table with 20% bloat that’s accessed constantly.


5. Fixing Bloat

Once you’ve identified bloated tables and indexes that require action, PostgreSQL provides several safe options to clean them up with minimal downtime.

I recommend starting with the least intrusive method that suits your situation, especially on production systems.

5.1 Rebuilding Indexes with REINDEX CONCURRENTLY

PostgreSQL’s built-in REINDEX CONCURRENTLY command allows you to rebuild an index without blocking reads or writes.

When to use:

  • You’ve identified bloated indexes (typically >40% bloat)
  • The table itself is not bloated, or you want to avoid a full table rewrite
  • You want a quick, safe fix with minimal impact on application queries
REINDEX INDEX CONCURRENTLY idx_users_email;
Pros Cons
Safe to use in production Only applies to indexes (not tables)
No long locks on the table Requires disk space for the new index copy
No external tools required Slower than standard REINDEX due to concurrent safety

💡 Tip: Focus on indexes that are frequently scanned or used in JOIN/WHERE conditions.

5.2 Repacking Tables with pg_repack

For bloated tables, pg_repack is the most reliable and widely used option. It rewrites the table and its indexes in the background, allowing continued read/write access during the operation.

When to use:

  • Tables with significant bloat (>20%)
  • Large tables that can’t be locked for extended periods
  • When you need to clean both the table and its indexes in one operation
pg_repack -d mydb -t bloated_table
Pros Cons
Cleans both tables and their indexes in one run Requires installation of the pg_repack extension
Minimal locking (only at start/end of operation) Needs additional disk space (typically 2–3× table size)
Supports parallel jobs and custom ordering May briefly block DDLs or long-running queries

I recommend running pg_repack directly on the database server via a terminal multiplexer (e.g., tmux, screen) to avoid session drops during long-running operations.

⚠️ Warning:

  • If you run pg_repack remotely and the connection drops (e.g., via SSH or load balancer), it can leave behind long-lived locks and block access to the table — especially with --no-kill-backend. See: [pg_repack issue #456](https://github.com/reorg/pg_repack/issues/140](https://github.com/reorg/pg_repack/issues/456)
  • If a repack operation fails or is interrupted, use the documented cleanup steps to remove temporary objects before retrying. Don’t skip cleanup — leftover objects can block future repack attempts.

6. Safety Tips and Space Requirements

Fixing bloat is safe — if done correctly. But repacking large tables or rebuilding large indexes can temporarily increase disk usage and create unexpected load on the system.

6.1 Disk Space Planning

Repacking and rebuilding objects require temporary space to hold the new copies.

Scenario Recommended Free Space
Standard operations the size of the table or index
Very large tables free space (accounts for temporary objects, WAL, sorting/buffering)

6.2 Timing Recommendations

Even though REINDEX CONCURRENTLY and pg_repack are designed for online use, they can still:

  • Increase I/O (especially for large objects)
  • Delay autovacuum
  • Create lock contention during short critical phases

Best practices:

  • Run during maintenance windows when possible
  • Process in small batches (e.g., one table/index at a time)
  • Avoid peak traffic hours

6.3 Monitor Blocking and Sessions

During the critical locking phase at the start and end of pg_repack, queries that touch the same table will be blocked briefly.

Recommendations:

  • Use --wait-timeout to control how long it waits for locks
  • Consider --no-kill-backend if you prefer to wait rather than terminate conflicting sessions
  • Actively monitor blocked queries (pg_stat_activity) during the operation

7. Summary

Bloat in PostgreSQL is a common but often overlooked issue that can silently degrade performance and waste disk space.

┌─────────────────────────────────────────────────────────────────┐
│  1. ESTIMATE                                                    │
│     Use lightweight SQL queries or pgstattuple extension        │
├─────────────────────────────────────────────────────────────────┤
│  2. ANALYZE                                                     │
│     Apply thresholds: >20% for tables, >40% for indexes         │
├─────────────────────────────────────────────────────────────────┤
│  3. FIX                                                         │
│     Use REINDEX CONCURRENTLY (indexes) or pg_repack (tables)    │
└─────────────────────────────────────────────────────────────────┘

By following these steps, you can keep your PostgreSQL database lean, fast, and cost-efficient.


Quick Reference

Task Tool Command Example
Estimate table bloat SQL query ioguix table_bloat.sql
Estimate index bloat SQL query ioguix btree_bloat.sql
Accurate bloat analysis pgstattuple SELECT * FROM pgstattuple('table_name');
Rebuild index online PostgreSQL REINDEX INDEX CONCURRENTLY idx_name;
Repack table online pg_repack pg_repack -d dbname -t table_name