Published on

# Benchmark: Single-Column Indexes vs. Compound Indexes

I’ve known about compound indexes for many years but never really thought about them in detail. That doesn’t mean, I’ve never used them – quite on the contrary. At some point in the past I’ve made up some assumptions about when to use them and when to avoid them. However, until now I’ve never taken the time to put my assumptions to the test.

## Benchmark Setup

First off, we need a table with records to run the benchmark against:

``````DROP TABLE IF EXISTS t;
CREATE TABLE t (a INTEGER NOT NULL, b INTEGER NOT NULL);
-- populate with one million records:
INSERT INTO t (a, b)
SELECT * FROM GENERATE_SERIES(1,1000) AS a
CROSS JOIN GENERATE_SERIES(1,1000) AS b;``````

Next, we need some random numbers to query for. I decided to compute these numbers in advance to avoid the random number generator during benchmark execution:

``````DROP TABLE IF EXISTS randx, randy;
SELECT TRUNC( 1 + (1000 * RANDOM()))::INTEGER AS x INTO randx FROM GENERATE_SERIES(1,50);
SELECT TRUNC( 1 + (1000 * RANDOM()))::INTEGER AS y INTO randy FROM GENERATE_SERIES(1,50);``````

Now, we have two new relations (named `randx` and `randy`), each containing 50 random numbers between 1 and 1000. Finally, we are defining three functions to repeatedly perform `SELECT` queries with different `WHERE` clauses. To get comparable results, these functions must all perform the same number of queries.

``````-- performs 5000 `SELECT * FROM t WHERE a = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_a ()
RETURNS VOID LANGUAGE plpgsql AS \$\$
DECLARE
x INTEGER;
y INTEGER;
BEGIN
FOR x IN SELECT * FROM randx LOOP
FOR y IN SELECT * FROM randy LOOP
PERFORM COUNT(*) FROM t WHERE t.a = x;
PERFORM COUNT(*) FROM t WHERE t.a = y;
END LOOP;
END LOOP;
END;
\$\$;

-- performs 5000 `SELECT * FROM t WHERE b = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_b ()
RETURNS VOID LANGUAGE plpgsql AS \$\$
DECLARE
x INTEGER;
y INTEGER;
BEGIN
FOR x IN SELECT * FROM randx LOOP
FOR y IN SELECT * FROM randy LOOP
PERFORM COUNT(*) FROM t WHERE t.b = x;
PERFORM COUNT(*) FROM t WHERE t.b = y;
END LOOP;
END LOOP;
END;
\$\$;

-- performs 5000 `SELECT * FROM t WHERE a = ? AND b = ?;` queries...
CREATE OR REPLACE FUNCTION benchmark_where_ab ()
RETURNS VOID LANGUAGE plpgsql AS \$\$
DECLARE
x INTEGER;
y INTEGER;
BEGIN
FOR x IN SELECT * FROM randx LOOP
FOR y IN SELECT * FROM randy LOOP
PERFORM COUNT(*) FROM t WHERE t.a = x AND t.b = y;
PERFORM COUNT(*) FROM t WHERE t.a = y AND t.b = x;
END LOOP;
END LOOP;
END;
\$\$;``````

## Benchmark Execution

Ensure that the query optimizer uses all available indexes and doesn’t opt to use sequential scans:

``SET enable_seqscan = off;``

Benchmark without indexes:

``````EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();``````

Benchmark with index on `(a)`:

``````CREATE INDEX t_a_idx ON t (a);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_a_idx;``````

Benchmark with index on `(b)`:

``````CREATE INDEX t_b_idx ON t (b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_b_idx;``````

Benchmark with indexes on `(a)` and `(b)`:

``````CREATE INDEX t_a_idx ON t (a);
CREATE INDEX t_b_idx ON t (b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_a_idx, t_b_idx;``````

Benchmark with compound index on `(a,b)`:

``````CREATE INDEX t_ab_idx ON t (a,b);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_ab_idx;``````

Benchmark with compound index on `(b,a)`:

``````CREATE INDEX t_ba_idx ON t (b,a);
EXPLAIN ANALYZE SELECT benchmark_where_a();
EXPLAIN ANALYZE SELECT benchmark_where_b();
EXPLAIN ANALYZE SELECT benchmark_where_ab();
DROP INDEX IF EXISTS t_ba_idx;``````

## Benchmark Results

Index `WHERE a=?` `WHERE b=?` `WHERE a=? AND b=?`
None 406007 ms 426321 ms 400189 ms
`(a)` 731 ms 427019 ms 813 ms
`(b)` 402615 ms 3259 ms 3277 ms
`(a)`, `(b)` 716 ms 3364 ms 1161 ms
`(a,b)` 725 ms 85071 ms 57 ms
`(b,a)` 80956 ms 3441 ms 55 ms

Bold text indicates, that no sequential table scans were required.

## Conclusion

A compound index on `(a,b)`

1. performs much better than two separate indexes on `(a)` and `(b)` in `WHERE a=? AND b=?` queries.
2. performs mostly identical to an index on `(a)` in `WHERE a=?` queries.
3. is useless in `WHERE b=?` queries. Therefore, the order of columns in the compound index matters: `(a,b)` does not equal `(b,a)`.

Strangely, queries with `WHERE a=?` are consistently faster than queries with `WHERE b=?`. I’m not sure why this is the case – might have something to do with `a` being the table’s first column.

Notes
Software: PostgreSQL v9.3.1
Hardware: MacBook Pro with 2.7 GHz and 16 GB RAM