Repairing Corrupted Indexes with REINDEX
ing
by Christoph Schiessl on PostgreSQL
I recently published an article explaining how to alphabetically order by columns using a custom ENUM
type. For illustration, I defined the type marital_status
and the table people
.
postgres=# CREATE TYPE marital_status_type
postgres-# AS ENUM ('single', 'married', 'divorced', 'other');
CREATE TYPE
postgres=# CREATE TABLE people (
postgres(# id SERIAL PRIMARY KEY,
postgres(# marital_status marital_status_type NOT NULL
postgres(# );
CREATE TABLE
postgres=# CREATE INDEX index_people_on_marital_status
postgres-# ON people (marital_status);
CREATE INDEX
Furthermore, I also defined an IMMUTABLE
function to typecast the ENUM
values to TEXT
so that I could create an index for the alphabetical ordering.
postgres=# CREATE FUNCTION marital_status_as_text(marital_status marital_status_type)
postgres-# RETURNS TEXT LANGUAGE SQL IMMUTABLE AS $sql$ SELECT CAST(marital_status AS TEXT); $sql$;
CREATE FUNCTION
postgres=# CREATE INDEX index_people_on_marital_status_text
postgres-# ON people (marital_status_as_text(marital_status));
CREATE INDEX
This function was needed because the typecast itself is mutable! You can think about this typecast as a function whose output depends not only on its parameter but also on external factors. Another way to say this is that the typecast to TEXT
is not referentially transparent precisely because of its dependency on factors other than its parameter. You may get different results if you call the function multiple times, even though you use the same parameters. In PostgreSQL terminology, referentially transparent functions are called IMMUTABLE
; certain operations work only with such functions. The creation of INDEX
es is a prime example of this restriction.
postgres=# INSERT INTO people (marital_status)
postgres-# VALUES ('single'), ('married'), ('divorced'), ('other');
INSERT 0 4
postgres=# SET SESSION enable_seqscan = off;
SET
postgres=# -- Query using the index `index_people_on_marital_status` ...
postgres=# SELECT * FROM people WHERE marital_status = 'other'::marital_status_type;
id | marital_status
----+----------------
4 | other
(1 row)
postgres=# -- Query using the index `index_people_on_marital_status_text` ...
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'other'::TEXT;
id | marital_status
----+----------------
4 | other
(1 row)
So, we told PostgreSQL the function marital_status_as_text
is IMMUTABLE
, which was needed so that the function could be used to define an INDEX
. However, given that the function is just a wrapper around the typecast, which is not referentially transparent, it cannot be IMMUTABLE
. So, essentially, we have lied to PostgreSQL, but I have not yet explained the external factors that influence the typecast. Well, the typecast returns the name of the ENUM
value as it was defined when we created the custom type. This is a problem because these values can be renamed later on. For instance, we may rename the value other
to unknown
.
postgres=# ALTER TYPE marital_status_type
postgres-# RENAME VALUE 'other' TO 'unknown';
ALTER TYPE
The problem arises because we already had data in the people
table before the rename. Now, we have a funny situation — note that I'm setting enable_seqscan=off
to force the Query Planner to use our indexes:
postgres=# SET SESSION enable_seqscan = off;
SET
postgres=# SELECT * FROM people WHERE marital_status = 'unknown'::marital_status_type;
id | marital_status
----+----------------
4 | unknown
(1 row)
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'unknown'::TEXT;
id | marital_status
----+----------------
(0 rows)
So, we get different results for essentially the same query. If the query uses the ENUM
value directly, we get a single row as a result, but if we search based on the corresponding TEXT
value, we get an empty result set. This is clearly not the expected behavior! But it gets even worse because depending on the planned query (with or without index usage), we can also get different results for one-to-one the same query.
postgres=# -- turn off sequential scans to force index scans ...
postgres=# SET SESSION enable_seqscan = off;
SET
postgres=# SET SESSION enable_indexscan = DEFAULT;
SET
postgres=# SET SESSION enable_bitmapscan = DEFAULT;
SET
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'other'::TEXT;
id | marital_status
----+----------------
4 | unknown
(1 row)
postgres=# -- turn off index scans to force sequential scans ...
postgres=# SET SESSION enable_seqscan = DEFAULT;
SET
postgres=# SET SESSION enable_indexscan = off;
SET
postgres=# SET SESSION enable_bitmapscan = off;
SET
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'other'::TEXT;
id | marital_status
----+----------------
(0 rows)
Crazy stuff, right? The data tracked by the index is out of sync with the data in the table — the index is now said to be corrupted. This is 100% our fault because we incorrectly declared our function as IMMUTABLE
, and PostgreSQL took us at our word. So, there is no way it could have known that renaming the ENUM
value would affect the index. Anyway, you should never do something like this in the first place, but what should you do if you nonetheless find yourself in a situation like this? There's only one thing that you can do, really, ...
REINDEX
ing
Your only option is to rebuild the corrupted indexes. In this case, we have only one corrupted index, and we can start a rebuild with a simple REINDEX
statement.
postgres=# REINDEX INDEX index_people_on_marital_status_text;
REINDEX
We can rerun the same experiment to confirm that the index is back in sync with the table's data.
postgres=# SET SESSION enable_seqscan = off;
SET
postgres=# SET SESSION enable_indexscan = DEFAULT;
SET
postgres=# SET SESSION enable_bitmapscan = DEFAULT;
SET
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'other'::text;
id | marital_status
----+----------------
(0 rows)
postgres=# SET SESSION enable_seqscan = DEFAULT;
SET
postgres=# SET SESSION enable_indexscan = off;
SET
postgres=# SET SESSION enable_bitmapscan = off;
SET
postgres=# SELECT * FROM people WHERE marital_status_as_text(marital_status) = 'other'::text;
id | marital_status
----+----------------
(0 rows)
Voilà! The problem is fixed, but not its root cause. If we now renamed another ENUM
value, we would run into the same problem again. The lesson here is that you must be careful when defining custom functions. Namely, you can run into issues if you set your functions' volatility class to be stricter than they really are.
That's everything for today! Thank you for reading, and see you soon!