unique constraints with `NULL`able columns

some versions of the sql standard dictate that NULL values do not compare equal. this can make creating unique indexes with NULLable column members difficult. but it can be done with this semi-clean workaround!

i wrote the original version of this document around 6 months ago as a sort of "note to self" at work. at the time, i was working with postgresql 14. funny enough, postgresql 15 (to be released imminently) obviates the need for the approach described in this post; pg15 contains a new feature to allow UNIQUE and NULL to work better together via UNIQUE NULLS NOT DISTINCT.

sometimes, when dealing with a sql database, we want to add a unique constraint to a table to ensure that certain kinds of rows can only appear once in that table. we might do this to offload some work from the application to the database, and so that we get a loud error from the database when something would happen to violate the unique constraint (which tips us off that our application business logic likely has a bug).

however, while unique constraints are great tool, their usefulness is sometimes hindered by a critical piece of the sql standard pertaining to NULL: two NULL values do not compare equal. since a unique constraint uses a check of equality between to rows to see if there has been a uniqueness violation, rows which contain NULLable values need to be treated specially. if you don't, two "identical" rows won't be caught as non-unique if they both have a NULL for a value in the same slot. (as a simple example, suppose we have a table with 3 columns of nullable ints. trying to insert unique pairs (5,5,5), (7,7,7) is fine. trying to insert non-unique pairs (5,5,5), (5,5,5) is not fine, and the database will correctly reject such an operation with an error. however, inserting non-unique pairs when a NULL is involved changes things: inserting (5,5,NULL), (5,5,NULL) will be allowed by the database. why? the final NULLs do not compare as equal despite both having the same non-value of NULL, resulting in the rows themselves not comparing as equal. depending on our schema and application, there are some cases where we'd prefer the NULLs be considered equal, and as such we'd rather the database also reject this insert operation!

okay-- so sometimes we want a columns' NULL values to compare as equal. well, (in postgres) you can't; this is specified by the sql standard and postgres 14 adheres to it. an ugly way to progress would be to give up on nullability, add NOT NULL to every column that you want to be a part of the uniqueness constraint, and try to salvage things (adjust your queries, propagate the change to systems that work with the database, etc.). personally, i think that sounds quite messy and isn't a satisfying "solution".

to actually progress, we need to replace NULLs with a substitute value for which equality actually works the way we want, but only in the eye of our uniqueness constraint. we want the actual rows stored to genuinely store NULL if the inserted value is NULL, but the index powering the uniqueness constraint to see a different value. can we do this? spoiler: the answer is yes.

postgres offers functional (unique) indices to provide the "treat values used by the index differently from the actual row value" behavior described above. we just need a function that can convert NULLs to a substitute value, but leaves non-NULL values alone. postgres has a built-in function for this: coalesce(value1,[value2, [...]]), which returns the first argument passed in (valueN) which is non-NULL. so, we can easily use something like coalesce(column_name, substitute) and know we'll either get the row's real value or our constant substitute as a result (always some value, never NULL, if substitute is hard-coded and non-NULL).

we need to pick a "sentinel" value substitute for each (nullable) column we transform in our functional unique index this way. this is actually worth putting some thought into. ideally we'd want our sentinel value to have two properties: 1) it is a value of the type of the column (this is actually required for the schema to even be valid), and 2) it is a value that rows will never take for this column, preventing any confusion / collision scenarios.

to explain, let's imagine a table that stores two columns which represent network source and destination ports (both nullable) and a textual note; additionally, we want port pairings to be unique across the table (even if one or both of the values are NULL). this means we want our unique index to reject inserting (80, 443, 'a'), (80, 443, 'b') (non-unique port pairing, textual note doesn't affect uniqueness) as well as (80, NULL, 'a'), (80, NULL, 'b') (non-unique port pairing, with NULLs that are rejected as being the "same").

so far we have a rough idea of what we want:

CREATE TABLE ports (
    src_port smallint CHECK (src_port IS NULL OR src_port BETWEEN 0 AND 65535),
    dest_port smallint CHECK (dest_port IS NULL OR dest_port BETWEEN 0 AND 65535),
    note text NOT NULL
);

CREATE UNIQUE INDEX uix_port_pairs
ON ports (
    coalesce(src_port, sentinel_goes_here),
    coalesce(dest_port, sentinel_goes_here)
    /* no `note` in unique index! doesn't factor in */
);

we need to put a smallint in each of the sentinel_goes_here slots (property #1 and required for a valid schema). we don't want whatever we put there to show up in actual day-to-day use of the table; this is important so our functional index can tell the difference between a real row value v.s. a value that is standing in for the row's NULL. a trick that works excellently here is to use a sentinel value that would be rejected by the column's check constraint. it's actually fine to index such a value for a column, plus you are assured that no row will ever be able to exist with a bad value because such operations will cause CHECK errors. since the valid domain for ports is BETWEEN 0 AND 65535 (inclusive), how about a sentinel value of... -1?

CREATE TABLE ports (
    src_port smallint CHECK (src_port IS NULL OR src_port BETWEEN 0 AND 65535),
    dest_port smallint CHECK (dest_port IS NULL OR dest_port BETWEEN 0 AND 65535),
    note text NOT NULL
);

CREATE UNIQUE INDEX uix_port_pairs
ON ports (
    coalesce(src_port, -1),
    coalesce(dest_port, -1)
);

let's give it a spin:

/* this insert works ok! */
INSERT INTO ports (src_port, dest_port, note) VALUES
    (80, 443, 'http to https'),     -- indexed as (80, 443)
    (NULL, 80, 'unsure to http'),   -- indexed as (-1, 80)
    (NULL, NULL, 'kinda vacuous!'); -- indexed as (-1, -1)

let's try to sneak a row into the table that would be a collision, meaning a (src_port, dest_port) value pair of (80, 443), (NULL, 80), or (NULL, NULL):

INSERT INTO ports (src_port, dest_port, note) VALUES
    (80, 443, 'different note, still rejected'); -- index sees (80, 443)
/* => error! */

INSERT INTO ports (src_port, dest_port, note) VALUES
    (NULL, 80, ''); -- index sees (-1, 80)
/* => error! */

INSERT INTO ports (src_port, dest_port, note) VALUES
    (NULL, NULL, 'these NULLs compare equal in unique index'); -- index sees (-1, -1)
/* => error! */

now, let's try intentionally adding rows to our table that would interfere with our unique index by using the sentinel value in our row:

INSERT INTO ports (src_port, dest_port, note) VALUES
    (-1, 80, 'trying to collide with "unsure to http"');
/* => error! check constraint violation */

INSERT INTO ports (src_port, dest_port, note) VALUES
    (-1, -1, 'trying to collide with "vacuous"');
/* => error! check constraint violation */

to summarize:

sometimes we want to ensure rows are unique across a table, considering NULLs as the same, but the sql standard can make that difficult if many of the involved columns of the uniqueness constraint are nullable because NULLs do not compare equal (including for the purposes of comparisons in a unique constraint).

we can fix this problem by using a functional unique index and postgres's coalesce() function to modify the row's values in the eyes of the unique index so that none of them contain NULL. this requires a valid "sentinel" value of the column's core type, which hopefully is one that a row will never take on in that column anyway. if the relevant column has a CHECK constraint, a perfect choice for the sentinel value is one that is outside the valid domain of the CHECK.

2022-09-10 14:00:00