How to migrate from Django’s PostgreSQL CI Fields to use a case-insensitive collation

If you upgrade to Django 4.2, you may see system check warnings like:
example.User.email: (fields.W906) django.contrib.postgres.fields.CIEmailField is deprecated. Support for it (except in historical migrations) will be removed in Django 5.1.
HINT: Use EmailField(db_collation="…") with a case-insensitive non-deterministic collation instead.
In this post we’ll cover why this has changed, and how to perform the necessary migration. You can actually migrate from Django 3.2, and may want to, because the alternative uses more accurate and flexible Unicode rules.
What changed, why
As covered in the miscellaneous release notes, three field classes and a mixin are deprecated:
CICharField
CIEmailField
CITextField
CIText
mixin
The CI*
fields provide Case-Insensitive versions of their vanilla counterparts. These use the citext module built-in to PostgreSQL. But, since version 12, PostgreSQL discourages using citext, stating:
Consider using nondeterministic collations instead of this module. They can be used for case-insensitive comparisons, accent-insensitive comparisons, and other combinations, and they handle more Unicode special cases correctly.
Essentially, collations are more featureful and correct.
The deprecation in Django 4.2 encourages you to follow PostgreSQL’s advice. You can specify a collation for CharField
, TextField
, and subclasses like EmailField
, with the db_collation
argument (added in Django 3.2). So for case-insensitive fields in PostgreSQL, you now need to create a case-insensitive collation and use it.
Django Fellow Mariusz Felisiak suggested this change in January 2022. Then in July that year, he then worked on it, in Ticket #33872. Thank you Mariusz for this tidy-up!
What to do
To deal with this deprecation, you need to migrate each CI*Field
to its vanilla variant, with an appropriate case-insensitive collation. All the pieces are in place for this migration if you are using Django 3.2+ with PostgreSQL 12+, so you can make the change well ahead of upgrading to Django 4.2.
1. Add a new collation
As the release notes say, you need a “case-insensitive non-deterministic collation”. Let’s deconstruct that:
- A collation is a specification of how to compare and sort strings.
- Case-insensitive: upper and lower case letters compare equal, and sort together.
- Non-deterministic: a PostgreSQL term meaning that the collation needs to be used for sorting - PostgreSQL cannot rely on sorting by byte values.
Okay, so how do you create such a collation? Here’s an example from the PostgreSQL documentation on non-deterministic collations:
CREATE COLLATION case_insensitive (
provider = icu, locale = 'und-u-ks-level2', deterministic = false
);
This might be the SQL you want, but let’s unpack it before looking at how to run it in Django migrations.
CREATE COLLATION
- creates a new collation within the current schema.case_insensitive
- our chosen name for the collation.provider = icu
- tells PostgreSQL to use the ICU library (International Components for Unicode) for this collation. You will need a version of PostgreSQL that was compiled with ICU, as per the managing collation docs.It’s normally available in OS packages, Docker images, and managed providers, etc. If you’re cautious, you can check with:
example=# SELECT EXISTS(SELECT * FROM pg_collation WHERE collname LIKE '%-x-icu'); exists -------- t (1 row)
und-u-ks-level2
is a language tag with extensions. It may look like gobbledygook at first, but we can systematically deconstruct it using the Unicode standards:und
- undetermined language, which activates the Unicode “root collation”. This collation sorts symbols first, then alphabetically per script.You may wish to use language-specific sorting, in which case swap this for a language code, e.g.
de
for Germany.-u-
- this specifies that what follows are extra “Extension U” Unicode attributes. These are all drawn from this table.ks-level2
- theks
attribute defines the collation strength, here set to “level 2”. Strength level 2 doesn’t include case in comparisons, only letters and accents. The best explanation of the strength levels I found was in the ICU documentation.You may wish to use level 1 instead, for accent-insensitive comparisons.
Another neat attribute you might want to add (after another
-
):kn-true
- numeric ordering. This will sort e.g. “a-20” before “a-100”.deterministic = false
tells PostgreSQL the collation is non-deterministic, as above.
Okay so that’s the collation creation SQL analyzed. Time to put this statement into a migration. You could use RunSQL
, but it’s more Djangoey and reversible to use the CreateCollation
operation (added in Django 3.2).
Create an empty migration in your “main” or “core” Django app:
$ ./manage.py makemigrations example --empty --name case_insensitive_collation
Migrations for 'example':
example/migrations/0005_case_insensitive_collation.py
Then, edit that migration file to include a CreateCollation
operation:
from django.contrib.postgres.operations import CreateCollation
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
("example", "0004_something_or_other"),
]
operations = [
CreateCollation(
"case_insensitive",
provider="icu",
locale="und-u-ks-level2",
deterministic=False,
),
]
Using the sqlmigrate
command, you can see the SQL that Django will run for the migration, forwards and backwards:
$ python manage.py sqlmigrate example 0005
BEGIN;
--
-- Create collation case_insensitive
--
CREATE COLLATION "case_insensitive" (locale="und-u-ks-level2", provider="icu", deterministic=false);
COMMIT;
$ python manage.py sqlmigrate example 0005 --backwards
BEGIN;
--
-- Create collation case_insensitive
--
DROP COLLATION "case_insensitive";
COMMIT;
Alright, step one done 😅.
2. Alter fields
Replace all CI*Field
s in your models with their vanilla variants, specifying db_collation
with your collation name. For example, to update this User
class:
class User(AbstractBaseUser):
- email = CIEmailField(_("email address"), unique=True)
+ email = models.EmailField(_("email address"), db_collation="case_insensitive", unique=True)
...
Then, generate migrations:
$ ./manage.py makemigrations
Migrations for 'users':
users/migrations/0010_alter_user_email.py
- Alter field email on user
(If you have models in your “main” / “core” app, you may wish to combine with the above CreateCollation
migration.)
Check the new migration with sqlmigrate
as well:
$ ./manage.py sqlmigrate users 0010
BEGIN;
--
-- Alter field email on user
--
DROP INDEX IF EXISTS "users_user_email_243f6e77_like";
ALTER TABLE "users_user" ALTER COLUMN "email" TYPE varchar(254) COLLATE "case_insensitive";
COMMIT;
$ ./manage.py sqlmigrate users 0010 --backwards
BEGIN;
--
-- Alter field email on user
--
DROP INDEX IF EXISTS "users_user_email_243f6e77_like";
ALTER TABLE "users_user" ALTER COLUMN "email" TYPE citext;
COMMIT;
All looks as expected, with ALTER COLUMN
statements to change the column’s data type.
(The DROP INDEX IF EXISTS
statement ensures no “like” index exists. This is an index that Django adds for CharField
/ TextField
classes, to speed up SQL LIKE
matches. Setting db_collation
makes Django remove any such index, since non-deterministic collations do not support LIKE
, or such an index. Since the column previously used the citext
type, no such index will exist anyway.)
One caveat: if you’re moving from CICharField
or CIEmailField
to their vanilla variants, your data might be too long. The CI*Fields
all use the citext
type, which is unlimited in size, whilst CharField
and EmailField
use varchar
. If your data doesn’t fit, PostgreSQL will error at migration time like:
silvr@a67bfe06cb5c:/opt/silvr$ ./manage.py migrate users 0010
Operations to perform:
Target specific migration: 0010_alter_user_email, from users
Running migrations:
Applying users.0010_alter_user_email...Traceback (most recent call last):
...
File "/.../site-packages/django/db/backends/utils.py", line 89, in _execute
return self.cursor.execute(sql, params)
django.db.utils.DataError: value too long for type character varying(254)
You can check ahead of time by querying your database:
In [2]: from django.db.models.functions import Length
In [3]: from django.db.models import Max
In [4]: User.objects.aggregate(m=Max(Length("email")))
Out[4]: {'m': 312}
You can mitigate this by specifying a larger max_length
, or migrating to TextField
instead.
3. Test-ify
With more involved model definitions, it’s best to test to ensure all the pieces are working correctly. Here are a couple tests checking the above User.email
field is case-insensitive and unique:
from django.db import IntegrityError
from django.test import TestCase
from users.models import User
class UserTests(TestCase):
def test_email_case_insensitive_search(self):
user = User.objects.create(email="Hacker@example.com")
user2 = User.objects.get(email="hacker@example.com")
assert user == user2
def test_email_case_insensitive_unique(self):
User.objects.create(email="Hacker@example.com")
msg = 'duplicate key value violates unique constraint "users_user_email_key"'
with self.assertRaisesMessage(IntegrityError, msg):
User.objects.create(email="hacker@example.com")
Adapt as necessary.
4. Adjust queries
Some of Django’s lookups, like icontains
, map to a SQL LIKE
. PostgreSQL does not allow such queries with nondeterministic collations—you will see this error if such a query is attempted:
NotSupportedError: nondeterministic collations are not supported for LIKE
One solution is to use the Collate
database function to create a differently-collated annotation for the field, and filter against that:
from django.db.models.functions import Collate
from users.models import User
User.objects.annotate(
email_deterministic=Collate("email", "und-x-icu"),
).filter(email_deterministic__icontains="example")
und-x-icu
is a general purpose, language-agnostic Unicode collation from ICU.
I encountered this within the admin, which uses icontains
by default for search fields. So the solution was to add the annotation in get_queryset()
and then declare it in search_fields
:
...
class UserAdmin(BaseUserAdmin):
search_fields = ("email_deterministic", ...)
...
def get_queryset(self, request: HttpRequest) -> QuerySet[User]:
return (
super()
.get_queryset(request)
.annotate(
email_deterministic=Collate("email", "und-x-icu"),
)
)
You may want to check for performance regressions, since such lookups probably need table scans. That said, it’s likely LIKE
against citext
columns also requires a table scan.
Alright, that is all it takes to migrate
Maybe drop citext
too
If you’ve migrated all your CI*Field
s, you may want to remove the citext extension from your database. This will save a sliver of memory in PostgreSQL, and it will make it a bit harder to add CI*Field
s in the future.
Django only has a PostgreSQL CreateExtension
operation, so you need to use RunSQL
to drop the extension. Here’s an example that you can put into a migration:
migrations.RunSQL(
sql='DROP EXTENSION IF EXISTS "citext"',
reverse_sql='CREATE EXTENSION IF NOT EXISTS "citext"',
)
Adding new case-insensitive db_collation
fields can fail
If you try to directly add a new field with db_collation
set to a nondeterministic collation, the migration will fail on Django <4.2 with:
django.db.utils.NotSupportedError: nondeterministic collations are not supported for operator class "varchar_pattern_ops"
Check sqlmigrate
for the migration:
$ ./manage.py sqlmigrate users 0010
BEGIN;
--
-- Add field email to user
--
ALTER TABLE "users_user" ADD COLUMN "email" varchar(254) COLLATE "case_insensitive" NULL UNIQUE;
CREATE INDEX "users_user_email_0c800cea_like" ON "users_user" ("email" varchar_pattern_ops);
COMMIT;
The statement that fails is the CREATE INDEX
. Django creates such indexes by default for varchar
and text
fields to speed up LIKE
queries. But as we’ve seen, they aren’t supported when using a nondeterministic collation.
Django 4.2 includes a fix to not create this index when db_collation
is declared: Ticket #33901. On earlier versions, you can edit your migration to use RunSQL
with the CREATE INDEX
statement removed:
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
("users", "0009_something_something"),
]
operations = [
migrations.RunSQL(
sql="""
ALTER TABLE "users_user" ADD COLUMN "email" varchar(254) COLLATE "case_insensitive" NULL UNIQUE;
""",
reverse_sql="""
ALTER TABLE "users_user" DROP COLUMN "email" CASCADE;
""",
state_operations=[
migrations.AddField(
model_name="user",
name="email",
field=models.EmailField(
db_collation="case_insensitive",
max_length=254,
null=True,
unique=True,
verbose_name="email address",
),
),
],
),
]
(Grab reverse_sql
from sqlmigrate --backwards
.)
Fin
I wrote this post whilst dealing with the migration for my client Silvr. Thank you to Anna Bierońska and Pascal Fouque there for reviewing the post and PR. And thanks to the original deprecator Mariusz Felisiak for reviewing the post too.
And in any case: Good luck, good luck, GOOD LUCK, …
—Adam
Learn how to make your tests run quickly in my book Speed Up Your Django Tests.
One summary email a week, no spam, I pinky promise.
Related posts:
- Migrate PostgreSQL IDs from serial to identity after upgrading to Django 4.1
- How to optimize PostgreSQL queries from Django using pgMustard
Tags: django, postgresql