Django REST framework: pagination on PostgreSQL triggers

Sat 02 April 2016

Django and Django REST Framework use SQL COUNT in pagination. As your database grows SQL COUNT becomes too slow. Fortunately the frameworks are well designed and allow to customize a way items are count. Let me illustrate that on a typical "books" example.

class Author(models.Model):
    name = models.CharField()

    class Meta:
        db_table = 'author'


class Book(models.Model):
    author = models.ForeignKey(Author, related_name='books')

    class Meta:
        db_table = 'book'

It has been working fine for some time but authors are so productive and wrote thousands of books. As a result alice.books.count() takes a few seconds to execute. One option is to store books count in Author model which might be helpful in other SQL queries. Another one is to keep count somewhere else, e.g., Redis.

class Author(models.Model):
    name = models.CharField()
    books_count = models.PositiveIntegerField(null=True)

    class Meta:
        db_table = 'author'

Note that books_count is nullable due to existing data on production servers. Setting a default value 0 on schema migration would take long to update every author record, also it's hard to tell whether an author record is already migrated (books_count is calculated or not).

Now we have Author.books_count field which should keep track of new books. You can implement that based on Django signals or SQL triggers.

Signals might fail, even with send_robust. Updating a books count in a signal receiver slows down an application (two to three DB queries and acquiring a DB lock on author record):

@receiver(post_save, sender=Book, dispatch_uid='update_author_books_count')
def update_author_books_count(sender, instance, created, **kwargs):
    if not created:
        return
    book = instance

    with atomic():
        author = Author.objects.select_for_update().get(pk=book.author_id)
        if author.books_count is None:
            author.books_count = author.books.count()
        else:
            author.books_count += 1
        author.save()

Moving an update logic from a signal receiver to Celery task doesn't look promising -- books_count would be inconsistent between a task trigger and its execution or simply Celery task might fail. Another issue is that a model signal is not "attached" to DB transaction. For example, a Celery task is run before a DB transaction is committed. That is usually quick fixed with task_name.apply_async(countdown=1) to delay a task execution, but I would rather recommend something like django-transaction-hooks.

What about SQL trigger option? A bad thing is that a business logic leaks from an app to DB. But with a good documentation and tests it's maintainable: SQL code is stored in a data migration module.

$ python manage.py makemigrations --empty yourappname

from django.db import migrations

CREATE_FUNCTION_SQL = """
CREATE OR REPLACE FUNCTION update_books_count_on_author() RETURNS trigger AS $$
    BEGIN
        UPDATE author SET books_count =
            CASE
                WHEN books_count IS NULL THEN (
                    SELECT count(*) FROM book WHERE author_id = author.id
                )
                ELSE books_count + 1
            END
        WHERE id = NEW.author_id;

        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;
"""

CREATE_TRIGGER_SQL = """
CREATE TRIGGER books_count_update
    AFTER INSERT ON book
    FOR EACH ROW
    EXECUTE PROCEDURE update_books_count_on_author();
"""


class Migration(migrations.Migration):
    dependencies = []
    operations = [
        migrations.RunSQL(CREATE_FUNCTION_SQL),
        migrations.RunSQL(CREATE_TRIGGER_SQL),
    ]

Here is a pitfall though. When a new book is created and author is updated on the same DB transaction, then books_count value might be overwritten.

with atomic():
    author = Author.objects.select_for_update().get(pk=author_id)
    Book.objects.create(author=author)
    author.name = 'Bob'
    author.save()

You can either explicitly list fields to update author.save(update_fields=['name']) or use django-save-the-change. Let's document that in the model docstring.

from save_the_change.mixins import SaveTheChange


class Author(SaveTheChange, models.Model):
    """Author, e.g., Terry Pratchett.

    :attribute books_count: How many books writer has. SQL COUNT is
        expensive operation, so we store calculated value and update it
        by SQL trigger (check a data migration module for details).
        It's important to save only fields that were updated in the model.
        Otherwise SQL trigger's results are overwritten by Django ORM.
        For example:

        1. author is requested with a lock (books_count = 1)
        2. new book is created
        3. SQL trigger updates author's books_count field (now it is 2)
        4. author instance is saved with the old value of books_count = 1.

        SaveTheChange mixin helps to prevent it.

    """

To benefit from books_count field in Django REST Framework we need a custom pagination class which implements Paginator.count property. The idea is to extract author ID from paginator's SQL, query a books count from Author model and return it, instead of default Book.objects.filter(author_id=author_id).count().

from django.core.paginator import Paginator
from rest_framework import pagination
from rest_framework.viewsets import ReadOnlyModelViewSet


class BookViewSet(ReadOnlyModelViewSet):
    pagination_class = BookPagination


class BookPagination(pagination.PageNumberPagination):
    django_paginator_class = CachedBookCountPaginator


class CachedBookCountPaginator(Paginator):
    @cached_property
    def count(self):
        """Return the total number of books, across all pages.

        It parses a SQL and learns what author ID was requested
        based on ``self.object_list.query``. After that we can get
        a cached books count from Author model.

        """
        # There is query.where, but I could't find an author ID easily.
        # Moreover query.where internals might be changed.
        sql = str(self.object_list.query)
        author_id = self._get_author_id_from_sql(sql)

        author = Author.objects.get(pk=author_id).only('books_count')
        # In case we got unsynced author, we fallback to SQL COUNT.
        if author.books_count is None:
            return self.query_count()

        return author.books_count

    def query_count(self):
        """Request books count from DB.

        We need this method to facilitate testing (mocks).

        """
        return super(CachedBookCountPaginator, self).count

@classmethod
def _get_author_id_from_sql(cls, sql):
    pass

I hope this helps. Cheers!

Category: Python Tagged: python django postgresql django rest framework pagination

Sat 02 April 2016

Comments