How to add a robots.txt to your Django site

Those Steaming Robots

robots.txt is a standard file to communicate to “robot” crawlers, such as Google’s Googlebot, which pages they should not crawl. You serve it on your site at the root URL /robots.txt, for example https://example.com/robots.txt.

To add such a file to a Django application, you have a few options.

You could serve it from a web server outside your application, such as nginx. The downside of this approach is that if you move your application to a different web server, you’ll need to redo that configuration. Also you might be tracking your application code in Git, but not your web server configuration, and it’s best to track changes to your robots rules.

The approach I favour is serving it as a normal URL from within Django. It becomes another view that you can test and update over time. Here are a couple of approaches to do that.

With a Template

This is the easiest approach. It keeps the robots.txt file in a template and simply renders it at the URL.

First, add a new template called robots.txt in your root templates directory, or in your “core” app’s templates directory:

User-Agent: *
Disallow: /private/
Disallow: /junk/

User-agent: GPTBot
Disallow: /

(The second rule there advises Open AI to not copy your site into their text database product ChatGPT. For a full list of rules to block AI companies, see Neil Clarke’s post.)

Second, add a urlconf entry:

from django.urls import path
from django.views.generic.base import TemplateView


urlpatterns = [
    # ...
    path(
        "robots.txt",
        TemplateView.as_view(template_name="robots.txt", content_type="text/plain"),
    ),
]

This creates a new view directly inside the URLconf, rather than importing it from views.py. This is not the best idea, since it’s mixing the layers in one file, but it’s often done pragmatically to avoid extra lines of code for simple views.

We need to set content_type to text/plain to serve it as a text document, rather than the default text/html.

After this is in place, you should be able to run python manage.py runserver and see the file served at http://localhost:8000/robots.txt (or similar for your runserver url).

With a Custom View

This is a slightly more flexible approach. Using a view, you can add custom logic, such as checking the Host header and serving different content per domain. It also means you don’t need to worry about variables being HTML escaped in your template, which might end up incorrect for the text format.

First, add a new view, in your “core” app:

from django.http import HttpResponse
from django.views.decorators.http import require_GET


@require_GET
def robots_txt(request):
    return HttpResponse(robots_txt_content, content_type="text/plain")


robots_txt_content = """\
User-Agent: *
Disallow: /private/
Disallow: /junk/

User-agent: GPTBot
Disallow: /
"""

We’re using Django’s require_GET decorator to restrict to only GET requests. Class-based views already do this, but we need to think about it ourselves for function-based views.

We generate the robots.txt content inside Python, by combining a list of lines using str.join().

Second, add a urlconf entry:

from django.urls import path
from core.views import robots_txt

urlpatterns = [
    # ...
    path("robots.txt", robots_txt),
]

Again, you should be able to check this on runserver.

Testing

As I wrote above, one of the advantages of serving this from Django is that we can test it. Automated tests will guard against accidental breakage of the code, or removal of the URL.

You can add some basic tests in a file like core/tests/test_views.py:

from http import HTTPStatus

from django.test import TestCase


class RobotsTxtTests(TestCase):
    def test_get(self):
        response = self.client.get("/robots.txt")

        assert response.status_code == HTTPStatus.OK
        assert response["content-type"] == "text/plain"
        assert response.content.startswith(b"User-Agent: *\n")

    def test_post_disallowed(self):
        response = self.client.post("/robots.txt")

        assert response.status_code == HTTPStatus.METHOD_NOT_ALLOWED

Run the tests with Django’s python manage.py test core.tests.test_views. It’s also a good idea check they are being run by making them fail, for example by commenting out the entry in the URL conf.

Checker

If you have a complicated set of robots.txt rules, you’ll want to run a checker after you deploy it. It seems Google’s is the de facto standard, see their webmasters page.

Django-Robots

If you want to control your robots.txt rules in your database, there’s a Jazzband package called django-robots. I haven’t used it, but it seems well maintained. It also adds some less standard rules, like directing to the sitemap.

Fin

Hope this helps you control those robots,

—Adam


Read my book Boost Your Git DX to Git better.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: