365
votes

In Django doc,

select_related() "follows" foreign-key relationships, selecting additional related-object data when it executes its query.

prefetch_related() does a separate lookup for each relationship, and does the "joining" in Python.

What does it mean by "doing the joining in python"? Can someone illustrate with an example?

My understanding is that for foreign key relationship, use select_related; and for M2M relationship, use prefetch_related. Is this correct?

4
Performing the join in python means that the join will not happen in the database. With a select_related, your join happens in the database and you only suffer one database query. With prefetch_related, you will be executing two queries and then the results will be 'joined' by the ORM so you can still type object.related_setMark Galloway
As a footnote, Timmy O'Mahony can also explain their differences using database hits: linkMærcos

4 Answers

507
votes

Your understanding is mostly correct. You use select_related when the object that you're going to be selecting is a single object, so OneToOneField or a ForeignKey. You use prefetch_related when you're going to get a "set" of things, so ManyToManyFields as you stated or reverse ForeignKeys. Just to clarify what I mean by "reverse ForeignKeys" here's an example:

class ModelA(models.Model):
    pass

class ModelB(models.Model):
    a = ForeignKey(ModelA)

ModelB.objects.select_related('a').all() # Forward ForeignKey relationship
ModelA.objects.prefetch_related('modelb_set').all() # Reverse ForeignKey relationship

The difference is that select_related does an SQL join and therefore gets the results back as part of the table from the SQL server. prefetch_related on the other hand executes another query and therefore reduces the redundant columns in the original object (ModelA in the above example). You may use prefetch_related for anything that you can use select_related for.

The tradeoffs are that prefetch_related has to create and send a list of IDs to select back to the server, this can take a while. I'm not sure if there's a nice way of doing this in a transaction, but my understanding is that Django always just sends a list and says SELECT ... WHERE pk IN (...,...,...) basically. In this case if the prefetched data is sparse (let's say U.S. State objects linked to people's addresses) this can be very good, however if it's closer to one-to-one, this can waste a lot of communications. If in doubt, try both and see which performs better.

Everything discussed above is basically about the communications with the database. On the Python side however prefetch_related has the extra benefit that a single object is used to represent each object in the database. With select_related duplicate objects will be created in Python for each "parent" object. Since objects in Python have a decent bit of memory overhead this can also be a consideration.

28
votes

Both methods achieve the same purpose, to forego unnecessary db queries. But they use different approaches for efficiency.

The only reason to use either of these methods is when a single large query is preferable to many small queries. Django uses the large query to create models in memory preemptively rather than performing on demand queries against the database.

select_related performs a join with each lookup, but extends the select to include the columns of all joined tables. However this approach has a caveat.

Joins have the potential to multiply the number of rows in a query. When you perform a join over a foreign key or one-to-one field, the number of rows won't increase. However, many-to-many joins do not have this guarantee. So, Django restricts select_related to relations that won't unexpectedly result in a massive join.

The "join in python" for prefetch_related is a little more alarming then it should be. It creates a separate query for each table to be joined. It filters each of these table with a WHERE IN clause, like:

SELECT "credential"."id",
       "credential"."uuid",
       "credential"."identity_id"
FROM   "credential"
WHERE  "credential"."identity_id" IN
    (84706, 48746, 871441, 84713, 76492, 84621, 51472);

Rather than performing a single join with potentially too many rows, each table is split into a separate query.

7
votes

Gone through the already posted answers. Just thought it would be better if I add an answer with actual example.

Let' say you have 3 Django models which are related.

class M1(models.Model):
    name = models.CharField(max_length=10)

class M2(models.Model):
    name = models.CharField(max_length=10)
    select_relation = models.ForeignKey(M1, on_delete=models.CASCADE)
    prefetch_relation = models.ManyToManyField(to='M3')

class M3(models.Model):
    name = models.CharField(max_length=10)

Here you can query M2 model and its relative M1 objects using select_relation field and M3 objects using prefetch_relation field.

However as we've mentioned M1's relation from M2 is a ForeignKey, it just returns only 1 record for any M2 object. Same thing applies for OneToOneField as well.

But M3's relation from M2 is a ManyToManyField which might return any number of M1 objects.

Consider a case where you have 2 M2 objects m21, m22 who have same 5 associated M3 objects with IDs 1,2,3,4,5. When you fetch associated M3 objects for each of those M2 objects, if you use select related, this is how it's going to work.

Steps:

  1. Find m21 object.
  2. Query all the M3 objects related to m21 object whose IDs are 1,2,3,4,5.
  3. Repeat same thing for m22 object and all other M2 objects.

As we have same 1,2,3,4,5 IDs for both m21, m22 objects, if we use select_related option, it's going to query the DB twice for the same IDs which were already fetched.

Instead if you use prefetch_related, when you try to get M2 objects, it will make a note of all the IDs that your objects returned (Note: only the IDs) while querying M2 table and as last step, Django is going to make a query to M3 table with the set of all IDs that your M2 objects have returned. and join them to M2 objects using Python instead of database.

This way you're querying all the M3 objects only once which improves performance.

-3
votes

As Django documentation says:

prefetch_related()

Returns a QuerySet that will automatically retrieve, in a single batch, related objects for each of the specified lookups.

This has a similar purpose to select_related, in that both are designed to stop the deluge of database queries that is caused by accessing related objects, but the strategy is quite different.

select_related works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.

prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python. This allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related. It also supports prefetching of GenericRelation and GenericForeignKey, however, it must be restricted to a homogeneous set of results. For example, prefetching objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.

More information about this: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#prefetch-related