Django：Queryset distinct 子句性能问题答案

【问题标题】：Django: Queryset distinct clause performance issueDjango：Queryset distinct 子句性能问题
【发布时间】：2019-01-03 14:06:45
【问题描述】：

我正在尝试从数据库查询中删除重复项。这是我的模型：

class Restaurant(models.Model):
    name = models.CharField(db_index=True)

class InformationSheet(models.Model):
    owner = models.ForeignKey(Restaurant, related_name='sheet')
    latitude = models.DecimalField(max_digits=10, decimal_places=6)
    longitude = models.DecimalField(max_digits=10, decimal_places=6)

    class Meta:
        indexes = [
            models.Index(fields=['latitude', 'longitude', 'owner']),
        ]

class Availability(models.Model):
   restaurant = models.ForeignKey(Restaurant, on_delete=models.CASCADE, db_index=True)
   supplier = models.ForeignKey(Supplier, on_delete=models.CASCADE)

   class Meta:
       indexes = [
        models.Index(fields=['restaurant', 'supplier']),
        ]

当为给定供应商定义餐厅可用性时，我需要在 gps 坐标之间选择信息表。

    suppliers = [1, 2, 3]
    sheets = InformationSheet.objects.filter(
        latitude__gte=lat_start,
        latitude__lte=lat_end,
        longitude__gte=long_min',
        longitude__lte=long_max,
        owner__availability__supplier_id__in=suppliers
    ).distinct()

该表有几十万个条目。生成的 SQL 查询最初很快，但添加“distinct”子句以删除重复项使查询速度太慢，无法满足我的需求。因为 distinct 阻止了索引的使用

我该如何继续？

【问题讨论】：

我觉得这很奇怪，因为这些都是外键，默认情况下 Django 无论如何都会在外键上设置索引。 DISTINCT 有一些影响，但这应该是相当有限的（它可以简单地存储已经枚举的内容）。
使用 postgresql，您可以使用 .distinct('distinct_field_name') 仅比较一个字段（通常是您的 id）而不是整行。 docs.djangoproject.com/en/2.0/ref/models/querysets/…
@WillemVanOnsem 我添加了看起来不错的查询计划...而且我已经尝试仅添加不同的 id 字段，但改进不是很好。（约少 10 毫秒）

标签： django postgresql django-models django-queryset distinct

【解决方案1】：

Limit  (cost=1.12..325.09 rows=30 width=215)
  ->  Unique  (cost=1.12..76866.65 rows=7118 width=215)
        ->  Nested Loop  (cost=1.12..76848.85 rows=7118 width=215)
              ->  Nested Loop  (cost=0.70..73308.59 rows=7118 width=219)
                    ->  Index Scan using restaurant_sheet_pkey on accommodation_sheet  (cost=0.42..12290.52 rows=193562 width=215)
                          Filter: ((latitude >= '-180.0000000'::numeric) AND (latitude <= 180.0000000) AND (longitude >= '-200.0000000'::numeric) AND (longitude <= 200.0000000))
                    ->  Index Only Scan using restaur_resta_i_3e16b7_idx on restaurant_availability  (cost=0.28..0.31 rows=1 width=4)
                          Index Cond: (hotel_id = restaurant_sheet.owner_id)
                          Filter: (supplier_id = ANY ('{1,2,3}'::integer[]))
              ->  Index Only Scan using restaurant_restaurant_pkey on restaurant_restaurant  (cost=0.42..0.50 rows=1 width=4)
                    Index Cond: (id = restaurant_sheet.owner_id)

实际上在查看查询计划时，使用了索引我不明白为什么需要500ms（只有我们数据库的一部分）。我有一个 SSD 以及我们的开发服务器。

我认为我的模型结构设计得很糟糕，但我真的不知道该怎么做。

【讨论】：