在表列中查找每组最频繁的值答案

【问题标题】：在表列中查找每组最频繁的值
【发布时间】：2021-12-09 19:41:50
【问题描述】：

我需要为每个种族找到object_of_search 的最常见值。我怎样才能做到这一点？ SELECT 子句中的子查询和相关子查询是不允许的。类似的东西：

mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"

但这并没有汇总，并且为每个种族和 object_of_search 提供了很多行：

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms
 ethnicity3                |                 2 |              100 | Firearms
 ethnicity1                |                 5 |               60 | Cat
 ethnicity1                |                 5 |               60 | Dog
 ethnicity2                |                 3 | 66.6666666666667 | Firearms
 ethnicity1                |                 5 |               60 | Psychoactive substances
 ethnicity1                |                 5 |               60 | Fireworks

应该是这样的：

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms

fiddle上的表。
查询：

SELECT DISTINCT
    stopAndSearches.officer_defined_ethnicity,
    count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity) AS "Sas for ethnicity",
    sum(case when stopAndSearches.outcome = 'Arrest' then 1 else 0 end)
       OVER (PARTITION BY stopAndSearches.officer_defined_ethnicity)::float /
       count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity)::float * 100 AS "Arrest rate",
    mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"
FROM stopAndSearches
GROUP BY stopAndSearches.sas_id, stopAndSearches.officer_defined_ethnicity;

表：

CREATE TABLE IF NOT EXISTS stopAndSearches(
    "sas_id" bigserial PRIMARY KEY,
    "officer_defined_ethnicity" VARCHAR(255),
    "object_of_search" VARCHAR(255),
    "outcome" VARCHAR(255)
);

【问题讨论】：

标签： sql postgresql greatest-n-per-group

【解决方案1】：

更新：Fiddle

这应该解决具体的“每个种族的对象”问题。

请注意，这不解决计数中的关系。这不是问题/请求的一部分。

调整您的 SQL 以包含此逻辑，以提供该详细信息：

WITH cte AS (
        SELECT officer_defined_ethnicity
             , object_of_search
             , COUNT(*) AS n
             , ROW_NUMBER() OVER (PARTITION BY officer_defined_ethnicity ORDER BY COUNT(*) DESC) AS rn
          FROM stopAndSearches
         GROUP BY officer_defined_ethnicity, object_of_search
     )
SELECT * FROM cte
 WHERE rn = 1
;

结果：

officer_defined_ethnicity	object_of_search	n	rn
ethnicity1	Cat	1	1
ethnicity2	Stolen goods	2	1
ethnicity3	Fireworks	1	1

【讨论】：

【解决方案2】：

SELECT DISTINCT ON (1)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY 1, 2
ORDER  BY 1, 3 DESC, 2;

或更明确地说：

SELECT DISTINCT ON (officer_defined_ethnicity)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY officer_defined_ethnicity, object_of_search
ORDER  BY officer_defined_ethnicity, ct DESC, object_of_search;

 officer_defined_ethnicity | object_of_search | ct
---------------------------+------------------+----
 ethnicity1                | Cat              | 1
 ethnicity2                | Stolen goods     | 2
 ethnicity3                | Firearms         | 1

db小提琴here

由于 DISTINCT ON 是在 GROUP BY 之后应用的，因此我们只需要一个查询级别。

聚合以获取每个 (officer_defined_ethnicity, object_of_search) 和 GROUP BY 的计数。
在officer_defined_ethnicity 和DISTINCT ON 中选择计数最高的行。

我将object_of_search 添加为第三个ORDER BY 项目以充当决胜局并产生确定性结果：
如果出现平局，请根据字母排序顺序选择第一个object_of_search。
适应您的需求。

见：

比row_number() 的子查询更简单且通常更快：

Select first row in each GROUP BY group? - Benchmarks

【讨论】：