根据其他列计算一列中的不同值答案

【问题标题】：Count Distinct values in one column based on other columns根据其他列计算一列中的不同值
【发布时间】：2019-03-09 04:33:34
【问题描述】：

我有一个如下所示的表格：

app_id  supplier_reached    creation_date   platform
10001       1            9/11/2018         iOS
10001       2            9/18/2018         iOS
10002       1            5/16/2018       android
10003       1            5/6/2018        android
10004       1            10/1/2018       android
10004       1            2/3/2018        android
10004       2            2/2/2018           web
10005       4            1/5/2018           web
10005       2            5/1/2018        android
10006       3            10/1/2018         iOS
10005       4            1/1/2018          iOS

目标是找出每个月提交的 app_id 的唯一数量。

如果我只是做一个count(distinct app_id)，我会得到以下结果：

Group by month  count(app number)
     Jan              1
     Feb              1
     may              3
  september           1
   october            2

但是，基于其他字段的组合，应用程序也被认为是唯一的。例如，对于 1 月份，the app_id 相同，但 app_id、supplier_reached 和 platform 的组合显示不同的值，因此 app_id 应计算两次。按照相同的模式，期望的结果应该是：

Group by month  Desired answer
     Jan              2
     Feb              2
     may              3
   september          2
    october           2

最后，表中可能有许多其他列可能会或可能不会影响应用程序的唯一性。

有没有办法在 SQL 中进行这种类型的计数？

我正在使用 Redshift。

【问题讨论】：

标签： sql postgresql count amazon-redshift

【解决方案1】：

如上所述，在 Redshift 中，count(distinct ...) 不适用于多个字段。

您可以先按您想要唯一的列进行分组，然后像这样计算记录：

select month,count(1) as app_number 
from (
    select month,app_id,supplier_reached,platform
    from your_table
    group by 1,2,3,4
)
group by 1

【讨论】：

there can be many other columns in the table which may .... contribute to the uniqueness 只需添加/更改内部子查询中使用的列以满足您的目的。请注意 Erwin Brandstetter 的回答（有充分的理由避免串联）。

【解决方案2】：

我认为 Postgres 或 Redshift 不支持带有多个参数的 COUNT(DISTINCT)。一种解决方法是使用串联：

count(distinct app_id || ':' || supplier_reached || ':' || platform)

【讨论】：

【解决方案3】：

你的目标是错误的。

你不想

to find the unique number of app_id submitted per month

你想要的

to find the unique number of app_id + supplier_reached + platform submitted per month.

因此，您需要使用 a) 列组合，例如 count(distinct col1||col2||col3) 或 b)

select t1.month, count(t1.*)
  (select distinct 
         app_id, 
         supplier_reached,  
         platform, 
         month 
   from sometable) t1
group by month

【讨论】：

我尝试这样做，但删除 distinct 会导致应用程序重复计数，因为在某些情况下记录也是重复的。 :|
不，我不想要to find the unique of app_id + supplier_reached + platform。利益相关者给出的定义指出，即使 app_id 可能相同（由于 DB 设计），应用程序的唯一性也由多个字段表征。我只是想按照提供的定义来计算。

【解决方案4】：

实际上，您可以在 Postgres 中方便地计算不同的ROW values：

SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM   tbl
GROUP  BY 1;

ROW 关键字在这里只是噪音：

count(DISTINCT ROW(app_id, supplier_reached, platform))

我不鼓励为此目的连接列。这是比较昂贵的，容易出错（考虑不同的数据类型和依赖于语言环境的text 表示）并且如果使用的分隔符可以包含在列值中，则会引入极端情况错误。

唉，not supported by Redshift：

...
Value expressions
    Subscripted expressions  
    Array constructors  
    Row constructors
...

【讨论】：