基于单独的分组列和条件进行计数答案

【问题标题】：Count based on separate group by columns and with a condition基于单独的分组列和条件进行计数
【发布时间】：2017-04-20 22:17:54
【问题描述】：

我正在尝试将三个单独的查询合并为一个，但仍会产生相同的结果，但只是作为一个表。 ColumnA 和 ColumnB 实际上都是 'yyyy-mm-dd' 的日期格式，理想情况下，最终结果将只是一列日期和每个查询的单独计数。

select columnA, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnA

select columnB, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnB

select columnB, count(distinct columnC)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
and columnX in ('itemA','ItemB')
group by columnB

【问题讨论】：

看起来像 UNION ALL 的教科书用例
虽然经过反思，我认为 Gordon 理解得更好。

标签： sql count hive

【解决方案1】：

与UNION ALL一起去：

select columnA, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnA
UNION ALL
select columnB, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnB
UNION ALL
select columnB, count(distinct columnC)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
and columnX in ('itemA','ItemB')
group by columnB

【讨论】：

【解决方案2】：

下面的查询表达了你想要做什么：

select d.dte, coalesce(a.cnt, 0) as acnt, coalesce(b.cnt, 0) as bcnt,
       b.c_cnt
from (select columnA as dte from data.table where timestamp between '2017-01-01' and '2017-01-07'

      union
      select columnB from data.table where timestamp between '2017-01-01' and '2017-01-07'
     ) d left join
     (select columnA, count(*) as cnt
      from data.table
      where timestamp between '2017-01-01' and '2017-01-07'
      group by columnA
     ) a
     on d.dte = a.columnA left join
     (select columnB, count(*) as cnt,
             count(distinct case when columnX in ('itemA','ItemB') then columnC end) as c_cnt
      from data.table
      where timestamp between '2017-01-01' and '2017-01-07'
      group by columnB
     ) b
     on d.dte = b.columnB;

我认为这是 Hive 兼容的，但偶尔 Hive 与其他 SQL 方言有惊人的偏差。

【讨论】：

【解决方案3】：

以下内容似乎是您想要的：

select columnA, count(*) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' group by columnA
Union All
select columnB, count(*) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' group by columnB
Union All
select columnB, count(distinct columnC) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' and columnX in ('itemA','ItemB') group by columnB

【讨论】：

【解决方案4】：

我能够使用以下方法使其工作：

With pullA as
(
  select columnA, count(*) as A_count
  from data.table
  group by columnA
),
pullB as
(
  select columnB, count(*) as B_count
  from data.table
  group by columnB
),

pullC as
(
  select columnB , count(*) as C_count
  from data.table
  where columnX in ('itemA', 'itemB')
  group by columnB
)

select ColumnB, A_count, B_count, C_count
from pullB
left join pullA
on ColumnB = ColumnA
left join pullC
on ColumnB = ColumnC

这种方法比联合或子查询方法效率更高还是更低？

【讨论】：