【问题标题】:Count based on separate group by columns and with a condition基于单独的分组列和条件进行计数
【发布时间】:2017-04-20 22:17:54
【问题描述】:

我正在尝试将三个单独的查询合并为一个,但仍会产生相同的结果,但只是作为一个表。 ColumnA 和 ColumnB 实际上都是 'yyyy-mm-dd' 的日期格式,理想情况下,最终结果将只是一列日期和每个查询的单独计数。

select columnA, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnA

select columnB, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnB

select columnB, count(distinct columnC)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
and columnX in ('itemA','ItemB')
group by columnB

【问题讨论】:

  • 看起来像 UNION ALL 的教科书用例
  • 虽然经过反思,我认为 Gordon 理解得更好。

标签: sql count hive


【解决方案1】:

UNION ALL一起去:

select columnA, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnA
UNION ALL
select columnB, count(*)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
group by columnB
UNION ALL
select columnB, count(distinct columnC)
from data.table
where timestamp between '2017-01-01' and '2017-01-07'
and columnX in ('itemA','ItemB')
group by columnB

【讨论】:

    【解决方案2】:

    下面的查询表达了你想要做什么:

    select d.dte, coalesce(a.cnt, 0) as acnt, coalesce(b.cnt, 0) as bcnt,
           b.c_cnt
    from (select columnA as dte from data.table where timestamp between '2017-01-01' and '2017-01-07'
    
          union
          select columnB from data.table where timestamp between '2017-01-01' and '2017-01-07'
         ) d left join
         (select columnA, count(*) as cnt
          from data.table
          where timestamp between '2017-01-01' and '2017-01-07'
          group by columnA
         ) a
         on d.dte = a.columnA left join
         (select columnB, count(*) as cnt,
                 count(distinct case when columnX in ('itemA','ItemB') then columnC end) as c_cnt
          from data.table
          where timestamp between '2017-01-01' and '2017-01-07'
          group by columnB
         ) b
         on d.dte = b.columnB;
    

    我认为这是 Hive 兼容的,但偶尔 Hive 与其他 SQL 方言有惊人的偏差。

    【讨论】:

      【解决方案3】:

      以下内容似乎是您想要的:

      select columnA, count(*) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' group by columnA
      Union All
      select columnB, count(*) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' group by columnB
      Union All
      select columnB, count(distinct columnC) as cnt from data.table where timestamp between '2017-01-01' and '2017-01-07' and columnX in ('itemA','ItemB') group by columnB
      

      【讨论】:

        【解决方案4】:

        我能够使用以下方法使其工作:

        With pullA as
        (
          select columnA, count(*) as A_count
          from data.table
          group by columnA
        ),
        pullB as
        (
          select columnB, count(*) as B_count
          from data.table
          group by columnB
        ),
        
        pullC as
        (
          select columnB , count(*) as C_count
          from data.table
          where columnX in ('itemA', 'itemB')
          group by columnB
        )
        
        select ColumnB, A_count, B_count, C_count
        from pullB
        left join pullA
        on ColumnB = ColumnA
        left join pullC
        on ColumnB = ColumnC
        

        这种方法比联合或子查询方法效率更高还是更低?

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2018-09-16
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-11-21
          相关资源
          最近更新 更多