【问题标题】:Duplicate removal by DISTINCT vs GROUP BY in Spark SQLSpark SQL 中 DISTINCT 与 GROUP BY 的重复删除
【发布时间】:2020-12-26 03:11:26
【问题描述】:

spark-sql 我使用的是 Spark-sql 2.4。

我有一个困扰我很长一段时间的问题 - 是否使用 DISTINCTGROUP BY(没有任何聚合)以更好的查询性能有效地从表中删除重复项.

对于DISTINCT,我会使用以下 -

select distinct 
       id, 
       fname, 
       lname, 
       age
from emp_table;

对于GROUP BY,我只会使用:

select id,
       fname,
       lname,
       age
from emp_table
group by 1, 2, 3, 4;

我在某处读到过关于Spark-SQL 的信息,指出只有当数据集的cardinality 很高时才应使用Distinct,否则使用Group By。但是,在我的日常工作中,我发现 Duplicate 的性能优于 Group By,即使在基数较低的情况下也是如此。

所以我的问题是哪一个在什么场景下表现更好。

有人可以请教我这个问题。在哪些情况下使用Distinct 的查询与使用Group By 的情况相比表现更好。

谢谢

【问题讨论】:

  • 尝试 .explain 并告诉我们您的结论。

标签: apache-spark apache-spark-sql


【解决方案1】:

它们在功能上是等效的,并且会生成相同的查询计划。为了清楚起见,使用 distinct。

【讨论】:

    【解决方案2】:

    这是两个查询的查询计划。正如@thebluephantom 所说,它们是相同的,所以应该没有任何性能差异。

    create table t1 (a int, b int, c int, d int);
    
    explain select a,b,c,d from t1 group by 1,2,3,4;
    == Physical Plan ==
    *(2) HashAggregate(keys=[a#14, b#15, c#16, d#17], functions=[])
    +- Exchange hashpartitioning(a#14, b#15, c#16, d#17, 200), true, [id=#33]
       +- *(1) HashAggregate(keys=[a#14, b#15, c#16, d#17], functions=[])
          +- Scan hive default.t1 [a#14, b#15, c#16, d#17], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#14, b#15, c#16, d#17], Statistics(sizeInBytes=8.0 EiB)
    
    explain select distinct a,b,c,d from t1;
    == Physical Plan ==
    *(2) HashAggregate(keys=[a#23, b#24, c#25, d#26], functions=[])
    +- Exchange hashpartitioning(a#23, b#24, c#25, d#26, 200), true, [id=#58]
       +- *(1) HashAggregate(keys=[a#23, b#24, c#25, d#26], functions=[])
          +- Scan hive default.t1 [a#23, b#24, c#25, d#26], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#23, b#24, c#25, d#26], Statistics(sizeInBytes=8.0 EiB)
    

    扩展解释表明查询经过优化后变得相同:

    explain extended select a,b,c,d from t1 group by 1,2,3,4;
    == Parsed Logical Plan ==
    'Aggregate [1, 2, 3, 4], ['a, 'b, 'c, 'd]
    +- 'UnresolvedRelation [t1]
    
    == Analyzed Logical Plan ==
    a: int, b: int, c: int, d: int
    Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
    +- SubqueryAlias spark_catalog.default.t1
       +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)
    
    == Optimized Logical Plan ==
    Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
    +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)
    
    == Physical Plan ==
    *(2) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
    +- Exchange hashpartitioning(a#41, b#42, c#43, d#44, 200), true, [id=#108]
       +- *(1) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
          +- Scan hive default.t1 [a#41, b#42, c#43, d#44], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)
    
    explain extended select distinct a,b,c,d from t1;
    == Parsed Logical Plan ==
    'Distinct
    +- 'Project ['a, 'b, 'c, 'd]
       +- 'UnresolvedRelation [t1]
    
    == Analyzed Logical Plan ==
    a: int, b: int, c: int, d: int
    Distinct
    +- Project [a#50, b#51, c#52, d#53]
       +- SubqueryAlias spark_catalog.default.t1
          +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)
    
    == Optimized Logical Plan ==
    Aggregate [a#50, b#51, c#52, d#53], [a#50, b#51, c#52, d#53]
    +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)
    
    == Physical Plan ==
    *(2) HashAggregate(keys=[a#50, b#51, c#52, d#53], functions=[], output=[a#50, b#51, c#52, d#53])
    +- Exchange hashpartitioning(a#50, b#51, c#52, d#53, 200), true, [id=#133]
       +- *(1) HashAggregate(keys=[a#50, b#51, c#52, d#53], functions=[], output=[a#50, b#51, c#52, d#53])
          +- Scan hive default.t1 [a#50, b#51, c#52, d#53], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)
    

    实际上表明查询引擎似乎更喜欢group by 查询。

    【讨论】:

    • 感谢您的回复。你能帮我理解为什么你说group by子句是基于上述查询计划的首选。
    • @Matthew 抱歉,如果我措辞不好,我只是说优化后的查询看起来更像是未优化的 group by 查询,而不是未优化的 distinct 查询。这只是一个与查询实际运行方式无关的观察结果。我建议在您的实际查询中使用 distinct 来提高可读性。
    • @thebluephantom 不是真的,我只是想了解这个 - 这真的是我第一次运行explain extended
    • 不,我的意思是一般?你一飞冲天,只是有兴趣……
    猜你喜欢
    • 2010-09-30
    • 2023-03-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-11-05
    • 2014-11-07
    • 1970-01-01
    相关资源
    最近更新 更多