【发布时间】:2017-06-07 21:40:51
【问题描述】:
我是 spark 新手,需要一些帮助来调试 spark 中非常慢的性能。 我正在做以下转换,它已经运行了 2 个多小时。
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext( sc )
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2b33f7a0
scala> val t1_df = hiveContext.sql("select * from T1" )
scala> t1_df.registerTempTable( "T1" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates( Array("c1","c2","c3", "c4" ))
scala> df1.registerTempTable( "ABC" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).count
[Stage 4:====================================================> (89 + 8) / 97]
我正在使用 spark2.1.0 并在 7 个节点的亚马逊 VM 集群上从 hive.2.1.1 读取数据,每个节点具有 250GB RAM 和 64 个虚拟内核。有了这个庞大的资源,我期待这个对 170 万个记录的简单查询能够飞起来,但它的速度非常慢。 任何指针都会有很大帮助。
更新: 添加解释计划:
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>
【问题讨论】:
-
您分配了多少资源(
spark.executor.instances和spark.executor.cores),请确保您为这些设置输入了一些合理的数字。此外,查看 SparkUI 以检查作业/阶段是否具有足够的并行度。 -
你能为我们做一件事吗?使用此命令添加命令以显示物理计划:
hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain() -
Thiago,添加了解释计划。
标签: performance apache-spark hive