【发布时间】:2021-07-23 07:17:52
【问题描述】:
我有一个 spark scala 作业,它正在为给定的时间间隔选择一个分数。我已经针对同一组数据运行了 3 次这项工作,每次我得到的分数结果都略有不同。我的分数是在一个 UDF 中选择的,该 UDF 包含一个 Seq 和一些要评估的分数。目前我只是评估一个分数,所以它应该只返回最高分,但我没有看到返回的一致高分。我不确定为什么会这样,任何帮助将不胜感激,如果需要,我可以添加更多信息。
RUN ONE:
- Grab data from s3
- Use pushdown predicates to get filtered data
- Filter events with certain business rules, reapartiion, and mapPartition to clean up data
- Filter data that is already good to go
- Union the two dataframes
- Join on small table
- GroupBy & Aggregate to get a sum
- GroupBy & Aggregate to get a List of sums
- UDF with business logic to randomly select sum (currently just selecting highest sum)
- Join on small table
- Partition data and write to S3
RUN TWO:
** Same as Run One but I no longer repartition and I run the UDF after I join
- Grab data from s3
- Use pushdown predicates to get filtered data
- Filter events with certain business rules and mapPartition to clean up data
- Filter data that is already good to go
- Union the two dataframes
- Join on small table
- GroupBy & Aggregate to get a sum
- GroupBy & Aggregate to get a List of sums
- Join on small table
- UDF with business logic to randomly select sum (currently just selecting highest sum)
- Partition data and write to S3
RUN THREE:
** Same as Run Two but passed in spark conf: spark.sql.shuffle.partitions=400
【问题讨论】:
-
比较执行计划。如果您发现 udf 的优化方式有所不同,请尝试将 UDF 标记为非确定性:
val myUdf = udf{ ... }.asNondeterministic。
标签: apache-spark apache-spark-sql user-defined-functions