使用 spark scala 比较两个大型数据集时出现内存不足问题答案

【问题标题】：Out of memory issue when compare two large datasets using spark scala使用 spark scala 比较两个大型数据集时出现内存不足问题
【发布时间】：2016-12-15 22:59:28
【问题描述】：

我每天使用 Spark scala 程序将 1000 万条记录从 Mysql 导入 Hive，并比较昨天和今天的数据集。

val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);

我正在使用 3 节点集群和程序，可以正常处理 400 万条记录。由于 RAM 内存不足，我们面临超过 400 万的内存不足问题。

我想知道比较两个大型数据集的最佳方法。

【问题讨论】：

数据框有什么唯一键？
是的蒂亚戈。表有一个唯一的键。
其实SparkSQL api调用except是隐式调用spark api的减法。如果你有钥匙，你可以试试 todayDf.subtractByKey(yesterdayDf);

标签： apache-spark apache-spark-sql spark-streaming hadoop2

【解决方案1】：

你有没有试过找出你有多少个分区： yesterdayDf.rdd.partitions.size 将为您提供有关昨天 Df 数据帧的信息，您也可以对其他数据帧执行相同操作。

你也可以使用 yesterdayDf.repartition(1000) // (a large number)查看OOM问题是否消失。

【讨论】：

感谢您的快速响应。尝试这种方式val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts"); yesterdayPartDf=yesterdayDf.repartition(1000) val todayDf=sqlContext.sql("select * from t_todayProducts"); todayPartDf=todayDf.repartition(1000) val diffDf=todayPartDf.except(yesterdayPartDf); 但在完成第一个 df 抛出 OOM 问题的 993 项任务后
@ArvindKumarAnugula 尝试 1200 或 1500 而不是 1000。同时指定 --executor-memory 32G 或更高
确切的内存问题是GC memory overhead exceeded.
每个节点的 RAM 大小仅为 16GB。我可以使用 --executor-memory 16G 吗？
你可以在这里看到我传递的配置：--master yarn-client --executor-memory 16G --num-executors 12 --executor-cores 4 --driver-memory 8G

【解决方案2】：

这个问题的原因很难说。但问题可能是由于某种原因，工作人员获取了太多数据。尝试清除数据框以执行异常。根据我在 cmets 的问题，你说你有关键列，所以只取它们这样的：

val yesterdayDfKey = yesterdayDf.select("key-column")
val todayDfKey = todayDf.select("key-column")
val diffDf=todayDfKey.except(yesterdayDfKey);

这样，您将获取一个带有键的数据框。比您可以使用 post 这样的连接来制作过滤器。

【讨论】：

【解决方案3】：

您还需要确保您的 yarn.nodemanager.resource.memory-mb 大于您的 --executor-memory。

【讨论】：

这种建议应该保存在cmets中

【解决方案4】：

您也可以尝试使用 left_anti join 将两个 df 加入键上，然后检查记录数

【讨论】：