【问题标题】:Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect使用 toPandas() 和 databricks 连接时遇到“java.lang.OutOfMemoryError: Java heap space”
【发布时间】:2021-03-21 03:25:48
【问题描述】:

我正在尝试将大小为 [2734984 行 x 11 列] 的 pyspark 数据帧转换为调用 toPandas() 的 pandas 数据帧。虽然在使用 Azure Databricks Notebook 时它工作得非常好(11 秒),但当我使用 databricks-connect 运行完全相同的代码时遇到java.lang.OutOfMemoryError: Java heap space 异常(db-connect 版本和 Databricks 运行时版本匹配并且都是 7.1 )。

我已经增加了火花驱动器内存 (100g) 和 maxResultSize (15g)。我想错误出在 databricks-connect 的某个地方,因为我无法使用笔记本复制它。

有什么提示吗?

错误如下:

Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
    at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
    at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
    at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
    at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
    at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
    at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
    at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
    at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
    at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
    at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
    at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
    at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```

【问题讨论】:

    标签: python pandas pyspark databricks databricks-connect


    【解决方案1】:

    这可能是因为 Databricks-connect 正在客户端计算机上执行 toPandas,然后可能会耗尽内存。您可以通过在(本地)配置文件${spark_home}/conf/spark-defaults.conf 中设置spark.driver.memory 来增加本地驱动程序内存,其中${spark_home} 可以通过databricks-connect get-spark-home 获得。

    【讨论】:

    • 很好,成功了!我怀疑这样的事情,但完全不知道从哪里开始。对于有同样问题的人:\conf\spark-defaults.conf 在我的情况下不存在,只是创建它并插入以下一个衬里:spark.driver.memory 10g。运行 toPandas 时,我可以清楚地看到本地机器上的内存负载增加。再次感谢!
    猜你喜欢
    • 2013-12-20
    • 2012-08-03
    • 2013-02-25
    • 2016-04-15
    • 2013-05-18
    • 1970-01-01
    • 1970-01-01
    • 2018-12-25
    相关资源
    最近更新 更多