Spark结构化流内存绑定答案

【问题标题】：Spark Structured Streaming Memory BoundSpark结构化流内存绑定
【发布时间】：2018-04-30 22:05:53
【问题描述】：

我正在处理平均负载为 100 Mb/s 的流。我有六个执行程序，每个执行程序分配了 12 Gb 的内存。但是，由于数据加载，我在几分钟内在 spark 执行程序中遇到内存不足错误（错误 52）。尽管 Spark 数据帧在概念上是 unbounded，但它似乎受总执行器内存的限制？

我的想法是大约每五分钟将数据帧/流保存为镶木地板。但是，在那之后，spark 似乎没有直接的机制来清除数据帧？

val out = df.
  writeStream.
  format("parquet").
  option("path", "/applications/data/parquet/customer").
  option("checkpointLocation", "/checkpoints/customer/checkpoint").
  trigger(Trigger.ProcessingTime(300.seconds)).
  outputMode(OutputMode.Append).
  start

【问题讨论】：

标签： scala apache-spark spark-dataframe spark-structured-streaming

【解决方案1】：

似乎没有直接的方法可以做到这一点。作为this conflicts with the general Spark model that operations be rerunnable in case of failure。

但是，我会在 2018 年 2 月 8 日 13:21 分享与 comment 相同的观点 issue。

【讨论】：