kafka max.poll.records 在火花流中不起作用答案

【问题标题】：the kafka max.poll.records does not work in spark streamingkafka max.poll.records 在火花流中不起作用
【发布时间】：2019-03-02 20:04:40
【问题描述】：

我的 spark streaming 版本是 2.0，kafka 版本是 0.10.0.1，spark-streaming-kafka-0-10_2.11。我使用直接方式获取kafka记录，我现在想限制我批量获取的最大消息数。所以我设置了 max.poll.records 值，但它不起作用。 spark中的consumer数就是kafka中的partition数？所以spark streaming中的最大记录数是max.poll.records*consumers？

【问题讨论】：

该属性是一个上限，而不是一个确切的数字。另外，不确定您对消费者的要求是什么，但您有多少执行者？

标签： apache-spark apache-kafka spark-streaming kafka-consumer-api

【解决方案1】：

max.poll.records 控制从轮询返回的记录数的上限。

在 Spark Streaming 中，一批中可能会发生多个民意调查。在这种情况下，max.poll.records 不会很有用。你应该使用spark.streaming.kafka.maxRatePerPartition，根据documentation

一个重要的参数是 spark.streaming.kafka.maxRatePerPartition，它是这个直接 API 读取每个 Kafka 分区的最大速率（以每秒消息数为单位）

所以每批的最大记录数将是

(spark.streaming.kafka.maxRatePerPartition) * (batch duration in seconds) * (number of kafka partitions)

例如，如果您在主题中有 2 个分区，批处理持续时间为 30 秒，spark.streaming.kafka.maxRatePerPartition 为 1000，您将看到每批 6000 (2 * 30 * 1000) 条记录。

还可以启用spark.streaming.backpressure.enabled 以根据处理批次所用的时间来获得更自适应的速率。

More info about under the hood working of kafka direct stream

【讨论】：