【问题标题】:Kafka "partition.assignment.strategy" in PysparkPyspark 中的卡夫卡“partition.assignment.strategy”
【发布时间】:2021-04-29 14:52:32
【问题描述】:

我正在尝试读取数据以将其转换为 Dataframe,我的软件的当前版本如下:

  1. spark-2.4.7-bin-hadoop2.7
  2. kafka_2.12-2.7.0

Kafka 正在工作,我存储了以下数据,我正在尝试读取:

~/development/kafka_home/kafka_2.13-2.6.0$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic testtopic --from-beginning
{"transaction_id": "1", "transaction_card_type": "Visa", "transaction_amount": 181.76, "transaction_datetime": "2021-01-25 15:44:44"}
{"transaction_id": "2", "transaction_card_type": "MasterCard", "transaction_amount": 228.62, "transaction_datetime": "2021-01-25 15:44:45"}
{"transaction_id": "3", "transaction_card_type": "Visa", "transaction_amount": 483.48, "transaction_datetime": "2021-01-25 15:44:46"}
{"transaction_id": "4", "transaction_card_type": "MasterCard", "transaction_amount": 477.87, "transaction_datetime": "2021-01-25 15:44:47"}
{"transaction_id": "5", "transaction_card_type": "MasterCard", "transaction_amount": 304.52, "transaction_datetime": "2021-01-25 15:44:48"}
{"transaction_id": "1", "transaction_card_type": "MasterCard", "transaction_amount": 346.99, "transaction_datetime": "2021-01-25 16:38:44"}
{"transaction_id": "2", "transaction_card_type": "Maestro", "transaction_amount": 384.33, "transaction_datetime": "2021-01-25 16:38:45"}
{"transaction_id": "3", "transaction_card_type": "MasterCard", "transaction_amount": 394.95, "transaction_datetime": "2021-01-25 16:38:46"}
{"transaction_id": "4", "transaction_card_type": "Visa", "transaction_amount": 22.75, "transaction_datetime": "2021-01-25 16:38:47"}
{"transaction_id": "5", "transaction_card_type": "MasterCard", "transaction_amount": 492.01, "transaction_datetime": "2021-01-25 16:38:48"}

我正在 PySpark 中执行以下代码

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

KAFKA_TOPIC_NAME_CONS = "testtopic"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost:9092'

spark = SparkSession \
    .builder \
    .appName("PySpark Structured Streaming with Kafka Demo") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .getOrCreate()

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic").load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = ds \
  .writeStream \
  .queryName("tableName") \
  .option("partition.assignment.strategy", "range")
  .format("console") \
  .start()

我得到的错误如下:

21/01/25 18:53:41 WARN kafka010.KafkaOffsetReader:尝试 1 时出错 获取 Kafka 偏移量:org.apache.kafka.common.config.ConfigException: 缺少所需的配置“partition.assignment.strategy” 没有默认值。

我做了一些研究,他们说名为“kafka-clients-1.1.0.jar”的 .jar 文件似乎是问题所在,但是我已经用相同的方法训练了 2.6.0 和 1.1.0 版本结果。

**

编辑:

**

我在“spark-defaults”中添加了以下内容

spark.jars /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.executor.extraClassPath /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.driver.extraClassPath /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.executor.extraLibrary /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar

并通过以下方式创建我的会话:

spark = SparkSession \
    .builder \
    .appName("PySpark Structured Streaming with Kafka Demo") \
    .getOrCreate()

我仍然收到以下错误:

java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister:提供者 org.apache.spark.sql.kafka010.KafkaSourceProvider 不能 实例化

对于这行代码:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic").load()

【问题讨论】:

    标签: python apache-spark pyspark apache-kafka


    【解决方案1】:

    Spark docs中提到的,只需要包含以下依赖即可:

    groupId = org.apache.spark
    artifactId = spark-streaming-kafka-0-10_2.11
    version = 2.4.7 <-- replace this by your appropriate Spark version
    

    Spark 警告不要直接使用 kafka-clients*.jar,因为它已经包含了这些 jar,并且为同一个库添加多个 jar 会使调试更加困难。

    不要手动添加对 org.apache.kafka 工件的依赖项(例如 kafka-clients)。 spark-streaming-kafka-0-10 工件已经具有适当的传递依赖项,不同的版本可能以难以诊断的方式不兼容。

    【讨论】:

    • 我已经为我的特定版本下载了您提到的 .jar 文件。现在我得到以下错误: df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic") .load() java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: 提供者 org.apache.spark.sql.kafka010.KafkaSourceProvider 无法实例化
    • 只是为了确保您已删除您在类路径中添加的与 kafka 相关的所有 jar,包括问题中提到的多个版本?通常会抛出此错误,因为您使用的是旧版本的库。
    • 我还建议您通过参数而不是代码来提供这些 jar。
    • 一个相关的 SO 问题是here
    • 我已经添加了对第一篇文章中提到的主题的评论,因此它更具可读性。我仍然有同样的问题
    猜你喜欢
    • 2017-04-15
    • 1970-01-01
    • 2018-09-15
    • 2014-10-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-27
    • 1970-01-01
    相关资源
    最近更新 更多