【问题标题】:Cassandra with PySpark and Python >=3.6Cassandra 与 PySpark 和 Python >=3.6
【发布时间】:2021-11-15 06:16:54
【问题描述】:

我是 Cassandra 和 Pyspark 的新手,最初我安装了 cassandra 版本 3.11.1、openjdk 1.8、pyspark 3.x 和 scala 1.12。运行我的 python 服务器后,我遇到了很多错误,如下所示。

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
        at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
        at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
        at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
        at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
        at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
        at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 23 more

我不知道这个错误到底是什么,但经过一些研究后,我意识到 pyspark Cassandra 连接存在一些问题。然后我也检查了版本。在我的研究中,我发现 4.x 以外的 Cassandra 版本与 Python3.9 不兼容。我卸载了 Cassandra 并尝试安装 cassandra4 发行版,但在运行命令后又引发了另一组错误:

wgethttp://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz

    Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
             Recommends: ntp but it is not going to be installed or
                         time-daemon
E: Unable to correct problems, you have held broken packages.

有人可以帮我理解这个问题吗?如何安装 Cassandra 和 Pyspark 以及 Python3.9。这里是不是版本不兼容?

根据答案更新问题

我在另一台机器上更新了我的版本:

目前,我正在使用以下版本:Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12

我尝试使用连接器 3.0.0 和 3.1.0 都给我错误:

UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


.......
        raise Exception("Java gateway process exited before sending its port number")
    Exception: Java gateway process exited before sending its port number

使用的连接字符串:--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell 因为 pyspark 版本现在是 3.0.1。

【问题讨论】:

  • 您能分享一下您是如何启动 Spark 环境和 Spark 会话的吗?
  • '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell'用于将 spark 连接到 cassandra 的代码
  • 你看到 Spark 在日志中获取那个包了吗?
  • 不!显示错误

标签: python-3.x apache-spark pyspark cassandra spark-cassandra-connector


【解决方案1】:

您使用了错误版本的 Cassandra 连接器 - 如果您使用的是 pyspark 3.x,则需要获取相应的版本 - 3.0 或 3.1。您的版本适用于旧版本的 Spark:

pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0

附: Cassandra 4.0 也已经发布了 - 使用 beta2 没有意义

【讨论】:

  • 我将其更改为 3.1.0,但这给了我错误 UNRESOLVED DEPENDENCIES com.datastax.spark#spark-cassandra-connector_2.12;3.1.0: not found
  • 这是我的连接字符串:--packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-贝壳
  • 对我来说工作得很好——检查你是否有防火墙禁用对 maven 中心的访问,或者类似的东西。如果您使用公司的 maven 存储库,请使用 --repositories 选项指定它们
猜你喜欢
  • 2019-03-26
  • 2020-12-14
  • 1970-01-01
  • 2020-12-15
  • 2019-06-02
  • 2016-01-29
  • 2021-05-27
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多