【发布时间】:2021-11-15 06:16:54
【问题描述】:
我是 Cassandra 和 Pyspark 的新手,最初我安装了 cassandra 版本 3.11.1、openjdk 1.8、pyspark 3.x 和 scala 1.12。运行我的 python 服务器后,我遇到了很多错误,如下所示。
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 23 more
我不知道这个错误到底是什么,但经过一些研究后,我意识到 pyspark Cassandra 连接存在一些问题。然后我也检查了版本。在我的研究中,我发现 4.x 以外的 Cassandra 版本与 Python3.9 不兼容。我卸载了 Cassandra 并尝试安装 cassandra4 发行版,但在运行命令后又引发了另一组错误:
wgethttp://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
Recommends: ntp but it is not going to be installed or
time-daemon
E: Unable to correct problems, you have held broken packages.
有人可以帮我理解这个问题吗?如何安装 Cassandra 和 Pyspark 以及 Python3.9。这里是不是版本不兼容?
根据答案更新问题
我在另一台机器上更新了我的版本:
目前,我正在使用以下版本:Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12
我尝试使用连接器 3.0.0 和 3.1.0 都给我错误:
UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
.......
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
使用的连接字符串:--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell 因为 pyspark 版本现在是 3.0.1。
【问题讨论】:
-
您能分享一下您是如何启动 Spark 环境和 Spark 会话的吗?
-
'--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell'用于将 spark 连接到 cassandra 的代码
-
你看到 Spark 在日志中获取那个包了吗?
-
不!显示错误
标签: python-3.x apache-spark pyspark cassandra spark-cassandra-connector