【发布时间】:2026-01-19 04:50:01
【问题描述】:
我在 windows10 jupyter notebook 上运行以下代码,
airline= 'C:\spark-3.0.0-preview-bin-hadoop2.7\airlines.csv'
airlines = sc.textFile(airline)
print(airlines)
C:\spark-3.0.0-preview-bin-hadoop2.7irlines.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0
(这给了我正确的输出)
当我尝试运行print(airlines.first())
我收到错误:
Py4JJavaError Traceback (most recent call last)
<ipython-input-13-36bc0976264b> in <module>
----> 1 print(airlines.first())
.
.
.
Py4JJavaError: An error occurred while calling o30.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/spark-3.0.0-preview-bin-hadoop2.7irlines.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.api.java.JavaRDDLike.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.JavaRDDLike.partitions$(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
【问题讨论】:
-
这里的
sc是什么? -
Spark 很懒惰。这意味着在需要文件之前不会加载文件。因此,您的第一条语句不会导致任何错误,它不会加载文件。一旦您调用
collect、first、toPandas、show或任何其他“Action”,将执行所需的“转换”,例如sc.textFile。该错误只是说明您的文件在给定的输入路径上不存在。看起来很容易解决! -
@akshat, sc 是 spark 上下文对象的默认值
标签: python apache-spark pyspark