【问题标题】:pyspark - Error while loading .csv file from url to Sparkpyspark - 将 .csv 文件从 url 加载到 Spark 时出错
【发布时间】:2020-10-21 10:21:56
【问题描述】:

pyspark 从 url 加载数据

url = "https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("cities.csv"), header=True)

但是,出现以下错误:

spark.read.csv(SparkFiles.get("cities.csv"), header=True)
[Stage 0:>                                                                                                                                                                                    
(0 + 1) / 1]20/06/30 19:10:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: File /tmp/spark-1ee8b00f-8657-4cdc-8d7b-e3bc473bbce7/userFiles-f9e0a88d-8678-48c4-a21b-c06ce76d528b/cities.csv exists and does not match contents of https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv
    

20/06/30 19:10:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 499, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.csv.``

我应该如何解决这个问题?

【问题讨论】:

    标签: python apache-spark pyspark py4j


    【解决方案1】:

    问题在于您的网址.. 为了从 github 读取数据,您必须传递 raw url。

    在数据页面上点击 raw 然后复制该 url 以获取数据

    url = 'https://raw.githubusercontent.com/jokecamp/FootballData/master/openFootballData/cities.csv'
    from pyspark import SparkFiles
    spark.sparkContext.addFile(url)
    df = spark.read.csv(SparkFiles.get("cities.csv"), header=True)
    

    【讨论】:

    • Shubham Jain,问题完美解决!非常感谢!
    猜你喜欢
    • 2016-06-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-06-24
    相关资源
    最近更新 更多