【问题标题】:pyspark read file from AWS S3 not workingpyspark 从 AWS S3 读取文件不起作用
【发布时间】:2019-09-17 18:47:13
【问题描述】:

我使用 brew 安装了 spark 和 hadoop:

brew info hadoop #=> hadoop: stable 3.1.2
brew info apache-spark #=> apache-spark: stable 2.4.4

我现在正在尝试加载托管在 s3 上的 csv 文件,尝试了许多不同的方法但没有成功(这里是其中之一):

import pyspark

conf = pyspark.SparkConf()

sc = pyspark.SparkContext('local[4]', conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "key here")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "key here")

sql = pyspark.SQLContext(sc)
df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')

我收到此错误:

19/09/17 14:34:52 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/ingestion/clients/aws_s3.py", line 51, in <module>
    df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 476, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
        ... 30 more

它看起来与我的 aws s3 凭据有关,但不确定如何设置。 (目前我的敬畏证书在我的 bash_profile 中)请帮助。

【问题讨论】:

    标签: apache-spark hadoop amazon-s3 pyspark


    【解决方案1】:

    也许thisisprabin 的解决方案here 会有所帮助?

    将以下内容添加到此文件“hadoop/etc/hadoop/core-site.xml”

    <property>
      <name>fs.s3.awsAccessKeyId</name>
      <value>***</value>
    </property>
    <property>
      <name>fs.s3.awsSecretAccessKey</name>
      <value>***</value>
    </property>
    

    sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
    
    sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/
    

    【讨论】:

    • 谢谢。我在哪里可以找到 Hadoop 在我本地的位置
    • 我猜是在/usr/local/Cellar/hadoop/3.1.2下,配置文件在/usr/local/Cellar/hadoop/3.1.2/libexec/etc/下hadoop/.
    • 你好@Theofanis,我已经通过自制软件安装了pyspark,但我找不到hadoop文件夹,能够运行pyspark并且它按预期工作。如何找到 core-site.xml?
    猜你喜欢
    • 2018-08-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-01-05
    • 2015-08-30
    • 2019-10-27
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多