【问题标题】:Pyspark save to S3Pyspark 保存到 S3
【发布时间】:2017-02-06 14:46:58
【问题描述】:

我正在尝试将一个大文件保存到 Amazon S3 存储桶。 以下代码完美运行:

sqlContext.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"]).select('A').repartition(1).write \
    .format("text") \
    .mode("overwrite") \
    .option("header", "false") \
    .option("codec", "gzip") \
    .save("s3n://BUCKETNAME/temp.txt")

保存我的完整数据框但是失败了。我的笔记本出现以下错误:

Py4JJavaError: An error occurred while calling o1274.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Date
    at org.jets3t.service.model.StorageObject.getLastModifiedDate(StorageObject.java:376)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:176)

在 spark 应用程序 UI 中,作业被描述为成功。

我有以下配置:

sc._jsc.hadoopConfiguration().set("fs.s3n.multipart.uploads.enabled", "true")

尝试调试我尝试了以下,它应该可以正常工作...

sqlContext.createDataFrame(full_df.select('columnA').take(5),['columnA']).select('columnA').repartition(1).write \
    .format("text") \
    .mode("overwrite") \
    .option("header", "false") \
    .option("codec", "gzip") \
    .save("s3n://BUCKETNAME/temp.txt")

我发现以下链接似乎与此问题有关,但我找不到工作包 Jets3t

谁能帮助解决这个神秘的错误?

【问题讨论】:

    标签: apache-spark amazon-s3 pyspark spark-dataframe


    【解决方案1】:

    使用 Hadoop 2.7 JAR 在 S3n 上切换到 s3a。 S3n 的时代已经结束了——它只是停止回归。

    【讨论】:

    • 这给了我一个Py4JJavaError: An error occurred while calling o631.save. : java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V 这似乎是因为hadoop 版本。在我的 spark-defaults.conf 我有以下内容: com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 这似乎与this page 一致但是安装的hadoop是hadoop version Hadoop 2.0.0-cdh4.7.1这是无法更改的东西...
    猜你喜欢
    • 2018-02-02
    • 2022-11-30
    • 1970-01-01
    • 2021-07-23
    • 2019-11-07
    • 2017-04-10
    • 2018-06-20
    • 1970-01-01
    • 2016-11-04
    相关资源
    最近更新 更多