将数据帧写入 HDFS 时出现 NumberFormatException 错误答案

【问题标题】：NumberFormatException error while writing a dataframe to HDFS将数据帧写入 HDFS 时出现 NumberFormatException 错误
【发布时间】：2017-10-30 15:41:06
【问题描述】：

我正在写dataframe 到HDFS，代码如下

final_df.write.format("com.databricks.spark.csv").option("header", "true").save("path_to_hdfs")

它给了我以下错误：

Caused by: java.lang.NumberFormatException: For input string: "124085346080"

下面的完整堆栈：

at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "124085346080"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:583)
    at java.lang.Integer.parseInt(Integer.java:615)
    at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
    at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
    at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
    at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
    at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
    ... 8 more

如果它试图将"124085346080" 转换为int，我看不出它有任何不能这样做的原因。

关于为什么会发生这种情况以及可以采取哪些措施来纠正它的任何建议？

【问题讨论】：

你能提供你的数据框的架构和你的数据框的样本吗？而且我假设错误不应该写入 csv。
Ramesh：错误在下面得到正确解决，上面提到的 no 超出了 INT 范围。由于该动作仅在写入帧时发生，因此错误实际上是在写入 csv 阶段。
感谢您告诉我:)

标签： scala hadoop apache-spark pyspark hdfs

【解决方案1】：

124,085,346,080 远大于 Integer.MAX_VALUE（即 2,147,483,647），因此无法转换为整数。请改用long。

【讨论】：

【解决方案2】：

值124085346080 不能存储为整数类型。由于整数可以存储的最大值是Integer.MAX_VALUE(2147483647)。尝试使用long。在读取数据时将架构更改为long

【讨论】：

由于 pyspark 不支持 LongType()，除了 DecimalType() 之外，我还有其他选择吗？
我想它说它支持 LongType() 请参考这里spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/sql/…