【发布时间】:2017-05-22 00:58:05
【问题描述】:
我有一个格式如下的文本文件字符串:
"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","St Louis, MO","B-5","24160 L221","2","female"
我想在逗号(,)处分割字符串,但忽略双引号(“”)内的逗号(,)。我正在使用 Spark 和 Scala 以及案例类来创建数据框。 我尝试了下面的代码,但出现错误:
val tit_rdd = td.map(td=>td.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)")).map(td=>tit(td(0).replaceAll("\"","").toInt ,
td(1).replaceAll("\"",""),
td(2).toInt,
td(3).replaceAll("\"",""),
td(4).toDouble,
td(5).replaceAll("\"",""),
td(6).replaceAll("\"",""),
td(7).replaceAll("\"",""),
td(8).replaceAll("\"",""),
td(9).replaceAll("\"","").toInt,
td(10).replaceAll("\"","")))
案例类代码如下:
case class tit (Num: Int, Class: String, Survival_Code: Int, Name: String, Age: Double, Province: String, Address: String, Coach_No: String, Coach_ID: String, Floor_No:Int, Gender:String)
错误:
17/05/21 14:52:39 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:40)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:31)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
【问题讨论】:
标签: scala apache-spark dataframe rdd case-class