【问题标题】:How to Split a String at commas(,) but ignore commas inside double quotes(",")如何在逗号(,)处拆分字符串但忽略双引号(“,”)内的逗号
【发布时间】:2017-05-22 00:58:05
【问题描述】:

我有一个格式如下的文本文件字符串:

"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","St Louis, MO","B-5","24160 L221","2","female"

我想在逗号(,)处分割字符串,但忽略双引号(“”)内的逗号(,)。我正在使用 Spark 和 Scala 以及案例类来创建数据框。 我尝试了下面的代码,但出现错误:

val tit_rdd = td.map(td=>td.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)")).map(td=>tit(td(0).replaceAll("\"","").toInt ,
                                                            td(1).replaceAll("\"",""),
                                                            td(2).toInt,
                                                            td(3).replaceAll("\"",""),
                                                            td(4).toDouble,
                                                            td(5).replaceAll("\"",""),
                                                            td(6).replaceAll("\"",""),
                                                            td(7).replaceAll("\"",""),
                                                            td(8).replaceAll("\"",""),
                                                            td(9).replaceAll("\"","").toInt,
                                                            td(10).replaceAll("\"","")))

案例类代码如下:

case class tit (Num: Int, Class: String, Survival_Code: Int, Name: String, Age: Double, Province: String, Address: String, Coach_No: String, Coach_ID: String, Floor_No:Int, Gender:String)

错误:

17/05/21 14:52:39 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NumberFormatException: For input string: ""
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:592)
    at java.lang.Integer.parseInt(Integer.java:615)
    at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
    at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
    at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:40)
    at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:31)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

【问题讨论】:

    标签: scala apache-spark dataframe rdd case-class


    【解决方案1】:

    NumberFormatException 是由于您的数据中的数字为空,您正在尝试使用 .toInt 将其转换为 Integer

    解决方案是使用TrygetOrElse,如下所示

    val tit_rdd = td.map(td=>td.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"))
      .map(td=>tit(Try(td(0).replaceAll("\"","").toInt) getOrElse 0 ,
      td(1).replaceAll("\"",""),
      Try(td(2).toInt) getOrElse 0,
      td(3).replaceAll("\"",""),
      Try(td(4).toDouble) getOrElse 0.0,
      td(5).replaceAll("\"",""),
      td(6).replaceAll("\"",""),
      td(7).replaceAll("\"",""),
      td(8).replaceAll("\"",""),
      Try(td(9).replaceAll("\"","").toInt) getOrElse 0,
      td(10).replaceAll("\"","")))
    

    应该可以解决问题

    将文本文件转换为dataFrame 的另一种方法是使用databricks csv reader

    sqlContext.read.format("com.databricks.spark.csv").load("path to the text file")
    

    这将生成默认的header names,例如_c0_c1
    您可以做的是将header line 放在您的文本文件中,并将上述行中的option 定义为

    sqlContext.read.format("com.databricks.spark.csv").option("header", true).load("path to the text file")
    

    您可以自己玩更多选项

    【讨论】:

      【解决方案2】:

      希望对你有帮助,先把所有的“,”(可拆分)替换为“#”,然后用“#”替换。

      scala> st.replace("""","""", "#").replace("""",""","#").replace(""","""", "#").replace(""""""", "").split("#").map("\"" + _ + "\"")
      res1: Array[String] = Array("1", "1st", "1", "Allen, Miss Elisabeth Walton", "29.0000", "Southampton", "St Louis, MO", "B-5", "24160 L221", "2", "female")
      scala> res1.size
      res2: Int = 11
      

      【讨论】:

        【解决方案3】:

        您应该使用 Spark 的内置 csv reader

        【讨论】:

          【解决方案4】:

          您可以使用 Spark-CSV 加载 csv 数据,它处理双引号内的所有逗号。

          这里是你如何使用它

          import org.apache.spark.sql.Encoders
          
            val spark =
              SparkSession.builder().master("local").appName("test").getOrCreate()
          
            import spark.implicits._
          
            val titschema = Encoders.product[tit].schema
          
            val dfList = spark.read.schema(schema = titschema).csv("data.csv").as[tit]
          
            dfList.show()
          
            case class tit(Num: Int,
                           Class: String,
                           Survival_Code: Int,
                           Name: String,
                           Age: Double,
                           Province: String,
                           Address: String,
                           Coach_No: String,
                           Coach_ID: String,
                           Floor_No: Int,
                           Gender: String)
          

          我希望这会有所帮助!

          如果您想创建与 SQLContext.createDataFrame 相同的架构 你可以使用 Scala 反射作为

          import org.apache.spark.sql.catalyst.ScalaReflection
          val titschema = ScalaReflection.schemaFor[tit].dataType.asInstanceOf[StructType]
          

          【讨论】:

            猜你喜欢
            • 2011-12-25
            • 2020-08-03
            • 2020-04-05
            • 2011-06-26
            • 2012-07-12
            • 1970-01-01
            相关资源
            最近更新 更多