【问题标题】:Transformation of data into list of objects of class in spark scala将数据转换为 spark scala 中的类对象列表
【发布时间】:2016-12-24 03:31:41
【问题描述】:

我正在尝试编写一个 spark 转换代码来将以下数据转换为以下类的对象列表,我对 scala 和 spark 完全陌生,并尝试拆分数据并将它们放入案例类中,但我无法追加他们回来了。请求您的帮助。

数据:

FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2

期望的输出:

PLayerStats{ String FirstName,
    String LastName,
    String Country,
    Map <String,Int> matchandscore
}

【问题讨论】:

    标签: scala class apache-spark transformation


    【解决方案1】:

    假设您已经将数据加载到名为 dataRDD[String] 中:

    case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])
    
    val result: RDD[PlayerStats] = data
      .filter(!_.startsWith("FirstName")) // remove header
      .map(_.split(",")).map { // map into case classes
        case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
      }
      .keyBy(p => (p.FirstName, p.LastName)) // key by player
      .reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore)) 
      .map(_._2) // remove key
    

    【讨论】:

    • @Bhushan 很高兴它有帮助 - 您可以接受/投票,让未来的读者知道这很有用
    【解决方案2】:

    首先将行转换为键值对说(Cristiano, rest of data)然后应用groupByKeyreduceByKey也可以工作然后尝试通过放置值将groupByKey或reduceByKey应用到你的类后转换键值对数据。借助著名的字数统计程序。

    http://spark.apache.org/examples.html

    【讨论】:

      【解决方案3】:

      您可以尝试以下方法:

      val file = sc.textFile("myfile.csv")
      
      val df = file.map(line => line.split(",")).       // split line by comma
                    filter(lineSplit => lineSplit(0) != "FirstName").  // filter out first row
                    map(lineSplit => {            // transform lines
                    (lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
                    toDF("FirstName", "LastName", "Country", "MatchAndScore")         
      
      df.schema
      // res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))
      
      df.show
      
      +---------+--------+---------+----------------+
      |FirstName|LastName|  Country|   MatchAndScore|
      +---------+--------+---------+----------------+
      |Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
      |Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
      |Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
      |Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
      |   Lionel|   Messi|Argentina|Map(Match1 -> 1)|
      |   Lionel|   Messi|Argentina|Map(Match2 -> 2)|
      |   Lionel|   Messi|Argentina|Map(Match3 -> 1)|
      |   Lionel|   Messi|Argentina|Map(Match4 -> 2)|
      +---------+--------+---------+----------------+
      

      【讨论】:

        猜你喜欢
        • 2017-06-11
        • 2020-02-09
        • 1970-01-01
        • 2021-03-08
        • 2013-03-08
        • 1970-01-01
        • 2021-10-14
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多