【问题标题】：How to store variables from a text file and manipulate its contents: Spark RDD / Scala -如何存储文本文件中的变量并操作其内容：Spark RDD / Scala -
【发布时间】：2020-05-28 10:30:30
【问题描述】：

我不知道如何将数据读入可管理的变量或操纵数据以检索最高和最低的销售数据。

问题：根据全球销量计算最高/最低销量类型（其中全球销量 = NA_Sales + EU_Sales + JP_Sales）。使用 println 将结果打印到终端。

示例输出：销量最高的类型：射击游戏全球销量：27.57 销量最低的类型：策略全球销量：0.23

//Create a case class to to represent the 9 columns 
case class Sales (Name: String, Platform: String, Year: Int, Genre: String, Publisher: String, NA_Sales: Double, EU_Sales: Double, JP_Sales: Double, Other_Sales: Double)

//Generate a sales schema based upon our class above
import org.apache.spark.sql.Encoders
val salesSchema = Encoders.product[Sales].schema


//Using our data schema we can load the Sales data as a Dataframe
val salesDF = spark.read.option("header", "true").schema(salesSchema).csv("hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv")

//convert a DataFrame to a DataSet
val salesDS = salesDF.as[Sales]

数据库格式如下：

Gran Turismo 3: A-Spec;PS2;2001;Racing;Sony Computer Entertainment;6.85;5.09;1.87;1.16
Call of Duty: Modern Warfare 3;X360;2011;Shooter;Activision;9.03;4.28;0.13;1.32
Pokemon Yellow: Special Pikachu Edition;GB;1998;Role-Playing;Nintendo;5.89;5.04;3.12;0.59
Call of Duty: Black Ops;X360;2010;Shooter;Activision;9.67;3.73;0.11;1.13
Pokemon HeartGold/Pokemon SoulSilver;DS;2009;Action;Nintendo;4.4;2.77;3.96;0.77
High Heat Major League Baseball 2003;PS2;2002;Sports;3DO;0.18;0.14;0;0.05
Panzer Dragoon;SAT;1995;Shooter;Sega;0;0;0.37;0
Corvette;GBA;2003;Racing;TDK Mediactive;0.2;0.07;0;0.01

【问题讨论】：

标签： scala apache-spark

【解决方案1】：

以下方法可能对您有所帮助

Case class to generate schema

case class Sales (Name: String, Platform: String, Year: Int, Genre: String, Publisher: String,
NA_Sales: Double, EU_Sales: Double, JP_Sales: Double, Other_Sales: Double)

Read the data

 val spark = sqlContext.sparkSession
    val implicits = spark.implicits
    import implicits._
    import org.apache.spark.sql.catalyst.ScalaReflection

    val data =
      """
        |Gran Turismo 3: A-Spec;PS2;2001;Racing;Sony Computer Entertainment;6.85;5.09;1.87;1.16
        |Call of Duty: Modern Warfare 3;X360;2011;Shooter;Activision;9.03;4.28;0.13;1.32
        |Pokemon Yellow: Special Pikachu Edition;GB;1998;Role-Playing;Nintendo;5.89;5.04;3.12;0.59
        |Call of Duty: Black Ops;X360;2010;Shooter;Activision;9.67;3.73;0.11;1.13
        |Pokemon HeartGold/Pokemon SoulSilver;DS;2009;Action;Nintendo;4.4;2.77;3.96;0.77
        |High Heat Major League Baseball 2003;PS2;2002;Sports;3DO;0.18;0.14;0;0.05
        |Panzer Dragoon;SAT;1995;Shooter;Sega;0;0;0.37;0
        |Corvette;GBA;2003;Racing;TDK Mediactive;0.2;0.07;0;0.01
      """.stripMargin

    val ds = spark.read
      .schema(ScalaReflection.schemaFor[Sales].dataType.asInstanceOf[StructType])
      .option("sep", ";")
      .csv(data.split("\n").toSeq.toDS())

    ds.show(false)
    ds.printSchema()

结果

+---------------------------------------+--------+----+------------+---------------------------+--------+--------+--------+-----------+
|Name                                   |Platform|Year|Genre       |Publisher                  |NA_Sales|EU_Sales|JP_Sales|Other_Sales|
+---------------------------------------+--------+----+------------+---------------------------+--------+--------+--------+-----------+
|Gran Turismo 3: A-Spec                 |PS2     |2001|Racing      |Sony Computer Entertainment|6.85    |5.09    |1.87    |1.16       |
|Call of Duty: Modern Warfare 3         |X360    |2011|Shooter     |Activision                 |9.03    |4.28    |0.13    |1.32       |
|Pokemon Yellow: Special Pikachu Edition|GB      |1998|Role-Playing|Nintendo                   |5.89    |5.04    |3.12    |0.59       |
|Call of Duty: Black Ops                |X360    |2010|Shooter     |Activision                 |9.67    |3.73    |0.11    |1.13       |
|Pokemon HeartGold/Pokemon SoulSilver   |DS      |2009|Action      |Nintendo                   |4.4     |2.77    |3.96    |0.77       |
|High Heat Major League Baseball 2003   |PS2     |2002|Sports      |3DO                        |0.18    |0.14    |0.0     |0.05       |
|Panzer Dragoon                         |SAT     |1995|Shooter     |Sega                       |0.0     |0.0     |0.37    |0.0        |
|Corvette                               |GBA     |2003|Racing      |TDK Mediactive             |0.2     |0.07    |0.0     |0.01       |
+---------------------------------------+--------+----+------------+---------------------------+--------+--------+--------+-----------+

root
 |-- Name: string (nullable = true)
 |-- Platform: string (nullable = true)
 |-- Year: integer (nullable = false)
 |-- Genre: string (nullable = true)
 |-- Publisher: string (nullable = true)
 |-- NA_Sales: double (nullable = false)
 |-- EU_Sales: double (nullable = false)
 |-- JP_Sales: double (nullable = false)
 |-- Other_Sales: double (nullable = false)

Get Lowest and highest selling genre

  // global sales
    val processedDF = ds.withColumn("global_sale", col("NA_Sales") + col("EU_Sales") + col("JP_Sales"))
      .groupBy("Genre")
      .agg(sum("global_sale").as("global_sale_by_genre"))

    println("Lowest selling :: " + processedDF.orderBy(col("global_sale_by_genre").asc).head()
      .getValuesMap(Seq("Genre", "global_sale_by_genre")).mkString(", "))
    println("Highest selling :: " + processedDF.orderBy(col("global_sale_by_genre").desc).head()
      .getValuesMap(Seq("Genre", "global_sale_by_genre")).mkString(", "))

结果

Lowest selling :: Genre -> Sports, global_sale_by_genre -> 0.32
Highest selling :: Genre -> Shooter, global_sale_by_genre -> 27.32

【讨论】：