【问题标题】:Spark Scala Data Frame to have multiple aggregation of single Group By [duplicate]Spark Scala数据框具有单个Group By的多个聚合[重复]
【发布时间】:2019-06-17 17:34:10
【问题描述】:

Spark Scala 数据框具有单个分组的多个聚合。 例如

val groupped = df.groupBy("firstName", "lastName").sum("Amount").toDF()

但是如果我需要 Count、Sum、Max 等呢

/* Below Does Not Work  , but this is what the intention is  
val groupped = df.groupBy("firstName", "lastName").sum("Amount").count().toDF()
*/

输出 groupped.show()

--------------------------------------------------
| firstName | lastName| Amount|count | Max | Min  |
--------------------------------------------------

【问题讨论】:

  • // 计算最大年龄和平均工资,按部门和性别分组。 ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" )) 请参阅文档中的 groupyBy 示例spark.apache.org/docs/2.3.0/api/scala/…跨度>
  • @user10958683 True.. 它是重复的,但 Zaks 的答案更具可读性

标签: scala apache-spark apache-spark-sql


【解决方案1】:
case class soExample(firstName: String, lastName: String, Amount: Int)
val df =  Seq(soExample("me", "zack", 100)).toDF

import org.apache.spark.sql.functions._

val groupped = df.groupBy("firstName", "lastName").agg(
     sum("Amount"),
     mean("Amount"), 
     stddev("Amount"),
     count(lit(1)).alias("numOfRecords")
   ).toDF()

display(groupped)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-03-02
    • 2019-02-09
    • 1970-01-01
    • 2017-09-16
    相关资源
    最近更新 更多