【问题标题】:Spark - Select sum and all columns of joined datasetSpark - 选择总和和连接数据集的所有列
【发布时间】:2018-10-05 14:58:40
【问题描述】:

我有 2 个表Employees(Id,Name),EmployeeSalary(EmployeeId,Designation,Salary)。一名员工可以在公司拥有多个职位并拥有多个薪水。我如何获得 EmployeeId、姓名、工资总和、所有名称的 Seq。

到目前为止我尝试的是

    employeeDS.join(employeeSalaryDS, employeeDS.col("Id")
.equalTo(employeeSalaryDS.col("EmployeeId")),"left_outer")
.groupBy(employeeDS.col("Id")).agg(sum("Salary") as "Sum of salaries")

【问题讨论】:

    标签: scala apache-spark aggregate-functions


    【解决方案1】:

    类似的东西

    scala> val dfe = Seq((101,"John"),(102,"Mike"), (103,"Paul"), (104,"Tom")).toDF("id","name")
    dfe: org.apache.spark.sql.DataFrame = [id: int, name: string]
    
    scala> val dfes = Seq((101,"Dev", 4000),(102,"Designer", 4000),(102,"Architect", 5000), (103,"Designer",6000), (104,"Consultant",8000), (104,"Supervisor",9000), (104,"PM",10000) ).toDF("id","desig","salary")
    dfes: org.apache.spark.sql.DataFrame = [id: int, desig: string ... 1 more field]
    
    scala> dfe.join(dfes, dfe.col("id").equalTo(dfes.col("id")),"left_outer").groupBy(dfe.col("Id")).agg(sum("Salary") as "Sum of salaries", collect_list('desig as "desig_list")).show(false)
    +---+---------------+-----------------------------------+
    |Id |Sum of salaries|collect_list(desig AS `desig_list`)|
    +---+---------------+-----------------------------------+
    |101|4000           |[Dev]                              |
    |103|6000           |[Designer]                         |
    |102|9000           |[Architect, Designer]              |
    |104|27000          |[PM, Supervisor, Consultant]       |
    +---+---------------+-----------------------------------+
    
    
    scala>
    

    【讨论】:

    • 这行得通。如果我想选择整个 EmployeeSalary 表呢
    • 没有得到你..上面的连接使用了整个薪水表,没有任何过滤器
    • collect_list 在这里只选择一列。选择整个 EmployeeSalary 表(包括指定和薪水)需要进行哪些更改
    • 在这种情况下,您需要将上述内容存储在单独的 ds 中并再次与薪水表连接。使用窗口函数获取结果。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多