【问题标题】:Cannot resolve symbol .withcolumn in spark 2.4无法在 spark 2.4 中解析符号 .withcolumn
【发布时间】:2020-08-04 23:42:38
【问题描述】:

火花:2.4

数据框包含每个员工的平均登录时间

AverageLoginHour|employee
3.392265193     |emp_1
2.833333333     |emp_2
5.638888889     |emp_3
6.909090909     |emp_4
7.361445783     |emp_5

代码:

tds.select("Employee","AverageLoginHour")
    (count("AverageLoginHour").alias("logincnt"))
    (sum("AverageLoginHour").alias("loginsum"))
      .withColumn("TotalEmployeeavg",col("loginsum")/col("logincnt")*100)

Error: Cannot resolve symbol .withcolumn

预期输出:

AverageLoginHour|   employee    Totalavg|Remarks
3.392265193     |    Emp_1      |5.2    |Below Avg
2.833333333     |    Emp_2      |5.2    |Below Avg
5.638888889     |    Emp_3      |5.2    |Above Avg
6.909090909     |    Emp_4      |5.2    |Above Avg
7.361445783     |    Emp_5      |5.2    |Above Avg

如果员工 AverageLoginHour 小于 Totalavg than .withcolumn Remarks as below Avg else Above Avg.

请分享您的建议。

【问题讨论】:

    标签: apache-spark apache-spark-sql


    【解决方案1】:

    在这种情况下,在带有 window 子句的内置函数中使用 avg

    Example:

    df.show()
    //+----------------+--------+
    //|AverageLoginHour|employee|
    //+----------------+--------+
    //|     3.392265193|   emp_1|
    //|     2.833333333|   emp_2|
    //|     5.638888889|   emp_3|
    //|     6.909090909|   emp_4|
    //|     7.361445783|   emp_5|
    //+----------------+--------+
    
    
    df.withColumn("Totalavg",avg(col("AverageLoginHour")).over()).
    withColumn("Remarks",when(col("Totalavg") > col("AverageLoginHour"),lit("Below Avg")).otherwise(lit("Above Avg"))).
    show()
    
    //+----------------+--------+------------+---------+
    //|AverageLoginHour|employee|    Totalavg|  Remarks|
    //+----------------+--------+------------+---------+
    //|     3.392265193|   emp_1|5.2270048214|Below Avg|
    //|     2.833333333|   emp_2|5.2270048214|Below Avg|
    //|     5.638888889|   emp_3|5.2270048214|Above Avg|
    //|     6.909090909|   emp_4|5.2270048214|Above Avg|
    //|     7.361445783|   emp_5|5.2270048214|Above Avg|
    //+----------------+--------+------------+---------+
    
    //rounding to 1
    df.withColumn("Totalavg",round(avg(col("AverageLoginHour")).over(),1)).withColumn("Remarks",when(col("Totalavg") > col("AverageLoginHour"),lit("Below Avg")).otherwise(lit("Above Avg"))).show()
    //+----------------+--------+--------+---------+
    //|AverageLoginHour|employee|Totalavg|  Remarks|
    //+----------------+--------+--------+---------+
    //|     3.392265193|   emp_1|     5.2|Below Avg|
    //|     2.833333333|   emp_2|     5.2|Below Avg|
    //|     5.638888889|   emp_3|     5.2|Above Avg|
    //|     6.909090909|   emp_4|     5.2|Above Avg|
    //|     7.361445783|   emp_5|     5.2|Above Avg|
    //+----------------+--------+--------+---------+
    

    另一种方法是不使用使用窗口函数并利用 crossJoin

    Example:

    val df1=df.selectExpr("avg(AverageLoginHour) as Totalavg")
    df.crossJoin(df1).
    withColumn("Remarks",when(col("Totalavg") > col("AverageLoginHour"),lit("Below Avg")).otherwise(lit("Above Avg"))).
    show()
    //+----------------+--------+------------+---------+
    //|AverageLoginHour|employee|    Totalavg|  Remarks|
    //+----------------+--------+------------+---------+
    //|     3.392265193|   emp_1|5.2270048214|Below Avg|
    //|     2.833333333|   emp_2|5.2270048214|Below Avg|
    //|     5.638888889|   emp_3|5.2270048214|Above Avg|
    //|     6.909090909|   emp_4|5.2270048214|Above Avg|
    //|     7.361445783|   emp_5|5.2270048214|Above Avg|
    //+----------------+--------+------------+---------+
    

    【讨论】:

      猜你喜欢
      • 2020-10-04
      • 1970-01-01
      • 2016-10-31
      • 2022-01-05
      • 2015-09-17
      • 2016-07-15
      • 2018-11-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多