【问题标题】:Get records based on column max value - in PySpark根据列最大值获取记录 - 在 PySpark
【发布时间】:2021-07-02 03:19:04
【问题描述】:

我有带有数据的汽车表

country car price
Germany Mercedes 30000
Germany BMW 20000
Germany Opel 15000
Japan Honda 20000
Japan Toyota 15000

我需要从表格中获取国家、汽车和价格,每个国家的最高价格

country car price
Germany Mercedes 30000
Japan Honda 20000

我看到了类似的问题,但 SQL 中有解决方案,我想要 PySpark 数据帧的 DSL 格式(链接以防万一:Get records based on column max value

【问题讨论】:

    标签: pyspark bigdata


    【解决方案1】:

    您需要row_numberfilter 才能实现如下所示的结果

    df = spark.createDataFrame(
    [
    ("Germany","Mercedes", 30000),
    ("Germany","BMW", 20000),
    ("Germany","Opel", 15000),
    ("Japan","Honda",20000),
    ("Japan","Toyota",15000)], 
    ("country","car", "price"))
    
    from pyspark.sql.window import *
    from pyspark.sql.functions import row_number, desc
    
    df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("country").orderBy(desc("price"))))
    
    
    df2 = df1.filter(df1.row_num == 1).drop('row_num')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-25
      • 1970-01-01
      • 2022-01-24
      相关资源
      最近更新 更多