【问题标题】:Pyspark keep only most recent timestamps that meet conditionPyspark keep only most recent timestamps that meet condition
【发布时间】:2022-12-02 06:06:31
【问题描述】:

I have the following dataset:

id  col1  timestamp
1    a    01.01.2022 9:00:00
1    b    01.01.2022 9:01:00
1    c    01.01.2022 9:02:00
1    a    01.01.2022 10:00:00
1    b    01.01.2022 10:01:00
1    d    01.01.2022 10:02:00
2    a    01.01.2022 12:00:00 
2    b    01.01.2022 12:01:00
2    a    01.01.2022 13:00:00
2    c    01.01.2022 13:01:00

What i would like to do is to keep all the timestamps after the last occurrence of 'a' for each id. This is what the dataset will look like:

id  col1  timestamp
1    a    01.01.2022 10:00:00
1    b    01.01.2022 10:01:00
1    d    01.01.2022 10:02:00
2    a    01.01.2022 13:00:00
2    c    01.01.2022 13:01:00

It is important to identify 'a' as the starting point. Any idea on how can I do it? I was thinking of using groupby and take the max timestamp but it only seems to work in some specific situations

【问题讨论】:

    标签: python pyspark group-by


    【解决方案1】:
    spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
    
      w = Window.partitionBy('id')
    ( #column cum_a =1 when col1=a else cum_a=0. Once populated, calculate the cumulative sum of cum_a for every id ordered by timestamp
      df.withColumn('cum_a', sum(when(col('col1')=='a',1).otherwise(0)).over(w.orderBy(to_timestamp('timestamp','dd.MM.yyyy HH:mm:ss'))))
      #Find the maximum cum_a value per id 
     .withColumn('max', max('cum_a').over(w))
      #Filter out where cum_a equals to max value of cum_a per id 
      .where(((col('cum_a')==col('max'))))
      #Drop unwamted intermediatary columns
      .drop('cum_a','max')
    ).show()
    
    
    +---+----+-------------------+
    | id|col1|          timestamp|
    +---+----+-------------------+
    |  1|   a|01.01.2022 10:00:00|
    |  1|   b|01.01.2022 10:01:00|
    |  1|   d|01.01.2022 10:02:00|
    |  2|   a|01.01.2022 13:00:00|
    |  2|   c|01.01.2022 13:01:00|
    +---+----+-------------------+
    

    【讨论】:

      猜你喜欢
      • 2022-12-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-12-02
      • 2022-12-19
      • 2022-12-02
      • 2022-12-07
      相关资源
      最近更新 更多