【问题标题】:how dataframe get same groupby with window function数据框如何通过窗口函数获得相同的groupby
【发布时间】:2019-03-14 02:21:41
【问题描述】:

我正在使用 PySpark 的 DataFrame 部分来分析来自 Apache Kafka 的数据。我遇到了一些麻烦,需要一些帮助。

    from pyspark.sql import functions

    # selected_df is dataframe come from kafka use spark.readStream.format("kafka")...

    windowed_group_1 = selected_df.withWatermark("kafka_time", "10 minutes").groupBy(functions.window("kafka_time", "10 seconds", "5 seconds"))

    windowed_group_2 = selected_df.withWatermark("kafka_time", "10 minutes").groupBy(functions.window("kafka_time", "10 seconds", "5 seconds"))

这两个groupby是同一个窗口函数吗?它们在相同的选项中。

如果不是,我该怎么做?

windowed_group_1 == windowed_group_2

提前感谢您的帮助。

【问题讨论】:

    标签: python dataframe pyspark


    【解决方案1】:

    也许这对我想要的很有用,窗口函数默认以 1970-01-01T00:00:00 作为参考帧,无论何时使用时间窗口。

    from pyspark.sql import functions as func
    
    a = labeled_df.groupBy(func.window("timestamp", "60 minute"), "proto").count().show(100, truncate=False)
    
    b = labeled_df.groupBy(func.window("timestamp", "60 minute"), "proto").count().show(100, truncate=False)
    

    结果a和b相同

    a
    +------------------------------------------+---------+-----+
    |window                                    |proto    |count|
    +------------------------------------------+---------+-----+
    |[2010-06-13 08:00:00, 2010-06-13 09:00:00]|UDP      |1803 |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|TCP      |22579|
    |[2010-06-13 09:00:00, 2010-06-13 10:00:00]|TCP      |2637 |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|IPv6-ICMP|453  |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|UDP      |1183 |
    |[2010-06-13 03:00:00, 2010-06-13 04:00:00]|UDP      |1467 |
    
    
    b
    +------------------------------------------+---------+-----+
    |window                                    |proto    |count|
    +------------------------------------------+---------+-----+
    |[2010-06-13 08:00:00, 2010-06-13 09:00:00]|UDP      |1803 |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|TCP      |22579|
    |[2010-06-13 09:00:00, 2010-06-13 10:00:00]|TCP      |2637 |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|IPv6-ICMP|453  |
    |[2010-06-13 02:00:00, 2010-06-13 03:00:00]|UDP      |1183 |
    |[2010-06-13 03:00:00, 2010-06-13 04:00:00]|UDP      |1467 |
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-24
      • 2017-04-03
      • 1970-01-01
      • 2021-03-27
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多