【问题标题】:Fetch week start date and week end date from Date从 Date 获取周开始日期和周结束日期
【发布时间】:2021-10-08 20:58:11
【问题描述】:

考虑到一周从周日开始到周六结束,我需要从给定日期获取周开始日期和周结束日期。

我提到了这个post,但这需要星期一作为一周的开始日期。 spark中是否有任何内置函数可以解决这个问题?

【问题讨论】:

    标签: pyspark apache-spark-sql


    【解决方案1】:

    找出星期几并使用 selectExpr 遍历列,并将星期日设为星期开始日期

    from pyspark.sql import functions as F
    
    
    df_b = spark.createDataFrame([('1','2020-07-13')],[ "ID","date"])
    df_b = df_b.withColumn('day_of_week', F.dayofweek(F.col('date')))
    df_b = df_b.selectExpr('*', 'date_sub(date, day_of_week-1) as week_start')
    df_b = df_b.selectExpr('*', 'date_add(date, 7-day_of_week) as week_end')
    
    df_b.show()
    
    +---+----------+-----------+----------+----------+
    | ID|      date|day_of_week|week_start|  week_end|
    +---+----------+-----------+----------+----------+
    |  1|2020-07-13|          2|2020-07-12|2020-07-18|
    +---+----------+-----------+----------+----------+
    

    Spark SQL 更新

    首先从数据框创建一个临时视图

    df_a.createOrReplaceTempView("df_a_sql")
    

    代码在这里

    %sql
    select *, date_sub(date,dayofweek-1) as week_start,
    date_sub(date, 7-dayofweek) as week_end
    from
    (select *, dayofweek(date) as dayofweek
    from df_a_sql) T
    

    输出

    +---+----------+-----------+----------+----------+
    | ID|      date|day_of_week|week_start|  week_end|
    +---+----------+-----------+----------+----------+
    |  1|2020-07-13|          2|2020-07-12|2020-07-18|
    +---+----------+-----------+----------+----------+
    

    【讨论】:

    • 这个可以用sql实现吗
    • 你在使用 spark-sql 吗?然后我们就可以了.. 添加更多信息 - 比如你使用的是什么 IDE/语言?
    • 是的,我正在使用 spark sql。但是语言在这里是如何发挥作用的。它只是 SQL。查询
    • 抱歉延迟回复,我已经用 spark-sql 代码更新了我的答案,请检查,这是一个正常的 mySql 函数,所以在其他 SQL 中,同样的逻辑应该可以工作
    【解决方案2】:

    也许这有帮助 -

    加载测试数据

       val df = spark.sql("select cast('2020-07-12' as date) as date")
        df.show(false)
        df.printSchema()
    
        /**
          * +----------+
          * |date      |
          * +----------+
          * |2020-07-15|
          * +----------+
          *
          * root
          * |-- date: date (nullable = true)
          */
    

    从星期日开始到星期六结束的一周

    
        // week starting from SUNDAY and ending SATURDAY
        df.withColumn("week_end", next_day($"date", "SAT"))
          .withColumn("week_start", date_sub($"week_end", 6))
          .show(false)
    
        /**
          * +----------+----------+----------+
          * |date      |week_end  |week_start|
          * +----------+----------+----------+
          * |2020-07-12|2020-07-18|2020-07-12|
          * +----------+----------+----------+
          */
    

    从星期一开始到星期日结束的一周

    
        // week starting from MONDAY and ending SUNDAY
        df.withColumn("week_end", next_day($"date", "SUN"))
          .withColumn("week_start", date_sub($"week_end", 6))
          .show(false)
    
        /**
          * +----------+----------+----------+
          * |date      |week_end  |week_start|
          * +----------+----------+----------+
          * |2020-07-12|2020-07-19|2020-07-13|
          * +----------+----------+----------+
          */
    

    从星期二开始到星期一结束的一周

        // week starting from TUESDAY and ending MONDAY
        df.withColumn("week_end", next_day($"date", "MON"))
          .withColumn("week_start", date_sub($"week_end", 6))
          .show(false)
    
        /**
          * +----------+----------+----------+
          * |date      |week_end  |week_start|
          * +----------+----------+----------+
          * |2020-07-12|2020-07-13|2020-07-07|
          * +----------+----------+----------+
          */
    

    【讨论】:

    • 好像出了点问题。如果日期是 2020-07-12,那么 week_start 应该是 2020-07-12,week_end 应该是 2020-07-18,但我得到的是 2020-07-05 和 2020-07-11。
    • 尚未测试这些极端案例。感谢您的通知。
    • @ben,请检查更新,我认为这更通用,可以从任何DAY开始的其他周使用
    • @downvoters,请您检查更新。如果它不适合你,请告诉我。
    • 你的回答还是不行。试用日期“2016-06-25”。
    【解决方案3】:

    在 pyspark 数据框中找出一周的开始日期和结束日期。星期一是一周的第一天。

    def add_start_end_week(dataframe, timestamp_col, StartDate, EndDate):
    """"
    Function:
        Get the start date and the end date of week
    args
        dataframe: spark dataframe
        column_name: timestamp column based on which we have to calculate the start date and end date
        StartDate: start date column name of week
        EndDate: end date column name of week
    """
    dataframe = dataframe.withColumn(
        'day_of_week', dayofweek(col(timestamp_col)))
    # start of the week (Monday as first day)
    dataframe = dataframe.withColumn('StartDate',when(col("day_of_week")>1, \
                                                      expr("date_add(date_sub({},day_of_week-1),1)".format(timestamp_col))). \
                                                      otherwise(expr("date_sub({},6)".format(timestamp_col))))
    #End of the Week
    dataframe = dataframe.withColumn('EndDate',when(col("day_of_week")>1, \
                                                    expr("date_add(date_add({},7-day_of_week),1)".format(timestamp_col))). \
                                                    otherwise(col("{}".format(timestamp_col))))
    
    return dataframe
    

    验证上述函数:

    df = spark.createDataFrame([('2021-09-26',),('2021-09-25',),('2021-09-24',),('2021-09-23',),('2021-09-22',),('2021-09-21',),('2021-09-20',)], ['dt'])
    dataframe = df.withColumn('day_of_week', dayofweek(col('dt')))
    # start of the week (Monday as first day)
    dataframe = dataframe.withColumn('StartDate',when(col("day_of_week")>1,expr("date_add(date_sub(dt,day_of_week-1),1)")).otherwise(expr("date_sub(dt,6)")))
    #End of the Week
    dataframe = dataframe.withColumn('EndDate',when(col("day_of_week")>1,expr("date_add(date_add(dt,7-day_of_week),1)")).otherwise(col("dt")))
    

    【讨论】: