【发布时间】:2020-10-02 06:27:27
【问题描述】:
我有一个包含 start_date、end_date、sales_target 的数据框。我添加了代码来识别日期范围之间的季度数,因此能够使用一些 UDF 将 sales_target 拆分为季度数。
df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"])
+----------+----------+------------+
|start_date| end_date |sales_target|
+----------+----------+------------+
|2020-01-01|2020-12-31| 15|
|2020-04-01|2020-12-31| 11|
|2020-07-01|2020-12-31| 3|
+----------+----------+------------+
以下是计算季度数后的数据框,并使用 UDF 函数拆分 sales_target。
spark.sql('select *, round(months_between(end_date, start_date)/3) as noq from df_temp').createOrReplaceTempView("df_temp")
spark.sql("select *, st_udf(cast(sales_target as integer), cast(noq as integer)) as sales_target from df_temp").createOrReplaceTempView("df_temp")
+----------+----------+--------+---------------+
|start_date| end_date |num_qtrs|sales_target_n |
+----------+----------+--------+---------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] |
|2020-04-01|2020-12-31| 3| [4,4,3] |
|2020-07-01|2020-12-31| 2| [2,1] |
+----------+----------+--------+---------------+
分解 sales_target 后,我能够得到以下结果:
+----------+----------+--------+-------------+---------------+------------------+
|start_date| end_date |num_qtrs|sales_target |sales_target_n | sales_target_new |
+----------+----------+--------+-------------+---------------+------------------+
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 3 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 3 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 2 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 1 |
+----------+----------+--------+-------------+---------------+------------------+
我需要帮助来根据 num_qtrs 值为每行添加不同的开始/结束日期。我需要得到一个如下的数据框。
+----------+----------+--------+-------------+------------------+--------------+--------------+
|start_date| end_date |num_qtrs|sales_target | sales_target_new |new_start_date| new_end_date |
+----------+----------+--------+-------------+------------------+--------------+--------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-01-01 |2020-03-31 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-07-01|2020-12-31| 2| [2,1] | 2 |2020-07-01 |2020-09-30 |
|2020-07-01|2020-12-31| 2| [2,1] | 1 |2020-10-01 |2020-12-31 |
+----------+----------+--------+-------------+------------------+--------------+--------------+
有人可以帮助我使用 pyspark 代码示例以实现上述预期结果。
【问题讨论】:
-
你的 spark 版本是什么?
-
是的,它的 databricks 运行时 6.3 与 apache spark 2.4.4 和 python 3.73。谢谢。
标签: python sql pyspark apache-spark-sql mysql-python