【问题标题】:pyspark - read df by row to search in another dfpyspark - 按行读取 df 以在另一个 df 中搜索
【发布时间】:2021-12-26 14:11:06
【问题描述】:

我是 pyspark 的新手,我需要帮助才能在 df 中搜索。
我有 df1 和学生数据如下

+---------+----------+--------------------+
|studentid|   course |  registration_date |
+---------+----------+--------------------+
|      348|         2|     15-11-2021     |
|      567|         1|     05-11-2021     |
|      595|         3|     15-10-2021     |
|      580|         2|     06-11-2021     |
|      448|         4|     15-09-2021     |
+---------+----------+--------------------+

df2.有关于注册时间的信息如下

+--------+------------+------------+
| period | start_date |  end_date  |
+--------+------------+------------+
|       1| 01-09-2021 | 15-09-2021 |
|       2| 16-09-2021 | 30-09-2021 |
|       3| 01-10-2021 | 15-10-2021 |
|       4| 16-10-2021 | 31-10-2021 |
|       5| 01-11-2021 | 15-11-2021 |
|       6| 16-11-2021 | 30-11-2021 |
+--------+------------+------------+

我需要逐行迭代 df1,获取学生 registration_date 并使用此日期,转到 df2 并获取条件为 df2.start_date 结果将是新的df,如下所示

+---------+----------+--------------------+--------+------------+------------+
|studentid|   course |  registration_date | period | start_date |  end_date  |
+---------+----------+--------------------+--------+------------+------------+
|      348|         2|     15-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      567|         1|     05-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      595|         3|     15-10-2021     |       3| 01-10-2021 | 15-10-2021 |
|      580|         2|     06-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      448|         4|     15-09-2021     |       1| 01-09-2021 | 15-09-2021 |
+---------+----------+--------------------+--------+------------+------------+

【问题讨论】:

    标签: python dataframe apache-spark pyspark


    【解决方案1】:

    您可以将join 条件指定为复杂条件。

    工作示例

    from datetime import datetime
    from pyspark.sql import functions as F
    
    
    df = spark.createDataFrame([
        (348, 2, datetime.strptime("15-11-2021", "%d-%m-%Y")),
        (567, 1, datetime.strptime("05-11-2021", "%d-%m-%Y")),
        (595, 3, datetime.strptime("15-10-2021", "%d-%m-%Y")),
        (580, 2, datetime.strptime("06-11-2021", "%d-%m-%Y")),
        (448, 4, datetime.strptime("15-09-2021", "%d-%m-%Y")),]
    , ("studentid", "course", "registration_date",)).withColumn("registration_date", F.to_date(F.col("registration_date")))
    
    df2 = spark.createDataFrame([
        (1, datetime.strptime("01-09-2021", "%d-%m-%Y"), datetime.strptime("15-09-2021", "%d-%m-%Y")),
        (2, datetime.strptime("16-09-2021", "%d-%m-%Y"), datetime.strptime("30-09-2021", "%d-%m-%Y")),
        (3, datetime.strptime("01-10-2021", "%d-%m-%Y"), datetime.strptime("15-10-2021", "%d-%m-%Y")),
        (4, datetime.strptime("16-10-2021", "%d-%m-%Y"), datetime.strptime("31-10-2021", "%d-%m-%Y")),
        (5, datetime.strptime("01-11-2021", "%d-%m-%Y"), datetime.strptime("15-11-2021", "%d-%m-%Y")),
        (6, datetime.strptime("16-11-2021", "%d-%m-%Y"), datetime.strptime("30-11-2021", "%d-%m-%Y")),]
    , ("period", "start_date", "end_date")).withColumn("start_date", F.to_date(F.col("start_date"))).withColumn("end_date", F.to_date(F.col("end_date")))
    
    df.join(df2, (df2["start_date"] <= df["registration_date"]) & (df["registration_date"] <= df2["end_date"])).show()
    
    

    输出

    +---------+------+-----------------+------+----------+----------+
    |studentid|course|registration_date|period|start_date|  end_date|
    +---------+------+-----------------+------+----------+----------+
    |      348|     2|       2021-11-15|     5|2021-11-01|2021-11-15|
    |      567|     1|       2021-11-05|     5|2021-11-01|2021-11-15|
    |      595|     3|       2021-10-15|     3|2021-10-01|2021-10-15|
    |      448|     4|       2021-09-15|     1|2021-09-01|2021-09-15|
    |      580|     2|       2021-11-06|     5|2021-11-01|2021-11-15|
    +---------+------+-----------------+------+----------+----------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-12-13
      • 1970-01-01
      • 2017-04-22
      • 2020-01-25
      • 2020-06-23
      • 2020-04-06
      • 2019-11-27
      相关资源
      最近更新 更多