【问题标题】:Left join two pandas dataframes under multiple conditions在多种条件下左连接两个熊猫数据框
【发布时间】:2020-12-18 19:30:56
【问题描述】:

我有两个数据框。其中一个是用户在网店中的搜索查询(102377 行),另一个是用户在搜索之外的点击次数(8004 行)。

    queries:
index   term                         timestamp
...
10      tight                        2018-09-27 20:09:23
11      differential pressure        2018-09-27 20:09:30
12      soot pump                    2018-09-27 20:09:32
13      gas pressure                 2018-09-27 20:09:46
14      case                         2018-09-27 20:11:29
15      backpack                     2018-09-27 20:18:35
...

clicks
    index   term             timestamp               artnr
    ...
    245     soot pump        2018-09-27 20:09:25    9150.0
    246     dungarees        2018-09-27 20:10:38    7228.0
    247     db23             2018-09-27 20:10:40    7966.0
    248     db23             2018-09-27 20:10:55    7971.0
    249     sealing blister  2018-09-27 20:12:05    7971.0
    250     backpack         2018-09-27 20:18:40    8739.0
    ...

我想要做的是加入查询中的点击。如果 queries.term 等于 clicks.term 并且 clicks.timestamp - queries.timestamp 之间的差值小于 10 和大于 0 秒,则应将查询数据帧的 term 替换为 clicks 数据帧的 artnr,使其看起来像:

    queries:
index   term                         timestamp
...
10      tight                        2018-09-27 20:09:23
11      differential pressure        2018-09-27 20:09:30
12      9150.0                       2018-09-27 20:09:32
13      gas pressure                 2018-09-27 20:09:46
14      case                         2018-09-27 20:11:29
15      8739.0                       2018-09-27 20:18:35
...

我的第一个方法如下:

df_Q['term'] = np.where(((((df_CS.timestamp-df_Q.timestamp).dt.total_seconds() <= 10.0) & 
                       (df_CS.timestamp-df_Q.timestamp).dt.total_seconds() >= 0) & 
                       (df_CS.term.str == df_Q.term.str)), df_CS['artnr'], df_CS['term'])

但这只是产生了以下错误:

ValueError:操作数无法与形状一起广播 (102377,) (8004,) (8004,)

有没有人知道如何通过左连接或其他解决方案来解决这个问题?

【问题讨论】:

  • 对于烟灰泵,点击次数为 7 秒 before 查询。但是对于背包,点击是 5 秒 after 查询。你想要之前还是之后或两者兼而有之?

标签: python pandas dataframe left-join data-mining


【解决方案1】:
queries = pd.DataFrame({'term': ['tight', 'differential pressure', 'soot pump', 'gas pressure', 'case', 'backpack'],
                        'timestamp': ['2018-09-27 20:09:23', '2018-09-27 20:09:30', '2018-09-27 20:09:32', '2018-09-27 20:09:46', '2018-09-27 20:11:29', '2018-09-27 20:18:35']})
print(queries)
                    term            timestamp
0                  tight  2018-09-27 20:09:23
1  differential pressure  2018-09-27 20:09:30
2              soot pump  2018-09-27 20:09:32
3           gas pressure  2018-09-27 20:09:46
4                   case  2018-09-27 20:11:29
5               backpack  2018-09-27 20:18:35

clicks = pd.DataFrame({'term': ['soot pump', 'dungarees', 'db23', 'db23', 'sealing blister', 'backpack'],
                       'timestamp': ['2018-09-27 20:09:25', '2018-09-27 20:10:38', '2018-09-27 20:10:40', '2018-09-27 20:10:55', '2018-09-27 20:12:05', '2018-09-27 20:18:40'],
                       'artnr':[9150.0, 7228.0, 7966.0, 7971.0, 7971.0, 8739.0]})
print(clicks)
              term            timestamp   artnr
0        soot pump  2018-09-27 20:09:25  9150.0
1        dungarees  2018-09-27 20:10:38  7228.0
2             db23  2018-09-27 20:10:40  7966.0
3             db23  2018-09-27 20:10:55  7971.0
4  sealing blister  2018-09-27 20:12:05  7971.0
5         backpack  2018-09-27 20:18:40  8739.0

首先,根据时间戳对两个数据帧进行排序

queries['timestamp'] = pd.to_datetime(queries['timestamp'])
clicks['timestamp'] = pd.to_datetime(clicks['timestamp'])

queries.sort_values('timestamp', ascending=True, inplace=True)
clicks.sort_values('timestamp', ascending=True, inplace=True)

然后使用 pd.merge_asof() 加入 'term' 列并且仅当 'timestamp' 的时间差在 10 秒内时

df = pd.merge_asof(
     queries, # left data
     clicks, # right data
     on="timestamp", # column to check time differnece
     by="term", # column to join on
     tolerance=pd.Timedelta("10s"), # time difference
     direction='forward', # join only if timestamp in right data after timestamp in left data
     )

如果未找到匹配项,则“artnr”列将具有 NA。所以使用'artnr'的非NA值来替换'term'

df['term'][df['artnr'].notna()] = df['artnr']
print(df)

                    term           timestamp   artnr
0                  tight 2018-09-27 20:09:23     NaN
1  differential pressure 2018-09-27 20:09:30     NaN
2              soot pump 2018-09-27 20:09:32     NaN
3           gas pressure 2018-09-27 20:09:46     NaN
4                   case 2018-09-27 20:11:29     NaN
5                   8739 2018-09-27 20:18:35  8739.0

【讨论】:

  • 如果我运行我得到的代码:“MergeError: key must be integer, timestamp or float”
  • 我认为时间戳列是字符串格式。使用df['timestamp'] = pd.to_datetime(df['timestamp'])将两个数据帧中的时间戳列转换为日期时间格式
猜你喜欢
  • 2013-08-14
  • 2014-03-14
  • 2019-05-02
  • 2021-05-11
  • 1970-01-01
  • 2023-02-10
  • 2021-07-21
相关资源
最近更新 更多