【问题标题】:Can one use comparisons to merge two pandas data-frames?可以使用比较来合并两个熊猫数据框吗?
【发布时间】:2014-10-30 13:31:20
【问题描述】:

使用以下命令:

pandas.merge(df_1, df_2, left_on=['date'], right_on=['from_date'])

如果第一个表的date-列中的值等于第二个表的from_date-列中的值,我会合并两个表中的两行。

现在我想让它稍微复杂一些。如果第一个表的date 列中的值等于或大于第二个表的from_date-列的值并且更小,我需要将第一个表中的一行与第二个表中的一行合并比第二列upto_date-列中的值。

在 SQL 中,人们会使用类似的东西:

select
    *
from
    table_1
join
    table_2
on
    table_1.date >= table_2.from_date
    and
    table_1.date <  table_2.upto_date

是否可以在熊猫中做到这一点。

【问题讨论】:

  • 您能否提供一个 df1 和 df2 的简短示例?
  • 由于您加入的值不再是唯一的,因此您可能无法按预期进行合并。如果您想简单地将两个表添加在一起,可以查看 .join 或 .concat
  • stackoverflow.com/questions/23508351/… 的可能重复项。有一个关于 Pandas DataFrame 的条件连接的建议问题 (github.com/pydata/pandas/issues/7480)
  • 想知道非 SQL 解决方案是否更容易(即:在 python 中解析 + 合并)。

标签: python pandas merge


【解决方案1】:

pandasql 是一个非常有用的工具,用于使用 SQLite 查询语法查询 pandas 数据帧。

资源

这是一个与您描述的类似的示例。

进口

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import pandas as pd
from pandas.io.parsers import StringIO
from pandasql import sqldf

# helper func useful for saving keystrokes
# when running multiple queries
def dbGetQuery(q):
    return sqldf(q, globals())

伪造一些数据

sample_a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,75
2014-01-21 00:00:00,49
2014-01-23 00:00:00,28
2014-01-24 00:00:00,91
2014-01-25 00:00:00,88
2014-01-27 00:00:00,98
2014-01-28 00:00:00,39
2014-01-29 00:00:00,90
2014-01-30 00:00:00,63
2014-01-31 00:00:00,77
"""

sample_b = """from_date,to_date,measure
2014-01-02 00:00:00,2014-01-06 00:00:00,89
2014-01-03 00:00:00,2014-01-07 00:00:00,80
2014-01-04 00:00:00,2014-01-05 00:00:00,44
2014-01-05 00:00:00,2014-01-12 00:00:00,68
2014-01-06 00:00:00,2014-01-11 00:00:00,62
2014-01-07 00:00:00,2014-01-14 00:00:00,5
2014-01-08 00:00:00,2014-01-09 00:00:00,23
"""

读取数据集以创建 2 个 DataFrame

df1 = pd.read_csv(StringIO(sample_a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(sample_b), parse_dates=['from_date', 'to_date'])

编写 SQL 查询

请注意,这个使用 SQLite BETWEEN 运算符。如果您愿意,也可以将其换掉并使用 ON timepoint &gt;= from_date AND timepoint &lt; to_date 之类的东西。

query = """
SELECT
    DATE(df1.timepoint) AS timepoint
    , DATE(df2.from_date) AS start
    , DATE(df2.to_date) AS end
    , df1.measure AS measure_a
    , df2.measure AS measure_b
FROM
    df1 
INNER JOIN df2
    ON df1.timepoint BETWEEN 
        df2.from_date AND df2.to_date
ORDER BY
    df1.timepoint;
"""

使用辅助函数运行查询

df3 = dbGetQuery(query)

df3
     timepoint       start         end  measure_a  measure_b
0   2014-01-03  2014-01-02  2014-01-06          5         89
1   2014-01-03  2014-01-03  2014-01-07          5         80
2   2014-01-04  2014-01-02  2014-01-06         73         89
3   2014-01-04  2014-01-03  2014-01-07         73         80
4   2014-01-04  2014-01-04  2014-01-05         73         44
5   2014-01-05  2014-01-02  2014-01-06         40         89
6   2014-01-05  2014-01-03  2014-01-07         40         80
7   2014-01-05  2014-01-04  2014-01-05         40         44
8   2014-01-05  2014-01-05  2014-01-12         40         68
9   2014-01-06  2014-01-02  2014-01-06         45         89
10  2014-01-06  2014-01-03  2014-01-07         45         80
11  2014-01-06  2014-01-05  2014-01-12         45         68
12  2014-01-06  2014-01-06  2014-01-11         45         62
13  2014-01-08  2014-01-05  2014-01-12          2         68
14  2014-01-08  2014-01-06  2014-01-11          2         62
15  2014-01-08  2014-01-07  2014-01-14          2          5
16  2014-01-08  2014-01-08  2014-01-09          2         23
17  2014-01-09  2014-01-05  2014-01-12         96         68
18  2014-01-09  2014-01-06  2014-01-11         96         62
19  2014-01-09  2014-01-07  2014-01-14         96          5
20  2014-01-09  2014-01-08  2014-01-09         96         23
21  2014-01-10  2014-01-05  2014-01-12         82         68
22  2014-01-10  2014-01-06  2014-01-11         82         62
23  2014-01-10  2014-01-07  2014-01-14         82          5
24  2014-01-11  2014-01-05  2014-01-12         61         68
25  2014-01-11  2014-01-06  2014-01-11         61         62
26  2014-01-11  2014-01-07  2014-01-14         61          5
27  2014-01-12  2014-01-05  2014-01-12         68         68
28  2014-01-12  2014-01-07  2014-01-14         68          5
29  2014-01-13  2014-01-07  2014-01-14          8          5
30  2014-01-14  2014-01-07  2014-01-14         94          5

【讨论】:

  • Python 告诉我 pandasql 没有属性 'dbGetQuery'。我也无法在网上找到有关此模块的任何信息。这段代码真的有用吗?
  • 我在答案的顶部定义了 dbGetQuery。它只是一个我经常写的辅助函数。
【解决方案2】:

我想我找到了解决方案。但是,我不确定它是否优雅和最佳:

df_1['A'] = 'A'
df_2['A'] = 'A'
df = pandas.merge(df_1, df_2, on=['A'])
df = df[(df['date'] >= df['from']) & (df['date'] < df['upto'])]
del df['A']

代表提问者发帖

【讨论】:

    【解决方案3】:

    来自pyjanitorconditional_join 可能对非等连接有帮助:

    使用@hernamesbarbara的假数据:

    # pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
    import pandas as pd
    import janitor
    
    (df1.conditional_join(
             df2, 
             ('timepoint', 'from_date', '>='), 
             ('timepoint', 'to_date', '<='))
    )
     
             left              right                   
        timepoint measure  from_date    to_date measure
    0  2014-01-03       5 2014-01-02 2014-01-06      89
    1  2014-01-03       5 2014-01-03 2014-01-07      80
    2  2014-01-04      73 2014-01-02 2014-01-06      89
    3  2014-01-04      73 2014-01-03 2014-01-07      80
    4  2014-01-04      73 2014-01-04 2014-01-05      44
    5  2014-01-05      40 2014-01-02 2014-01-06      89
    6  2014-01-05      40 2014-01-03 2014-01-07      80
    7  2014-01-05      40 2014-01-04 2014-01-05      44
    8  2014-01-05      40 2014-01-05 2014-01-12      68
    9  2014-01-06      45 2014-01-02 2014-01-06      89
    10 2014-01-06      45 2014-01-03 2014-01-07      80
    11 2014-01-06      45 2014-01-05 2014-01-12      68
    12 2014-01-06      45 2014-01-06 2014-01-11      62
    13 2014-01-08       2 2014-01-05 2014-01-12      68
    14 2014-01-08       2 2014-01-06 2014-01-11      62
    15 2014-01-08       2 2014-01-07 2014-01-14       5
    16 2014-01-08       2 2014-01-08 2014-01-09      23
    17 2014-01-09      96 2014-01-05 2014-01-12      68
    18 2014-01-09      96 2014-01-06 2014-01-11      62
    19 2014-01-09      96 2014-01-07 2014-01-14       5
    20 2014-01-09      96 2014-01-08 2014-01-09      23
    21 2014-01-10      82 2014-01-05 2014-01-12      68
    22 2014-01-10      82 2014-01-06 2014-01-11      62
    23 2014-01-10      82 2014-01-07 2014-01-14       5
    24 2014-01-11      61 2014-01-05 2014-01-12      68
    25 2014-01-11      61 2014-01-06 2014-01-11      62
    26 2014-01-11      61 2014-01-07 2014-01-14       5
    27 2014-01-12      68 2014-01-05 2014-01-12      68
    28 2014-01-12      68 2014-01-07 2014-01-14       5
    29 2014-01-13       8 2014-01-07 2014-01-14       5
    30 2014-01-14      94 2014-01-07 2014-01-14       5
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-06-11
      • 2016-01-01
      相关资源
      最近更新 更多