【问题标题】:Find rows in a DataFrame that partially match conditions在 DataFrame 中查找部分匹配条件的行
【发布时间】:2018-10-18 08:35:33
【问题描述】:

给定一个 DataFrame,在 DataFrame 中查找与给定值列表部分匹配的行的最佳方法是什么。

目前,我在 DataFrame (df1) 中有一行给定值,我遍历这些值,然后将一个函数应用于另一个 DataFrame (df2) 的每一行,该函数计算行中有多少值符合条件,然后返回一个计数高于某个值的第二个 DataFrame 的子集。

def partialMatch(row, conditions):
    count = 0
    if(row['ResidenceZip'] == conditions['ResidenceZip']):
        count+=1
    if(row['FirstName'] == conditions['FirstName']):
        count +=1
    if(row['LastName'] == conditions['LastName']):
        count +=1
    if(row['Birthday'] == conditions['Birthday']):
        count+=1
    return count

concat_all = []
for i, row in df1.iterrows():
    c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'], 
         'LastName': row['LastName'],'Birthday': row['Birthday']}
    df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
    x1 = df2[df2['count']>=3]
    concat_all.append(x1)

这可行,但速度很慢。有关加快此过程的任何提示?

例如,在下面的两个数据帧上运行代码,df1 的第一行将返回 df2 的前三行,而不是最后两行。

df1
    FirstName|LastName | Birthday | ResidenceZip 
    John     |  Doe    | 1/1/2000 |  99999
    Rob      |  A      | 1/1/2010 |  19499

df2
    FirstName|LastName | Birthday | ResidenceZip | count
    John     |  Doe    | 1/1/2000 |  99999       | 3
    John     |  Doe    | 1/1/2000 |  99999       | 3
    John     |  Doex   | 1/1/2000 |  99999       | 3
    Joha     |  Doex   | 1/1/2000 |  99999       | 2
    Joha     |  Doex   | 9/9/2000 |  99999       | 1
    Rob      |  A      | 9/9/2009 |  19499       | 0

【问题讨论】:

  • 如果可能,请提供示例输入数据框和您的预期输出?

标签: python pandas


【解决方案1】:

不确定是否有办法绕过至少一个DataFrame,但这里有一个可能会加快速度的选项。它确实允许将 FirstName 与 LastName 进行意外比较,但可以通过在值中添加唯一前缀来避免这种情况(例如“@”表示名字,“&”表示姓氏)

import numpy as np

s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]

输出concat_all

[  FirstName LastName  Birthday  ResidenceZip
 0      John      Doe  1/1/2000         99999
 1      John      Doe  1/1/2000         99999
 2      John     Doex  1/1/2000         99999,
   FirstName LastName  Birthday  ResidenceZip
 5       Rob        A  9/9/2009         19499]

时间

def Alollz(df1, df2):
    s1 = [set(x) for x in df1.values]
    s2 = [set(x) for x in df2.values]
    masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
    concat_all = [df2[m] for m in masks]
    return concat_all

def SharpObject(df1, df2):
    concat_all = []
    for i, row in df1.iterrows():
        c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'], 
             'LastName': row['LastName'],'Birthday': row['Birthday']}
        df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
        x1 = df2[df2['count']>=3]
        concat_all.append(x1)
    return concat_all

%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

还有更大的:

# you should never append dfs like this in a loop
for i in range(7):
    df1 = df1.append(df1)
    df2 = df2.append(df2)

%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】:

    【解决方案2】:

    使用 numpy isin 函数:

    df1_vals = df1.values
    df2_vals = df2.values
    df1_rows = range(df1_vals.shape[0])
    
    concat_all = \
        [df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]
    

    这里是设置的数据框:

    df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
                        'LastName': ['Doe', 'A'],
                        'Birthday': ['1/1/2000', '9/9/2009'],
                        'ResidenceZip': [99999, 19499]})
    
    df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
                        'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
                        'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
                        'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})
    

    【讨论】:

      猜你喜欢
      • 2023-03-17
      • 1970-01-01
      • 1970-01-01
      • 2018-05-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-02-25
      • 2013-02-25
      相关资源
      最近更新 更多