【问题标题】:Comparing Two DataFrames and Loop through them (to test a condition)比较两个 DataFrame 并循环遍历它们(测试条件)
【发布时间】:2020-03-05 17:44:40
【问题描述】:

我正在尝试根据条件“加入”两个 DataFrame。

条件

if df1.Year == df2.Year & 
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID 
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.

我的数据如下所示:

df1: 

Year     Week     ID   Date
2020      1      123   2020-01-01 00:00:00
2020      1      345   2020-01-01 00:00:00
2020      2      123   2020-01-07 00:00:00
2020      1      123   2020-01-01 00:00:00


df2: 

Year    BeginDate               EndDate                 ID
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    123
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    123
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    978
2020    2020-09-21  00:00:00    2020-01-02  00:00:00    978


end_df: #Expected output 

Year     Week     ID   Condition
2020      1      123     True  #Year is matching, week1 is between the dates, ID is matching too 
2019      1      345     False #Year is not matching
2020      2      187     False # ID is not matching 
2020      1      123     True  # Same as first row. 

我想通过循环两个 DataFrame 来解决这个问题:

for row in df1.iterrrows(): 
    for row2 in df2.iterrows(): 
         if row['Year'] == row2['Year2']:
              if row['ID] == row2['ID']: 
                  .....
                  .....
                   row['Condition'] = True 
         else: 
            row['Condition'] = False 

但是...这会导致一个又一个错误。

真的很期待你们将如何解决这个问题。提前谢谢了!

更新 1

我创建了一个循环。但是,这个循环需要很长时间(我不确定如何将值添加到新列)。

注意,在 df1 中,我创建了一个“日期”列(与 df2 的开始和结束日期格式相同)。

现在关键:如何将 True 值(在循环末尾..)添加到我的 df1(在额外的列中)?

for index, row in df1.interrows(): 
      row['Year'] = str(row['Year'])

      for index1, row1 in df2.iterrows():
          row1['Year'] = str(row1['Year'])


          if row['Year'] == row1['Year']:
                  row['ID'] = str(row['ID']) 
                  row1['ID'] = str(row1['ID']) 


                  if row['ID] == row1['ID']: 

                        if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
                              print("I would like to add this YES to df1 in an extra column")

编辑 2

尝试@davidbilla 解决方案:看起来“条件”列表现不佳。如您所见,即使在 df1.Year != df2.Year 时它也匹配。请注意,df2 是根据 ID 排序的(因此所有相同的唯一编号都应该存在

【问题讨论】:

  • df1 数据缺少日期列。
  • 已将其添加到示例中。感谢您提及@cs95。

标签: python pandas numpy


【解决方案1】:

我猜您期待这样的结果 - 如果您尝试逐行匹配数据帧(即比较 df1 的第 1 行和 df2 的第 1 行):

df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID'])&((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])), True, False)

np.where 将条件作为第一个参数,第二个参数是条件通过的值,第三个参数是条件失败的值。

编辑 1: 基于您的示例数据集

df1 = pd.DataFrame([[2020,1,123],[2020,1,345],[2020,2,123],[2020,1,123]],
                   columns=['Year','Week','ID'])
df2 = pd.DataFrame([[2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',123],
                    [2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',123],
                    [2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',978],
                    [2020,'2020-09-21  00:00:00','2020-01-02  00:00:00',978]],
                   columns=['Year','BeginDate','EndDate','ID'])
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID']),True, False)
# &((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])) - removed this condition as the df has no Date field
print(df1)

输出:

   Year  Date   ID  condition
0  2020     1  123       True
1  2020     1  345      False
2  2020     2  123      False
3  2020     1  123      False

编辑 2:比较 df1 中的一行与 df2 中的所有行

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))

这需要df1['Year'] 并将其与df2['Year'] 的所有值进行比较。

基于样本数据集:

df1:

   Year       Date   ID  
0  2020 2020-01-01  123  
1  2020 2020-01-01  345  
2  2020 2020-10-01  123  
3  2020 2020-11-13  123  

df2:

   Year  BeginDate    EndDate   ID
0  2020 2020-01-01 2020-02-01  123
1  2020 2020-01-01 2020-01-02  123
2  2020 2020-03-01 2020-05-01  978
3  2020 2020-09-21 2020-10-01  978

代码更改:

date_range = list(zip(df2['BeginDate'],df2['EndDate']))

def check_date(date):
    for (s,e) in date_range:
        if date>=s and date<=e:
            return True
    return False

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
df1['date_compare'] = df1['Date'].apply(lambda x: check_date(x)) # you can directly store this in df1['condition']. I just wanted to print the values so have used a new field
df1['condition'] = (df1['condition']==True)&(df1['date_compare']==True)

输出:

   Year       Date   ID  condition  date_compare
0  2020 2020-01-01  123       True          True    # Year match, ID match and Date is within the range of df2 row 1
1  2020 2020-01-01  345      False          True    # Year match, ID no match
2  2020 2020-10-01  123       True          True    # Year match, ID match, Date is within range of df2 row 4
3  2020 2020-11-13  123      False         False    # Year match, ID match, but Date is not in range of any row in df2

编辑 3: 基于更新的问题(之前我认为如果 3 个值 year、id 和 date 在任何不在同一行的行中匹配 df2 就可以了)。我想我现在对您的要求有了更深入的了解。

df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['Date'] = pd.to_datetime(df1['Date'])

df1['condition'] = False
for idx1, row1 in df1.iterrows():
    match = False
    for idx2, row2 in df2.iterrows():
        if (row1['Year']==row2['Year']) & \
                (row1['ID']==row2['ID']) & \
                (row1['Date']>=row2['BeginDate']) & \
                (row1['Date']<=row2['EndDate']):
            match = True
    df1.at[idx1, 'condition'] = match

输出 - 设置 1:

DF1:

   Year       Date   ID
0  2020 2020-01-01  123
1  2020 2020-01-01  123
2  2020 2020-01-01  345
3  2020 2020-01-10  123
4  2020 2020-11-13  123

DF2:

   Year  BeginDate    EndDate   ID
0  2020 2020-01-15 2020-02-01  123
1  2020 2020-01-01 2020-01-02  123
2  2020 2020-03-01 2020-05-01  978
3  2020 2020-09-21 2020-10-01  978

DF1 结果:

   Year       Date   ID  condition
0  2020 2020-01-01  123       True
1  2020 2020-01-01  123       True
2  2020 2020-01-01  345      False
3  2020 2020-01-10  123      False
4  2020 2020-11-13  123      False

输出 - 第 2 组: DF1:

   Year       Date        ID
0  2019 2019-01-01   s904112
1  2019 2019-01-01   s911243
2  2019 2019-01-01   s917131
3  2019 2019-01-01  sp986214
4  2019 2019-01-01   s510006
5  2020 2020-01-10   s540006

DF2:

   Year  BeginDate    EndDate       ID
0  2020 2020-01-27 2020-09-02  s904112
1  2020 2020-01-27 2020-09-02  s904112
2  2020 2020-01-03 2020-03-15  s904112
3  2020 2020-04-15 2020-01-05  s904112
4  2020 2020-01-05 2020-05-15  s540006
5  2019 2019-01-05 2019-05-15  s904112

DF1 结果:

   Year       Date        ID  condition
0  2019 2019-01-01   s904112      False
1  2019 2019-01-01   s911243      False
2  2019 2019-01-01   s917131      False
3  2019 2019-01-01  sp986214      False
4  2019 2019-01-01   s510006      False
5  2020 2020-01-10   s540006       True

【讨论】:

  • 太好了,我只收到这个错误(即使我只尝试为第一个表达式运行它)。 raise ValueError("Can only compare same-labeled " "Series objects") ValueError: Can only compare same-labeled Series objects 我的数据类型相似,我也将它用于两个数据帧:df.sort_index(inplace=True)
  • @Roverflow 您在哪一行收到此错误?请参阅编辑 1,它适用于您提供的示例数据集。
  • @Roverflow 此解决方案将 df1 中的第 1 行与 df2 中的第 1 行进行比较,因此 df1 中的第 4 行与 df2 中的第 4 行进行比较 - df1 的第 4 行和 df2 的第 4 行的 ID 值不同.您是否希望为 df1 中的每一行循环遍历 df2 中的所有行?
  • 您是否使用pd.to_datetime(df1['Date'])df1['Date'] 转换为时间戳?
  • @Roverflow 请参阅编辑 3。早些时候我误解了您的要求,即 df1 中的每一行都应与 df2 中任何行中的 3 个值匹配,但现在我明白它在 df2 的同一行中匹配.检查 EDIT 3 并让我知道它是否适合您
【解决方案2】:

所需输出的第二行有Year 作为2019,所以我假设df1.Year 的第二行也是2019 而不是2020

如果我理解正确,您需要在BeginDateEndDate 范围之外合并和过滤掉Date。首先,df2 中存在重复和无效的日期范围。我们需要在合并之前删除重复项和无效范围。无效的日期范围是 BeginDate >= EndDate 的范围,这是 df2 的索引 3。

#convert all date columns of both `df1` and `df2` to datetime dtype
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['BeginDate', 'EndDate']] = df2[['BeginDate', 'EndDate']].apply(pd.to_datetime)

#left-merge on `Year`, `ID` and using `eval` to compute 
#columns `Condition` where `Date` is between `BeginDate` and `EndDate`. 
#Finally assign back to `df1`
df1['Condition'] = (df1.merge(df2.loc[df2.BeginDate < df2.EndDate].drop_duplicates(), 
                                on=['Year','ID'], how='left')
                       .eval('Condition= BeginDate <= Date <= EndDate')['Condition'])

Out[614]:
   Year  Week   ID       Date  Condition
0  2020     1  123 2020-01-01       True
1  2019     1  345 2020-01-01      False
2  2020     2  123 2020-01-07      False
3  2020     1  123 2020-01-01       True

【讨论】:

  • 谢谢!当 ID 出现在 DF1 中时,它只给出“真”,而不是在 DF2 中。
猜你喜欢
  • 2016-08-29
  • 2022-08-22
  • 2021-08-15
  • 2021-12-17
  • 1970-01-01
  • 2013-06-10
  • 1970-01-01
  • 2014-01-23
  • 2020-09-25
相关资源
最近更新 更多