【发布时间】:2020-03-05 17:44:40
【问题描述】:
我正在尝试根据条件“加入”两个 DataFrame。
条件
if df1.Year == df2.Year &
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.
我的数据如下所示:
df1:
Year Week ID Date
2020 1 123 2020-01-01 00:00:00
2020 1 345 2020-01-01 00:00:00
2020 2 123 2020-01-07 00:00:00
2020 1 123 2020-01-01 00:00:00
df2:
Year BeginDate EndDate ID
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 978
2020 2020-09-21 00:00:00 2020-01-02 00:00:00 978
end_df: #Expected output
Year Week ID Condition
2020 1 123 True #Year is matching, week1 is between the dates, ID is matching too
2019 1 345 False #Year is not matching
2020 2 187 False # ID is not matching
2020 1 123 True # Same as first row.
我想通过循环两个 DataFrame 来解决这个问题:
for row in df1.iterrrows():
for row2 in df2.iterrows():
if row['Year'] == row2['Year2']:
if row['ID] == row2['ID']:
.....
.....
row['Condition'] = True
else:
row['Condition'] = False
但是...这会导致一个又一个错误。
真的很期待你们将如何解决这个问题。提前谢谢了!
更新 1
我创建了一个循环。但是,这个循环需要很长时间(我不确定如何将值添加到新列)。
注意,在 df1 中,我创建了一个“日期”列(与 df2 的开始和结束日期格式相同)。
现在关键:如何将 True 值(在循环末尾..)添加到我的 df1(在额外的列中)?
for index, row in df1.interrows():
row['Year'] = str(row['Year'])
for index1, row1 in df2.iterrows():
row1['Year'] = str(row1['Year'])
if row['Year'] == row1['Year']:
row['ID'] = str(row['ID'])
row1['ID'] = str(row1['ID'])
if row['ID] == row1['ID']:
if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
print("I would like to add this YES to df1 in an extra column")
编辑 2
尝试@davidbilla 解决方案:看起来“条件”列表现不佳。如您所见,即使在 df1.Year != df2.Year 时它也匹配。请注意,df2 是根据 ID 排序的(因此所有相同的唯一编号都应该存在
【问题讨论】:
-
df1 数据缺少日期列。
-
已将其添加到示例中。感谢您提及@cs95。