Python - DataFrame - 从另一个数据帧中查找数据答案

【问题标题】：Python - DataFrame - Findind data from one dataframe in anotherPython - DataFrame - 从另一个数据帧中查找数据
【发布时间】：2020-10-17 15:51:27
【问题描述】：

2 个数据框，DF1 是主数据框，DF2 告诉假期和员工已经接管了当月

DF1=pd.DataFrame({'Name': ['A','B','C','D'],
   'CurrDate': ['27-Jun', '27-Jun','27-Jun', '27-Jun']})



DF2=pd.DataFrame({'Name': ['A','A','B','B','B','C'],'Holiday': ['27-Jun', '26-Jun','27-Jun','25-Jun','23-Jun','27-Jun']  })

我想将DF1 中的'CurrDate' 与DF2 中的'Holiday' 进行比较。 DF1 将更新为假期前的日期。所以 DF1 看起来像：

DF1=pd.DataFrame({'Name': ['A','B','C','D'], 'CurrDate': ['25-Jun', '26-Jun','26-Jun', '27-Jun']})

我正在努力将数据框放入循环中。

【问题讨论】：

标签： python pandas numpy loops dataframe

【解决方案1】：

这是一个快速而肮脏的模型，可以正确处理假期安排中的漏洞。它不处理给定用户没有假期或第一个记录的假期在当前日期之前的极端情况，但我会把这些留给你——基础知识都在这里。

from datetime import datetime, timedelta
import pandas as pd

datetime_format = '%d-%b'
str2dt = lambda dts: datetime.strptime(dts, datetime_format)

current_date_col_name = 'curr_date'
name_col_name = 'name'
holiday_col_name = 'holiday'

df1 = pd.DataFrame({
    name_col_name: ['A','B','C','D'],
    current_date_col_name: ['27-Jun', '27-Jun','27-Jun', '27-Jun'],
})

# assuming "current_date" can vary by person
# if not, you can just ignore df1
current_dates = {
    row[name_col_name]: str2dt(row[current_date_col_name]) for ind, row in df1.iterrows()
}

holidays_df = pd.DataFrame({
    name_col_name: ['A', 'A', 'B', 'B', 'B', 'B', 'C'],
    holiday_col_name: ['27-Jun', '26-Jun', '27-Jun', '26-Jun', '25-Jun', '22-Jun', '27-Jun']
})


holiday_dt_col_name = 'holiday_datetime'
last_day_worked_col_name = 'last_day_worked'


# convert holiday days to datetime objects
holidays_df[holiday_dt_col_name] = holidays_df[holiday_col_name].apply(str2dt)

one_day = timedelta(days=1)

last_dates_worked = {}
for group_name, gdf in holidays_df.groupby(name_col_name):
    gdf_sorted = gdf.sort_values(by=holiday_dt_col_name, ascending=False)
    current_date = current_dates[group_name]

    prev_date = current_date
    last_date_worked = None
    for ind, row in gdf_sorted.iterrows():
        holiday_date = row[holiday_dt_col_name]
        time_diff = holiday_date - prev_date
        if time_diff < -one_day:
            last_date_worked = holiday_date - (time_diff + one_day)
            break
        prev_date = holiday_date

    if last_date_worked is None:
        last_date_worked = prev_date - one_day

    last_dates_worked[group_name] = last_date_worked

print("Outcome:")
for person, last_date_worked in last_dates_worked.items():
    print(f'{person}: {last_date_worked}')
print()

Outcome:
A: 1900-06-25 00:00:00
B: 1900-06-24 00:00:00
C: 1900-06-26 00:00:00

【讨论】：

【解决方案2】：

from datetime import datetime, timedelta
import pandas as pd

datetime_format = '%d-%b'

current_date_col_name = 'curr_date'
name_col_name = 'name'
holiday_col_name = 'holiday'


df1 = pd.DataFrame({
    name_col_name: ['A','B','C','D'],
    current_date_col_name: ['27-Jun', '27-Jun','27-Jun', '27-Jun'],
})


df2 = pd.DataFrame({
    name_col_name: ['A', 'A', 'B', 'B', 'B', 'C'],
    holiday_col_name: ['27-Jun', '26-Jun', '27-Jun', '25-Jun', '23-Jun', '27-Jun']
})


holiday_dt_col_name = 'holiday_datetime'
last_day_worked_col_name = 'last_day_worked'

# convert holiday days to datetime objects
df2[holiday_dt_col_name] = df2[holiday_col_name].apply(
    lambda dts: datetime.strptime(dts, datetime_format))

# get the day before each holiday day
df2[last_day_worked_col_name] = df2[holiday_dt_col_name] - timedelta(days=1)

# get rid of the non-minimal days.
# last_worked will be as series containing the first holiday day for each person as datetime objects
last_worked = df2.groupby(name_col_name)[last_day_worked_col_name].min()

# so now dump the datetimes to as series of strings
last_worked_strs = last_worked.apply(lambda dt: datetime.strftime(dt, datetime_format))
last_worked_strs_df = pd.DataFrame(last_worked_strs)

# join dfs on name
joined_df = df1.join(last_worked_strs_df, on=name_col_name)

# fill current date into na cells
no_holiday_rows = joined_df[last_day_worked_col_name].isna()
joined_df.loc[no_holiday_rows, last_day_worked_col_name] = joined_df.loc[no_holiday_rows, current_date_col_name]

print(joined_df)

Output:

  name curr_date last_day_worked
0    A    27-Jun          25-Jun
1    B    27-Jun          22-Jun
2    C    27-Jun          26-Jun
3    D    27-Jun          27-Jun

【讨论】：

谢谢，但是如果您看到“B”的 last_day_worked 不正确。应该是 6 月 26 日。即使 B 于 25 日离职 - 我们不需要将他的最后上任日期视为 27-6 月 27 日的当前日期是 6 月 26 日而不是 6 月 22 日。
您的请求没有说明为什么您会计算一些假日日而不计算其他日数。 B 于 6 月 23 日和 25 日起飞。如果您想以编程方式忽略某些日子，则需要另一个字段或另一个规则。
我不想考虑所有的假期，只是根据他们的假期日历获取员工的 latest_working_date。