【问题标题】:Python - DataFrame - Findind data from one dataframe in anotherPython - DataFrame - 从另一个数据帧中查找数据
【发布时间】:2020-10-17 15:51:27
【问题描述】:

2 个数据框,DF1 是主数据框,DF2 告诉假期和员工已经接管了当月

DF1=pd.DataFrame({'Name': ['A','B','C','D'],
   'CurrDate': ['27-Jun', '27-Jun','27-Jun', '27-Jun']})



DF2=pd.DataFrame({'Name': ['A','A','B','B','B','C'],'Holiday': ['27-Jun', '26-Jun','27-Jun','25-Jun','23-Jun','27-Jun']  })

我想将DF1 中的'CurrDate' 与DF2 中的'Holiday' 进行比较。 DF1 将更新为假期前的日期。所以 DF1 看起来像:

DF1=pd.DataFrame({'Name': ['A','B','C','D'], 'CurrDate': ['25-Jun', '26-Jun','26-Jun', '27-Jun']})

我正在努力将数据框放入循环中。

【问题讨论】:

    标签: python pandas numpy loops dataframe


    【解决方案1】:

    这是一个快速而肮脏的模型,可以正确处理假期安排中的漏洞。它不处理给定用户没有假期或第一个记录的假期在当前日期之前的极端情况,但我会把这些留给你——基础知识都在这里。

    from datetime import datetime, timedelta
    import pandas as pd
    
    datetime_format = '%d-%b'
    str2dt = lambda dts: datetime.strptime(dts, datetime_format)
    
    current_date_col_name = 'curr_date'
    name_col_name = 'name'
    holiday_col_name = 'holiday'
    
    df1 = pd.DataFrame({
        name_col_name: ['A','B','C','D'],
        current_date_col_name: ['27-Jun', '27-Jun','27-Jun', '27-Jun'],
    })
    
    # assuming "current_date" can vary by person
    # if not, you can just ignore df1
    current_dates = {
        row[name_col_name]: str2dt(row[current_date_col_name]) for ind, row in df1.iterrows()
    }
    
    holidays_df = pd.DataFrame({
        name_col_name: ['A', 'A', 'B', 'B', 'B', 'B', 'C'],
        holiday_col_name: ['27-Jun', '26-Jun', '27-Jun', '26-Jun', '25-Jun', '22-Jun', '27-Jun']
    })
    
    
    holiday_dt_col_name = 'holiday_datetime'
    last_day_worked_col_name = 'last_day_worked'
    
    
    # convert holiday days to datetime objects
    holidays_df[holiday_dt_col_name] = holidays_df[holiday_col_name].apply(str2dt)
    
    one_day = timedelta(days=1)
    
    last_dates_worked = {}
    for group_name, gdf in holidays_df.groupby(name_col_name):
        gdf_sorted = gdf.sort_values(by=holiday_dt_col_name, ascending=False)
        current_date = current_dates[group_name]
    
        prev_date = current_date
        last_date_worked = None
        for ind, row in gdf_sorted.iterrows():
            holiday_date = row[holiday_dt_col_name]
            time_diff = holiday_date - prev_date
            if time_diff < -one_day:
                last_date_worked = holiday_date - (time_diff + one_day)
                break
            prev_date = holiday_date
    
        if last_date_worked is None:
            last_date_worked = prev_date - one_day
    
        last_dates_worked[group_name] = last_date_worked
    
    print("Outcome:")
    for person, last_date_worked in last_dates_worked.items():
        print(f'{person}: {last_date_worked}')
    print()
    
    Outcome:
    A: 1900-06-25 00:00:00
    B: 1900-06-24 00:00:00
    C: 1900-06-26 00:00:00
    

    【讨论】:

      【解决方案2】:
      from datetime import datetime, timedelta
      import pandas as pd
      
      datetime_format = '%d-%b'
      
      current_date_col_name = 'curr_date'
      name_col_name = 'name'
      holiday_col_name = 'holiday'
      
      
      df1 = pd.DataFrame({
          name_col_name: ['A','B','C','D'],
          current_date_col_name: ['27-Jun', '27-Jun','27-Jun', '27-Jun'],
      })
      
      
      df2 = pd.DataFrame({
          name_col_name: ['A', 'A', 'B', 'B', 'B', 'C'],
          holiday_col_name: ['27-Jun', '26-Jun', '27-Jun', '25-Jun', '23-Jun', '27-Jun']
      })
      
      
      holiday_dt_col_name = 'holiday_datetime'
      last_day_worked_col_name = 'last_day_worked'
      
      # convert holiday days to datetime objects
      df2[holiday_dt_col_name] = df2[holiday_col_name].apply(
          lambda dts: datetime.strptime(dts, datetime_format))
      
      # get the day before each holiday day
      df2[last_day_worked_col_name] = df2[holiday_dt_col_name] - timedelta(days=1)
      
      # get rid of the non-minimal days.
      # last_worked will be as series containing the first holiday day for each person as datetime objects
      last_worked = df2.groupby(name_col_name)[last_day_worked_col_name].min()
      
      # so now dump the datetimes to as series of strings
      last_worked_strs = last_worked.apply(lambda dt: datetime.strftime(dt, datetime_format))
      last_worked_strs_df = pd.DataFrame(last_worked_strs)
      
      # join dfs on name
      joined_df = df1.join(last_worked_strs_df, on=name_col_name)
      
      # fill current date into na cells
      no_holiday_rows = joined_df[last_day_worked_col_name].isna()
      joined_df.loc[no_holiday_rows, last_day_worked_col_name] = joined_df.loc[no_holiday_rows, current_date_col_name]
      
      print(joined_df)
      
      Output:
      
        name curr_date last_day_worked
      0    A    27-Jun          25-Jun
      1    B    27-Jun          22-Jun
      2    C    27-Jun          26-Jun
      3    D    27-Jun          27-Jun
      

      【讨论】:

      • 谢谢,但是如果您看到“B”的 last_day_worked 不正确。应该是 6 月 26 日。即使 B 于 25 日离职 - 我们不需要将他的最后上任日期视为 27-6 月 27 日的当前日期是 6 月 26 日而不是 6 月 22 日。
      • 您的请求没有说明为什么您会计算一些假日日而不计算其他日数。 B 于 6 月 23 日和 25 日起飞。如果您想以编程方式忽略某些日子,则需要另一个字段或另一个规则。
      • 我不想考虑所有的假期,只是根据他们的假期日历获取员工的 latest_working_date。
      猜你喜欢
      • 1970-01-01
      • 2018-01-10
      • 2017-11-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多