【问题标题】:Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched根据列中的匹配值以及匹配值的最小/最大值时间戳过滤数据框
【发布时间】:2026-01-30 10:55:01
【问题描述】:

我有一个电子邮件地址列表,我想在有序字典中找到匹配项,并将其转换为数据框。

这是我的电子邮件地址列表:

email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

这是我的字典变成了 DataFrame (df2):

    sender      type          _time
0  c@aol.com      email   2020-12-09 19:45:48.013140
1  c@aol.com      email    2020-13-09 19:45:48.013140
2  g@aol.com      email   2020-12-09 19:45:48.013140
3  b@aol.com      email    2020-14-11 19:45:48.013140

我想创建一个新的 DataFrame,它显示匹配的发件人的列、匹配数(计数)、第一次看到的日期和最后一次看到的日期。全部由匹配的发件人分组。第一次看到的日期将是匹配发件人的 _time 列中的最小时间戳,最后看到的列值将是匹配发件人的 _time 列中的最大时间戳。

脚本运行后的示例输出如下所示:

      sender  count      type          first_seen            last_seen
0  c@aol.com   2        email   2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140
1  g@aol.com   1        email   2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140
2  b@aol.com   1        email    2020-14-11 19:45:48.013140   2020-14-11 19:45:48.013140
3  a@aol.com   0        email             NA                     NA

到目前为止,这是我的蟒蛇:

#Collect list of email addresses I want to find in df2
email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])

# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['c@aol.com','c@aol.com','g@aol.com','b@aol.com',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}

# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)

# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column

# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']

关于如何在数据框中获得预期输出的任何想法或建议?我已经尝试了所有方法,并且一直卡在为每个计数大于 0 的匹配获取 first_seen 和 last_seen 值。

【问题讨论】:

    标签: python python-3.x pandas dataframe


    【解决方案1】:

    我相信这段代码可以解决问题。

    数据点创建:

        data = pd.DataFrame()
        data['sender'] = ['c@aol.com','c@aol.com','g@aol.com','b@aol.com']
        data['type'] = 'email'
        data['_time'] = ['2020-12-09 19:45:48.013140','2020-13-09 
        19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-11 19:45:48.013140']
    

    用预期的列创建一个新的 df :

        new_data = pd.DataFrame(columns = 
        ['count','first_seen','last_seen','sender','type'] )
        new_data['sender'] = list(set(data['sender'].values)) #data from input df
        new_data['type'] = 'email' #constant
    

    遍历唯一发件人列表:

         for j in new_data['sender']:
           temp_data = data[data['sender'] == j] #data with only a particular sender
           new_data.loc[new_data['sender'] == j, 'count'] = len(temp_data)#count
    
           if len(temp_data) > 1:#if multiple timings for a sender
                timings = list(set(temp_data['_time']))#get all possible timings for sender
                new_data.loc[new_data['sender'] == j, 'first_seen'] = min(timings)
                new_data.loc[new_data['sender'] == j, 'last_seen'] = max(timings)
        
           elif len(temp_data) == 1:#if single timimngs per sender
                new_data.loc[new_data['sender'] == j, 'first_seen'] = new_data.loc[new_data['sender'] == j, 'last_seen'] = temp_data.iloc[0]['_time']
    

    您将在 new_data df 中找到所需的格式

    【讨论】:

      【解决方案2】:

      根据您的输入df,您可以执行Groupby.agg

      In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
      
      In [1191]: res
      Out[1191]: 
            sender   type                       _time                                  
                                                  min                         max count
      0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140     1
      1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140     2
      2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140     1
      

      编辑:要删除嵌套列,请执行以下操作:

      In [1206]: res.columns = res.columns.droplevel()
      
      In [1207]: res
      Out[1207]: 
                                                  min                         max  count
      0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
      1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
      2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1
      

      EDIT-2:同时使用df1

      In [1246]: df = df1.merge(df, how='left')
      In [1254]: df.type = df.type.fillna('email')
      
      In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
      
      In [1260]: res.columns = res.columns.droplevel()
      
      In [1261]: res
      Out[1261]: 
                                                  min                         max  count
      0  a@aol.com  email                         NaN                         NaN      0
      1  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
      2  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
      3  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1
      

      【讨论】:

      • 这可行,但是有没有办法让 min 和 max 和 count 列成为自己的列,而不是让它们嵌套在 _time 下?
      • @CoderGuru 请在EDIT 下查看我的更新答案。
      • 嗨,这和你说的完全一样;但是,它不是通过 email_list 或 df1 过滤的。它仅根据自身扫描重复项。例如,a@aol.com 应该在输出中,计数为 0。有什么方法可以实现吗?
      • @CoderGuru 请查看我在EDIT-2下的更新答案。
      最近更新 更多