如何使用熊猫查找序列中缺失的日期行？答案

【问题标题】：How to find missing date rows in a sequence using pandas?如何使用熊猫查找序列中缺失的日期行？
【发布时间】：2019-11-18 16:35:22
【问题描述】：

我有一个超过4 million rows and 30 columns 的数据框。我只是提供我的患者数据框的样本

df = pd.DataFrame({
    'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
    'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
                 '1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
                 '11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
    'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})

我想做两件事

1) 找出序列中缺失的主题及其记录

2) 获取每个主题的item_name 计数

对于q2，这是我尝试过的

df.groupby(['subject_ID','item_name']).count()  # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?

对于q1，这就是我正在尝试的

df['day'].le(df['shift_date'].add(1))

我希望我的输出如下所示

【问题讨论】：

缺少日期的定义是什么？
例如。 subject_id = 2 has records only for dates 1,3,8,9,10。通过这个我们可以推断出他的日期2,4,5,6,7的记录丢失了。
如果您看到subject_id = 1 you can see that he/she has records continuously. There is no break in between their dates。这就是为什么Seq_status = Yes 表明他/她在序列中
@Datanovice - 更新了示例数据框和预期输出。有微小的变化。
@SSMK 你想要丢失的日期还是只想要它们的总数？

标签： python python-3.x pandas dataframe pandas-groupby

【解决方案1】：

您可以通过以下方式获得第一部分：

In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name   Fio2  PEEP
subject_ID
1              2     3
2              0     5
3              3     0

编辑：

我认为您的示例输出中的日期格式仍然有些混乱，因此强烈建议您将所有内容切换到 ISO 8601 标准，因为这样可以防止出现类似问题。 pandas 无法自行正确解析 11/01/2022 条目，因此我在示例中手动修复了它。

使用我假设的这些日期，您可以通过分组和使用.resample() 找到差距：

In [73]: df['dates'] = pd.to_datetime(df['date_visit'])

In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")

In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())

In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
   subject_ID      dates
0           2 2020-01-02
1           2 2020-01-04
2           2 2020-01-05
3           2 2020-01-06
4           2 2020-01-07
5           3 2022-01-12
6           3 2022-01-14
7           3 2022-01-15

然后，您可以通过检查 ID 是否出现在这个新框架中来将 seq status 添加到第一个框架。

【讨论】：

您好，感谢您的回复。赞成。对于第二部分，请参考示例数据框中的日期。我想获取每个主题的缺失日期
是的，我确实看到了那里的日期，但逻辑不清楚。我假设这些是 13/01 的欧洲格式，但如果主题 3 从 1 月 11 日开始，为什么缺少的日期都在 11 月？ FWIW，建议将所有这些转换为 ISO 日期格式，以使其明确。
抱歉，更新了预期的输出和示例数据框。日期格式为day/month/year
抱歉耽搁了。标记答案