【问题标题】:Replace duplicate value with NaN using groupby使用 groupby 将重复值替换为 NaN
【发布时间】:2021-04-28 10:24:57
【问题描述】:

数据集(MWE)

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                948                     12.74
AS          25-01-2021  7096                949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                950                     14.6

我正在尝试用 NaN 替换列 {people_vaccinated,people_fully_vaccinated,people_vaccinated_per_hundred} 中的重复项,同时在 location 上使用 groupby()。我在网上尝试了一些解决方案,但无法让它们为我工作,所以改为使用以下逻辑

def remove(df , a):
    df['duplicate'] = df[a].shift(1)
    df[a] = df.apply(lambda x: np.nan if x[a] == x['duplicate'] \
                            else x[a], axis=1)
    df = df.drop('duplicate', axis=1)
    return df

dfn = remove(dfn,'people_vaccinated')
dfn = remove(dfn,'people_fully_vaccinated')
dfn = remove(dfn,'people_vaccinated_per_hundred')

当您有连续的空值(超过 2 个)时,上述逻辑将失败。我需要用 NaN 替换重复项(同时保留第一个实例)。做这个的最好方式是什么?从上面的sn-p可以看出people_fully_vaccinated列有重复值

样本输出

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                NaN                     12.74
AS          25-01-2021  NaN                 949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                NaN                     14.6

【问题讨论】:

    标签: python pandas numpy pandas-groupby


    【解决方案1】:

    mask + duplicated

    我们可以定义一个列名列表,然后在 for 循环 mask 中为每个列创建每个唯一 location 的重复值

    cols = ['people_vaccinated', 'people_fully_vaccinated', 'people_vaccinated_per_hundred']
    for c in cols:
        df[c] = df[c].mask(df.duplicated(['location', c]))
    

       location        date  people_vaccinated  people_fully_vaccinated  people_vaccinated_per_hundred
    0        AL  12-01-2021            70861.0                   7270.0                           1.45
    1        AL  13-01-2021            74792.0                   9245.0                           1.53
    2        AL  14-01-2021            80480.0                  11366.0                           1.64
    3        AL  15-01-2021            86956.0                  13488.0                           1.77
    4        AL  16-01-2021            93797.0                  14202.0                           1.91
    5        AL  17-01-2021           100638.0                  14917.0                           2.05
    6        AS  22-01-2021             5627.0                    940.0                          10.10
    7        AS  23-01-2021             5881.0                    948.0                          10.56
    8        AS  24-01-2021             7096.0                      NaN                          12.74
    9        AS  25-01-2021                NaN                    949.0                          12.98
    10       AS  26-01-2021             7230.0                    950.0                          13.23
    11       AS  27-01-2021             8133.0                      NaN                          14.60
    

    【讨论】:

      【解决方案2】:

      这里尝试使用series.duplicated 创建一个布尔掩码。

      import numpy as np
      
      remove = lambda x: df[x].duplicated(keep='first')  # return boolean mask
      
      df.loc[remove('people_vaccinated'), 'people_vaccinated'] = np.NaN
      

      【讨论】:

        猜你喜欢
        • 2018-12-02
        • 2019-11-10
        • 2018-03-12
        • 1970-01-01
        • 2021-06-27
        • 2018-09-09
        • 2019-05-05
        • 2019-09-29
        • 2017-08-25
        相关资源
        最近更新 更多