使用新数据更新 Pandas 数据框，同时保留现有 ID 号答案

【问题标题】：Updating a Pandas dataframe with new data whilst retaining existing ID number使用新数据更新 Pandas 数据框，同时保留现有 ID 号
【发布时间】：2020-01-03 04:36:48
【问题描述】：

我有一个可以表示如下的 Pandas 数据框：

df = pd.DataFrame({'id':[1,2,3,4],
                   'gp':['a','a','b','b'],
                   'meta':['one','two','three','four'],
                   'matchvar':['wwww','w ww w','xxxx','xyxx'],
                   'match':[np.nan,'yes',np.nan,'no']})

...看起来像：

   id gp   meta matchvar match
0   1  a    one     wwww   NaN
1   2  a    two   w ww w   yes
2   3  b  three     xxxx   NaN
3   4  b   four     xyxx    no

可以使用 groupby 将数据分组到 'gp' 列：

for g in df.groupby(['gp']):
    print(g[1])

   id gp meta matchvar match
0   1  a  one     wwww   NaN
1   2  a  two   w ww w   yes

   id gp   meta matchvar match
2   3  b  three     xxxx   NaN
3   4  b   four     xyxx    no

如果组中的最后一行数据在“匹配”列中包含“是”，我只想保留最后一行数据，但“id”列需要使用前一行的值进行更新.

如果组中的最后一行数据在“匹配”列中包含“否”，则需要保留两行而不进行任何更改。

这可以概括为：

   id gp meta matchvar match
0   1  a  one     wwww   NaN  --> [row discarded]
1   2  a  two   w ww w   yes  --> 1   1  a  two   w ww w   yes [N.B. id from previous row]

...和：

   id gp   meta matchvar match
2   3  b  three     xxxx   NaN  --> 2   3  b  three     xxxx   NaN
3   4  b   four     xyxx    no  --> 3   4  b   four     xyxx    no

因此，预期的输出应该是具有以下结构的数据框：

   id gp   meta matchvar match
1   1  a    two   w ww w   yes
2   3  b  three     xxxx   NaN
3   4  b   four     xyxx    no

我可以使用 .last() 保留组的最后一行，但我不知道如何维护前一行的“id”值。

如有任何建议，我们将不胜感激。

【问题讨论】：

yes 和 no 列中的 no 值是否只出现在最后一行？
每个组最多只包含 2 行，最后（或第二）行包含 'yes' 和 'no' 值。
有没有一组只有一行的情况？
有些组可能只有一行，但如有必要，我可以事先过滤掉。
在这种情况下，只需将yes 组与no 组分开即可。处理yes 的id 以获取前一行值并返回concat 它们。我使用这个逻辑发布了一个解决方案

标签： python pandas group-by

【解决方案1】：

按照您的逻辑，只使用矢量化方法来保持我们的代码高效，我们可以执行以下操作：

mask_yes = df['match'].eq('yes') # array with True for rows with 'yes'
mask_no = df['match'].eq('no')   # array with True for rows with 'no'

# if the row is 'yes', get the shifted id, else the original id
df['id'] = np.where(mask_yes, df['id'].shift(), df['id']) 

# if a group has 'no' mark all rows as True so we can keep the whole group
mask = df.assign(indicator=mask_no).groupby('gp')['indicator'].transform('any')
# filter on groups with 'no' or only the row 'yes'
df = df[mask | mask_yes]

    id gp   meta matchvar match
1  1.0  a    two   w ww w   yes
2  3.0  b  three     xxxx   NaN
3  4.0  b   four     xyxx    no

【讨论】：

【解决方案2】：

正如您在评论中确认每个组有 2 行，因此您可以尝试以下逻辑：创建掩码 m 以将“否”组与“是”组分开。处理“yes”组的id 并同时使用drop_duplicates 和concat 选择最后一行

m = df.match.eq('no').groupby(df.gp).transform('any')
df_yes = (df.assign(id=df.id.shift(fill_value=0))[~m]
            .drop_duplicates('gp', keep='last'))
df_final = pd.concat([df_yes, df[m]])

Out[108]:
   id gp   meta matchvar match
1   1  a    two   w ww w   yes
2   3  b  three     xxxx   NaN
3   4  b   four     xyxx    no

【讨论】：