Pandas 删除重复项的行为不符合预期答案

【问题标题】：Pandas drop duplicates does not behave as expectedPandas 删除重复项的行为不符合预期
【发布时间】：2019-11-22 17:15:15
【问题描述】：

我有一个 pandas 系列，其索引包含多个重复项，我使用 drop_duplicates 使其索引可用于在其他系列/数据帧上进一步切片：

In[1]: test
Out[1]: 
5575    21010210
5575    21010210
5577    21010210
5577    21010210
5577    21010210
5583    21010210
5583    21010210
5583    21010210
5586    21010210
5586    21010210
5586    21010210
8545    21010210
8545    21010210
8718    21000102
8718    21000102
8721    21000102
8721    21000102
Name: CC, dtype: object

当我申请test.drop_duplicates() 时，我希望所有现有的索引都会保留，尽管不会重复。出于某种原因，pandas 没有将其中一些索引识别为重复项，而是简单地将它们从数据框中清除：

In[2]: test.drop_duplicates()
Out[2]: 
5575    21010210
8718    21000102
Name: CC, dtype: object

奇怪的是，如果我之前重置索引，drop_duplicates 方法将正常工作：

In[3]: test.reset_index().drop_duplicates()
Out[3]: 
    index        CC
0    5575  21010210
2    5577  21010210
5    5583  21010210
8    5586  21010210
11   8545  21010210
13   8718  21000102
15   8721  21000102

为什么 pandas 会简单地从操作中删除一些索引？如何在不重置索引的情况下有效删除这些重复项？

【问题讨论】：

您要删除索引或值中的重复项吗？这是两种不同的操作。 pd.DataFrame.drop_duplicates 去除重复值（可选仅在特定列中），清除重复索引的方法是pd.Index.drop_duplicates
你想要test.index.drop_duplicates。在您的第一个场景中，您将删除列值中的重复项，请注意索引 5575 - 5545 都具有相同的值。
当你重置索引时，你将两列作为条件传递给 drop_duplicates ，在设置索引之前，你只传递单个值
@Erfan 成功了。我唯一的问题是：使用 test.index.drop_duplicates() 将返回一个 Index 对象，有没有办法在系列中具有这种行为？两种可能的方法是test.reset_index().drop_duplicates().set_index()，另一种是test.index.drop_duplicates().join()，还有其他方法吗？
你想要：test[~test.index.duplicated()]

标签： python pandas

【解决方案1】：

这是你的 pandas Series 对象：

import pandas as pd

data = [
    21010210, 21010210, 21010210, 21010210, 21010210, 21010210, 
    21010210, 21010210,  21010210, 21010210, 21010210, 21010210, 
    21010210, 21000102, 21000102, 21000102, 21000102
]

idx = [
    5575, 5575, 5577, 5577, 5577, 5583, 5583, 5583, 
    5586, 5586, 5586, 8545, 8545, 8718, 8718, 8721, 8721
]

series = pd.Series(data, index=idx).rename("CC")

print(series)

>>>
5575    21010210
5575    21010210
5577    21010210
5577    21010210
5577    21010210
5583    21010210
5583    21010210
5583    21010210
5586    21010210
5586    21010210
5586    21010210
8545    21010210
8545    21010210
8718    21000102
8718    21000102
8721    21000102
8721    21000102
Name: CC, dtype: int64

现在，如果您运行 drop_duplicates()，这将忽略您的索引：

返回 DataFrame 并删除重复行，仅可选考虑某些列。 索引，包括时间索引是忽略

print(series.drop_duplicates())

5575    21010210
8718    21000102
Name: CC, dtype: int64

最后，reset_index() 将返回一个dataframe，其中先前的索引被插入到数据框列中并且索引将重置：

print(series.reset_index())
    index        CC
0    5575  21010210
1    5575  21010210
2    5577  21010210
3    5577  21010210
4    5577  21010210
5    5583  21010210
6    5583  21010210
7    5583  21010210
8    5586  21010210
9    5586  21010210
10   5586  21010210
11   8545  21010210
12   8545  21010210
13   8718  21000102
14   8718  21000102
15   8721  21000102
16   8721  21000102

重置DataFrame的索引，使用默认索引。

这意味着drop_duplicates() 现在将考虑这两列。

print(series.reset_index().drop_duplicates())
    index        CC
0    5575  21010210
2    5577  21010210
5    5583  21010210
8    5586  21010210
11   8545  21010210
13   8718  21000102
15   8721  21000102

最有效的方法是

print(series.loc[~series.index.duplicated()])
5575    21010210
5577    21010210
5583    21010210
5586    21010210
8545    21010210
8718    21000102
8721    21000102
Name: CC, dtype: int64

【讨论】：