执行合并时防止行重复答案

【问题标题】：Prevent duplication of rows when performing merge执行合并时防止行重复
【发布时间】：2024-10-26 10:30:02
【问题描述】：

我正在处理的数据分析项目遇到了困难。

基本上，如果我有示例 CSV 'A'：

id   | item_num
A123 |     1
A123 |     2
B456 |     1

我有示例 CSV 'B'：

id   | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...

如果我使用Pandas 执行merge，结果如下：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | Mary had a...
A123 |     1    | ...little lamb.
A123 |     2    | ...little lamb.
B456 |     1    | Its fleece...

我怎样才能让它变成：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb...
B456 |     1    | Its fleece...

这是我的代码：

import pandas as pd

# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))


# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))

# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))

# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")

我非常感谢任何帮助 - 我非常困难！我正在处理 20,000 多行。

谢谢。

编辑：我的帖子被标记为可能重复。不是，因为我不一定要添加一列 - 我只是想防止 description 乘以归因于特定 id 的 item_num 的数量。

更新，6/21：

如果 2 个 DF 看起来像这样，我该如何进行合并？

id   | item_num | other_col
A123 |     1    | lorem ipsum
A123 |     2    | dolor sit
A123 |     3    | amet, consectetur
B456 |     1    | lorem ipsum

我有示例 CSV 'B'：

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb.
B456 |     1    | ...Its fleece...

所以我最终得到：

id   | item_num |  other_col  | description
A123 |     1    | lorem ipsum | Mary Had a...
A123 |     2    | dolor sit   | ...little lamb.
B456 |     1    | lorem ipsum | ...Its fleece...

意思是，在“other_col”中带有“amet, consectetur”的 3 的行将被忽略。

【问题讨论】：

Adding new column to existing DataFrame in Python pandas的可能重复
看起来你想concat or append，而不是merge。

标签： python python-2.7 python-3.x csv pandas

【解决方案1】：

我会这样做：

In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))

In [136]: result
Out[136]:
     id  item_num       description
0  A123         1     Mary had a...
1  A123         2   ...little lamb.
2  B456         1  ...Its fleece...

说明：我们可以在B DF 中创建“虚拟”item_num 列用于加入：

In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
     id       description  item_num
0  A123     Mary had a...         1
1  A123   ...little lamb.         2
2  B456  ...Its fleece...         1

【讨论】：

我希望这对我有用，但它似乎没有包含来自 CSV 之一的任何数据。事实上，生成的 CSV 只是其中一个 CSV 的副本。
@kabaname，你确定你已经分配了合并回来的结果吗？
没关系，所以我得到了它来产生结果 - 但它仍然将行相乘以便重复 1 和 2 的描述，就像在我的示例中一样。换句话说，Mary had a... 对 1 和 2 都在重复，然后...little lamb. 还在重复。 @maxu
只是想我会回过头来告诉你这很好用！感谢您的帮助。
嗨@MaxU，我有一个关于这个问题的更新，想知道你是否能提供一些见解？

【解决方案2】：

我认为你需要连接

result = pd.concat([df1.set_index('id'), df2.set_index('id')],axis = 1).reset_index()

你得到

    id      item_no     description
0   A123    1           Mary had a...
1   A123    2           ...little lamb
2   B456    1           ...Its fleece...

【讨论】：

我收到了ValueError: Shape of passed values is (13, 10799), indices imply (13, 6240)

【解决方案3】：

尝试索引您的 df，然后删除重复项：

df = df.set_index(['id', 'item_num']).drop_duplicates()

【讨论】：

所以我试过了，它似乎已经删除了两列和所有数据......但这确实解决了重复问题，因为剩余数据没有像以前那样重复。