【发布时间】:2024-10-26 10:30:02
【问题描述】:
我正在处理的数据分析项目遇到了困难。
基本上,如果我有示例 CSV 'A':
id | item_num
A123 | 1
A123 | 2
B456 | 1
我有示例 CSV 'B':
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
如果我使用Pandas 执行merge,结果如下:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
我怎样才能让它变成:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
这是我的代码:
import pandas as pd
# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
我非常感谢任何帮助 - 我非常困难!我正在处理 20,000 多行。
谢谢。
编辑:我的帖子被标记为可能重复。不是,因为我不一定要添加一列 - 我只是想防止 description 乘以归因于特定 id 的 item_num 的数量。
更新,6/21:
如果 2 个 DF 看起来像这样,我该如何进行合并?
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
我有示例 CSV 'B':
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
所以我最终得到:
id | item_num | other_col | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...
意思是,在“other_col”中带有“amet, consectetur”的 3 的行将被忽略。
【问题讨论】:
-
看起来你想
concatorappend,而不是merge。
标签: python python-2.7 python-3.x csv pandas