在python中将字符串从多个csv导入到主csv答案

【问题标题】：Importing strings from multiple csvs to a master csv in python在python中将字符串从多个csv导入到主csv
【发布时间】：2018-02-04 17:12:23
【问题描述】：

我有很多包含字符串的 csv 文件。我想将 python 3 中的字符串从多个 csv 导入到主 csv，但要确保没有添加主 csv 中已经包含的重复项。

我已经编写了一些代码，但我不确定如何将打印内容写入主 csv 以及如何检查重复项。

我当前的代码是：

 output = [ ]
            f = open( 'example.csv' , 'r' )
for line in f:
                cells = line.split( "," )
                output.append( ( cells[ 3 ]))

f.close( ) 

print (output)

任何帮助将不胜感激。

提前致谢。

【问题讨论】：

标签： python csv duplicates export-to-csv

【解决方案1】：

答案实际上取决于这些 CSV 文件有多大，即您希望在主 CSV 中结束多少字。基于此，您可以拥有或多或少优化的 Python 代码。

首先，您应该提供一些示例，因为从显示的内容中，您从第三列获取字符串并将它们放入输出列表中。

一个解决方案可能是这样的：

from csv import reader
words = set()

#  open master CSV file in case it already exists and load all words
#  now, this is the part where you didn't give an example of how master CSV should look like
#  I'll assume its just a word per line text file
with open(MASTER_CSV_FILE, 'r') as f:
    for line in f:
        words.append(line)

with open(NEW_CSV_FILE, 'r') as f:
    for columns in reader(f):
        words.append(columns[3])

#  here again, I'll just write word per line in MASTER_CSV_FILE
with open(MASTER_CSV_FILE, 'w') as f:
    for word in words:
        f.write(word + '\n')

我的回答基于以下假设：

主 CSV 文件实际上是每行一个字的文本文件（由于缺乏示例），
新的 CSV 文件每行总是至少有 3 个逗号分隔值，
您只想对单词进行重复数据删除，不想计算重复次数。

【讨论】：

对不起，不够清晰。我正在提取 URL 并将它们添加到一个相当大的主 csv 中。我希望主表中没有重复的 URL，并且要从原始 csv 中删除提取的 URL。
我想你正在创建爬行机器人，在这种情况下，你应该考虑如果机器人意外停止会发生什么。无论如何，除非您存储其他数据，否则将 URL 存储在主文件中不需要 CSV 格式。
我另外添加了日期和时间，这是造成头痛的原因，因此我可以搜索 csv。还有其他方法可以更有效地做到这一点吗？
在主 CSV 中有大量 URL 并且可能很长的 URL 将消耗大量 CPU 周期和内存，因为在确定 URL 是新的还是已通过字符串匹配处理之前，必须先加载所有这些。您可以使用哈希函数，例如md5 并将 URL 十六进制摘要保存在附加列中。这样，您可以使用哈希作为键创建内存中的字典，并根据它们检查新 URL 是否已被其 md5 十六进制摘要处理。

【解决方案2】：

这是另一种可能对您有用的方法。

import pandas as pd

# Create a DataFrame that will be used to load all the data.
# The duplicates will be removed once all the csv's have been
# loaded
df = pd.DataFrame()

# Read the contents of the csv files into the DataFrame.
# I'm assuming all the csv's have the same data format.
for f in os.listdir():
    if f.endswith(".csv"):
        df = df.append(pd.read_csv(f))

# Eliminate the duplicates. This will use the values in
# all the columns of the DataFrame to determine whether
# a particular row is a duplicate.
df.drop_duplicates(inplace=True)

然后，如果需要，您可以使用 df.to_csv() 将 DataFrame 转换回 csv 文件。

希望对您有所帮助。

【讨论】：