通过 .csv 文件的更有效方法？答案

【问题标题】：More efficient way to go through .csv file?通过 .csv 文件的更有效方法？
【发布时间】：2015-08-14 12:58:26
【问题描述】：

我正在尝试解析 .CSV 文件中的一些字典，使用单独的 .txt 文件中的两个列表，以便脚本知道它在寻找什么。我们的想法是在 .CSV 文件中找到与 Word 和 IDNumber 都匹配的行，然后在匹配时提取第三个变量。但是，代码运行速度非常慢。有什么想法可以提高效率吗？

import csv

IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'

WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')

for CurrentIDNumber in open(IDNumberList_filename).readlines():
    for CurrentWord in open(WordsOfInterest_filename).readlines():
        FoundCurrent = 0

        with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
                    FoundCurrent = 1
                    CurrentProportion= row['CurrentProportion']

            if FoundCurrent == 0:
                CurrentProportion=0
            else:
                CurrentProportion=1
                print('found')

【问题讨论】：

您能否提供一个示例说明您希望如何显示输出？
这段代码的复杂度为 O(mn)，其中 m 和 n 是各自文件中单词和 id 的计数。难怪它真的很慢。它真的需要检查 ID 和 word 的所有可能组合吗？
CurrentProportion= row['CurrentProportion'] 在使用前设置为 0 或 1 有什么意义？
dictionary_WordsOfInterest.txt 和 IDs.txt 有多大？你能一口气读完吗？如果是这样，我建议将它们存储在set() 中并使用运算符in。（即a = set([1,2,3]); 1 in a）。在一个集合中的平均搜索时间是 O(1)。
谢谢... CurrentProportion = 1 目前只是一个占位符。我将 CurrentProportion 设置为零虽然是因为我想要输出。如果文件中没有Proportion（因为PID和CurrentWord不匹配），那么我想将它设置为0。

标签： python list csv python-3.x

【解决方案1】：

首先，考虑将文件dictionary_individualwords.csv 加载到内存中。我猜python字典是这种情况下的正确数据结构。

【讨论】：

【解决方案2】：

您正在打开 CSV 文件 N 次 N = (# lines in IDS.txt) * (# lines in dictionary_WordsOfInterest.txt)。如果文件不是太大，您可以通过将其内容保存到dictionary 或list of lists 来避免这种情况。

每次读取IDS.txt 的新行时打开dictionary_WordsOfInterest.txt 的方式相同

此外，您似乎正在从 txt 文件中寻找可能的对 (CurrentIDNumber, CurrentWord) 的任何组合。例如，您可以将 id 存储在一个集合中，将单词存储在另一个集合中，对于 csv 文件中的每一行，您可以检查 id 和单词是否都在各自的集合中。

【讨论】：

您好，感谢您的出色建议。至少对于这个文件，ID 和 Word 肯定都在集合中；这只是找到它们的一个例子。但是，我可能可以对它们进行排序。您已经明确指出了代码变慢的正确方向，因此我将在这些方面进行工作。

【解决方案3】：

当您对 .txt 文件使用 readlines 时，您已经使用它们构建了一个内存列表。您应该首先构建这些列表，并且它们只解析一次 csv 文件。比如：

import csv

IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'

WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')

numberlist = open(IDNumberList_filename).readlines():
wordlist =  open(WordsOfInterest_filename).readlines():

FoundCurrent = 0

with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        for CurrentIDNumber in numberlist:
            for CurrentWord in wordlist :

                if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
                    FoundCurrent = 1
                    CurrentProportion= row['CurrentProportion']

                if FoundCurrent == 0:
                    CurrentProportion=0
                else:
                    CurrentProportion=1
                    print('found')

注意：未经测试

【讨论】：

谢谢！我将尝试解决这个问题，并让你知道。 .csv 有 >100,000 行和 100 列。