提高写入txt文件python的速度答案

【问题标题】：Improving the speed of writing into a txt file python提高写入txt文件python的速度
【发布时间】：2023-03-03 11:38:01
【问题描述】：

我正在根据每个单词的 TD IDF 计算生成一个 txt 文件。

我正在使用此代码编写文件

w_writer = open("tf_idf_vectors_stops_2.txt", "w")
for x in xrange(0, len(listPatient)):
    patientId = listPatient[x] #List for patientid
    for words in tdDict_final[patientId]:
        w_writer.write(patent + "," + str(multiListTokens.index(words[0])) +     "," + str(words[2]))
        w_writer.write("\n")
w_writer.close()

listPatient 是一个由有序 ID 组成的列表。

 listPatient = ['001', '002', '003', '004']

tdDict_final 是一个字典，由作为键的 ID 和单词和单词值组成

在代码中我们将 word 和 word[2] 的值称为 words[0]，因为 word[1] 将是“:”，tdDict_final 的格式如下所示。

 {'001': [('dog', ':', '0.2534879), ('cat', ':', '0.0133487)],
  '002': [('floor', ':', '0.047589'), ('board'), ':' ('0.099345)],
  '003': [('key'), ':', '0.04993)],
  '004': [('thanks', ':', '0.01479')]}

tdDict 包含 listPatient 中的所有患者

multilistTokens 是一个包含许多不同词汇（标记）的列表

multilistTokens 包含在 tdDict 中找到的所有可能的 dictinct 词汇表。

问题是，我上面的代码在写出来时非常缓慢且缓慢。

用上面的代码有没有办法提高写入txt文件的效率？

非常感谢

【问题讨论】：

那么这些天患者获得了专利？
你的列表和字典有多大；双循环可能是缓慢的实际原因，而不是写入磁盘。
另外，你有没有计算出这段代码需要多长时间：multiListTokens.index(words[0])。只需在该双循环中运行它，无需任何写入或其他任何内容，然后让我们知道它是快得多还是同样慢。
这段代码应该写得很快，除非multiListTokens 很大。输出文件是否正确？
len(multiListTokens) 是 444032

标签： python

【解决方案1】：

with open("tf_idf_vectors_stops_2.txt", "w") as w_writer:
    for patientId in listPatient:
        for words in tdDict_final[patientId]:
            w_writer.write("%s,%s,%s\n" % (patent, str(multiListTokens.index(words[0])), str(words[2])))

第一 |您应该使用with 语句而不是打开文件然后手动关闭文件。 with 语句是python context manager，这意味着它将以w_writer 的形式打开文件，然后当你完成时它会自动关闭它。

第二次 |没有必要使用上面的xrange，因为除了从listPatient (patientId = listPatient[x]) 获取patientId 之外，您没有使用x。您可以直接从listPatient 中提取patientId 并从那里使用它。

第三次 |使用+ 方法将字符串相加在python 中是出了名的慢。在 python 中连接（连接）字符串的最有效方法是使用 join 方法或使用就地分隔符（正如我所拥有的）。此外，您不应该调用 write 两次，因为您可以将 "\n" 合并到第一个 write 语句中。

【讨论】：

join 与+ 相比，五个字符串？这真的不会有显着差异。无论如何，建议的字符串格式重写更优雅（尽管我更喜欢.format()）。
实际上你只会使用一个这样的连接："".join([patent, ",", str(multiListTokens.index(words[0])), ",", str(words[2]), "\n"])，是的，它明显更快。
我没有看到 5 个字符串连接在一起：In [16]: %timeit "".join(["asdf", "qwer", "uiop", "zxcv", "vbnm"]) -- The slowest run took 14.50 times longer than the fastest. This could mean that an intermediate result is being cached -- 1000000 loops, best of 3: 388 ns per loop ==== In [17]: %timeit "asdf" + "qwer" + "uiop" + "zxcv" + "vbnm" -- The slowest run took 53.08 times longer than the fastest. This could mean that an intermediate result is being cached -- 100000000 loops, best of 3: 18.9 ns per loop。
即使使用最慢运行，为了避免缓存，join 方法我得到 14.5 * 388 = 5626 ns，正常情况下我得到 53.08 * 18.9 = 1003 ns与+ 的字符串连接。
@Niche.P 这解决了您的速度问题吗？你的代码现在快了多少？因为对我来说，.index 方法似乎导致了速度变慢，而且它仍然存在于双循环中。