写入文件的最佳方法答案

【问题标题】：best way to writing to file写入文件的最佳方法
【发布时间】：2021-09-08 20:17:18
【问题描述】：

将数据写入文件的哪种方式更好？

# 1 way
whole_data = ""
for file_name in list_of_files:
    r_file = open(file_name, 'r')
    whole_data += r_file.read()
    r_file.close()
with open("destination_file.txt", 'w') as w_file:
    w_file.write(whole_data)


# 2 way
for file_name in list_of_files:
    r_file = open(file_name, 'r')
    with open("destination_file.txt", 'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()

# separate open/colse for write
w_file = open("destination_file.txt", 'w')
for file_name in list_of_files:
    with open(file_name, 'r') as r_file:
        w_file.write(r_file.read())
w_file.close()

1 方法首先将整个数据保存到超级字符串中，然后将其写入目标文件。 2种方式从文件中读取并立即将数据附加到目标文件。我曾经在代码中使用两种方式，但我不确定哪种方式更好。你知道这两种方式的优缺点吗？如果您知道更好的做法，请分享。 // 编辑：添加第三种方式

【问题讨论】：

您可以检查读取和写入文件需要更多时间
使用python的特殊timeit模块
并在r_file 周围加上一个with
当然。我去做。但我不确定它是否适用于每种大小的文件。

标签： python python-2.7 file stream read-write

【解决方案1】：

“with 语句会在文件离开 with 块后自动关闭文件，即使在出现错误的情况下也是如此。我强烈建议您尽可能使用 with 语句，因为它允许更简洁的代码并进行处理任何意外错误对您来说都更容易。”

check this out

【讨论】：

【解决方案2】：

直观地说，第二种方式“感觉”更快，但您可以随时尝试并计时。

【讨论】：

【解决方案3】：

timeit 模块接受两个字符串，一个语句 (stmt) 和一个设置。然后它运行设置代码并运行 stmt 代码 n 次，并报告它所花费的平均时间长度。

def func_one(n):
    setup = '''
    whole_data = ""
    for file_name in list_of_files:
        r_file = open(file_name, 'r')
        whole_data += r_file.read()
        r_file.close()
    with open("destination_file.txt", 'w') as w_file:
        w_file.write(whole_data) '''

stmt = 'func_one(10)'

timeit.timeit(10) # Shows the time taken to do this func 10 times

我向文件写入 10 次的原因，以便 timeit 可以找到确切的值而不是四舍五入的值

同样，你可以做第二种方式-

setup = '''
def func_two(n):
    for file_name in list_of_files:
    r_file = open(file_name, 'r')
    with open("destination_file.txt", 'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()'''

stmt = 'func_one(10)'
    
timeit.timeit(10) # Shows the time taken to do this func 10 times

然后你可以比较打印出来的时间。

我知道这太过分了。但有时，看代码无法判断哪个更快

【讨论】：

这是一个非常有趣的时间测量解决方案。我想知道 timeit.timeit() 是如何知道“list_of_files”中的内容的？
List_of_files 将在此之前写入代码。它将由 OP 定义

【解决方案4】：

如果您的小文件数量有限，我想您不会注意到任何区别，但如果您使用第一种方法处理许多非常大的文件，您将消耗大量内存，基本上没有任何理由，所以第二种方法肯定更具可扩展性。

也就是说，您可能不需要在每次迭代时重新打开（并隐式关闭）输出文件，这可能会根据操作系统、磁盘/网络性能等因素减慢速度。您可以像这样重构代码

with open("destination_file.txt", 'a') as w_file:
    for file_name in list_of_files:
       with open(file_name, 'r') as r_file
          w_file.write(r_file.read())

【讨论】：

我认为这是最安全的解决方案。因为没有内存溢出的问题。并且有很好的错误处理，例如来自 file_list 的文件名中的拼写错误。

【解决方案5】：

这是一个尽力而为的测试工具，但我不能强调它在现实中证明的多么少。在 3-4 次运行中（每一次 10K 试验），每一次都至少出现过一次，而且只有 0.1 秒 - 0.2 秒（超过 10K 次试验！）。也就是说，我正在我的工作站上运行一些 IO 重型 ML 模型，因此其他人可能会产生更可靠的数据。无论如何，我会说这是一种语法选择，性能不是主要问题。

我做了一些语法更改（在适当的地方嵌套了 with），并在设置了一些文件后将每个方法移到了一个函数中。如果您像@gimix 所说的那样更改每个文件中的行数，您也可能会发现不同的数字。根据他的回答，全数据方法也会不必要地使用大量内存，因此这可能是编写干净、高性能和面向未来的代码的决定因素。

import timeit

test_files = []

for n in range(100):
    file_name = f'test_file_{n}.txt'
    with open(file_name, 'w') as f:
        for i in range(10):
            f.write(f'{i}\n')
        test_files.append(file_name)


def whole_data():
    data = ""
    for file in test_files:
        with open(file, 'r') as fr:
            data += fr.read()
        
    with open('whole_data_file.txt', 'w') as fw:
        fw.write(data)


def file_by_file():
    with open('line_by_line_file.txt', 'w') as fw:
        for file in test_files:
            with open(file, 'r') as fr:
                fw.write(fr.read())


print('Whole data method:', timeit.timeit("whole_data()", globals=globals(), number=10_000))
# Whole data method: 10.38545351603534
# Whole data method: 10.356000136991497

print('File by file method:', timeit.timeit("file_by_file()", globals=globals(), number=10_000))
# File by file method: 10.356590001960285
# File by file method: 10.507033439003862

请注意，如果不在 SSD 上运行，上述所有操作可能需要一分钟以上（我使用的是高性能 NVME SSD）

【讨论】：

可能文件太小了。我将使用您的代码来处理更大的文件。

【解决方案6】：

好的，我做了一些测试。结果并不像我预期的那样。首先，我认为第一次测试中的运算符 += 首先会引发内存问题。内存问题仅发生在保存文件的第二种“方式”中。我想让操作更复杂，所以我在输入文件中添加了替换字符。

唯一的预期结果是 1way 和 3way（分开）之间的时间差。仅适用于 30 个文件 1路：0.26499秒 2路：0.648秒三路：0.242 秒

2way 中超过 30 个文件出现“MemoryError”，所以我从测试中排除了这种方式。

对于所有（222 个大文件）结果是相当可预测的： 1路：39秒三路：1.577 秒

代码：

from os import listdir
from os.path import isfile, join
import time

my_path = r"path_to_files"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
print(len(list_of_files))

repeat = 20


if True:  # 39.381000042 sec
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 1 way
        whole_data = ""
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                tmp = r_file.read().replace('d', 'A')
                whole_data += tmp
        with open("destination_file.txt", 'w') as w_file:
            w_file.write(whole_data)

    print(time.time() - now)  # print time elapsed
    # --------------- 1 way ---------------


if True:  # MemoryError
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 2 way
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                with open("destination_file.txt", 'a') as w_file:
                    tmp = r_file.read().replace('d', 'A')  # MemoryError
                    w_file.write(tmp)

    print(time.time() - now)  # print time elapsed
    # --------------- 3 way ---------------


if True:  # 1.53500008583 sec
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # separate open/close for write
        w_file = open("destination_file.txt", 'w')
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                tmp = r_file.read().replace('d', 'A')
                w_file.write(tmp)
        w_file.close()

    print(time.time() - now)  # print time elapsed
    # --------------- separate ---------------

【讨论】：

【解决方案7】：

看起来第二种方式对于许多文件来说更安全、更快。

from os import listdir
from os.path import isfile, join
import timeit

my_path = r"./"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
test_files = list_of_files
print(len(test_files), "files ~6.5kB per file")


def whole_data(amount):
    data = ""
    for file in test_files[:amount]:
        with open(file, 'rb') as fr:
            data += str(fr.read())
        
    with open('whole_data_file.txt', 'w') as fw:
        fw.write(data)


def file_by_file(amount):
    with open('line_by_line_file.txt', 'w') as fw:
        for file in test_files[:amount]:
            with open(file, 'rb') as fr:
                fw.write(str(fr.read()))


# all files are taken from game wither 3
# 100 files ~6.5kB per file
print('Whole data method 20/number=100:', timeit.timeit("whole_data(20)",  globals=globals(), number=100))
print('Whole data method 100/number=20:', timeit.timeit("whole_data(100)", globals=globals(), number=20 ))
# Whole data method 20/number=100: 285.6315555
# Whole data method 100/number=20: 495.45210849999995

print('File by file method 20/number=100:', timeit.timeit("file_by_file(20)",  globals=globals(), number=100))
print('File by file method 100/number=20:', timeit.timeit("file_by_file(100)", globals=globals(), number=20 ))
# File by file method 20/number=100: 212.43927700000006
# File by file method 100/number=20: 205.07520319999992

【讨论】：