如何继续读取 csv 文件？答案

【问题标题】：How to resume reading a csv file?如何继续读取 csv 文件？
【发布时间】：2019-04-07 15:15:24
【问题描述】：

import csv

with open('test.csv', 'r') as f:
   reader = csv.reader(f)
   for i in reader:
      print(i)

CSV

id,name
001,jane
002,winky
003,beli
...

到目前为止，程序只会读取一次 csv。如果再次重新启动，程序将从第一行读取001。如果程序在002 停止阅读然后下一个开始阅读将是003，我该如何继续阅读？

【问题讨论】：

您必须将进度保存在另一个文件中。
这意味着将内容写入另一个文件然后开始比较两个文件？
CSV read specific row的可能重复
最简单的方法是保存您已经阅读的行数，并在下次开始时跳过该行数。
@mrHOT 不是重复的。请先阅读问题。

标签： python python-3.x csv

【解决方案1】：

为此，每次从 CSV 文件中读取一行时，您都需要不断地将当前位置保存在另一个文件中，这当然会增加一些处理它的开销。

我认为创建Context Manager Type 和with 语句将是解决此问题的一种非常好的方法，并且可以在一定程度上最小化开销。

下面的代码实现了一个用于读取 CSV 文件的内容管理器，并允许读取它，或者如果在读取整个文件之前被中断（在 with 语句的上下文中），它们会自动恢复。

这是通过创建一个单独的“状态”文件来跟踪成功读取的最后一行来完成的。如果在读取过程中没有发生异常，则该文件将被删除，但是，它不会发生，如果发生，它将保留。因此，下次读取文件时，将检测到现有的状态文件并用于允许读取从之前停止的位置开始。

值得注意的是，由于每个可恢复的 CSV 阅读器都是一个单独的对象，因此您一次可以创建和使用多个。在读取 CSV 文件时，每个关联的“状态”文件保持打开状态，因此无需在每次更新其内容时重复打开和关闭。

import csv
import os

class ResumableCSVReader:

    def __init__(self, filename):
        self.filename = filename
        self.state_filename = filename + '.state'
        self.csvfile = None
        self.statefile = None

    def __enter__(self):
        self.csvfile = open(self.filename, 'r', newline='')

        try:  # Open and read state file
            with open(self.state_filename, 'r', buffering=1) as statefile:
                self.start_row = int(statefile.read())

        except FileNotFoundError: # No existing state file.
            self.start_row = 0

        self.statefile = open(self.state_filename, 'w', buffering=1)

        return _CSVReaderContext(self)

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.csvfile:
            self.csvfile.close()
        if self.statefile:
            self.statefile.close()
            if not exc_type:  # No exception?
                os.remove(self.state_filename) # Delete state file.


class _CSVReaderContext:

    def __init__(self, resumable):
        self.resumable = resumable
        self.reader = csv.reader(self.resumable.csvfile)

        # Skip to start row.
        for _ in range(self.resumable.start_row):
            next(self.reader)

        self.current_row = self.resumable.start_row

    def __iter__(self):
        return self

    def __next__(self):
        self.current_row += 1
        row = next(self.reader)

        # Update state file.
        self.resumable.statefile.seek(0)
        self.resumable.statefile.write(str(self.current_row)+'\n')

        return row


if __name__ == '__main__':

    csv_filename = 'resumable_data.csv'

    # Read a few rows and raise an exception.
    try:
        with ResumableCSVReader(csv_filename) as resumable:
            for _ in range(2):
                print('row:', next(resumable))

            raise MemoryError('Forced')  # Cause exception.

    except MemoryError:
        pass  # Expected, suppress to allow test to keep running.

    # CSV file is now closed.

    # Resume reading where left-off and continue to end of file.
    print('\nResume reading\n')

    with ResumableCSVReader(csv_filename) as resumable:
        for row in resumable:
            print('row:', row)

    print('\ndone')

输出：

row: ['id', 'name']
row: ['001', 'jane']

Resume reading

row: ['002', 'winky']
row: ['003', 'beli']

done

【讨论】：

【解决方案2】：

为此，您需要跟踪到目前为止您已阅读文件的程度，file.tell() 可能会派上用场。之后，您可以使用file.seek() 从此处开始阅读您的文件。代码看起来有点像：

def read_from_position(last_position):
  file = open("file_location")
  file.seek(last_position)
  file.readline() # Do what you want with this
  return file.tell() # this is the updated last position

您可以通过跟踪您上次阅读的行数并迭代发布那么多行来在您的代码中实现相同的目标。

【讨论】：

readline() 可能会在后台进行预读和缓冲，从而使 file.tell() 给出的结果不一定与您的 readline 进度相匹配。

【解决方案3】：

在这种情况下，您必须每次都显式保存当前位置，这可能在计算上有点昂贵，但它可以工作，代码如下：

import csv


def update_last(x):
    with open('last.txt', 'w') as file:
        file.write(str(x))


def get_last():
    try:
        with open('last.txt', 'r') as file:
            return int(file.read().strip())
    except:
        with open('last.txt', 'w') as file:
            file.write('0')
            return 0

with open('your_file.txt', 'r') as f:
    reader = csv.reader(f)
    last = get_last() + 1
    current = 1
    for i in reader:
        if current < last:
            current += 1
            continue
        print(i)
        current += 1
        update_last(current)

【讨论】：

pathlib.Path.write_text 简化了很多 open/write/close 的东西。并让您的异常处理程序在 get_last 调用 update_last(0) - DRY！
@PaulMcG：Path.write_text() 方法直到 Python 3.5 才被添加，所以 not 使用它可以使这段代码兼容更多的版本......除了做在这种特殊情况下（IMO），“老式”方式并没有太多工作。（事实上，我有点惊讶他们甚至懒得添加像write_text() 这样微不足道的东西。）
mrHOT：FWIW，我认为总的来说这是一个合理的方法——而且你的实现肯定非常简洁。也就是说，每次读取一行时打开和关闭last.txt 文件都会增加很多的开销，特别是因为它涉及对操作系统进行多次调用（这往往非常昂贵）。我的一个建议是不要将last.txt 文件名硬编码到其中，因为这样做会阻止代码一次与多个 CSV 文件一起使用。此外，当不再需要文件时，需要额外的代码来删除它...

【解决方案4】：

使用生成器的魔力：

def get_rows(infile='test.csv'):
    with open(infile) as f:
        reader = csv.reader(f)
        for row in reader:
            yield row

for id, name in get_rows():
    out = some_complex_business_logic(id, name)
    print(out)

当您运行复杂的业务逻辑时，生成器会暂停，然后在您准备好下一行时透明地恢复。

【讨论】：