【问题标题】:Cannot Iterate over cStringIO无法迭代 cStringIO
【发布时间】:2018-07-18 07:27:20
【问题描述】:

在脚本中,我将行写入文件,但有些行可能是重复的。所以我创建了一个临时的cStringIO 类文件对象,我称之为“中间文件”。我先将这些行写入中间文件,删除重复项,然后写入真实文件。

所以我编写了一个简单的 for 循环来遍历中间文件中的每一行并删除所有重复项。

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

我的问题是for 循环永远不会被执行。我可以通过在调试器中放置断点来验证这一点;该行代码只是被跳过并且函数退出。我什至阅读了this answer from this thread 并插入了代码cStringIO.OutputType.getvalue(f_temp),但这并没有解决我的问题。

我不知道为什么我不能读取和遍历我的类文件对象。

【问题讨论】:

  • f_temp 是文件对象吗? cStringIO.OutputType.getvalue(f_temp)...的目的是什么?
  • @juanpa.arrivillaga 是的,它是一个类似文件的对象。显然,cStringIO.OutputType.getvalue(f_temp) 的目的是将cStringIO 类文件对象转换为Output 类型以便可以读取它。见this评论。

标签: python stringio cstringio


【解决方案1】:

您引用的答案有点不完整。它告诉如何将 cStringIO 缓冲区作为字符串获取,但是您必须对该字符串执行一些操作。你可以这样做:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # contents = cStringIO.OutputType.getvalue(f_temp)  # From: https://stackoverflow.com/a/40553378/8117081
    contents = f_temp.getvalue()     # simpler approach
    contents = contents.strip('\n')  # remove final newline to avoid adding an extra row
    lines = contents.split('\n')     # convert to iterable

    for line in lines:  # Iterate through the list of lines.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line + '\n')
            lines_seen.add(line)
    f_out.close()

但在 f_temp “文件句柄”上使用普通的 IO 操作可能会更好,如下所示:

def remove_duplicates(f_temp, dir_out):  # f_temp is the cStringIO object.
    """Function to remove duplicates from the intermediate file and write to physical file."""
    lines_seen = set()  # Define a set to hold lines already seen.
    f_out = define_outputs(dir_out)  # Create the real output file by calling function "define_outputs". Note: This function is not shown in my pasted code.

    # move f_temp's pointer back to the start of the file, to allow reading
    f_temp.seek(0)

    for line in f_temp:  # Iterate through the cStringIO file-like object.
        line = compute_md5(line)  # Function to compute the MD5 hash of each line. Note: This function is not shown in my pasted code.
        if line not in lines_seen:  # Not a duplicate line (based on MD5 hash, which is supposed to save memory).
            f_out.write(line)
            lines_seen.add(line)
    f_out.close()

这是一个测试(任意一个):

import cStringIO, os

def define_outputs(dir_out):
    return open('/tmp/test.txt', 'w') 

def compute_md5(line):
    return line

f = cStringIO.StringIO()
f.write('string 1\n')
f.write('string 2\n')
f.write('string 1\n')
f.write('string 2\n')
f.write('string 3\n')

remove_duplicates(f, 'tmp')
with open('/tmp/test.txt', 'r') as f:
    print(str([row for row in f]))
# ['string 1\n', 'string 2\n', 'string 3\n']

【讨论】:

  • f_temp.seek(0) 有效!谢谢!我还有一个简短的问题。由于f_temp(或任何cStringIO 对象)是“类文件”对象,是否有必要在阅读完所有行后编写f_temp.close()
  • 我当然会在你完成后关闭它。对于文件或 StringIO,当最后一个引用超出范围时,垃圾收集器会自动释放相关资源,但依赖它并不是一种好的形式。完成后最好明确关闭该对象。如果您要快速创建和关闭大量它们,这一点尤其重要。这通常通过在open 步骤上使用with 子句最容易实现。
猜你喜欢
  • 2016-08-23
  • 1970-01-01
  • 2011-11-10
  • 2019-03-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-01-03
相关资源
最近更新 更多