【问题标题】:Splitting large text file into smaller text files by line numbers using Python使用 Python 按行号将大文本文件拆分为较小的文本文件
【发布时间】:2013-04-23 18:44:49
【问题描述】:

我有一个文本文件,说真的是_big_file.txt,其中包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写一个 Python 脚本,将 real_big_file.txt 分成较小的文件,每个文件 300 行。例如,small_file_300.txt 包含第 1-300 行,small_file_600 包含第 301-600 行,依此类推,直到生成足够的小文件来包含大文件中的所有行。

如果有任何关于使用 Python 完成此任务的最简单方法的建议,我将不胜感激

【问题讨论】:

    标签: python file split lines


    【解决方案1】:

    使用itertools grouper 配方:

    from itertools import zip_longest
    
    def grouper(n, iterable, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)
    
    n = 300
    
    with open('really_big_file.txt') as f:
        for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
            with open('small_file_{0}'.format(i * n), 'w') as fout:
                fout.writelines(g)
    

    与将每一行存储在列表中相比,此方法的优势在于它可以逐行处理迭代,因此不必一次将每个small_file 存储到内存中。

    请注意,在这种情况下,最后一个文件将是 small_file_100200,但只会持续到 line 100000。发生这种情况是因为fillvalue='',这意味着当我没有更多行可写时,我将 nothing 写入文件,因为组大小不均等。您可以通过写入临时文件然后重命名它来解决此问题,而不是像我一样先命名它。这是如何做到的。

    import os, tempfile
    
    with open('really_big_file.txt') as f:
        for i, g in enumerate(grouper(n, f, fillvalue=None)):
            with tempfile.NamedTemporaryFile('w', delete=False) as fout:
                for j, line in enumerate(g, 1): # count number of lines in group
                    if line is None:
                        j -= 1 # don't count this line
                        break
                    fout.write(line)
            os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
    

    这次fillvalue=None 和我遍历每一行检查None,当它发生时,我知道过程已经完成,所以我从j 中减去1 不计算填充符,然后写文件。

    【讨论】:

    【解决方案2】:
    lines_per_file = 300  # Lines on each small file
    lines = []  # Stores lines not yet written on a small file
    lines_counter = 0  # Same as len(lines)
    created_files = 0  # Counting how many small files have been created
    
    with open('really_big_file.txt') as big_file:
        for line in big_file:  # Go throught the whole big file
            lines.append(line)
            lines_counter += 1
            if lines_counter == lines_per_file:
                idx = lines_per_file * (created_files + 1)
                with open('small_file_%s.txt' % idx, 'w') as small_file:
                    # Write all lines on small file
                    small_file.write('\n'.join(stored_lines))
                lines = []  # Reset variables
                lines_counter = 0
                created_files += 1  # One more small file has been created
        # After for-loop has finished
        if lines_counter:  # There are still some lines not written on a file?
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write them on a last small file
                small_file.write('n'.join(stored_lines))
            created_files += 1
    
    print '%s small files (with %s lines each) were created.' % (created_files,
                                                                 lines_per_file)
    

    【讨论】:

    • 唯一的问题是,在使用这种方法编写之前,您必须将每个small_file 一次存储在内存中,但这可能是也可能不是问题。当然,您可以通过将其逐行写入文件来解决此问题。
    【解决方案3】:

    我这样做的方式更易于理解,并且使用更少的捷径,以便让您进一步了解它的工作原理和原因。以前的答案有效,但如果您不熟悉某些内置函数,您将无法理解该函数在做什么。

    因为您没有发布代码,所以我决定这样做,因为您可能不熟悉基本 Python 语法以外的其他内容,因为您提出问题的方式使您看起来好像没有尝试,也不知道如何解决问题

    以下是在基本 python 中执行此操作的步骤:

    首先您应该将您的文件读入一个列表以便妥善保管:

    my_file = 'really_big_file.txt'
    hold_lines = []
    with open(my_file,'r') as text_file:
        for row in text_file:
            hold_lines.append(row)
    

    其次,您需要设置一种按名称创建新文件的方法!我建议一个循环和几个计数器:

    outer_count = 1
    line_count = 0
    sorting = True
    while sorting:
        count = 0
        increment = (outer_count-1) * 300
        left = len(hold_lines) - increment
        file_name = "small_file_" + str(outer_count * 300) + ".txt"
    

    第三,在该循环中,您需要一些嵌套循环,将正确的行保存到数组中:

    hold_new_lines = []
        if left < 300:
            while count < left:
                hold_new_lines.append(hold_lines[line_count])
                count += 1
                line_count += 1
            sorting = False
        else:
            while count < 300:
                hold_new_lines.append(hold_lines[line_count])
                count += 1
                line_count += 1
    

    最后一件事,再次在您的第一个循环中,您需要编写新文件并添加最后一个计数器增量,以便您的循环将再次执行并写入一个新文件

    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)
    

    注意:如果行数不能被 300 整除,则最后一个文件的名称将与最后一个文件行不对应。

    了解这些循环为何起作用很重要。您已将其设置为在下一个循环中,您编写的文件的名称会发生​​变化,因为您的名称取决于不断变化的变量。这是一个非常有用的脚本工具,用于文件访问、打开、写入、组织等。

    如果你无法理解循环中的内容,这里是整个函数:

    my_file = 'really_big_file.txt'
    sorting = True
    hold_lines = []
    with open(my_file,'r') as text_file:
        for row in text_file:
            hold_lines.append(row)
    outer_count = 1
    line_count = 0
    while sorting:
        count = 0
        increment = (outer_count-1) * 300
        left = len(hold_lines) - increment
        file_name = "small_file_" + str(outer_count * 300) + ".txt"
        hold_new_lines = []
        if left < 300:
            while count < left:
                hold_new_lines.append(hold_lines[line_count])
                count += 1
                line_count += 1
            sorting = False
        else:
            while count < 300:
                hold_new_lines.append(hold_lines[line_count])
                count += 1
                line_count += 1
        outer_count += 1
        with open(file_name,'w') as next_file:
            for row in hold_new_lines:
                next_file.write(row)
    

    【讨论】:

      【解决方案4】:
      lines_per_file = 300
      smallfile = None
      with open('really_big_file.txt') as bigfile:
          for lineno, line in enumerate(bigfile):
              if lineno % lines_per_file == 0:
                  if smallfile:
                      smallfile.close()
                  small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
                  smallfile = open(small_filename, "w")
              smallfile.write(line)
          if smallfile:
              smallfile.close()
      

      【讨论】:

        【解决方案5】:
        import csv
        import os
        import re
        
        MAX_CHUNKS = 300
        
        
        def writeRow(idr, row):
            with open("file_%d.csv" % idr, 'ab') as file:
                writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
                writer.writerow(row)
        
        def cleanup():
            for f in os.listdir("."):
                if re.search("file_.*", f):
                    os.remove(os.path.join(".", f))
        
        def main():
            cleanup()
            with open("large_file.csv", 'rb') as results:
                r = csv.reader(results, delimiter=',', quotechar='\"')
                idr = 1
                for i, x in enumerate(r):
                    temp = i + 1
                    if not (temp % (MAX_CHUNKS + 1)):
                        idr += 1
                    writeRow(idr, x)
        
        if __name__ == "__main__": main()
        

        【讨论】:

        • 嘿,小问题,你介意解释一下为什么使用 quotechar='\"' 谢谢
        • 我正在使用它,因为在我的情况下我有一个不同的引号字符 ( | )。您可以跳过将其设置为默认引号字符为 (quotes ")
        • 对于关心速度的人来说,一个包含 98500 条记录(大约 13MB 大小)的 CSV 文件在大约 2.31 秒内被此代码拆分。我会说这很好。
        【解决方案6】:

        我必须对 650000 行文件做同样的事情。

        使用枚举索引和整数 div it (//) 和块大小

        当该数字更改时,关闭当前文件并打开一个新文件

        这是一个使用格式字符串的python3解决方案。

        chunk = 50000  # number of lines from the big file to put in small file
        this_small_file = open('./a_folder/0', 'a')
        
        with open('massive_web_log_file') as file_to_read:
            for i, line in enumerate(file_to_read.readlines()):
                file_name = f'./a_folder/{i // chunk}'
                print(i, file_name)  # a bit of feedback that slows the process down a
        
                if file_name == this_small_file.name:
                    this_small_file.write(line)
        
                else:
                    this_small_file.write(line)
                    this_small_file.close()
                    this_small_file = open(f'{file_name}', 'a')
        

        【讨论】:

        • 您可以通过评论print(i, file_name)获得显着的加速
        • 也可以将file_to_read.readlines() 改为file_to_read...
        【解决方案7】:

        files 设置为要将主文件拆分为的文件数 在我的例子中,我想从我的主文件中获取 10 个文件

        files = 10
        with open("data.txt","r") as data :
            emails = data.readlines()
            batchs = int(len(emails)/10)
            for id,log in enumerate(emails):
                fileid = id/batchs
                file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
                file.write(log)
        

        【讨论】:

        • 感谢@JoeVenner 我尝试了这种方法,但对于大文件来说它会变慢
        【解决方案8】:

        如果您想将其拆分为 2 个文件,这是一种非常简单的方法,例如:

        with open("myInputFile.txt",'r') as file:
            lines = file.readlines()
        
        with open("OutputFile1.txt",'w') as file:
            for line in lines[:int(len(lines)/2)]:
                file.write(line)
        
        with open("OutputFile2.txt",'w') as file:
            for line in lines[int(len(lines)/2):]:
                file.write(line)
        

        使这种动态将是:

        with open("inputFile.txt",'r') as file:
            lines = file.readlines()
        
        Batch = 10
        end = 0
        for i in range(1,Batch + 1):
            if i == 1:
                start = 0
            increase = int(len(lines)/Batch)
            end = end + increase
            with open("splitText_" + str(i) + ".txt",'w') as file:
                for line in lines[start:end]:
                    file.write(line)
            
            start = end
        

        【讨论】:

          【解决方案9】:

          在 Python 中,文件是简单的迭代器。这提供了对它们进行多次迭代的选项,并且始终从上一个迭代器获得的最后一个位置继续。记住这一点,我们可以使用islice 在连续循环中每次获取文件的下 300 行。棘手的部分是知道何时停止。为此,我们将为next 行“采样”文件,一旦用尽,我们可以break 循环:

          from itertools import islice
          
          lines_per_file = 300
          with open("really_big_file.txt") as file:
              i = 1
              while True:
                  try:
                      checker = next(file)
                  except StopIteration:
                      break
                  with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
                      out_file.write(checker)
                      for line in islice(file, lines_per_file-1):
                          out_file.write(line)
                  i += 1
          

          【讨论】:

            【解决方案10】:
            with open('/really_big_file.txt') as infile:
                file_line_limit = 300
                counter = -1
                file_index = 0
                outfile = None
                for line in infile.readlines():
                    counter += 1
                    if counter % file_line_limit == 0:
                        # close old file
                        if outfile is not None:
                            outfile.close()
                        # create new file
                        file_index += 1
                        outfile = open('small_file_%03d.txt' % file_index, 'w')
                    # write to file
                    outfile.write(line)
            

            【讨论】:

            • 您的答案可以通过额外的支持信息得到改进。请edit 添加更多详细信息,例如引用或文档,以便其他人可以确认您的答案是正确的。你可以找到更多关于如何写好答案的信息in the help center
            猜你喜欢
            • 2012-06-26
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2013-07-31
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多