【问题标题】:Using python to process a chunk of a file every x lines使用python每x行处理一个文件块
【发布时间】:2011-07-19 17:29:14
【问题描述】:

我在这里要做的是从文件 DATA.txt 中的每个行间隔 y 中读取 z 行,然后执行函数 find 在那一行中。 IE。我想跳过前 y 行;阅读接下来的 z 行;在刚刚读入的那些行上执行函数find;跳过接下来的 y 行;并重复文件的长度(传递到 sys.argv[1])。

我在这里为变量 lines 提供了大量空白行,我不知道为什么。如果需要,我可以提供函数 find 但我认为这样更简单。

如果有人想建议一种完全不同的方式,只要我了解发生了什么,我会很乐意修复现有代码。

编辑:我遗漏了一些括号,但添加它们并不能解决问题。

import sys
import operator
import linecache
def find(arg)
    ...
x=0
while x<int(sys.argv[1]):
   x+=1 
   if mod(x, y)==0:
       for i in range(x,x+z):
           block=linecache.getline('DATA.txt', i)
           g = open('tmp','a+')
           g.write(block)
           linecache.clearcache()
           lines=g.read()
           find(lines)
           g.close()
   else:
       pass
g.close()
f.close()

【问题讨论】:

  • Nit-pick: linecache.clearcache 除了添加引用代码对象之外不会做任何事情。我猜你的意思是linecache.clearcache()
  • @Maimon 我认为使用 linecache 不适合这项工作。您需要一个将在文件中运行的生成器或迭代器,而不是在每次需要读取一行时似乎重新打开文件的繁重函数

标签: python text-extraction


【解决方案1】:

edit:尝试以下操作,我想我对您现在要做什么有了更好的了解。

g = open('tmp','a+')
while x<int(sys.argv[1]):
   x+=1 
   if mod(x, y)==0:
       curr = g.tell()
       for i in range(x,x+z):
           block=linecache.getline('DATA.txt', i)
           g.write(block)
           linecache.clearcache()
       g.seek(curr)
       lines = g.read()
       find(lines)
   else:
       pass
g.close()

【讨论】:

  • 该修复的问题在于,现在代码一次只在一行数据上执行 find,而这并不是有意的。
  • 非常感谢,非常感谢您抽出宝贵时间,您应该知道您的帮助是为了一个好的事业。
【解决方案2】:

Maimon,您的初始代码在索引方面是错误的。而且安德鲁的代码也是错误的,因为他把你的代码作为一个开始。

查看 Andrew 代码的结果,其中我删除了有关 g 的行:

import sys
import operator
import linecache

x=0
y=7  # to skip
z=3  # to print

#g = open('tmp','a+')
while x<23:
    x+=1
    print 'x==',x
    if operator.mod(x, y)==0:
        #curr = g.tell()
        for i in range(x,x+z):
            block=linecache.getline('poem.txt', i)
            print 'block==',repr(block)
            #g.write(block)
            linecache.clearcache()
            #g.seek(curr)
            #lines = g.read()
            #find(lines)

    else:
        pass

#g.close()

应用于包含 24 行的名为“poem.txt”的文件:

1 In such a night, when every louder wind
2 Is to its distant cavern safe confined;
3 And only gentle Zephyr fans his wings,
4 And lonely Philomel, still waking, sings;
5 Or from some tree, famed for the owl's delight,
6 She, hollowing clear, directs the wand'rer right:
7 In such a night, when passing clouds give place,
8 Or thinly veil the heav'ns' mysterious face;
9 When in some river, overhung with green,
10 The waving moon and trembling leaves are seen;
11 When freshened grass now bears itself upright,
12 And makes cool banks to pleasing rest invite,
13 Whence springs the woodbind, and the bramble-rose,
14 And where the sleepy cowslip sheltered grows;
15 Whilst now a paler hue the foxglove takes,
16 Yet checkers still with red the dusky brakes
17 When scattered glow-worms, but in twilight fine,
18 Shew trivial beauties watch their hour to shine;
19 Whilst Salisb'ry stands the test of every light,
20 In perfect charms, and perfect virtue bright:
21 When odors, which declined repelling day,
22 Through temp'rate air uninterrupted stray;
23 When darkened groves their softest shadows wear,
24 And falling waters we distinctly hear;

结果是:

x== 1
x== 2
x== 3
x== 4
x== 5
x== 6
x== 7
block== '7 In such a night, when passing clouds give place,\n'
block== "8 Or thinly veil the heav'ns' mysterious face;\n"
block== '9 When in some river, overhung with green,\n'
x== 8
x== 9
x== 10
x== 11
x== 12
x== 13
x== 14
block== '14 And where the sleepy cowslip sheltered grows;\n'
block== '15 Whilst now a paler hue the foxglove takes,\n'
block== '16 Yet checkers still with red the dusky brakes\n'
x== 15
x== 16
x== 17
x== 18
x== 19
x== 20
x== 21
block== '21 When odors, which declined repelling day,\n'
block== "22 Through temp'rate air uninterrupted stray;\n"
block== '23 When darkened groves their softest shadows wear,\n'
x== 22
x== 23
x== 24
x== 25

我选择 y=7 作为要跳过的行数,但第 7 行被打印出来了。

此外,在打印 3 行 7-8-9(选择 z=3)而不是继续 10、11、12... 之后,计数继续 8、9、10... 然后下一个打印的 3 行是 14-15-16,而应该是 7 + 3 第一行之后的行,即第 11-12-13 行

其实如果要跳过7行,那么打印3行,打印出来的行必须是:
8-9-10
18-19-20
28-29-30
等等

我说的对吗?

编辑 1

我的解决办法是:

def chunk_reading(filepath,y,z,x=0):
    # x : number of lines to skip before the periodical treatment
    # y : number of lines to periodically skip
    # z : number of lines to periodically print
    with open('poem.txt') as f:
        try:
            for sk in xrange(x):
                f.next()
            while True:
                try:
                    for i in xrange(y):
                        print 'i==',i
                        f.next()
                    for j in xrange(z):
                        print 'j==',j
                        print repr(f.next())
                except StopIteration:
                    break
        except StopIteration:
            print 'Not enough lines before the lines to print'


chunk_reading('poem.txt',7,3)

产生:

i== 0
i== 1
i== 2
i== 3
i== 4
i== 5
i== 6
j== 0
"8 Or thinly veil the heav'ns' mysterious face;\n"
j== 1
'9 When in some river, overhung with green,\n'
j== 2
'10 The waving moon and trembling leaves are seen;\n'
i== 0
i== 1
i== 2
i== 3
i== 4
i== 5
i== 6
j== 0
'18 Shew trivial beauties watch their hour to shine;\n'
j== 1
"19 Whilst Salisb'ry stands the test of every light,\n"
j== 2
'20 In perfect charms, and perfect virtue bright:\n'
i== 0
i== 1
i== 2
i== 3
i== 4

编辑 2

即使对于无法记录在 RAM 中的非常大的文件,上述解决方案也可用。

以下一个可用于大小有限的文件:

def slice_reading(filepath,y,z,x=0):
    # x : number of lines to skip before the periodical treatment
    # y : number of lines to periodically skip
    # z : number of lines to periodically print
    with open('poem.txt') as f:
        lines = f.readlines()
        lgth = len(lines)

    if lgth > x+y:
        for a in xrange(x+y,lgth,y+z):
            print lines[a:a+z]
    else:
        print 'Not enough lines before lines to print'


slice_reading('poem.txt',7,3,5)

结果

['13 Whence springs the woodbind, and the bramble-rose,\n', '14 And where the sleepy cowslip sheltered grows;\n', '15 Whilst now a paler hue the foxglove takes,\n']
['23 When darkened groves their softest shadows wear,\n', '24 And falling waters we distinctly hear;']

【讨论】:

  • 感谢您指出这个错误,并感谢您的更正。这种方法不仅更准确,而且速度也快得多。
  • @Maimon 谢谢。对您的问题没有用,但我告诉您 islice() 的存在,非常有趣的函数可以在迭代中切割块,在其他情况下可能对您有用
【解决方案3】:

我认为您的问题可能是lines=g.read 行。它应该是lines=g.read()

【讨论】:

    【解决方案4】:

    在“不同的方法”类别中,我提供了这个(行号显然仅用于显示):

    1""" 2 从DATA.txt中读取行,先跳过3行,再打印2行, 3 然后再跳过 3 行,以此类推。 4""" 5 6 def my_print(l): 7 如果(my_print.skip_counter > 0): 8 my_print.skip_counter -= 1 9 其他: 10 如果(my_print.print_counter > 0): 11 my_print.print_counter -= 1 12 打印升, 13 其他: 14 my_print.skip_counter = my_print.skip_size 15 my_print.print_counter = my_print.print_size 16 my_print(l) 17 18 my_print.skip_size = 3 19 my_print.skip_counter = my_print.skip_size 20 21 my_print.print_size = 2 22 my_print.print_counter = my_print.print_size 23 24 数据 = 打开('DATA.txt') 25 用于输入数据: 26 my_print(行)

    改善这一点的第一种方法是将 my_print() 包装在一个类中(将您的 x 和 y 作为成员变量)。然后,如果您想要真正“pythonic”的东西,您可以对生成器产生兴趣。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-03-04
      • 1970-01-01
      • 2019-08-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-05-07
      • 2016-06-05
      相关资源
      最近更新 更多