【问题标题】:How to read a subset of records from a warc file如何从 warc 文件中读取记录的子集
【发布时间】:2015-08-01 07:37:11
【问题描述】:

我正在尝试在 Python 中解析来自 Common Crawl 的 .warc 文件。

由于文件很大,我想从前几条记录的样本/子集开始。

如何将文件截断为仅包含前 X 行,同时保留现有的换行符/回车符?

这是我已经尝试过的:

  1. head -n 250 oldfile > newfile 这删除了解析文件所需的一些返回值。如果我尝试在我的 Hadoop 作业中使用此文件(使用 warc 包读取它),则会出现以下错误:

      Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
            super(MRJob, self).execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
            self.run_job()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
            runner.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
            self._run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
            self._invoke_step(step_num, 'mapper')
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
            self.per_step_runner_finish(step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
            self._wait_for_process(proc_dict, step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
            (proc_dict['args'], returncode, ''.join(tb_lines)))
        Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
        Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
            self.run_mapper(self.options.step_num)
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
            for out_key, out_value in mapper(key, value) or ():
          File "/var/cc-mrjob/mrcc.py", line 33, in mapper
            for i, record in enumerate(f):
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
            record = self.read_record()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
            header = self.read_header(fileobj)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
            raise IOError("Bad version line: %r" % version_line)
        IOError: Bad version line: 'WARC/1.0\n'
    
  2. 与 #1 相同,但使用 tail 命令

  3. 与 #1 相同,但使用 trsed 后替换任何丢失的换行符或 ^M(回车)字符。这会导致 warc 包仍然抱怨预期的回车符或换行符没有到位。
  4. unix2dos oldfile

【问题讨论】:

  • 查看warc python lib它不会一次读取整个.warc文件,而是一次读取一条记录。你需要截断什么?一个诚实的问题,可能是通过网络或其他方式进行的?
  • 添加到之前的“它不会读取整个 .warc”,仅使用 warc 库实现“读取 N 个第一条记录”非常简单:islice(warc_file, N),如果这就是您正在寻找的为。
  • @Ilja 谢谢——这正是我想要的。您可以将其添加为答案吗?

标签: python webarchive warc


【解决方案1】:

很难正确处理换行符,因为 .warc 文件也可能包含二进制数据。截断也可能会产生损坏的 .warc 文件,因为例如 python 库相信 Content-Length 标头是有效的。

warc python 库一次只从 .warc 文件中读取一条记录(避免一次将整个文件读取到内存中),因此可以仅使用 python 处理子集。例如:

import warc
from itertools import islice

N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
    do_stuff_with(record)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-08-12
    相关资源
    最近更新 更多