【问题标题】:Extract Data Dump From Freebase in Python在 Python 中从 Freebase 中提取数据转储
【发布时间】:2018-07-09 11:47:15
【问题描述】:

使用从 website 下载的数据转储 Freebase Triples (freebase-rdf-latest.gz),打开和读取此文件的最佳过程是什么为了提取信息,假设有关公司和企业的相关信息? (在 Python 中)

据我所知,有一些包可以完成这个目标:在python中打开gz文件并读取一个rdf文件,我不知道如何完成这个......

我在python 3.6 中的失败尝试:

import gzip

with gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:
       for line in uncompressed_file.read():
           print(line)

使用 xml 结构之后,我可以通过解析获取信息,但无法读取文件。

【问题讨论】:

标签: python parsing freebase gzip


【解决方案1】:

问题在于 gzip 模块一次解压缩整个文件,将解压缩的文件存储在内存中。对于这么大的文件,更实用的方法是一次将文件解压缩一点,然后流式传输结果。

#!/usr/bin/env python3
import io
import zlib

def stream_unzipped_bytes(filename):
    """
    Generator function, reads gzip file `filename` and yields
    uncompressed bytes.

    This function answers your original question, how to read the file,
    but its output is a generator of bytes so there's another function
    below to stream these bytes as text, one line at a time.
    """
    with open(filename, 'rb') as f:
        wbits = zlib.MAX_WBITS | 16  # 16 requires gzip header/trailer
        decompressor = zlib.decompressobj(wbits)
        fbytes = f.read(16384)
        while fbytes:
            yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)
            fbytes = f.read(16384)


def stream_text_lines(gen):
    """
    Generator wrapper function, `gen` is a bytes generator.
    Yields one line of text at a time.
    """
    try:
        buf = next(gen)
        while buf:
            lines = buf.splitlines(keepends=True)
            # yield all but the last line, because this may still be incomplete
            # and waiting for more data from gen
            for line in lines[:-1]:
                yield line.decode()
            # set buf to end of prior data, plus next from the generator.
            # do this in two separate calls in case gen is done iterating,
            # so the last output is not lost.
            buf = lines[-1]
            buf += next(gen)
    except StopIteration:
        # yield the final data
        if buf:
            yield buf.decode()


# Sample usage, using the stream_text_lines generator to stream
# one line of RDF text at a time
bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))
for line in stream_text_lines(bytes_generator):
    # do something with `line` of text
    print(line, end='') 

【讨论】:

  • 不工作,我使用了“stream_text_lines(stream_unzipped_bytes(file_name))”,它仍在运行。
  • 嗯,奇怪...如果您尝试完全按照编写的示例代码,代码会生成文本行吗?这是一个非常大的文件,处理所有数据可能需要相当长的时间,但它立即开始为我打印结果。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-05-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多