python中的倒排索引，具有spacy作为标记化和与原始文档的持久关系答案

【问题标题】：inverted index in python with spacy as tokenization and persistent relation to original documentspython中的倒排索引，具有spacy作为标记化和与原始文档的持久关系
【发布时间】：2017-03-23 14:47:34
【问题描述】：

我想在 python 中使用出色的https://spacy.io/ 库来构建一个倒排索引来标记单词。

它们提供了一个很好的例子，如何同时执行预处理并最终得到一个很好的准备索引的文档列表。

texts = [u'One document.', u'...', u'Lots of documents']
# .pipe streams input, and produces streaming output
iter_texts = (texts[i % 3] for i in range(100000000))
for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=4)):
    assert doc.is_parsed
    if i == 30:
        break
    print(i)
    print(doc)

目前我不明白的是如何使用此方法维护与原始文档的关系（文件路径/ URL），即将其存储为每个文档的附加属性。

【问题讨论】：

标签： python nlp inverted-index spacy

【解决方案1】：

您可能会发现doc.user_data 字典很有用。请注意，它当前未在doc.to_bytes() 输出中序列化，因此您需要单独存储它。序列化为元组 (pickle(doc.user_dict), doc.to_bytes()) 可能会起作用。

【讨论】：

但是如何在nlp.pipe(iter_texts, batch_size=50, n_threads=4)执行期间用值（即文件路径）填充它？
是否有可能“调整”管道方法以实际获取一些 ID 值？

【解决方案2】：

这里是解决方案 https://github.com/explosion/spaCy/issues/172

def gen_items():
    print("Yield 0")
    yield (0, 'Text 0')
    print("Yield 1")
    yield (1, 'Text 1')
    print("Yield 2")
    yield (2, 'Text 2')

gen1, gen2 = itertools.tee(gen_items())
ids = (id_ for (id_, text) in gen1)
texts = (text for (id_, text) in gen2)
docs = nlp.pipe(texts, batch_size=50, n_threads=4)
for id_, doc in zip(ids, docs):
    print(id_, doc.text)

【讨论】：