Python中的反向搜索答案

【问题标题】：Inverted search in PythonPython中的反向搜索
【发布时间】：2017-03-19 00:05:10
【问题描述】：

我正在尝试将反向搜索作为 map reduce 的一部分，我能够完成其中的第一部分（mapper）。第一部分的输出如下所示（标题仅供参考，并非mapper的实际输出）

word     frequency     document
------------------------------
tire        1           car
headlight   1           shop
tire        1           car
gas         1           gasstation
beer        1           gasstation
headlight   1           car
tire        1           shop

我正在尝试以下解决方案：

单词在哪个文件中找到，以及它的频率。（例如在汽车文件中发现了两次轮胎）

到目前为止，我一直尝试使用字典来获取找到该单词的文件，但我无法链接它来获取计数，下面是我得到的输出：

{'car':[tire,tire,headlight],'shop':[headlight],'gasstation':[gas,beer]}

预期：

tire           {'car':2,'shop':1}
headlight      {'car':1, 'shop':1}

【问题讨论】：

请注意，“预期”不是 desired 输出的占位符。你为什么期待这个输出？您希望生成它的代码在哪里？给minimal reproducible example。
看Counter类

标签： python python-2.7 dictionary inverted-index

【解决方案1】：

您想要的是reduce dict 您必须对列表中的元素进行分组。

假设您的映射的输出是这样的字典列表：

mapped_data = [
    { 'word': 'tire', 'frequency': 1, 'document': 'car' },
    { 'word': 'headlight', 'frequency': 1, 'document': 'shop' }
]

然后你可以这样做：

def reducer(accumulated, line):
    # We've never seen this word before, create the dict to store the documents
    if line['word'] not in accumulated:
        accumulated[line['word']] = {}

    # We've never seen this word in this document before, initialize the counter.
    if line['document'] not in accumulated[line['word']]:
        accumulated[line['word']][line['document']] = 0

    # Increment th counter
    accumulated[line['word']][line['document']] += line['frequency']

    return accumulated_data

reduce(reducer, mapped_data, {})

这将产生预期的结果：

{
    'tire': {
        'car': 2,
        'shop': 1
    },
    'headlight': {
        ...
    },
    ...
}

【讨论】：