Python：列表列表字典答案

【问题标题】：Python: Dictionary of list of listsPython：列表列表字典
【发布时间】：2011-04-21 02:35:05
【问题描述】：

def makecounter():
     return collections.defaultdict(int)

class RankedIndex(object):
  def __init__(self):
    self._inverted_index = collections.defaultdict(list)
    self._documents = []
    self._inverted_index = collections.defaultdict(makecounter)


def index_dir(self, base_path):
    num_files_indexed = 0
    allfiles = os.listdir(base_path)
    self._documents = os.listdir(base_path)
    num_files_indexed = len(allfiles)
    docnumber = 0
    self._inverted_index = collections.defaultdict(list)

    docnumlist = []
    for file in allfiles: 
            self.documents = [base_path+file] #list of all text files
            f = open(base_path+file, 'r')
            lines = f.read()

            tokens = self.tokenize(lines)
            docnumber = docnumber + 1
            for term in tokens:  
                if term not in sorted(self._inverted_index.keys()):
                    self._inverted_index[term] = [docnumber]
                    self._inverted_index[term][docnumber] +=1                                           
                else:
                    if docnumber not in self._inverted_index.get(term):
                        docnumlist = self._inverted_index.get(term)
                        docnumlist = docnumlist.append(docnumber)
            f.close()
    print '\n \n'
    print 'Dictionary contents: \n'
    for term in sorted(self._inverted_index):
        print term, '->', self._inverted_index.get(term)
    return num_files_indexed
    return 0

执行此代码时出现索引错误：列表索引超出范围。

上面的代码生成一个字典索引，该索引将“术语”存储为键，并将术语出现在其中的文档编号存储为列表。例如：如果“猫”一词出现在文档 1.txt、5.txt 和 7.txt 中，则字典将具有：猫

现在，我必须修改它以添加词频，因此如果单词 cat 在文档 1 中出现两次，在文档 5 中出现三次，在文档 7 中出现一次：预期结果： term

我玩弄了代码，但没有任何效果。我不知道要修改这个数据结构来实现上述目的。

提前致谢。

【问题讨论】：

标签： python information-retrieval

【解决方案1】：

首先，使用工厂。开始：

def makecounter():
    return collections.defaultdict(int)

以后使用

self._inverted_index = collections.defaultdict(makecounter)

作为for term in tokens: 循环，

        for term in tokens:  
                self._inverted_index[term][docnumber] +=1

这会在每个self._inverted_index[term] 中留下一个字典，例如

{1:2,5:3,7:1}

在你的例子中。由于您希望在每个 self._inverted_index[term] 中添加一个列表列表，因此在循环结束后添加：

self._inverted_index = dict((t,[d,v[d] for d in sorted(v)])
                            for t in self._inverted_index)

一旦制作（这种方式或任何其他方式 - 我只是展示一种简单的构建方式！），那么这个数据结构实际上会像你不必要地让它难以构建一样难以使用，当然（ dict 的 dict 更有用，更易于使用和构造），但是，嘿，一个人的肉 &c;-)。

【讨论】：

我已按照您的建议进行了更改。我意识到你的方法比实现列表列表的字典更简单和清晰。但是，它目前给我一个错误，我已经编辑了上面的代码。
@csguy，在您的indexdir 方法中（假设它是 1，您上面发布的缩进都是错误的）您完全破坏了之前分配给self._inverted_index 的任何内容通过将您以前的错误数据结构分配给它，从而使您对代码的编辑完全无关紧要。您确实意识到，当您执行self.a = b 时，它至少不再重要 之前分配给self.a 的任何东西（如果有的话），对吧？！
我知道问题出在哪里，但由于我不太了解您的实现，所以我决定坚持使用我的方法，即列表列表的字典，即使它过于复杂。

【解决方案2】：

这是您可以使用的通用算法，但您需要调整一些代码以适应它。它生成一个字典，其中包含每个文件的字数字典。

filedicts = {}
for file in allfiles:
  filedicts[file] = {}

  for term in terms:
    filedict.setdefault(term, 0)
    filedict[term] += 1

【讨论】：

【解决方案3】：

也许您可以为 (docname, frequency) 创建一个简单的类。

那么你的 dict 可以有这个新数据类型的列表。您也可以创建一个列表列表，但单独的数据类型会更简洁。

【讨论】：