创建一个包含词汇的语料库答案

【问题标题】：Create a Corpus Containing the Vocabulary of Words创建一个包含词汇的语料库
【发布时间】：2020-01-18 21:23:24
【问题描述】：

我正在为我的文档字典中的所有单词计算 inverse_document_frequency，我必须显示根据查询分数排名的前 5 个文档。但是我在创建包含文档中单词词汇的语料库时陷入了循环。请帮助我改进我的代码。此代码块用于读取我的文件并从文件中删除标点符号和停用词

def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return List of Words
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in english_stopwords]

此代码块用于将所有文件名存储在我的文件夹中

file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)

这段代码用于创建我正在处理的文档字典

documents = {}
for i in file_names:
documents[i]=wordList(i)

以上代码运行良好且快速，但这段代码需要大量时间来创建语料库，我该如何改进这一点

#create a corpus containing the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
     for doc in documents.values(): #iterate through documents 
        for word in doc: #go through each word in the current doc
            if not word in corpus: 
                corpus.append(word) #add word in corpus if not already added

此代码创建一个字典，用于存储语料库中每个单词的文档频率

df_corpus = {} #document frequency for every word in corpus
for word in corpus:
    k = 0 #initial document frequency set to 0
    for doc in documents.values(): #iterate through documents
        if word in doc.split(): #check if word in doc
            k+=1 
    df_corpus[word] = k

从 2 小时开始，它创建了语料库并仍在创建请帮助我改进我的代码。这是我正在使用的数据集 https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ

【问题讨论】：

你分享的链接没有打开，如果可能的话分享示例数据..
@qaiser 立即查看
面临同样的问题...
@qaiser drive.google.com/file/d/1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ/…

标签： python nltk information-retrieval

【解决方案1】：

如何代替列表，将语料库设置为 set 类型？你也不需要额外的if。

corpus = set() # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    corpus.update(doc) #add word in corpus if not already added

【讨论】：

我现在正在使用 set 但由于 for 循环没有效果