【问题标题】:Ignoring duplicate words in a python dictionary忽略python字典中的重复单词
【发布时间】:2011-07-25 14:00:58
【问题描述】:

我有一个 Python 脚本,它接收“.html”文件,删除停用词并返回 Python 字典中的所有其他词。但是如果同一个词出现在多个文件中,我希望它只返回一次。即包含不间断的单词,每个单词只有一次。

def run():
filelist = os.listdir(path)
regex = re.compile(r'.*<div class="body">(.*?)</div>.*', re.DOTALL | re.IGNORECASE)
reg1 = re.compile(r'<\/?[ap][^>]*>', re.DOTALL | re.IGNORECASE)
quotereg = re.compile(r'&quot;', re.DOTALL | re.IGNORECASE)
puncreg = re.compile(r'[^\w]', re.DOTALL | re.IGNORECASE)
f = open(stopwordfile, 'r')
stopwords = f.read().lower().split()
filewords = {}

htmlfiles = []
for file in filelist:
    if file[-5:] == '.html':
        htmlfiles.append(file)
        totalfreq = {}


for file in htmlfiles:
    f = open(path + file, 'r')
    words = f.read().lower()
    words = regex.findall(words)[0]
    words = quotereg.sub(' ', words)
    words = reg1.sub(' ', words)
    words = puncreg.sub(' ', words)
    words = words.strip().split()

    for w in stopwords:
        while w in words:
            words.remove(w)


    freq = {}
    for w in words:
            words=words
    print words

if __name__ == '__main__':
run()

【问题讨论】:

    标签: python regex dictionary duplicates stop-words


    【解决方案1】:

    使用set。只需将您找到的每个单词都添加到集合中;它会忽略重复项。

    假设您有一个迭代器,它返回文件中的每个单词(这是针对纯文本;HTML 会更复杂):

    def words(filename):
        with open(filename) as wordfile:
            for line in wordfile:
                for word in line.split():
                    yield word
    

    然后让他们进入set 很简单:

    wordlist = set(words("words.txt"))
    

    如果您有多个文件,请这样做:

    wordlist = set()
    wordfiles = ["words1.txt", "words2.txt", "words3.txt"]
    
    for wordfile in wordfiles:
        wordlist |= set(words(wordfile))
    

    您还可以为停用词使用一组。然后你可以简单地在事后从单词列表中减去它们,这可能比在添加之前检查每个单词是否是停用词更快。

    stopwords = set(["a", "an", "the"])
    wordlist -= stopwords
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-02-10
      • 2023-03-05
      • 2023-03-03
      • 2012-03-03
      • 1970-01-01
      • 2010-09-08
      相关资源
      最近更新 更多