如何在python中以快速的方式从字符串列表中创建词汇表答案

【问题标题】：How to create a vocabulary from a list of strings in a fast manner in python如何在python中以快速的方式从字符串列表中创建词汇表
【发布时间】：2020-03-14 10:53:06
【问题描述】：

我有一个问题已解决，但没有以有效的方式解决。我有一个字符串列表，它们是图像的标题。我需要获取此字符串列表中的任何单词并创建一个包含以下信息的字典

单词，如果该单词在该列表中出现 5 次或更多次
该词的简单 id

因此，我在 python 字典中的词汇表将包含 word:id 条目

首先，我有一个辅助函数可以将字符串划分为标记或单词

def split_sentence(sentence):
    return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))

然后，我将生成这样的词汇表，这很有效

def generate_vocabulary(train_captions):
    """
    Return {token: index} for all train tokens (words) that occur 5 times or more, 
        `index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
    """  
    #convert the list of whole captions to one string
    string=listToStr = ' '.join([str(elem) for elem in train_captions]) 

    #divide the string tokens (individual words), by calling the previous function 
    individual_words=split_sentence(string)

    #create a list of words that happen 5 times or more in that string  
    more_than_5=list(set([x for x in individual_words if individual_words.count(x) >= 5]))

    #generate ids
    ids=[i for i in range(0,len(more_than_5))] 

    #generate the vocabulary(dictionary)
    vocab = dict(zip(more_than_5,ids))

    return {token: index for index, token in enumerate(sorted(vocab))}

对于相对较小的标题列表，该代码就像一个魅力。但是，对于具有数千个长度的列表（例如，80000），它会永远持续下去。我现在正在运行此代码一小时。

有什么方法可以加快我的代码速度？如何更快地计算 more_than_5 变量？

编辑：我忘了提到，在这个字符串列表的极少数特定成员中，在句子开头的某些元素中有 \n 符号。是否可以从我的列表中删除这个符号，然后再次应用算法？

【问题讨论】：

标签： python python-3.x python-2.7

【解决方案1】：

您可以使用collections 包中的Counter 计算一次出现的单词次数，而不是在列表理解的每一步都计算它。

import re
from collections import Counter

def split_sentence(sentence):
    return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))

def generate_vocabulary(train_captions, min_threshold):
    """
    Return {token: index} for all train tokens (words) that occur min_threshold times or more, 
        `index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
    """  
    #convert the list of whole captions to one string
    concat_str = ' '.join([str(elem).strip('\n') for elem in train_captions]) 
    #divide the string tokens (individual words), by calling the split_sentence function 
    individual_words = split_sentence(concat_str)
    #create a list of words that happen min_threshold times or more in that string  
    condition_keys = sorted([key for key, value in Counter(individual_words).items() if value >= min_threshold])
    #generate the vocabulary(dictionary)
    result = dict(zip(condition_keys, range(len(condition_keys))))
    return result

train_captions = ['Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been.',
                  'I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.',
                  'Almost nothing was more annoying than having our wasted time wasted on something not worth wasting it on.']

generate_vocabulary(train_captions, min_threshold=5)
# {'a': 0, 'because': 1, 'catholic': 2, 'i': 3, 'was': 4}

【讨论】：

非常感谢！只是我忘记提及的另一件事。在字符串列表中某些字符串的开头，有\n 个字符串。我怎样才能消除这个特定的符号（保留句子的其余部分）并再次应用算法？再次感谢您。
@mad，您可以使用 str.strip() 方法从字符串的开头和结尾删除 '\n'。答案有更新。

【解决方案2】：

就像@Eduard Ilyasov 所说，Counter 类在需要计算事物时是最好的。

这是我的解决方案：

import re
import collections

original_text = (
    "I say to you today, my friends, though, even though ",
    "we face the difficulties of today and tomorrow, I still have ",
    "a dream. It is a dream deeply rooted in the American ",
    "dream. I have a dream that one day this nation will rise ",
    'up, live out the true meaning of its creed: "We hold these ',
    'truths to be self-evident, that all men are created equal."',
    "",
    "I have a dream that one day on the red hills of Georgia ",
    "sons of former slaves and the sons of former slave-owners ",
    "will be able to sit down together at the table of brotherhood. ",
    "I have a dream that one day even the state of ",
    "Mississippi, a state sweltering with the heat of injustice, ",
    "sweltering with the heat of oppression, will be transformed ",
    "into an oasis of freedom and justice. ",
    "",
    "I have a dream that my four little chi1dren will one day ",
    "live in a nation where they will not be judged by the color ",
    "of their skin but by the content of their character. I have ",
    "a dream… I have a dream that one day in Alabama, ",
    "with its vicious racists, with its governor having his lips ",
    "dripping with the words of interposition and nullification, ",
    "one day right there in Alabama little black boys and black ",
    "girls will he able to join hands with little white boys and ",
    "white girls as sisters and brothers. "
    )

def split_sentence(sentence):
    return (x.lower() for x in re.split('\W+', sentence.strip()) if x)

def generate_vocabulary(train_captions):
    word_count = collections.Counter()

    for current_sentence in train_captions:
        word_count.update(split_sentence(str(current_sentence)))

    return {key: value for (key, value) in word_count.items() if value >= 5}

print(generate_vocabulary(original_text))

我做了一些你没有指定的假设：

我没想到一个词会跨越两个句子
我保留了这样一个事实，即您的字幕并不总是字符串。如果您知道它们将永远存在，您可以通过将 word_count.update(split_sentence(str(current_sentence))) 更改为 word_count.update(split_sentence(current_sentence)) 来简单地编写代码

【讨论】：