计算文件中所有单词的词频答案

【问题标题】：Count word frequency of all words in a file计算文件中所有单词的词频
【发布时间】：2019-05-27 19:15:32
【问题描述】：

我有一个文本文件，我从中删除了符号和停用词。

我还对其进行了标记（将其分解为所有单词的列表），以防使用列表更容易操作。

我想创建一个.csv 文件，其中所有单词（长格式）的频率按降序排列。我该怎么办？

我曾想过这样循环遍历列表：

longData = pandas.DataFrame([], index=[], columns=['Frequency'])
for word in tokenizedFile:
    if word in longData.index:
         longData.loc[word]=longData.loc[word]+1
    else:
         wordFrame = pandas.DataFrame([1], index=[word])
         longData.append(wordFrame)

但这似乎非常低效和浪费。

【问题讨论】：

您的解决方案有效吗？
在这里大声思考，但我认为：words =list(set(toknizedFile))，然后是 tokens = np.asarray(tokenizedFile)，然后你迭代 (for word in words:) 并计算每个单词的实例数 num_instances = length(np.where(tokens == word)) 你可以开始制作字典或一个 df 来存储每个单词的实例数

标签： python python-3.x pandas text nltk

【解决方案1】：

计数器在这里会很好：

    from collections import Counter
    c = Counter(tokenizedFile)
    longData = pd.DataFrame(c.values(), index = c.keys(), columns=['Frequency'])

【讨论】：

【解决方案2】：

如果你的文本是上面这样的字符串列表：

from sklearn.feature_extraction import text


texts = [
        'this is the first text',
        'this is the secound text',
        'and this is the last text the have two word text'


        ]


#istantiate.
cv = text.CountVectorizer()



cv.fit(texts)


vectors = cv.transform(texts).toarray()

您将需要探索更多参数。

【讨论】：

【解决方案3】：

您可以使用Series.str.extractall() 和Series.value_counts()。假设 file.txt 是删除了文本符号和停用词的文件路径：

# read file into one column dataframe, the default column name is '0'
df = pd.read_csv('file.txt', sep='\n', header=None)

# extract words into rows and then do value_counts()
words_count = df[0].str.extractall(r'(\w+)')[0].value_counts()

上述结果words_count 是一个系列，您可以通过以下方式将其转换为数据框：

df_new = words_count.to_frame('words_count')

【讨论】：

【解决方案4】：

如果有人还在苦苦挣扎，你可以试试下面的方法：

df = pd.DataFrame({"words": tokenizedFile.lower()})
value_count = pd.value_counts(df["words"])  # getting the count of all the words
# storing the words and its respective count in a new dataframe
# value_count.keys() are the words, value_count.values is the count
vocabulary_df = pd.DataFrame({"words": value_count.keys(), "count": value_count.values})

这是做什么的，

获取单词列表（tokenizedFile），并将所有单词转换为小写。然后，创建一个标题为words 的列，数据将是文件中的所有单词。
value_count 变量将使用可用于数据帧的value_counts 方法存储每个单词在我们的 df 数据帧中出现的次数。它默认按计数的降序对其进行排序。
我们的最后一行代码创建了一个新的vocabulary_df，它将存储所有单词，并且它可以很好地计算到一个新的数据帧中（value_count 保存为一个系列类型）。这里，value_count.keys() 包含单词，value_count.values 包含每个单词的计数。

希望这对沿线的人有所帮助。 :)

【讨论】：