使用 pandas 计算 csv 中的句子和单词答案

【问题标题】：using pandas to count sentences and words inside a csv使用 pandas 计算 csv 中的句子和单词
【发布时间】：2020-04-25 16:25:27
【问题描述】：

我正在尝试创建一个 python 程序，该程序遍历用户选择的 csv 文件，并根据句号或换行符以及所有单词的总数打印句子的总数。

插入文件

总句子数：3

总字数：15

不重复的总字数为：12

data = pd.read_csv('dundun.csv', sep='\t')
words = data['sentences'].str.split(expand=True)
word_count = {}
for word in words:
    count = word_count.get(word, 0)
    count += 1
    word_count[word] = count
print(word_count)

我正在尝试这段代码，但它给了我计算单词的错误输出我的 csv 看起来像：

【问题讨论】：

标签： python pandas csv word-count

【解决方案1】：

对于数据框df，计数句子：

df['review_sentence_count'] = df['reviews'].apply(sent_tokenize).tolist()
df['review_sentence_count'] = df['review_sentence_count'].apply(len)

删除标点符号后计算单词：

string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))
df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)

将带有新列的新数据写入 csv：

df.to_csv('./data/dataset.csv')

【讨论】：

【解决方案2】：

尝试使用：

import string
nwords = data['sentences'].str.split().map(len).sum()
nsenetences = data['sentences'].count()
nunique_words = len(set([x for i in data['senetences'].str.split().apply(lambda x: [''.join([y for y in i if y not in string.punctuation]) for i in x]).tolist() for x in i]))

【讨论】：

让我们continue this discussion in chat。