【发布时间】:2017-04-02 20:22:33
【问题描述】:
我在一个文本文件中有一组数据,我想根据预定义的单词(drive、street、i、lives)构建一个频率表。下面是例子
ID | Text
---|--------------------------------------------------------------------
1 | i drive to work everyday in the morning and i drive back in the evening on main street
2 | i drive back in a car and then drive to the gym on 5th street
3 | Joe lives in Newyork on NY street
4 | Tod lives in Jersey city on NJ street
这是我想得到的输出
ID | drive | street | i | lives
----|--------|----------|------|-------
1 | 2 | 1 | 2 | 0
2 | 2 | 1 | 1 | 0
3 | 0 | 1 | 0 | 1
4 | 0 | 1 | 0 | 1
这是我正在使用的代码,我可以找到单词的数量,但这并不能解决我的需求,我想使用一组预定义的单词来查找如上所示的计数
from nltk.corpus import stopwords
import string
from collections import Counter
import nltk
from nltk.tag import pos_tag
xy = open('C:\Python\data\file.txt').read().split()
q = (w.lower() for w in xy)
stopset = set(stopwords.words('english'))
filtered_words = [word for word in xyz if not word in stopset]
filtered_words = []
for word in xyz:
if word not in stopset:
filtered_words.append(word)
print(Counter(filtered_words))
print(len(filtered_words))
【问题讨论】:
-
为什么你有一个列表理解,然后是手动版本?
-
代码产生什么输出?
-
Counter({'street': 4, 'drive': 4, 'back': 2, 'lives': 2, 'main': 1, 'morning': 1, 'nj' : 1, '5th': 1, 'tod': 1, 'everyday': 1, 'newyork': 1, 'jersey': 1, 'joe': 1, 'city': 1, 'gym': 1 , 'ny': 1, '汽车': 1, '晚上': 1, '工作': 1})
-
@AlexHall - 没明白你的意思
标签: python python-3.x word-count word-frequency