【发布时间】:2021-05-29 15:21:40
【问题描述】:
这些是我的模块:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
我有一个与此类似的 df:
df = pd.DataFrame({'comments': ['Daniel is really cool',
'Daniel is the most',
'We had such a',
'Very professional operation',
'Lots of bookcases']})
然后我通过以下:
df['tokenized'] = df['comments'].apply(word_tokenize)
df['tagged'] = df['tokenized'].apply(pos_tag)
df['lower_tagged'] = df['tokenized'].apply(lambda lt: [word.lower() for word in lt]).apply(pos_tag)
我感兴趣的列是较低的标记列
0 [(daniel, NN), (is, VBZ), (really, RB), (cool,...
1 [(daniel, NN), (is, VBZ), (the, DT), (most, RBS)]
2 [(we, PRP), (had, VBD), (such, JJ), (a, DT)]
3 [(very, RB), (professional, JJ), (operation, NN)]
4 [(lots, NNS), (of, IN), (bookcases, NNS)]
我正在尝试实现一个函数,该函数返回 lower_tagged 列中 1000 个最常用名词的列表。
预期的结果应该类似于:
nouns = ['daniel', 'operation', 'bookcases', 'lots']
我试过的一种方法如下:
lower_tag = df['lower_tagged']
print([t[0] for t in lower_tag if t[1] == 'NN'])
但是,这只会返回一个空列表。我尝试过的另一种方法:
def list_nouns(df):
s = lower_tag
nouns = [word for word, pos in pos_tag(word_tokenize(s)) if pos.startswith('NN')]
return nouns
但是,我收到此错误:expected string or bytes-like object
为这篇长篇文章道歉 - 任何建议都将不胜感激,因为我已经坚持了一段时间! 谢谢
【问题讨论】: